HW5. Spatial data (2D, 3D, etc) (7th of Oct, 23.59)
We will use the RGB pixel values of photos (any photos) as a source of multi-dimensional data. Example code extracts the pixel blocks (1,1), (1,3), (3,4) in order to make 3D, 9D and 36D vectors from the image provided (local filename or URL from internet). For example, the block size (3,4) extracts 3x4=12 pixel mini image and flattens that out as a 1x36 RGB values vector. For sake of testing you can also use 2D data or larger blocks for longer vectors to be used in tasks below. Here are also example images for you to download and/or test: https://drive.google.com/drive/folders/1ahrn6c-aF1AQ_VKEdRxbG9-MwEIEiGel
Here is the code https://colab.research.google.com/drive/1-PX8IhXvmIYkpVdh6STJFNDG2xkrkGNI#scrollTo=WuUwI0M0GcQ_ that reads an image, creates respective RGB vectors, calculates distances for a few random queries, and reports k=4 nearest neighbors. If data is too large or too “regular”, then there is no need to read in all data. Feel free to use the random skipping built in code to read in less data. You can use the photos from the folder or use it only as an inspiration and instead find preferably other examples that would help you understand the concepts of how the similarity search would work. In your report include the photo(s) that you used as the source of data.
The second cell of the notebook contains examples of use of existing libraries, as provided by ChatGPT. These provide also some timings. Feel free to extend.
EX1 (1P). Validate that the provided code does what it promises to do and that you can get the code running (of course, you can use your own computer for running it). Hopefully it calculates the k-NN correctly. Use k=10 for validations. Validate that the libraries provide correct answers. Study the effect of distance measure used, by interpreting what are their differences, and whether one or other method should be used. Use at least two of the provided distances - the Euclidean distance, Manhattan distance, cosine distance. In short - analyze and run the code. So that you can use it for the below tasks.
EX2 (1P). Implement one (any) indexing method (explicit code), with two goals:
- Get it running correctly (at least on smaller dimensions like 3D, 6D ..)
- Get it running faster than the full search - report whether and how you achieved it.
Note: you can use code generation tools to help you. But ultimately you are responsible for the produced code and how it works.
Report, which method you chose for indexing, and how does it work.
EX3 (1P). Explain how to implement Random Projection tree on the provided task. How large it would grow (how many nodes, estimate space). Describe what would one node look like, what should be stored in each tree node. Explain how the k-NN search would work for k=1, k=10, k=100. What would the simplest implementation possibly look like for you? Remember - there is the indexing phase but also the usage phase supporting queries (that may not even be in the index previously).
Provide your own illustrations to explain how the RP-tree based search would work. You can also use the example shown in the lecture https://colab.research.google.com/drive/1CFJsdcQ9rQ3DSPpCF-B8TSJK6QmOOdnN as the basis. Feel free to use and modify, if useful.
EX4 (1P). Experiment with the code and methods - how large data can be analyzed with these methods, what is the effect of different indexing methods compared to no indexing. What is the effect of dimensionality, and distance measures? Note that you can increase the volume and dimensionality of data easily. Ask yourself 3-4 questions and provide answers to them. Report the chosen used methods, the data sizes, the speed of the methods. Argue, which methods are your favorites, when and why.
Note: you may want to also choose to visualize the vectors (on the PCA, or as line graphs, colors or mini-images or rows). For such extra effort additional point can be given.
EX5 (1P). We used image data so far merely as a source of experimental data. Discuss what applications and benefits such similarity searches would offer for image processing tasks. Identify some potential tasks and state which methods would be most useful. Think both in terms of single pixels, as well as longer areas/cvectors across larger blocks. How does the dimensionality and distance measure affect the expected results? Which other distance measures could be used, what would be their preferred use cases.
EX6 (Bonus (1p)) . Consider instead of images the vector embeddings used in modern language processing tasks and LLM-s (word2vec, BERT, Transformers). Imagine a database of documents that would serve as a source of “truth” for LLM based chatbots. The RAG - Retrieval Augmented Generation. Based on the knowledge of k-NN search methods, estimate the requirements for search methods for high-dimensional data. How much data in documents, how many vectors, what dimensionalities, how fast the nearest neighbor search methods should work, etc. Search from Internet what are the used cases, data sizes and efficiency, precision, etc requirements for k-NN like searches in higher dimensional spaces.
Hint: You can use this link: https://medium.com/@bijit211987/rag-vs-vectordb-2c8cb3e0ee52