Discovering the potential of vector databases with LLMs: Unraveling approximate neighbors, the technical path to easy learning!

Phanindra Reddy Madduru
3 min readJul 26, 2023

--

Hey there, data adventurers! 👋 Welcome to the thrilling world of vector databases and Language Models (LLMs) — where we discover the magic of finding friends in data land! 🌌

So, what’s the deal with vector databases? Well, they’re like treasure chests for recommender engines. Picture this: we want to recommend stuff to users, right? 🤔 So, we learn these cool vector representations of users and the things we want to recommend. These vectors are like secret codes that help us find similar items using a super speedy “approximate nearest neighbor search”! ✨

Now, here’s where LLMs step into the spotlight. With their arrival, it became a breeze to transform text documents into vector representations. These vectors capture the true meaning of the text, making it simpler to find documents that are like soulmates — I mean, semantically similar! ❤️

But wait, here’s the funny part — sometimes, being exactly precise isn’t crucial. That’s where Product Quantization (PQ) swoops in with its quirky charm! PQ helps us represent vectors with less accuracy by clustering them together. Instead of keeping track of all the vectors, we index the centroids of these lively clusters. It’s like having a wild dance party and just remembering the center of all the action! 🎉

So, how do we make this happen? We start by dividing each vector into smaller vectors, and then, we throw a K-means party for each of these mini-vectors. We’re not indexing the vectors themselves — oh no! We’re indexing the cluster’s heart and soul — its centroid! 😍

Imagine having two clusters per partition and six vectors to party with — that’s like squeezing three times the fun in the same space! Of course, more vectors mean even more compression magic! 🌈 Each vector now points to a bunch of clusters and their related centroids, like little arrows pointing to where the party’s at!

Now, let’s say we’re on the hunt for the nearest neighbors of a query vector. We measure the squared Euclidean distance for each cluster in each partition, and voilà! We return the vectors with the lowest summed squared Euclidean distances. It’s like finding your long-lost cousins — you just know when you’ve found the right ones! 👨‍👩‍👧‍👦

Best of all, we don’t have to visit every single vector in town — that’d take forever! Instead, we just dance our way through the clusters’ centroids. But hey, there’s a little trade-off between speed and accuracy. The more clusters we have, the more accurate the results, but it might take a bit longer to find our dance partners. So, it’s all about striking the perfect balance! ⚖️

Now, don’t get me wrong — this is still a bit of a brute force affair since the number of clusters influences the algorithm’s performance. But, fear not! We can team up PQ with other algorithms for a super-duper speedy retrieval! 💨

So, my fellow data detectives, there you have it — the fantastical world of vector databases and LLMs! It’s like solving mysteries in a land filled with vectors and clusters, and we’re the fearless explorers on this thrilling adventure! 🕵️‍♂️🗺️

Now go forth and find your data friends with the power of approximate neighbors — because in data land, there’s always a party waiting to happen! 🎊💻

--

--