Hierarchical Navigable Small Worlds Algorithm (HNSW)
Introduction
The hierarchical Navigable Small Worlds Algorithm (HNSW) is a vector similarity search algorithm that searches for the nearest neighbor given a particular vector which is primarily based on graph networks and graph algorithms. It was proposed to be a better solution than the classification algorithm like K Nearest Neighbors.
What are vectors?
The traditional vector definition of "physical quantities with both magnitude and direction" isn't very useful from a computer science point of view. Vectors in CS are essentially matrix(or a column matrix to be more specific). The elements in the matrix represent the magnitude of the vector along each axis in an n-dimensional space and the sign associated with each element is the direction along which a vector is oriented. Just like matrices can perform all kinds of operations on vectors like scaling it, rotating it, adding it with another vector etc. But in this blog, we aim on learning what we mean by vector similarity.
Vector Similarity
This notion of "similarity" might be confusing to many of you. What does similarity mean? and why do we need it?
The way we define similarity is not always straightforward. But let us take a look at a brief example to understand the motivation behind why we are defining vector similarity.
In the example above, since it's a colored image, each pixel is a vector consisting of 3 values i.e. red, green and blue values in pixels. The 2 parts highlighted in the image have roughly the same "shade" and look pretty similar to our eyes. We might be able to define a strong notion of similarity from here. Looking at the vectors that comprise these 2 pixels we realize that the 3 entries in the matrix are roughly equal. So from this example, one gets a good idea about what it means for 2 vectors to be similar and how it translates into the real world. But in the real world, it is not so simple as equality is not the sole measure of similarity. Some of the other metrics used are:
- Cosine Similarity
- Euclidean distance
- Manhattan Distance
- Minkowski Distance
- Jaccard Coefficient
.......and many more.
Where Can Similarity be Used?
Now, let us have a look at the types of datasets in which the concept of similarity is typically used.
Vector similarity is typically used on datasets that involve textual data, such as news articles, scientific papers, customer reviews, social media posts, and so on. For example
In NLP, vector similarity is often used in tasks such as document classification, sentiment analysis, information retrieval, and question answering. For example, in document classification, vector similarity can be used to identify the most relevant category for a given document based on the similarity between the document's vector representation and the vector representation of each category. In sentiment analysis, vector similarity can be used to compare the sentiment of two different texts or to identify the most similar text in a dataset to a given query.
Vector similarity can also be used in image and audio processing tasks. In image processing, vector similarity can be used to compare the similarity between two images, while in audio processing, vector similarity can be used to compare the similarity between two audio signals. But all this depends on what techniques we are going to use to find out the similarity.
Now let us get our hands dirty with the intuition and actual math behind HNSW and how we realised that we needed an algorithm like it.
Motivation for HNSW
Before HNSW the main techniques used in the industry were KNN and A-KNN(Approximate KNN). The main problem with KNN is that it is not useful when the dataset is huge and when the dimensionality of the vectors is large. We might speed up the process by a bit using PCA but the results are not extremely impressive. To improve the process Approximate KNNs were introduced which tolerated a bit of error. Its formula is
So to bypass all these problems HNSW algorithm can be used which is based on an incremental layered graph structure with skip connections and it also tolerates a bit of error like KNN.
In the vast majority of studied graph algorithms, searching takes the form of greedy routing in KNN graphs. For a graph, we start the search at some entry point (it can be random or supplied by a separate algorithm) and iteratively traverse the graph. At each step of the traversal, the algorithm examines the distances from a query vector to the neighbors of a current base node and then selects as the next base node the adjacent node that minimizes the distance, while constantly keeping track of the best-discovered neighbors. The search is terminated when some stopping condition is met. The problem with this algorithm is that:
The number of steps required might be extremely large.
If the graph has low global connectivity, then this method leads to poor results.
To overcome these problems scientists came up with navigable graphs that scale by a logarithmic or a polylogarithmic factor w.r.t number of hops. The main problem with this method is that:
- Since the scaling factor is logarithmic/ polylogarithmic hence the performance is not up to the mark on datasets of low size.
P.S. These types of networks are called navigable small world graphs (NSW).
Following the problems associated with these algorithms, other solutions were proposed like Kleinberg's model, scale-free models etc but each came with its problems.
The Thinking Behind HNSW
In the routing phase, there were mainly 2 phases, i.e. zoom in and zoom out. If we start from a low-degree vertex and traverse the graph simultaneously increasing the node's degree till the links from a node approximately reach the scale of the distance of the query. This is the zoom-out phase. The problem is that we don't know when the node's degree is to be increased so if not done properly it might be stuck in false local minima.
If we start from a high-degree vertex then we directly enter the zoom-in phase of the search. Research shows that setting up high-degree nodes as starting points is usually significantly better. However, this method does not scale well as has a polylogarithmic time complexity.
So to finally set everything straight, researchers invented the algorithm we have been building up to i.e. HNSW. But before we get there I need to introduce one more concept which will help in understanding the concept which is the skip list.
Skip List
Let us say we have a sorted linked list. If we have to search for a particular node, instead of traversing through the entire linked list, which in the worst case takes a time complexity of O(n), we can use a skip list data structure that skips certain nodes so that we can make the searching process faster. Look at the figure below for a better understanding.
The Algorithm
So our algorithm tries to make the process faster by the following process:
The search starts from the upper layer which only has the longest links.
The algorithm greedily traverses through the elements from the upper layer until a local minimum is reached.
After that, the search switches to the lower layer (which has shorter links), restart from the element which was the local minimum in the previous layer and the process repeats.
The maximum number of connections per element in all layers can be made constant, thus allowing a logarithmic complexity scaling of routing in a navigable small-world network.
For every element, we select an integer level l which defines the maximum layer to which the element belongs. If we set an exponentially decaying probability of l we get a logarithmic scaling of the expected number of layers in the structure.
The search procedure is an iterative greedy search starting from the top layer and finishing at the zeroth layer.
The figure below illustrates the algorithm visually.
NOTE: You might be wondering how should we select the nodes which we must insert in each layer. In the HNSW algorithm, the construction of the hierarchy is done in a sequential manner, where each element is inserted into the structure based on its similarity to existing elements. This can lead to a bias towards certain regions of the data space if the elements are inserted in a fixed order. To overcome this bias, the HNSW algorithm employs level randomization. Level randomization involves randomly selecting a starting level for each element during the insertion process. This starting level determines the level at which the element will be inserted into the structure. By introducing randomness in the starting level, the algorithm ensures that elements are inserted into different parts of the structure, leading to a more balanced representation of the data space.
Parameters of the Algorithm
The entire process consists of 2 things i.e. insertion and searching. These are controlled by parameters i.e. efconstruction
and efsearch
. ef
stands for "exploratory factor". These 2 factors determine the trade-off between the construction time of the data structure and searching time.
efsearch
: determines the number of nodes that the search algorithm will explore in the hierarchical graph when searching for nearest neighbors. A higher value ofefsearch
will lead to a more accurate search, but at the cost of increased query time.efconstruction
: determines the number of nodes that the construction algorithm will explore during the construction of the HNSW graph. A higher value ofefconstruction
will result in a more accurate graph, but at the cost of increased construction time. Ideally, its value should be large enough so that it makes K-ANNs recall as close to 1 as possible(In reality, the recall flattens out after a certain point).mL
– The maximum number of edges a vector will get in a graph. The higher this number is, the more memory the graph will consume, but the better the search approximation may be.
Complexity of Construction
The process of insertion of nodes in each layer follows a sequence of K-ANN. The number of layers for an element to be added in is a constant which depends on mL.
On average, this translates into a complexity of O(n*logn).
Complexity of Searching
The probability that the next node belongs to the layer is p = exp(-mL). The probability that a node cannot be reached on the ith step is upper bounded by exp(-s*mL). The overall complexity of this process is O(logn).
Conclusion
Advantages
To sum up, HNSW is a very powerful algorithm with the following advantages:
Fast Query Time
Very scalable for high dimensional vectors
It beats the state-of-the-art performance of most algorithms on many datasets like SIFT Learn dataset, Non-Metric Space Library etc.
Disadvantages
It can consume huge memory, especially for large datasets since it needs to store the nodes and their corresponding links.
As the dataset changes, the HNSW index may need to be rebuilt to maintain its accuracy. This can be a computationally expensive process, especially for large datasets.
Thanks for reading. :)
I hope you guys had an amazing read. Keep Learning !!