KGraph is a library for k-nearest neighbor (k-NN) graph construction and online k-NN search using a k-NN Graph as index. KGraph implements heuristic algorithms that are extremely generic and fast:
For best generality, the C++ API should be used. A python wrapper is provided under the module name pykgraph, which supports Euclidean and Angular distances on rows of NumPy matrices.
KGraph depends on a recent version of GCC with C++11 support, cmake and the Boost library. The package can be built and installed with ```sh cmake -DCMAKE_BUILD_TYPE=release . make sudo make install ```
A Makefile.plain is also provided in case cmake is not available.
The Python API can be installed with ``` python setup.py install ```
```python from numpy import random import pykgraph
dataset = random.rand(1000000, 16) query = random.rand(1000, 16)
index = pykgraph.KGraph(dataset, 'euclidean') index.build(reverse=-1) # index.save("index_file");
knn = index.search(query, K=10) # this uses all CPU threads knn = index.search(query, K=10, threads=1) # one thread, slower knn = index.search(query, K=1000, P=100) # search for 1000-nn, no need to recompute index. ```
Both index.build and index.search supports a number of optional keywords arguments to fine tune the performance. The default values should work reasonably well for many datasets. One exception is that reverse=-1 should be added if the purpose of building index is to speedup search, which is the typical case, rather than to obtain the k-NN graph itself.
Two precautions should be taken:
For performance considerations, the Python API does not support user-defined similarity function, as the callback function is invoked in such a high frequency that, if written in Python, speedup will inevitably be brought down. For the full generality, the C++ API should be used.
The KGraph C++ API is based on two central concepts: the index oracle and the search oracle. (Oracle is a fancy way of calling a user-defined callback function that behaves like a black box.) KGraph works solely with object IDs from 0 to N-1, and relies on the oracles to map the IDs to actual data objects and computes the similarity. To use KGraph, the user has to extend the following two abstract classes
```cpp class IndexOracle { public: // returns size of dataset virtual unsigned size () const = 0; // computes similarity of object i and j virtual float operator () (unsigned i, unsigned j) const = 0; };
class SearchOracle { public: /// Returns the size of the dataset. virtual unsigned size () const = 0; /// Computes similarity of query and object i. virtual float operator () (unsigned i) const = 0; }; ```
With the oracle classes defined, index construction and online search become straightfoward:
```cpp
KGraph *index = KGraph::create();
if (need_to_create_new_index) { MyIndexOracle oracle(...); // subclass of kgraph::IndexOracle KGraph::IndexParams params; params.reverse = -1; index->build(oracle, params); index->save("some_path"); } else { index->load("some_path"); }
MySearchOracle oracle(...); // subclass of kgraph::SearchOracle
KGraph::SearchParams params; params.K = K; vector<unsigned> knn(K); // to save K-NN ids. index->search(oracle, params, &knn[0]);
delete index; ``` Doxygen documentation