In Semantic search with embeddings, I described how to build semantic search systems (also called neural search). These systems are being used more and more with indexing techniques improving and representation learning getting better every year with new deep learning papers. The medium post explain how to build them, and this list is meant to reference all interesting resources on the topic to allow anyone to quickly start building systems.
-
Tutorials explain in depth how to build semantic search systems
- Semantic search with embeddings end to end explanation on how to build semantic search pipelines
- google cloud embedding similarity system Use google cloud to build an embedding similarity system
- cvpr 2020 tutorial on image retrieval end to end in depth tutorial focusing on image
-
Good datasets to build semantic search systems
- Tensorflow datasets building search systems only requires image or text, many tf datasets are interesting in that regard
- Torchvision datasets datasets provided for vision are also interesting for this
-
Pretrained encoders make it possible to quickly build a new system without training
- Vision+Language
- Clip encode image and text in a same space
- Image
- Efficientnet b0 is a simple way to encode images
- Dino is an encoder trained using self supervision which reaches high knn classification performance
- Face embeddings compute face embeddings
- Text
- Labse a bert text encoder trained for similarity that put sentences from 109 in the same space
- Misc
- Jina examples provide example on how to use pretrained encoders to build search systems
- Vectorhub image, text, audio encoders
- Vision+Language
-
Similarity learning allows you to build new similarity encoders
- Fine tuning classification with keras enables adapting an existing image encoder to a custom dataset
- Fine tuning classification with hugging face makes it possible to adapt existing text encoders
- Lightly is a simple way to train image encoders with self supervision
- Pytorch big graph library to encode a graph as node and link embeddings
- RSVD a spark library to compute large scale svd with spark
- Groknet Using image and categories and many datasets to fine tune product embeddings with many losses
-
Indexing and approximate knn: indexing make it possible to create small indices encoding million of embeddings that can be used to query the data in milli seconds
- Faiss Many aknn algorithms (ivf, hnsw, flat, gpu, …) in c++ with a python interface
- Autofaiss to use faiss easily
- Nmslib fast implementation of hnsw
- Annoy a aknn algorithm by spotify
- Scann a aknn algorithm faster than hnsw by google
- Catalyzer training the quantizer with backpropagation
- hora approximate knn implemented in rust
- Search pipelines allow fast serving and customization of how the indices are queries
-
Companies: many companies are being built around semantic search systems
- Jina is building flexible pipeline to encode and search with embeddings
- Weaviate is building a cloud-native vector search engine
- Pinecone a startup building databases indexing embeddings
- Vector ai is building an encoder hub
- Milvus builds an end to end open source semantic search system
- FeatureForm's embeddinghub combining DB and KNN
- vespa knn-based managed retrieval engine
- Many other companies are using these systems and releasing open tools on the way, and it would be too long a list to put them here (for example facebook with faiss and self supervision, google with scann and thousand of papers, microsoft with sptag, spotify with annoy, criteo with rsvd, deepr, autofaiss, …)