GitXplorerGitXplorer
r

awesome-semantic-search

public
140 stars
7 forks
2 issues

Commits

List of commits on branch main.
Verified
1a16d0058bb95057055b578dbce7b88aa7077856

add hora to knn implementations

rrom1504 committed 3 years ago
Verified
ae45f2704a5f55addbb8fd9815effc84e0d29d5a

add vespa

rrom1504 committed 3 years ago
Verified
1a0299023bb271ccd3ffecbc16fe9479ff5da6db

add featureform/embeddinghub

rrom1504 committed 3 years ago
Verified
98e921947ab411144be054d9dad0813db302beee

add content sub section in pretrained encoders

rrom1504 committed 4 years ago
Verified
c56201505e45620cbc8cc24a80d4b90e6771a44d

Bootstrap this repo

rrom1504 committed 4 years ago
Unverified
97f402d6394f8d66658f0224aa5cb91992370971

first commit

rrom1504 committed 4 years ago

README

The README file for this repository.

awesome-semantic-search

In Semantic search with embeddings, I described how to build semantic search systems (also called neural search). These systems are being used more and more with indexing techniques improving and representation learning getting better every year with new deep learning papers. The medium post explain how to build them, and this list is meant to reference all interesting resources on the topic to allow anyone to quickly start building systems.

image

  • Tutorials explain in depth how to build semantic search systems
  • Good datasets to build semantic search systems
    • Tensorflow datasets building search systems only requires image or text, many tf datasets are interesting in that regard
    • Torchvision datasets datasets provided for vision are also interesting for this
  • Pretrained encoders make it possible to quickly build a new system without training
    • Vision+Language
      • Clip encode image and text in a same space
    • Image
      • Efficientnet b0 is a simple way to encode images
      • Dino is an encoder trained using self supervision which reaches high knn classification performance
      • Face embeddings compute face embeddings
    • Text
      • Labse a bert text encoder trained for similarity that put sentences from 109 in the same space
    • Misc
      • Jina examples provide example on how to use pretrained encoders to build search systems
      • Vectorhub image, text, audio encoders
  • Similarity learning allows you to build new similarity encoders
  • Indexing and approximate knn: indexing make it possible to create small indices encoding million of embeddings that can be used to query the data in milli seconds
    • Faiss Many aknn algorithms (ivf, hnsw, flat, gpu, …) in c++ with a python interface
    • Autofaiss to use faiss easily
    • Nmslib fast implementation of hnsw
    • Annoy a aknn algorithm by spotify
    • Scann a aknn algorithm faster than hnsw by google
    • Catalyzer training the quantizer with backpropagation
    • hora approximate knn implemented in rust
  • Search pipelines allow fast serving and customization of how the indices are queries
    • Milvus end to end similarity engine, on top of faiss and hnswlib
    • Jina flexible end to end similarity engine
    • Haystack question answering on text pipeline
  • Companies: many companies are being built around semantic search systems
    • Jina is building flexible pipeline to encode and search with embeddings
    • Weaviate is building a cloud-native vector search engine
    • Pinecone a startup building databases indexing embeddings
    • Vector ai is building an encoder hub
    • Milvus builds an end to end open source semantic search system
    • FeatureForm's embeddinghub combining DB and KNN
    • vespa knn-based managed retrieval engine
    • Many other companies are using these systems and releasing open tools on the way, and it would be too long a list to put them here (for example facebook with faiss and self supervision, google with scann and thousand of papers, microsoft with sptag, spotify with annoy, criteo with rsvd, deepr, autofaiss, …)