You can interact with an embedding of the abstracts here
To find related papers we used the following methods:
- Load the dataset using code from this kaggle kernel
- Tokenize the abstracts using scispacy and vectorize using sklearn's TfidfVectorizer
- Embed the vectors into a lower dimensional space using UMAP
- Cluster the embedding using HDBSCAN to find related abstracts
- Rank each point by its distance to a representative point within the cluster to find relevant documents within a cluster.
- Use Bokeh to create widgets to visualize and interact with the embedding