g

covid19-kaggle

public

0 stars

0 forks

0 issues

Commits

List of commits on branch master.

Unverified

be8efc825c2b84575937c7d0e2e3c3f99690c55a

Update README

ggclen committed 5 years ago

Unverified

aec439f9b44cc0194658c8880f5259fc33483db3

Update bokeh version

ggclen committed 5 years ago

Unverified

1c4ba8826fc43164db538d61d09c2bb918413866

Make lots of smaller clusters

ggclen committed 5 years ago

Unverified

950123fc8287706e45f8d00a0eab7de26ab631bf

Merge branch 'master' of github.com:gclen/covid19-kaggle

ggclen committed 5 years ago

Unverified

477d2c93477608917b52e09039bfc07b53da480b

Make urls clickable

ggclen committed 5 years ago

Verified

ebe01c475c8f83c1f917a2d3a25de1a2a1fa1380

Update README.md

ggclen committed 5 years ago

README

The README file for this repository.

You can interact with an embedding of the abstracts here

Methodology

To find related papers we used the following methods:

Load the dataset using code from this kaggle kernel
Tokenize the abstracts using scispacy and vectorize using sklearn's TfidfVectorizer
Embed the vectors into a lower dimensional space using UMAP
Cluster the embedding using HDBSCAN to find related abstracts
Rank each point by its distance to a representative point within the cluster to find relevant documents within a cluster.
Use Bokeh to create widgets to visualize and interact with the embedding