GitXplorerGitXplorer
g

covid19-kaggle

public
0 stars
0 forks
0 issues

Commits

List of commits on branch master.
Unverified
be8efc825c2b84575937c7d0e2e3c3f99690c55a

Update README

ggclen committed 5 years ago
Unverified
aec439f9b44cc0194658c8880f5259fc33483db3

Update bokeh version

ggclen committed 5 years ago
Unverified
1c4ba8826fc43164db538d61d09c2bb918413866

Make lots of smaller clusters

ggclen committed 5 years ago
Unverified
950123fc8287706e45f8d00a0eab7de26ab631bf

Merge branch 'master' of github.com:gclen/covid19-kaggle

ggclen committed 5 years ago
Unverified
477d2c93477608917b52e09039bfc07b53da480b

Make urls clickable

ggclen committed 5 years ago
Verified
ebe01c475c8f83c1f917a2d3a25de1a2a1fa1380

Update README.md

ggclen committed 5 years ago

README

The README file for this repository.

You can interact with an embedding of the abstracts here

Methodology

To find related papers we used the following methods:

  1. Load the dataset using code from this kaggle kernel
  2. Tokenize the abstracts using scispacy and vectorize using sklearn's TfidfVectorizer
  3. Embed the vectors into a lower dimensional space using UMAP
  4. Cluster the embedding using HDBSCAN to find related abstracts
  5. Rank each point by its distance to a representative point within the cluster to find relevant documents within a cluster.
  6. Use Bokeh to create widgets to visualize and interact with the embedding