GitXplorerGitXplorer
g

very_long_scientific_papers

public
1 stars
0 forks
0 issues

Commits

List of commits on branch master.
Verified
31830249bf80add8fc39ef5790bafbf4c5df4eef

Update README.md

gghomasHudson committed 3 years ago
Unverified
6a719aa0a4f0a9c38fc13a94ab397a34ef1b7657

Removed old data

gghomasHudson committed 3 years ago
Unverified
e4a1958d1104b3f15ab83b935ad9caf99d1a1d5a

Added extra docs

gghomasHudson committed 3 years ago
Unverified
8d5a4ab2c24b05fe547997dca7bee9c0636ce563

Matched abstracts

gghomasHudson committed 3 years ago
Unverified
8afc84d86c8db9a345e5cc1a301e4067a8656f5a

Add deduped

gghomasHudson committed 3 years ago
Verified
1f288d5fe51cedc4ecebd8525c24b0c22778e1a5

Update 0006012v1.abstract.txt

gghomasHudson committed 3 years ago

README

The README file for this repository.

Very Long scientific papers

This dataset contains code and data for the very long scientific papers dataset based on arxiv.org. The data is stored under the final/test directory with PAPER_ID.main.txt and corresponding PAPER_ID.abstract.txt files.

Data gathering process

The data is gathered (main.py) using the following steps:

  • Search for anything containing the word thesis in the title using the arxiv api
  • Download the source for these documents
  • Use engrafo to convert this into html
  • Filter the html to remove math, images, etc..
  • Find the abstract and seperate it (if cannot be found, skip document)
  • Convert to txt format

To gather your own data, simply run main.py.