GitXplorerGitXplorer
a

pdfsummary

public
0 stars
0 forks
0 issues

Commits

List of commits on branch master.
Verified
7469624b7be404e6a163b57a35b5ddff26d02fc2

Update README.md

aarchen2019 committed 5 years ago
Unverified
82d3b4ca3ade470610cb10026124cfdd373f4bca

Fixed supported for windows.

hhoward-yen committed 5 years ago
Verified
7905a80b0d70af308734085a3218543feab413ff

Update requirements.txt.

hhoward-yen committed 5 years ago
Unverified
a865ed3b869c79f4508a088a4ab234ac4ea7c2b3

Merge branch 'master' of https://github.com/archen2019/pdfsummary

hhoward-yen committed 5 years ago
Verified
ce7ea859e43253aabb55bd17fb2d70c70034358b

Update README.md

aarchen2019 committed 5 years ago
Unverified
b5967e022f5054b2a0f61fdf18cee81ada669fcc

Update requirements.txt.

hhoward-yen committed 5 years ago

README

The README file for this repository.

pdfsummary

Python script that creates a text summary of a PDF file.

Created for HackPrinceton2019.

Table of Contents

Installation

To install pdfsummary from GitHub:

git clone https://github.com/archen2019/pdfsummary

pdfsummary also needs the dependencies listed in requirements.txt. To install, run the following:

pip install -r requirements.txt

tesseract is also a required dependency. To install tesseract, follow the instructions at https://github.com/tesseract-ocr/tesseract/wiki.

Usage

To run pdfsummary, simply enter the cloned folder and run the file run.py. Then, enter the file name of the PDF, the number of sentences in the summary, and the number of keywords.

$ cd pdfsummary
$ python run.py
File Name: [FILE-NAME]
Number of sentences in summary: [NUM-SENTENCES]
Number of key phrases: [NUM-PHRASES]

This will create 4 files in the directory of the original PDF:

  • keyphrases.txt A text file containing the key phrases.
  • summary.txt A text file containing the summary.
  • Summary.pdf A PDF file containing the key phrases and the summary.
  • highlighted.pdf A PDF file containing the original pdf, with key phrases highlighted.

Methodology

  1. Use pdf2image to convert the PDF into PNG images.
  2. Use tesseract to extract text from images and create a text-searchable copy of the original PDF.
  3. Process text to remove extra newlines and reconnect hyphenated words.
  4. Use sumy to generate a summary of the processed text.
  5. Use pke to generate key phrases from the processed text.
  6. Create pdf containing key phrases and summary.
  7. Highlight key phrases in text-searchable PDF.

Citations

Boudin, Florian. “Pke: An Open Source Python-Based Keyphrase Extraction Toolkit.” Proceedings of COLING 2016, the 26th International Conference on Computational Linguistics: System Demonstrations, The COLING 2016 Organizing Committee, 2016, pp. 69–73. ACLWeb, https://www.aclweb.org/anthology/C16-2015.