Python script that creates a text summary of a PDF file.
Created for HackPrinceton2019.
To install pdfsummary
from GitHub:
git clone https://github.com/archen2019/pdfsummary
pdfsummary
also needs the dependencies listed in requirements.txt
. To install, run the following:
pip install -r requirements.txt
tesseract
is also a required dependency. To install tesseract
, follow the instructions at https://github.com/tesseract-ocr/tesseract/wiki.
To run pdfsummary
, simply enter the cloned folder and run the file run.py
. Then, enter the file name of the PDF, the number of sentences in the summary, and the number of keywords.
$ cd pdfsummary
$ python run.py
File Name: [FILE-NAME]
Number of sentences in summary: [NUM-SENTENCES]
Number of key phrases: [NUM-PHRASES]
This will create 4 files in the directory of the original PDF:
-
keyphrases.txt
A text file containing the key phrases. -
summary.txt
A text file containing the summary. -
Summary.pdf
A PDF file containing the key phrases and the summary. -
highlighted.pdf
A PDF file containing the original pdf, with key phrases highlighted.
- Use
pdf2image
to convert the PDF into PNG images. - Use
tesseract
to extract text from images and create a text-searchable copy of the original PDF. - Process text to remove extra newlines and reconnect hyphenated words.
- Use
sumy
to generate a summary of the processed text. - Use
pke
to generate key phrases from the processed text. - Create pdf containing key phrases and summary.
- Highlight key phrases in text-searchable PDF.
Boudin, Florian. “Pke: An Open Source Python-Based Keyphrase Extraction Toolkit.” Proceedings of COLING 2016, the 26th International Conference on Computational Linguistics: System Demonstrations, The COLING 2016 Organizing Committee, 2016, pp. 69–73. ACLWeb, https://www.aclweb.org/anthology/C16-2015.