Homework 3 - Which book would you recomend?

Task

The goal of this project was to experiment with crawling, parsing and to get confidence with different techniques regarding search engines, like how to retrieve results for a conjunctive query, how to score results with cosin similarity and tfidf, and so on so forth. Finally, it was asked to write both a recursive and a dynamic programming algorithm for the problem of longest increasing subsequence of a string.

We also decided to produce an original logo for our search engine! It does remind me of something, not sure what though...

Pronunciation: goo·gs

Usage

In the repository, it is included requirements.txt, which consists in a file containing the list of items to be installed using conda, like so:

conda install --file requirements.txt

Once the requirements are installed, you shouldn't have any problem when executing the scripts. Consider also creating a new environment, so that you don't have to worry about what is really needed and what not after you're done with this project. With conda, that's easily done with the following command:

conda create --name <env> --file requirements.txt

where you have to replace <env> with the name you want to give to the new environment.

Repo structure

The repository consists of the following files:

data:

This directory contains both the data retrieved just after the crawling part (parsed_books.tsv) and the data after the preprocessing and cleaning part (clean_data.csv).
images:

This directory contains images for the search engine logo and for part of the recursive complexity proof. Just ignore this.
indexes:

This directory contains the pickle objects for the vocabulary, the inverted index dictionary and the tfidf inverted index.
book_links.txt:

A txt file containing the links for all the html urls.
data_collector.py:

A Python script containing the functions to download the txt file and the html pages for the books.
functions.py:

A Python script containing all the functions used in the main.ipynb, apart from the data collection and parsing parts.
main.ipynb:

A Jupyter notebook which provides the solutions to all the homework questions. The notebook just contains the answers; the only code provided here is the one for exercise 5, for which the answer is actually the code itself.
parser.py:

A Python script containing the functions to parse the html pages and extract the tsv file.
requirements.txt:

A txt file containing the dependecies of the project; see the usage part for details.

adm-homework-3

Commits

Update: warning comment on the volatility of goodreads

Fix: lines to download ntlk data and fixed requirements file

Minor fix

Fix: update on code to use scripts from functions file

Update: markdown comments on the tfidf search engine

Update: markdown comments on the simple search engine

README

Homework 3 - Which book would you recomend?

Task

Usage

Repo structure