GitXplorerGitXplorer
m

adm-homework-3

public
1 stars
0 forks
0 issues

Commits

List of commits on branch main.
Unverified
7ec374ff46f838f63a07c5c31e3a1b6988f64851

Update: warning comment on the volatility of goodreads

mmikcnt committed 4 years ago
Unverified
f6ee6a5e3b581c9d54bf9bc8b308e5f60f9f6e49

Fix: lines to download ntlk data and fixed requirements file

mmikcnt committed 4 years ago
Unverified
8a6c7b11438404bbffbff01883a4d9534a1a2bae

Minor fix

mmikcnt committed 4 years ago
Unverified
6e3193eb62a62111213ceb9b6132509587825ddc

Fix: update on code to use scripts from functions file

mmikcnt committed 4 years ago
Unverified
606a29cc6a1953bbd303328c5f1cee481bfa9949

Update: markdown comments on the tfidf search engine

mmikcnt committed 4 years ago
Unverified
bd5a747611de43d6bb2d6a059bb6d819ef64ba55

Update: markdown comments on the simple search engine

mmikcnt committed 4 years ago

README

The README file for this repository.

Homework 3 - Which book would you recomend?

Sublime's custom image

Task

The goal of this project was to experiment with crawling, parsing and to get confidence with different techniques regarding search engines, like how to retrieve results for a conjunctive query, how to score results with cosin similarity and tfidf, and so on so forth. Finally, it was asked to write both a recursive and a dynamic programming algorithm for the problem of longest increasing subsequence of a string.

We also decided to produce an original logo for our search engine! It does remind me of something, not sure what though...

Pronunciation: goo·gs

Usage

In the repository, it is included requirements.txt, which consists in a file containing the list of items to be installed using conda, like so:

conda install --file requirements.txt

Once the requirements are installed, you shouldn't have any problem when executing the scripts. Consider also creating a new environment, so that you don't have to worry about what is really needed and what not after you're done with this project. With conda, that's easily done with the following command:

conda create --name <env> --file requirements.txt

where you have to replace <env> with the name you want to give to the new environment.

Repo structure

The repository consists of the following files:

  1. data:

    This directory contains both the data retrieved just after the crawling part (parsed_books.tsv) and the data after the preprocessing and cleaning part (clean_data.csv).

  2. images:

    This directory contains images for the search engine logo and for part of the recursive complexity proof. Just ignore this.

  3. indexes:

    This directory contains the pickle objects for the vocabulary, the inverted index dictionary and the tfidf inverted index.

  4. book_links.txt:

    A txt file containing the links for all the html urls.

  5. data_collector.py:

    A Python script containing the functions to download the txt file and the html pages for the books.

  6. functions.py:

    A Python script containing all the functions used in the main.ipynb, apart from the data collection and parsing parts.

  7. main.ipynb:

    A Jupyter notebook which provides the solutions to all the homework questions. The notebook just contains the answers; the only code provided here is the one for exercise 5, for which the answer is actually the code itself.

  8. parser.py:

    A Python script containing the functions to parse the html pages and extract the tsv file.

  9. requirements.txt:

    A txt file containing the dependecies of the project; see the usage part for details.