GitXplorerGitXplorer
l

rails-scrapper

public
1 stars
0 forks
0 issues

Commits

List of commits on branch master.
Unverified
4cd417c803ffe91988dd50544260df9db76c8e56

changes fast

committed 12 years ago
Unverified
02b80d94463f375b41463362abaf19385ced1c8a

final

committed 12 years ago
Unverified
8a064ec908a45603c6475e3ebd26120ae20175fb

Added sidekiq gem, and workers

llurraca committed 12 years ago
Unverified
8c24ac913f6040bd17ace2e490be7fecc6c6fb55

Changes to mechanize, added fixes to crawler

committed 12 years ago
Unverified
8b2f81fac06ea679b8c7c07bc928f3739f2fd4c3

Added Batch views and controller

llurraca committed 12 years ago
Unverified
1f2beac8029792bb1465da68aa94ba4398bc7358

fixed crawler

llurraca committed 12 years ago

README

The README file for this repository.

== Andre Web Scrapper

This project will provide the user with the ability to crawl large websites to determine if they are business sites or not by matching all the internal links with given keywords. Will also keep a list of the internals links that matched a specific keyword.

Site that are not active will be flagged as isActive = false and be reflected in the database and resulting excel file.

== Core Functionalities

  • Support large files including 200k sites excel files
  • Support multiple keywords to be added to an specific batch crawl
  • Stop/Start/Resume Crawl process
  • Realtime stats
  • Multi-threading for max processing speed

== Additional Information

This is Ruby on Rails web application running on Rails 3.2.8 and Ruby 1.9.3 (both latest version at the moment of writing) database will be MySQL.

  • Additional Gems Nokogiri Mechanize RSpec Devise more to be added