GitXplorerGitXplorer
j

crawler

public
1 stars
0 forks
0 issues

Commits

List of commits on branch master.
Unverified
76d871f4b1a78d18f8df63acf31c90d399f7f085

adding build status

jjkamenik committed 10 years ago
Unverified
e79e03686a2285e8acd2b095c5a595576d156023

Adding travis CI

jjkamenik committed 10 years ago
Unverified
c56fdea34629bc8d05c84d2744ec42878f92e21b

Adding a constructor and Valid member

jjkamenik committed 10 years ago
Unverified
2b5b9a17cb9e98102e9024c3aa142d21e1c011f0

Adding the forgotten main code

jjkamenik committed 10 years ago
Unverified
084824f444f932e5bd7ddacecebdac9dfae5b1d2

Adding a README

jjkamenik committed 10 years ago
Unverified
7dcb8f4067aaa6abd16fcc3c7ed274ced92992b2

An http error wrapper

jjkamenik committed 10 years ago

README

The README file for this repository.

Crawler

Build Status

A simple web crawler

Setup

You will need GoLang 1.3

  1. Setup a workspace as per: http://golang.org/doc/code.html#Workspaces.
  2. Download the code into the work space
  3. Get the required libraries
  4. Test and build
  5. Run
# build go workspace
$ cd ~
$ mkdir go
$ export GOPATH=$HOME/go

# download code
$ mkdir -p go/src/github.com/jkamenik
$ cd go/src/github.com/jkamenik
$ git clone http://github.com/jkamenik/crawler

# get the libraries
$ cd ~/go/src/github.com/jkamenik/crawler
$ go get .

# test and build
$ go test .
$ go build

# run
$ ~/go/bin/crawler <args>

Challenge

The goal is to provide a tool that takes a single command line argument of a URL and determine the content of that URL after crawling it.

The following requirements apply to this challenge:

  1. The tool must download the HTML
  2. The tool must parse and print all the links found in that HTML
  3. The tool must allow for an optional depth argument (default 2) which will control how many pages it will crawl for links.
  4. The output should be the link's text followed by the link url (see below).
  5. A reasonable exit code needs to be provided if the main URL is not accessible; 2nd level URL errors can be ignored.
$ crawler http://somedomain.com
Home -> /
About Us -> /about_us.php
Careers -> http://otherdomain.com/somedomain.com
  Home -> http://somedomain.com
  Careers -> /somedomain.com

Extra credit (optional)

  • Parallelize the downloading, and parsing, and collecting of links
  • Follow redirects of any page
  • Add debugging which is off by default and can be enabled with "-v"
    • Control the level of debugging by repeating "-v" (i.e., "-vvvv")
  • Save the HTML in a folder matching the link title
  • Save any resources used the by page: CSS, JS, and Images.
    • Rewrite the links and references in the HTML to be relative file paths
  • Enable Javascript, using Selenium Webdriver, or similar