A simple web crawler
You will need GoLang 1.3
- Setup a workspace as per: http://golang.org/doc/code.html#Workspaces.
- Download the code into the work space
- Get the required libraries
- Test and build
- Run
# build go workspace
$ cd ~
$ mkdir go
$ export GOPATH=$HOME/go
# download code
$ mkdir -p go/src/github.com/jkamenik
$ cd go/src/github.com/jkamenik
$ git clone http://github.com/jkamenik/crawler
# get the libraries
$ cd ~/go/src/github.com/jkamenik/crawler
$ go get .
# test and build
$ go test .
$ go build
# run
$ ~/go/bin/crawler <args>
The goal is to provide a tool that takes a single command line argument of a URL and determine the content of that URL after crawling it.
The following requirements apply to this challenge:
- The tool must download the HTML
- The tool must parse and print all the links found in that HTML
- The tool must allow for an optional depth argument (default 2) which will control how many pages it will crawl for links.
- The output should be the link's text followed by the link url (see below).
- A reasonable exit code needs to be provided if the main URL is not accessible; 2nd level URL errors can be ignored.
$ crawler http://somedomain.com
Home -> /
About Us -> /about_us.php
Careers -> http://otherdomain.com/somedomain.com
Home -> http://somedomain.com
Careers -> /somedomain.com
Extra credit (optional)
- Parallelize the downloading, and parsing, and collecting of links
- Follow redirects of any page
- Add debugging which is off by default and can be enabled with "-v"
- Control the level of debugging by repeating "-v" (i.e., "-vvvv")
- Save the HTML in a folder matching the link title
- Save any resources used the by page: CSS, JS, and Images.
- Rewrite the links and references in the HTML to be relative file paths
- Enable Javascript, using Selenium Webdriver, or similar