GitXplorerGitXplorer
y

ant

public
277 stars
17 forks
6 issues

Commits

List of commits on branch master.
Verified
17aa1411aefe1b809cf08733caaedc1fd54d8d90

build(deps): bump golang.org/x/net (#48)

ddependabot[bot] committed a month ago
Verified
6c7a55c5708c8339a72d616b59c124034ee426ab

build(deps): bump github.com/tidwall/match from 1.0.3 to 1.1.1 (#35)

ddependabot[bot] committed 3 years ago
Verified
fe3a4dcbab2bc32ba5da6c1d2f2a35ed79c2cb85

Queue updates (#36)

ffelix committed 3 years ago
Verified
b7f5a4cfc3683372c3e330a3dd65c441b7354e98

build(deps): bump github.com/golang/snappy from 0.0.3 to 0.0.4 (#32)

ddependabot[bot] committed 3 years ago
Verified
e74c7a1b38b681bad0f971a0aa435e7a30b73a40

build(deps): bump github.com/andybalholm/cascadia from 1.2.0 to 1.3.1 (#34)

ddependabot[bot] committed 3 years ago
Verified
a070493e6da73913d83cb2a9c6e7941ec2df62a2

Add compression to *antcache.Diskstore (#31)

yyields committed 4 years ago

README

The README file for this repository.



ant (alpha) is a web crawler for Go.








Declarative

The package includes functions that can scan data from the page into your structs or slice of structs, this allows you to reduce the noise and complexity in your source-code.

You can also use a jQuery-like API that allows you to scrape complex HTML pages if needed.

var data struct { Title string `css:"title"` }
page, _ := ant.Fetch(ctx, "https://apple.com")
page.Scan(&data)
data.Title // => Apple

Headless

By default the crawler uses http.Client, however if you're crawling SPAs youc an use the antcdp.Client implementation which allows you to use chrome headless browser to crawl pages.

eng, err := ant.Engine(ant.EngineConfig{
  Fetcher: &ant.Fetcher{
    Client: antcdp.Client{},
  },
})

Polite

The crawler automatically fetches and caches robots.txt, making sure that it never causes issues to small website owners. Of-course you can disable this behavior.

eng, err := ant.NewEngine(ant.EngineConfig{
  Impolite: true,
})
eng.Run(ctx)

Concurrent

The crawler maintains a configurable amount of "worker" goroutines that read URLs off the queue, and spawn a goroutine for each URL.

Depending on your configuration, you may want to increase the number of workers to speed up URL reads, of-course if you don't have enough resources you can reduce the number of workers too.

eng, err := ant.NewEngine(ant.EngineConfig{
  // Spawn 5 worker goroutines that dequeue
  // URLs and spawn a new goroutine for each URL.
  Workers: 5,
})
eng.Run(ctx)

Rate limits

The package includes a powerful ant.Limiter interface that allows you to define rate limits per URL. There are some built-in limiters as well.

ant.Limit(1) // 1 rps on all URLs.
ant.LimitHostname(5, "amazon.com") // 5 rps on amazon.com hostname.
ant.LimitPattern(5, "amazon.com.*") // 5 rps on URLs starting with `amazon.co.`.
ant.LimitRegexp(5, "^apple.com\/iphone\/*") // 5 rps on URLs that match the regex.

Note that LimitPattern and LimitRegexp only match on the host and path of the URL.


Matchers

Another powerful interface is ant.Matcher which allows you to define URL matchers, the matchers are called before URLs are queued.

ant.MatchHostname("amazon.com") // scrape amazon.com URLs only.
ant.MatchPattern("amazon.com/help/*")
ant.MatchRegexp("amazon\.com\/help/.+")

Robust

The crawl engine automatically retries any errors that implement Temporary() error that returns true.

Becuase the standard library returns errors that implement that interface the engine will retry most temporary network and HTTP errors.

eng, err := ant.NewEngine(ant.EngineConfig{
  Scraper: myscraper{},
  MaxAttempts: 5,
})

// Blocks until one of the following is true:
//
// 1. No more URLs to crawl (the scraper stops returning URLs)
// 2. A non-temporary error occured.
// 3. MaxAttempts was reached.
//
err = eng.Run(ctx)

Built-in Scrapers

The whole point of scraping is to extract data from websites into a machine readable format such as CSV or JSON, ant comes with built-in scrapers to make this ridiculously easy, here's a full cralwer that extracts quotes into stdout.

func main() {
	var url = "http://quotes.toscrape.com"
	var ctx = context.Background()
	var start = time.Now()

	type quote struct {
		Text string   `css:".text"   json:"text"`
		By   string   `css:".author" json:"by"`
		Tags []string `css:".tag"    json:"tags"`
	}

	type page struct {
		Quotes []quote `css:".quote" json:"quotes"`
	}

	eng, err := ant.NewEngine(ant.EngineConfig{
		Scraper: ant.JSON(os.Stdout, page{}, `li.next > a`),
		Matcher: ant.MatchHostname("quotes.toscrape.com"),
	})
	if err != nil {
		log.Fatalf("new engine: %s", err)
	}

	if err := eng.Run(ctx, url); err != nil {
		log.Fatal(err)
	}

	log.Printf("scraped in %s :)", time.Since(start))
}

Testing

anttest package makes it easy to test your scraper implementation it fetches a page by a URL, caches it in the OS's temporary directory and re-uses it.

The func depends on the file's modtime, the file expires daily, you can adjust the TTL by setting antttest.FetchTTL.

// Fetch calls `t.Fatal` on errors.
page := anttest.Fetch(t, "https://apple.com")
_, err := myscraper.Scrape(ctx, page)
assert.NoError(err)