GitXplorerGitXplorer
r

scrapy-boilerplate

public
49 stars
11 forks
1 issues

Commits

List of commits on branch master.
Unverified
9e6a9ca4f37ace4b5daad57bff5c075c50cd101b

examples: removed unneeded decorator.

rrmax committed 12 years ago
Unverified
bf75c211066349df5e06ceaa8eda8fddd3006bee

added examples of multiple callbacks and inline callbacks.

rrmax committed 12 years ago
Unverified
2441d5a78d81e9197a5e6890ae44a14759a853d1

fixed missing readme

rrmax committed 12 years ago
Unverified
c319622db76ea99690eed9a588b035dd72923aa3

updated setup

rrmax committed 12 years ago
Unverified
21bc1a33dfb03292c97102d0f4962cbca862c724

updated readme and examples.

rrmax committed 12 years ago
Unverified
0dc7746743d39a3484599ec372f9baf74a533974

renamed ItemFactory to NewItem to match NewSpider function naming.

rrmax committed 12 years ago

README

The README file for this repository.

================== scrapy-boilerplate

scrapy-boilerplate is a small set of utilities for Scrapy_ to simplify writing low-complexity spiders that are very common in small and one-off projects.

It requires Scrapy_ (>= 0.16) and has been tested using python 2.7. Additionally, PyQuery_ is required to run the scripts in the examples_ directory.

.. note::

The code is experimental, includes some magic under the hood and might be hard to debug. If you are new to Scrapy_, don't use this code unless you are ready to debug errors that nobody have seen before.


Usage Guide

Items

Standard item definition:

.. code:: python

from scrapy.item import Item, Field

class BaseItem(Item): url = Field() crawled = Field()

class UserItem(BaseItem): name = Field() about = Field() location = Field()

class StoryItem(BaseItem): title = Field() body = Field() user = Field()

Becomes:

.. code:: python

from scrapy_boilerplate import NewItem

BaseItem = NewItem('url crawled')

UserItem = NewItem('name about location', base_cls=BaseItem)

StoryItem = NewItem('title body user', base_cls=BaseItem)

BaseSpider

Standard spider definition:

.. code:: python

from scrapy.spider import BaseSpider

class MySpider(BaseSpider): name = 'my_spider' start_urls = ['http://example.com/latest']

  def parse(self, response):
      # do stuff

Becomes:

.. code:: python

from scrapy_boilerplate import NewSpider

MySpider = NewSpider('my_spider')

@MySpider.scrape('http://example.com/latest') def parse(spider, response): # do stuff

CrawlSpider

Standard crawl-spider definition:

.. code:: python

from scrapy.contrib.spiders import CrawlSpider, Rule from scrapy.contrib.linkextractors.sgml import SgmlLinkExtractor

class MySpider(CrawlSpider): name = 'my_spider' start_urls = ['http://example.com']

  rules = (
      Rule(SgmlLinkExtractor('category\.php'), follow=True),
      Rule(SgmlLinkExtractor('item\.php'), callback='parse_item'),
  )

  def parse_item(self, response):
      # do stuff

Becomes:

.. code:: python

from scrapy_boilerplate import NewCrawlSpider

MySpider = NewCrawlSpider('my_spider') MySpider.follow('category.php')

@MySpider.rule('item.php') def parse_item(spider, response): # do stuff

Running Helpers

Single-spider running script:

.. code:: python

file: my-spider.py

imports omitted ...

class MySpider(BaseSpider): # spider code ...

if name == 'main': from scrapy_boilerplate import run_spider custom_settings = { # ... } spider = MySpider()

  run_spider(spider, custom_settings)

Multi-spider script with standard crawl command line options:

.. code:: python

file: my-crawler.py

imports omitted ...

class MySpider(BaseSpider): name = 'my_spider' # spider code ...

class OtherSpider(CrawlSpider): name = 'other_spider' # spider code ...

if name == 'main': from scrapy_boilerplate import run_crawler, SpiderManager custom_settings = { # ... }

  SpiderManager.register(MySpider)
  SpiderManager.register(OtherSpider)

  run_crawler(custom_settings)

.. note:: See the examples_ directory for working code examples.

.. _Scrapy: http://www.scrapy.org .. _PyQuery: http://pypi.python.org/pypi/pyquery .. _examples: https://github.com/darkrho/scrapy-boilerplate/tree/master/examples