GitXplorerGitXplorer
h

scrapy-selenium-demo

public
21 stars
11 forks
1 issues

Commits

List of commits on branch master.
Unverified
02f01d6b5636b31dc3d937995cc7093be4d440cb

added quitting chrome

hharrywang committed 5 years ago
Unverified
194778e1a30d48ea08dcc088ab7aa640acd3ac90

added ProxyMesh support

hharrywang committed 5 years ago
Unverified
e15384c5f99b2534918477d7a2bafe8aa7f69d15

first version

hharrywang committed 5 years ago
Verified
910d37de3300980ee56beaeeb635dd5393c3f1ec

Initial commit

hharrywang committed 5 years ago

README

The README file for this repository.

Scrapy + Selenium Demo

This repo contains the code for Part V of my tutorial: A Minimalist End-to-End Scrapy Tutorial (https://medium.com/p/11e350bcdec0).

The website to crawl is https://dribbble.com/designers, which is an infinite scroll page.

I borrowed some code from "Web Scraping: A Less Brief Overview of Scrapy and Selenium, Part II" - many thanks to the author!

Setup

Tested with Python 3.6 via virtual environment:

$ python3.6 -m venv venv
$ source venv/bin/activate
$ pip install -r requirements.txt

Chrome Driver:

You need to download the chrome driver from: https://chromedriver.chromium.org/downloads

Note: the version of the driver must match the version of chrome installed on your machine for this to work.

For example, this repo uses the chromedriver 77.0.3865.40 that supports Chrome version 77 - you need to make sure installed Chrome is version 77 (check it from Menu--> Chrome --> About Google Chrome)

Run

Run scrapy crawl dribbble, which should start an instance of Chrome and scroll to the bottom of the page automatically. The extracted data is logged to the console.

Use ProxyMesh with Scrapy

You must set the http_proxy environment variable, then activate the HttpProxyMiddleware.

For HTTP:

$ export http_proxy=http://USERNAME:PASSWORD@HOST:PORT

such as:

$ export http_proxy=http://harrywang:mypassword@us-wa.proxymesh.com:31280

For HTTPS:

For https requests, you should use IP authentication, and remove USERNAME:PASSWORD@ from the http_proxy variable.

To activate the HttpProxyMiddleware, uncomment the following part in settings.py:

DOWNLOADER_MIDDLEWARES = {
    'scrapy.downloadermiddlewares.httpproxy.HttpProxyMiddleware': 100,
}

Use ProxyMesh with Selenium

IP authentication must be set first: add the IP of the machine running this script to you ProxyMesh account for IP authentication. Then, uncomment the following two lines in the spider file.

# PROXY = "us-wa.proxymesh.com:31280"
# chrome_options.add_argument('--proxy-server=%s' % PROXY)