GitXplorerGitXplorer
P

Web-Scraping-PyData

public
9 stars
0 forks
3 issues

Commits

List of commits on branch master.
Verified
f9a56055fe046124aae2643ece90ee78e0e9913e

Merge pull request #6 from Proteusiq/dependabot/npm_and_yarn/Presentation/ini-1.3.7

PProteusiq committed 4 years ago
Verified
35b51b07eefd2d065300e5579a7c064190959146

Bump ini from 1.3.5 to 1.3.7 in /Presentation

ddependabot[bot] committed 4 years ago
Verified
ddb37d6dfa2ddf55b7932aa9a89fad119b35cd22

basics of web scraping

PProteusiq committed 4 years ago
Verified
46e2594adcf035d9f83b94605a6af568bfc257d6

Add files via upload

PProteusiq committed 4 years ago
Verified
226b545c6317bf7fee0b7bff5433ce17142a3768

Update proxies.ipynb

PProteusiq committed 4 years ago
Verified
7e32adbbc2dfdadea26f7734f9f55361d923a686

Update proxies.ipynb

PProteusiq committed 4 years ago

README

The README file for this repository.

Web Scraping: PyData Copenhagen

Tips and Tricks for web data collections using Python

From basic to advance web scraping, these are my tips and tricks on how to gather, automate, and store web data using Python's rich eco-systems.

Do's and Don'ts of Web Scraping

  • Don't Be A Clown
  • Always read and understand 'Terms of Use'
  • Go Gentle
  • Be Open

Legality of Web Scraping

Twenty Years of Web Scraping and the Computer Fraud and Abuse Act

Victory! Ruling in hiQ v. Linkedin Protects Scraping of Public Data | Electronic Frontier Foundation

Talk

  • What I have learned, unlearned and discovering in Web Scraping

  • Not everyone likes Chips & Fish: These are my opinions of do-s and don'ts

Road Ahead:

Installation

Assuming that you have git, Anaconda or miniconda installed in your system:

Clone Repo:

git clone https://github.com/Proteusiq/Web-Scraping-PyData.git
cd Web-Scraping-PyData

Automatic: Recreate the environment from yaml:

conda env create -f environment.yml
conda activate talks

If automatic creation of the environment above failed, you can manually create the environment

Manually [Only if automatic failed]:

conda create -n talks python=3.7 pandas requests beautifulsoup4 lxml selenium jupyterlab ipython

conda activate talks
conda install -c conda-forge nodejs
pip install requests_html tqdm fuzzywuzzy[speed] html5lib python-Levenshtein

Presentation

conda activate talks
cd Presentation
npm install # Needed only once
npm start

Notebooks

jupyter lab --port 8004

Navigate to notebooks. Notebooks are chronological numbered to explain presented tips and tricks.

Examples

Examples highlighting the use Network to gather data.

Results:

code examples:

  • bilbase.py and bilbase_api.py: how to write same code with two different approaches
  • bolig_network.py: how to write a single code that capture almost all Denmark's real estates data.
  • boliga_progress_bar.py: how to add a progress bar in web scraping
  • advance> run example.py: Advance web scraping. Build friendly API: a single class to rule them all
  • coming soon logging, mongodb, celery, and more

You can run any example as:

cd examples
python bilbase.py

Release History

Comming Soon

  • 0.1.5

    • ADD: Micro-service repo
    • CHANGE: New examples and use of selenium
  • 0.1.4

    • CHANGE: Adding Nodejs
    • ADD: Scrabing using JavaScript
  • 0.0.1

    • Work in progress

Resources:

Awesome Web Scraping (Python)

Meta

Prayson Daniel – @proteusiqpraysonwilfred@gmail.com

Distributed under the MIT license. See LICENSE for more information.

https://github.com/praysondaniel/github-link

Contributing

  1. Fork it (https://github.com/Proteusiq/Web-Scraping-PyData/fork)
  2. Create your feature branch (git checkout -b feature/fooBar)
  3. Commit your changes (git commit -am 'Add some fooBar')
  4. Push to the branch (git push origin feature/fooBar)
  5. Create a new Pull Request