Tips and Tricks for web data collections using Python
From basic to advance web scraping, these are my tips and tricks on how to gather, automate, and store web data using Python's rich eco-systems.
- Don't Be A Clown
- Always read and understand 'Terms of Use'
- Go Gentle
- Be Open
Twenty Years of Web Scraping and the Computer Fraud and Abuse Act
Victory! Ruling in hiQ v. Linkedin Protects Scraping of Public Data | Electronic Frontier Foundation
-
What I have learned, unlearned and discovering in Web Scraping
-
Not everyone likes Chips & Fish: These are my opinions of do-s and don'ts
Road Ahead:
- Basics: Leaving the Basics Behind
- Better: It simple, but not always
- Advance: Web Scraping Micro-services
Assuming that you have git, Anaconda or miniconda installed in your system:
Clone Repo:
git clone https://github.com/Proteusiq/Web-Scraping-PyData.git
cd Web-Scraping-PyData
conda env create -f environment.yml
conda activate talks
If automatic creation of the environment above failed, you can manually create the environment
conda create -n talks python=3.7 pandas requests beautifulsoup4 lxml selenium jupyterlab ipython
conda activate talks
conda install -c conda-forge nodejs
pip install requests_html tqdm fuzzywuzzy[speed] html5lib python-Levenshtein
conda activate talks
cd Presentation
npm install # Needed only once
npm start
jupyter lab --port 8004
Navigate to notebooks. Notebooks are chronological numbered to explain presented tips and tricks.
Examples highlighting the use Network to gather data.
code examples:
-
bilbase.py
andbilbase_api.py
: how to write same code with two different approaches -
bolig_network.py
: how to write a single code that capture almost all Denmark's real estates data. -
boliga_progress_bar.py
: how to add a progress bar in web scraping -
advance> run example.py
: Advance web scraping. Build friendly API: a single class to rule them all - coming soon logging, mongodb, celery, and more
You can run any example as:
cd examples
python bilbase.py
Comming Soon
-
0.1.5
- ADD: Micro-service repo
- CHANGE: New examples and use of selenium
-
0.1.4
- CHANGE: Adding Nodejs
- ADD: Scrabing using JavaScript
-
0.0.1
- Work in progress
Prayson Daniel – @proteusiq – praysonwilfred@gmail.com
Distributed under the MIT license. See LICENSE
for more information.
https://github.com/praysondaniel/github-link
- Fork it (https://github.com/Proteusiq/Web-Scraping-PyData/fork)
- Create your feature branch (
git checkout -b feature/fooBar
) - Commit your changes (
git commit -am 'Add some fooBar'
) - Push to the branch (
git push origin feature/fooBar
) - Create a new Pull Request