GitXplorerGitXplorer
m

Data-Science-From-Scratch

public
27 stars
11 forks
0 issues

Commits

List of commits on branch master.
Unverified
97244a1f31df2d0eee230f99497b7db7f11e53e6

demo github api without authentication

mmsaroufim committed 10 years ago
Unverified
39925158532fdc3edbdb5e475107d047d2f63c29

added some webscraping tools

mmsaroufim committed 10 years ago
Unverified
3d982db656ff7b6149b91c12faa2c47616d39844

added some basic web scraping notes

mmsaroufim committed 10 years ago
Unverified
f1fa8c6f79bb33ff1b9adbb7342e826f01cacb60

added most common words utility

mmsaroufim committed 10 years ago
Unverified
a01ff9dafd47fc1baf1178fe59af7dcd2a28942f

removed stuff

mmsaroufim committed 10 years ago
Unverified
db588557e75e656975e769dfe1c4580a9c4ac798

put text stuff in its own dir

mmsaroufim committed 10 years ago

README

The README file for this repository.

Data-Science-From-Scratch

##Getting Data

###Reading Files

By using stdin and stdout it's easy to create unix like utilities for text processing and pipe them into each other. For example, to count the number of lines in a file that contain numbers:

cat someFile.txt | python egrep.py "[0-9]" | python line_count.py 

###Web Scraping

Be sure you've set up a virtual environment and then just use BeautifulSoup, Requests and html5lib. Please checkout a page's robots.txt and terms before you do something like this.

source_code_of_a_webpage = BeautifulSoup(requests.get(url_of_page).text,'html5lib')

When you're working with json, transform your data to a dictionary and be happy

import json
deserialized = json.loads(serialized_json)

###Twitter API

Get some credentials at https://apps.twitter.com . Please don't check in your consumer_key and secret_key in your repo.

Web-Scraping/twitter.py

twitter.py CONSUMER_KEY SECRET_KEY

##Working with data

The size of the range of a feature should not affect its predictive power so it's usually a good idea to rescale your dataset to have mean 0 and variance 1.

return (data_matrix[i,j] - means[j])/stdevs[j]

You don't want your parser to break if some value is not expected

def try_or_none(f):
    def f_or_none(x):
        try: return f(x)
        except: return f_or_none
    return f_or_none