GitXplorerGitXplorer
m

Data-Science-From-Scratch

public
27 stars
11 forks
0 issues

Commits

List of commits on branch master.
Verified
5cdcdb3f66a1ff75906919762d6dcc301b5ef778

Merge pull request #1 from jagomo1/patch-1

mmsaroufim committed 6 years ago
Verified
957334ff33cd95a562e80283e636ec5b15868300

Prueba personal

jjagomo1 committed 6 years ago
Unverified
0302542ed5832656c05af6039a57747d561e4e1f

added python internals related stuff, machine learning libraries, and a half complete implementation of a sql database

mmsaroufim committed 10 years ago
Unverified
2130a194d1bd55f6ff90cc033c33ef640e515ae6

added code for knn and demo of why curse of dimensionality is a thing

mmsaroufim committed 10 years ago
Unverified
78013d2ecf875ce56d2f6002636ffcc18b0f06b5

new stuff

mmsaroufim committed 10 years ago
Unverified
88745ab663978ab68615c043222c39ca84e85756

added twitter api stuff

mmsaroufim committed 10 years ago

README

The README file for this repository.

Data-Science-From-Scratch

##Getting Data

###Reading Files

By using stdin and stdout it's easy to create unix like utilities for text processing and pipe them into each other. For example, to count the number of lines in a file that contain numbers:

cat someFile.txt | python egrep.py "[0-9]" | python line_count.py 

###Web Scraping

Be sure you've set up a virtual environment and then just use BeautifulSoup, Requests and html5lib. Please checkout a page's robots.txt and terms before you do something like this.

source_code_of_a_webpage = BeautifulSoup(requests.get(url_of_page).text,'html5lib')

When you're working with json, transform your data to a dictionary and be happy

import json
deserialized = json.loads(serialized_json)

###Twitter API

Get some credentials at https://apps.twitter.com . Please don't check in your consumer_key and secret_key in your repo.

Web-Scraping/twitter.py

twitter.py CONSUMER_KEY SECRET_KEY

##Working with data

The size of the range of a feature should not affect its predictive power so it's usually a good idea to rescale your dataset to have mean 0 and variance 1.

return (data_matrix[i,j] - means[j])/stdevs[j]

You don't want your parser to break if some value is not expected

def try_or_none(f):
    def f_or_none(x):
        try: return f(x)
        except: return f_or_none
    return f_or_none