GitXplorerGitXplorer
m

DataGen

public
0 stars
1 forks
0 issues

Commits

List of commits on branch master.
Unverified
bb78836e0f4a159d33b4c7fa2204556afce7cba3

include wall clock time and speedup.

mmahmoud committed 13 years ago
Unverified
fd7b8c7b5e1554e06845d0387275e27d04035b41

gzipfile.write() can return None, so default to 0

mmahmoud committed 13 years ago
Unverified
9b511d4f85ba8a83f754fff18d8972dcdd8d9332

update decorator a bit for new return signature

mmahmoud committed 13 years ago
Unverified
f41ea3998dfc7054b66d0a491fc2c55fcb4ae584

whoops, forgot a call to out_hook

mmahmoud committed 13 years ago
Unverified
5ecebe535617c6572f1e88be021acf3bd39e20f5

removing non-markdown readme

mmahmoud committed 13 years ago
Unverified
dc4eb84fa5890f8169d8db4cba24fb76248e27d7

Merge branch 'master' of github.com:makuro/DataGen

mmahmoud committed 13 years ago

README

The README file for this repository.

DataGen

DataGen is a pretty small Python program used to generate pretty large datasets of random numbers for testing with MapReduce frameworks like Hadoop and Disco.

Beyond simply being concurrent, there is a disproportionate amount of work that went into targeting S3, including boto integration and streaming gzip+md5 creation.

There are several configuration options, most of them available at the top of main.py. Python may not be the fastest language when it comes to CPU-bound tasks, but in this case, serviceable and readily-available concurrency in the standard library outweighed other optimization strategies.

Usage

To run the program, navigate to DataGen/src and run something like:

python main.py 20000 25 test

Provided you are using Python 2.6 or higher, this should generate 25 20,000-line gzipped TSV files all starting with the name 'test' (in your current directory). If you've got boto and configured your AWS environment properly, the files will be uploaded and removed as they are generated.

TODO

  • Try cStringIO-based streaming gzip
  • Try LFSR-based random number generation
  • PyPy tests