GitXplorerGitXplorer
a

pyncd

public
13 stars
7 forks
0 issues

Commits

List of commits on branch master.
Unverified
91c48c77dfc25539b0887d60300eddd7606323be

Update usage section.

aalephmelo committed 9 years ago
Unverified
1f4d37ab9d84de63463d8a30b7b11438f82cb452

Update usage section.

aalephmelo committed 9 years ago
Unverified
0884538f1009176799c18e9e3eda7ec165e755c2

Check to-do.

aalephmelo committed 9 years ago
Unverified
fb3826de0238d9c13785bde4e4f819be29e3c0f9

Add file as parameters

aalephmelo committed 9 years ago
Unverified
cf1f2eee04f4f741aca23cdac96512bd30ebd357

Merge branch 'master' of github.com:alephmelo/pyncd

aalephmelo committed 9 years ago
Unverified
215b8914b3b1ef27230b9f63f45f2b12363a26cd

Add text samples.

aalephmelo committed 9 years ago

README

The README file for this repository.

Pyncd

A Python powered normalized compression distance (NCD) calculator.

All data are created equal but some data are more alike than others. The NCD uses a method expressing this alikeness using a similarity metric based on compression. The NCD is a non-negative number 0 ≤ r ≤ 1 representing how different the two files are. Smaller numbers represent more similar files. It is parameter-free in that it doesn’t use any features or background knowledge about the data, and can without changes be applied to different areas and across area boundaries.

Usage

$ git clone git@github.com:alephmelo/pyncd.git
$ cd pyncd
$ python pyncd.py <file1> <file2>

Examples

Let's measure the nomalized compression distance between a square image and the same image.

x = open('examples/imgs/square.png', 'rb').read() # square image
y = open('examples/imgs/square.png', 'rb').read() # square image
$ python pyncd.py
0.02

As we can see, the ncd result was almost 0, so we can say that the two files are very much alike, even though they are the same image, this happens because some compression algorithm works better with some kind of files.

Now, we are measuring the distance between a square image and a rectangle image.

x = open('examples/imgs/square.png', 'rb').read() # square image
y = open('examples/imgs/rectangle.png', 'rb').read() # rectangle image
$ python pyncd.py
0.53

We can see that the distance between the files are 53% alike.

And finnaly we are measuring the distance between a square image and a circle image.

x = open('examples/imgs/square.png', 'rb').read() # square image
y = open('examples/imgs/circle.png', 'rb').read() # circle image
$ python pyncd.py
0.80

The NCD between the files is 80% now.

These are great stats considering that the NCD doesn't need to see inside the file. Think big picture!

To-do

  • [ ] Implement setup.py
  • [x] Be like $ pyncd <file1> <file2>
  • [ ] Support directories and create distance matrix.

References

  • Rudi Cilibrasi and Paul M. B. Vitányi. Clustering by compression. IEEE Transactions on Information Theory, 51:1523–1545, 2005