GitXplorerGitXplorer
w

cadiff

public
0 stars
0 forks
0 issues

Commits

List of commits on branch master.
Unverified
8064ed27ffb789e87bceea5a1d7b56110c02c00e

Added this random new file that Xcode autogenerated.

wwadetregaskis committed 6 years ago
Unverified
d9f9f8e08dccee19b9a73520355879c4df7fd0f4

Now wait until the I/O stream is _actually_ closed, when verifying duplicity of files, to signal the concurrency limiter.

wwadetregaskis committed 6 years ago
Unverified
ec085faa2f9998e9e20d1b667fd99fbf4eb207a8

Moved a debug log about when duplicity verification begins to be _after_ it passes the concurrency limiter; when it's _actually_ starting the verification operation.

wwadetregaskis committed 6 years ago
Unverified
8792f7c5173def61cc4a973fb092499d71984c87

Fixed estimatedTimeRemaining() to correctly handle a progress of zero.

wwadetregaskis committed 6 years ago
Unverified
32bbaa5797a0e193a75eab5a3712f3c1a47e8abc

File comparison now uses the same configurable concurrency settings as the hashing stage.

wwadetregaskis committed 6 years ago
Unverified
475b43f74e31156093741fd696dd989fff92fa32

Now explicitly close the I/O objects for files being compared in compareFiles().

wwadetregaskis committed 6 years ago

README

The README file for this repository.

cadiff - Content-addressing diff

This is essentially 'diff' in binary mode where it identifies files based on their content, not their filename. It simply lets you compare two sets of data where file names or other metadata may have changed, but all you really care about is whether the actual files - that is, the actual data within each - is duplicated or not.

It's also an interesting exercise in how heavily you can optimise for I/O performance using libdispatch. The result is not pretty, but effective.

I (Wade Tregaskis) wrote it because I needed a tool for checking that I'd copied all my photos off my SD cards, possibly many weeks prior and having renamed and moved them all around since then.

NOTE: this does not tell you differences within two files. This is a different kind of diff. Though something like that is on the TODO list, below.

License

Normal 2-clause BSD. See the top of main.m for details.

TODOs

  • Add the ability to tell you how two matches differ (e.g. how their metadata is different; different names, modification dates, etc).
  • Optimise the indexing (hashing) step by doing multiple passes. i.e. first hash only the first 4K of each file, then for all the conflicting ones, hash e.g. the next 1 MiB, etc. This is just a further optimisation of the existing method of hashing only a subset of the file (1 MiB by default) that might be able to further reduce overall I/O while dealing with hash conflicts gracefully.
  • Figure out why there appears to be some kind of serialisation occurring, in batches of concurrencyLimiter size.
  • Figure out why bandwidth utilisation of the SD card slot, as a percentage, seems inversely proportionate with internal disks'. i.e. you can get 0 MB/s and 400 MB/s respectively, or 10 MB/s and 200 MB/s, or 20 MB/s and 20 MB/s, but not the obvious 20 MB/s and 400 MB/s.