GitXplorerGitXplorer
l

sumdiff

public
0 stars
0 forks
0 issues

Commits

List of commits on branch main.
Unverified
1dd6073b4d12272c67e6bfef6366e0180fd286f6

Update a few comments and address a TODO in kc.

llukesandberg committed 9 months ago
Unverified
10523efd5fbe4ac698962957ab7c7bacc7217e1c

Apply an insight from the hyyro paper to progressively restrict the range of matches that are considered

llukesandberg committed 9 months ago
Unverified
90247f3b22c40c22a787bdde713044b8524e42aa

Address a few TODOs

llukesandberg committed 9 months ago
Unverified
d53bea20d60956795b6c2bcc9e3c6bae72ec1c45

Fix a `cargo clippy` issue

llukesandberg committed 9 months ago
Unverified
352843488ca267d6c56ea3395b6ed8c2d2201941

Add an implementation of the O(NP) Wu-Manber-Meyers LCS algorithm

llukesandberg committed 9 months ago
Unverified
6631a57d02adb872d2032db8bba3207a0ed84abd

Ammend a few comments

llukesandberg committed 9 months ago

README

The README file for this repository.

sumdiff

A tool to summarize large diffs

To aid in large refactorings it is necessary to understand the diff being applied while avoiding the burden of reviewing every line.

Approach

The high level approach is to identify the crux of the changes and to ignore irrelevant textual details.

  1. Identify the changing parts of files (aka run diff)
  2. Tokenize changing parts to understand lexical context
    • requires a language sensitive lexer
    • and a fallback lexer for ambiguous inputs (or unsupported languages)
  3. Run diff again
  4. Break the edit script into parts
  5. Select and print the most popular edits that cover all changes.

Huh?

What we are attempting to do is identify logical and prevalent changes. For a set of scripted edits we can expect uninformity which should allow a small set of 'logical' edits to cover all the actual edits.

Examples

TODO(luke): get things working first

Implementation questions and decisions

Utf8 Validation?

Should we validate our inputs as utf-8? It seems reasonable however experience has shown that debugging a utf8 rejection error is difficult. So on the one hand we could just make sure the error messages are useful, but on the other we could declare that this isn't our problem.

For the purposes of a diff tool we should assume that the inputs are useful to someone (they bothered to ask us to compare them), so rejecting them isn't particularly useful.

Struct of arrays vs array of structs?

When 'parsing' files we want to track metadata bout the data such as file offset, token length, AST depth, etc. It would be tempting to construct a struct Token to represent this but instead we have chosen a 'struct of arrays' apporach and store metadata in parallel datastructures.

Many algorithm runtimes will be dominated by the cost of scanning through token lists, while querying data about offset and depth is rarer, by moving that data to be stored elsewhere we slightly complicate some output algorithms in exchange for improving memory locality of our diff algorithms.

Why compare Meyers, Dijkstra and the Kuo and Cross LCS algorithms?

The meyers and KC papers both have this fun detail where they criticize other algorithms, the criticisms are fair but tickled my curiousity. Furthermore, the Kuo and Cross algorithm facinated me.

  • Kuo and Cross: This paper had some clear opportunities to improve performance (binary search, counting sort, caching offsets, compressing matches), and required some additional work over the paper to actually produce an LCS instead of a length. By working on these I was able to drop an nlogn term from the complexity, mitigate the R term but was unable to escape the NL term. For standard edits (e.g. code diffs), the LCS is proportional to the length of the text so fundamentally we are still implementing a quadratic algorithm, though admittedly many of the constant factors are small.

  • Dijkstra: the Meyers paper discusses that improving the performance of dijkstra on edit graphs requires 'complex linked list and bucket queues'. Indeed the literature on these (Radix Queues, Fibonnaci Queues, Dials Algorithm) is complex. But edit graphs are very simple with all edge weights as 0 or 1, thus I was able to use a simple circular buffer (a VecDeque) as a priority queue, leading to a trivial O(1) solution. The remaining problem is simply the datastructure management. The size of our predecessor datastructures scale with the number of nodes visited. In principle this is O(ND) which is much worse than Meyers.

    • To do better I believe would require adopting more techniques from meyers, e.g. an explicit representation of the 'frontier diagonals'.

After working on all 3 it is clear that the criticisms of Meyers are completely accurate (except for the thing about 'complex priority queues'). The KC algorithm only becomes competitive when the alphabet size becomes large and the number of edits is also large (aka D is large), and for dijkstra to become competitive does require stealing representation techniques from Meyers.

The key insight of Meyers is to target D (the number of edits) instead of L (the LCS), which is useful for standard diff scenarios where our expecation is that D is small.

TODO

SIMD

The counting sort implementation requires a prefix-sum computation, this can be accomplished more quickly using SIMD a clear TODO, however since std::simd is unstable this is going to be complex.