From badiff
Jump to: navigation, search


byte array diff

A pure-java byte-level diffing library for working with large inputs, taking advantage of parallel processing during optimization. badiff is pretty fast:

Test output running on my desktop, diffing two 10MB files, the original randomly generated, and the target randomly distorted from the original.

Computing and applying diffs for random files of length:10485760
Fill time:135ms
Diff computation time:39s
Diff size:3535682
Diff apply time:77ms

Source is available on github, and can be cloned with:

 git clone badiff

badiff has no external dependencies, but requires Java 1.6 or higher. badiff is released under a BSD license.

Find it on maven:


Example usage

Diffing two files, and storing the result as a file

 File orig, target, diff;
 JFileChooser chooser = new JFileChooser();
 // prompt the user for the files to diff and file to save to
 if(chooser.showOpenDialog(null) != JFileChooser.APPROVE_OPTION)
 orig = chooser.getSelectedFile();
 if(chooser.showOpenDialog(null) != JFileChooser.APPROVE_OPTION)
 target = chooser.getSelectedFile();
 if(chooser.showSaveDialog(null) != JFileChooser.APPROVE_OPTION)
 diff = chooser.getSelectedFile();
 // do the diff
 FileDiff computed = FileDiffs.diff(orig, target);
 // move the generated diff file to the dest file location

Diffing two byte arrays and storing the result as a byte array

 ByteArrayDiffs badiff = new ByteArrayDiffs(); // ByteArrayDiffs can optionally specify a serializer
 String orig = "Hello world!";
 String target = "Hellish cruel world!";
 byte[] diff = badiff.diff(orig.getBytes(), target.getBytes()); // bidirectional diff
 byte[] udiff = badiff.udiff(orig.getBytes(), target.getBytes()); // unidirectional diff


badiff uses the O(ND) algorithm described in this paper:

By default, input is run through the diffing graph in chunks of 1KB. Increasing chunk size can potentially decrease the size of the resulting diff, but the cost grows with the square of the increase; a chunk size of 2KB takes 4 times longer to compute than a chunk size of 1KB. After chunked graphing is completed the resultant edit list is post-processed to remove some obvious artifacts of chunking, such as pairs of (INSERT,DELETE) operations with identical data that can potentially occur at chunk boundaries.

Chunked graphing is done in parallel by a thread pool sized to the number of available processor cores.

Post-processing is done serially rather than in parallel, so for large input sizes (>50MB) it may be advantageous to skip the post-processing.


badiff is copyrighted ©2014 Robin Kirkman