18-97-14-83.crawl.commoncrawl.org | ToothyWiki | MoonShadow | RecentChanges | Login
- M-A: [10:20] MS: The ability to detect inserted/removed blocks.
- M-A: [10:21] Oh, and whilst you're at it, a Hex editor that allows block insertion.
- Vitenka: [10:22] ability to do 'diff modulo endianness' would be helpful.
http://citeseer.ist.psu.edu/cache/papers/cs/31172/http:zSzzSzxmailserver.orgzSzdiff2.pdf/myers86ond.pdf
http://freshmeat.net/projects/xdiff-lib/
http://neil.fraser.name/writing/diff/ - I still have a long way to go to arrive at something useful. Good to have my suspicions about the Myers paper confirmed, though.
Source for a bytewise binary diff tool based on the Eugene Myers paper above is [here]. A Cygwin binary is [here], I don't imagine the binary will work if you don't have Cygwin installed but the source is portable. No particular guarantees made about reliability, usefulness, clarity of source etc etc. Shoving a GUI around it is left as an exercise for the Perl/Python?+GTK gurus.
Todo / known problems:
- Currently, memory for the longest possible edit list is reserved. Should grow dynamically instead.
- Currently, memory for the biggest possible arrays for the edit graph progress states is reserved (though it is only initialised as necessary). These arrays are sparse and mostly unused, so we should be able to do much better.
- I have still not convinced myself that the algorithm is correct. The paper fails to mention the fact that the algorithm accesses uninitialised values of the progress state arrays V, leaves the derivations of the reverse snake search and the actual edit list as an exercise for the reader, and appears to flip between using N/M to mean "length" and "position of last character". The self-tests now pass and I *think* they give full code coverage, but I would be interested if people do happen across pairs of files for which the output is incorrect.
- "assume files have opposite endianness" switch.
- Error handling for failed file seeks / reads.
Offset address confusion:
-1,42,77
00000020: .. .. .. .. .. .. .. .. .. .. a9 .. .. .. .. .. .......... .....
+1,43,77
00000040: .. .. .. .. .. .. .. .. .. .. .. .. .. a3 .. .. ............. ..
- For deletions, the deleted characters are displayed, so the dump is from file A. For insertions, the inserted characters are displayed, so the dump is from file B. Positions in both files are provided, so someone writing a smarter GUI can do something prettier. - MoonShadow
- I see. How about this one? --M-A
-1,1,1
00000000: .. 15 .. .. .. .. .. .. .. .. .. .. .. .. .. .. . ..............
+1,2,1
00000000: .. 38 .. .. .. .. .. .. .. .. .. .. .. .. .. .. .8..............
- The byte with the hex code 15, which is at offset 1 in file A, was deleted. The byte with the hex code 38, which is at offset 1 in file B, is inserted at file A's offset 2. Looks right to me. Well, assuming that's what actually happened. The offset given for each edit does not take other edits into account - this makes it easier to look up context for edits in an external hex editor and is trivial for a GUI wrapper to do if it wants to. Though I should probably make it an option. - MoonShadow