ec2-3-236-234-223.compute-1.amazonaws.com | ToothyWiki | Admiral | RecentChanges | Login | Webcomic
So, currently procrastinating.
I currently have automated backups running on my home computer, which generates grandfather-father-son style tar archives. That is, I have for example:
5989160960 Feb 13 02:15 0_1326245236.tar
3200050564 Feb 13 02:15 0_1326245236.tbz
686940160 Feb 17 01:27 1_1329442036.tar
412468098 Feb 17 01:27 1_1329442036.tbz
222167040 Mar 7 01:27 2_1331083636.tar
129352636 Mar 7 01:27 2_1331083636.tbz
Now, bzip2-ing them doesn't seem to be *that* effective. Moreover, the tar seems to blindly copy in an entire file if it has changed in the slightest. I back up a lot of files that get minor changes (such as mailboxes and log files), so some sort of binary diff should help out a lot.
Alas, xdelta3 doesn't seem to do a good job, as it only has a search window of 64MB, and therefore is very much a linear-only diff tool. What I want to create is a generic diff-compression utility that you give the file to compress, plus a few extra files for added context, and it will use redundancy from everywhere in those files, regardless of how big they are. The established (by rsync and rzip) method of doing this is with a rolling checksum. Those incremental backups should compress very nicely to a tiny fraction of their original size.
So, I have a java program that will read in these files and print out the matches it finds. It is fairly quick, and it caches the regular checksums of the additional files in a file on disc so that this doesn't need to be rebuilt each time. For instance:
Starting diff with 3 input files
File 0, size 5989160960: 0_1326245236.tar
Loading checksums for 0_1326245236.tar
Reading checksums took 1950 ms
File 1, size 686940160: 1_1329442036.tar
Loading checksums for 1_1329442036.tar
Reading checksums took 269 ms
File 2, size 222167040: 2_1331083636.tar
Found match from location 3062 to file 1 location 3062 of length 1699
Found match from location 22372 to file 1 location 22372 of length 1333
Found match from location 66404 to file 1 location 66404 of length 1761
Found match from location 68166 to file 1 location 68166 of length 2643
Found match from location 76643 to file 1 location 76643 of length 1846
Found match from location 79715 to file 1 location 79715 of length 2358
Found match from location 100708 to file 1 location 100708 of length 2357
Found match from location 114532 to file 1 location 114532 of length 1332
Found match from location 185188 to file 1 location 185188 of length 1844
However, there is a slight problem, and that is that some of the mails in my mailboxes have base-64 encoded files, where those files were obviously full of nulls. The base-64 representation goes something like this:
This means that my program generates an extremely large number of matches all overlapping, like this:
Found match from location 1458932 to file 0 location 2378752 of length 1083
Found match from location 1458932 to file 0 location 80098304 of length 1083
Found match from location 1458932 to file 0 location 1802637312 of length 1083
Found match from location 1458932 to file 0 location 1804695552 of length 1083
Found match from location 1458932 to file 0 location 1805683712 of length 1083
Found match from location 1458932 to file 0 location 1830771712 of length 1083
Found match from location 1458932 to file 0 location 1880933376 of length 1083
Found match from location 1458932 to file 0 location 1887559680 of length 1083
Found match from location 1458932 to file 0 location 1888234496 of length 1083
Found match from location 1458932 to file 0 location 1952214016 of length 1083
Found match from location 1458932 to file 0 location 2009562112 of length 1083
Found match from location 1458932 to file 0 location 2055955456 of length 1083
Found match from location 1458932 to file 0 location 2074396672 of length 1083
(in fact, there are 7379 matches for location 1458932).
I am currently looking at methods of choosing the best overlap to use in the diff. Obviously the biggest would be best, but can I reduce the amount of work spent finding it?
- Sounds like the Google search term you're after is "minimal edit distance". See [here] for previous discussion and some source code :) --MoonShadow
- Probably not. The problem isn't finding the matches, it is simply that there are so many of them. The rolling hash algorithm enables me to find matches across the 3GB file without much in the way of trouble, which a dynamic programming approach would not. The problem is that there may be a much longer match in there somewhere. The naive approach would be to enumerate all the matches found, and cherry-pick the largest ones, but in order for the program to finish before the end of the universe I need to be able to suppress match generation for sections of the file that already have a suitable match, without missing the really useful large matches.
- The reason I am suggesting this approach is that finding the set of minimal edits locates exactly the smallest set of things that have changed (and therefore also the set of largest matches). What you describe is only a problem for you because you are doing something different to generate your diffs. The question you are asking is how to find the [longest common subsequence] of two strings. This problem is intimately related to that of the minimal edit distance, is solved by similar algorithms in similar time and space; and since what you are actually trying to do is generate a diff, I suggest the minimal edit distance problem is the one you really want to solve, since otherwise you would have to do extra work to extract the diff (aka the edits) anyway. If you do elect to go for a solution for generating your diff that is not equivalent to solving the minimal edits problem - i.e. one that is not mathematically correct - that's fair enough, and I'm sure there are plenty of nice heuristics people could come up with, but you'll have to accept that your solution will sometimes - possibly often - miss the best matches; that's the tradeoff you're making. If that's the direction you want to go, I suggest you start with the simplest thing that could possibly work and get that running and debugged - maybe greedy selection, or just picking the first match found each time - and use a large corpus of test data to gather some stats so you have a baseline. You say you see lots of matches, but what does that translate to in practice? - just how bad would the result be if you select them cheaply/greedily/randomly? This is not at all clear; perhaps a simple solution is already good enough? If not, you are then in a good position to try different heuristics that people suggest and watch how the stats change. There might also completely different ways to address specific particularly nasty corner cases; have you considered performing RLE, or static Huffman with English frequency tables, on your mailboxes before generating the diffs, for instance? --MoonShadow
- Yes, I'm basically saying finding the absolutely best solution is too costly, and I'm looking for shortcuts. In terms of general compression of text, the resulting diff files should be fairly amenable to bzip2. What I want to do is pick out the large chunks of file that are completely identical. The bits that have repetitions that are causing problems by creating too many matches are merely a problem for my algorith - bzip2 will eat them for breakfast. In terms of heuristics, there are all sorts of possibilities, but a general "give up if it gets silly" rule seems appropriate. For instance, stop looking for matches if there are already x overlapping matches over the current area. Ah, as I suspected, having done this, and grepping the output for any matches over 10MB gives three matches covering 118MB from the 211MB file (two of them overlap).
Found match from location 44142960 to file 1 location 203497840 of length 42241207
Found match from location 86384496 to file 1 location 294909808 of length 81713448
Found match from location 87920934 to file 0 location 3487263972 of length 49251701
- Oh, I see the problem - you've got a chunk of text that is repeated. What I think you need to do is advance the first file once you've found a positive match for that bit of text. --Vitenka
- Yes, that's a solution. However, what's to say that you won't find a 1kB match and miss the 40MB one by advancing over it?
- Could you explain how that would happen, maybe give an example? I'm struggling to imagine it. Surely if there is a 40Mb match, there is also a (40Mb-1kb) match? --MoonShadow
- Valid point. However, there may be 40960 individual consecutive 1k matches, which could mask a single 40M match.
- Doing a dictionary / runlength compression on both files first would also have this effect. --Vitenka
- Probably won't do this, as it would complicate matters quite a bit.
Right, back to the drawing board. There's a better way to do this. Up until now, I was validating and sizing each match as it was being discovered, which requires seeking to the point in both files and reading. Due to the number of false hash hits, this was causing a huge amount of disc activity (mostly slow seeks) to generate a huge number of matches that were to be thrown away. New method is to generate all the hash hits first, merge consecutive hits to generate larger potential matches, and then validate the potential matches in decreasing size order. As a match is validated, then all other hits totally overlapped by it can be thrown away, and all hits partially overlapping can be reduced in size. I may even not bother storing single hash hits - the running checksum block size is 1k, so I won't find any matches smaller than 2k, but that's fine.
Having implemented the first part of this, it is a lot faster than before (so far), but still a bit pedestrian, taking 20 minutes to identify 258000 matches of size 4k and bigger, having done some low-hanging-fruit filtering to remove silly overlaps. The next stage is to stuff these all in a data structure and process them in decreasing size order. For this, I will need the collection sorted three ways - by size, start pos, and end pos. I won't need an R-tree, as I will never need to search for matches that are bigger than the search window.
Okay (I am doing some real work, honest), having done that, the program now generates 458 non-overlapping matches which cover 156MB of the 211MB file, in 20 minutes. However, these matches are not validated against the actual file contents, and this will have to be done, although this will be a lot quicker than the previous validation pallaver. I also have an idea of how to speed up the search stage significantly.
And now the program generates a diff containing 382 validated non-overlapping matches covering 160MB of the 211 MB file and 336 literal copy regions. The diff file is 51MB in size, and takes 9 minutes to create. Now I just need an un-diff program to reverse it and verify.
The program seems to spend most of its time searching the checksum lists for matches, which is an interpolation search for "extra info" files and a hash table lookup for the input file (it's easier to update a hash table as the input file is processed). There is the potential to improve the performance of this by storing up a suitable-sized list of checksums and check them all at once using a merge join instead of searching for each one individually, but it needs a little thought to allow the filtering that the program is currently using to stop the number of hits becoming silly.
Yes, there was a bug in the program, so it was writing an incorrect diff file. However, now that bug is fixed and an "un-diff" program written, I can correctly retrieve the original 211MB file from the 51MB diff file (which takes a few seconds). This test case contains quite a lot of corner cases already, I can't think of any others to test out. I have yet to try the performance improvement described above, which shouldn't be too hard. Yes, a little filtering would be useful, but part of the purpose of filtering is to reduce the number of lookup operations being performed, which would no longer be relevant. The other purpose of filtering is to prevent memory being filled up with too many matches, so this will need to still be done.
And now it's five times faster (or more, depending), and it doesn't use huge amounts of RAM any more.