OpenFileDiff is a fileformat
(.openfilediff)
that describes the differences between 2 files in a storage efficient manner. Could be used for file versioning, or creating 'patch-files' for new software releases (incremental updates by modifying the files of the previous install).
The way it works is relatively simple, in a nutshell, the openfilediff fileformat starts with a series of integers, each integer is terminated by an ascii dash character b'-' (which works about the same as null terminated strings, except when decoding we just keep adding all bytes together until we find the terminator, b'-'). We call this part of the file the 'indexTable', because each integer represents an index of bytes1 or the second half of the openfilediff, which we call the payload. The payload are bytes taken from file 2 - and are all stored after the triple dash b'---'. The first 2 integers in the indexTable are always pointing to 2 indices in file 1, which we want to replace by the characters found using the next 2 indices thereafter, which point to the payload portion of the openfilediff. So if the .openfilediff file is b'0-5-0-2---hi', we should replace the characters between index 0 and 5 (0,1,2,3,4 but not the character at index 5) of file 1 with the characters at the indices 0-2 (so 0 & 1, as we always exclude the last index), where index 0 is where the payload portion starts, i.e. the first character after the triple dash b'---'. .openfilediff can also look something like b'0-4--10-11-0-1---a'. Here as you can see, we got a double slash after 0-4, this double slash means that we should replace the beforementioned range of indices with nothing at all. Then in this example, 10-11 tells us it to replace index 10 with whatever indices comes after - 0-1 i.e index 0 of the payload, which is b'a'. And that pretty much covers the entirety of the format. If you want to write your own implementation, I recommend taking a look at the source code and just rewriting it in your programming language of choice - I dont think its possible to optimize the algorithm much further than my implementation, at least not by any breakthrough margins. Since the indices of .openfilediff arent guaranteed to be counting upwards (although in practice, with the current implementation, it always does), there is also the possibility of reusing the same payload data several times by using the same indices of the payload several times wherever possible, however, the current implementation does not do this (PRs are welcome). Go ahead, try it! Simply drop in a file - then a similar but slightly modified file and it will create a
.openfilediff
version of the modified file. You can repeat by dragging the first file you dragged in once again, but then uploading the
.openfilediff
as the second file. You should then get back the second file you uploaded the first time.
Drop file here or click to select
Its current implementation consist of 2 simple Python functions (around 200 lines of code total)
getDiff(bytes1:bytes,bytes2:bytes)->bytearray
&
applyDiff(bytes1:bytes,diff:bytes)->bytearray
with no external dependencies. Example usage:
b1=b"1234567890987654321abcdefghijklmnopqrstuvwxyz" b2=b"1234567890987654321abcdefghijmeownopqrstuvwxyz" # Same as b1, except 'kl' got replaced with 'meow' diff=getDiff(b1,b2) B2=applyDiff(b1,diff) print(diff) # Would print: bytearray(b'\x1d- -\x00-\x04-,\x01-,\x01---meow') or bytearray(b'29-32-0-4---meow') if you use humanly readable numbers (there are 2 commented out functions 'intToBytes' & 'bytesToInt' that if you uncomment, getDiff/applyDiff will start using humanly readable numbers instead - however, doing so would also increase the diff file's size somewhat) # The numbers represent which indices we want to replace with what indices of the string that comes after the triple dashes ('---') right before, in our example, 'meow'. # So in our example We take the characters in the first string b1, replace characters between the indices 29 and 32 (that is, the characters on indices 29, 30 & 31 which are 'klm') with the characters coming after in the diff at indices 0 to 4 (0,1,2,3 but not 4) counting from the first character after the triple-dashes ('---'), which are 'meow'. print(len(b2)) # Would print: 46 print(len(diff))# Would print: 14 print(b2==B2) # Will always print true, regardless of the original contents of b1 & b2
Everything in this document and its associated repository (link below) is available under the permissive Apache 2.0. As such, feel free to use this dataformat & the encoder/decoder implementation as you see fit in any software for any use. Todo: * Benchmark with various input sizes (I suspect both the encoding & decoding algorithms are close close to O(n)) * Format this document a bit better
Written by Olliver Aira (@ItsCubeTime) 12/21/2024
⬩—❖—⬩
https://github.com/ItsCubeTime/openfilediff