Content Defined chunking (CDC)
One of the interesting use cases of the rolling hash function is that it can create dynamic, content-based chunks of a stream or file. - wikipedia / HN / Borg
The simplest approach to calculate the dynamic chunks is to calculate the rolling hash and if it matches a pattern (like the lower N bits are all zeroes) then itβs a chunk boundary. This approach will ensure that any change in the file will only affect its current and possibly the next chunk, but nothing else.
see also
- How would you fingerprint a piece of data?
- Foundation - Introducing Content Defined Chunking (CDC)
- FastCDC: A Fast and Efficient Content-Defined Chunking Approach for Data Deduplication
- Content Defined Chunking
- CDC File Transfer - tools for synching and streaming files from Windows to Linux. They are based on Content Defined Chunking (CDC), in particular FastCDC, to split up files into chunks.
Written on September 9, 2020, Last update on January 9, 2023
hash