Discussion:
[s3ql] Optimizing deduplication
c***@iu.edu
2018-08-15 21:09:27 UTC
Permalink
Hi, I am interested in learning a bit more about the architecture,
specifically what is the order of operations between; splitting data into
chunks, compression, deduplication, and encryption?

I have a large amount of data that is very similar, but each file is only
somewhat compressible. My concern is that compression might have a negative
effect on the potential for efficient deduplication... is this possible?
Also, is the compression done separately on each chunk, or on a per-file
basis? Finally, will smaller chunks (max-file-size) produce better
deduplication (just at the expense of more network operations)? For what
it's worth, my data is mostly 3D numerical arrays where each file is a few
GB.

Thanks so much. Chris
--
You received this message because you are subscribed to the Google Groups "s3ql" group.
To unsubscribe from this group and stop receiving emails from it, send an email to s3ql+***@googlegroups.com.
For more options, visit https://groups.google.com/d/optout.
Nikolaus Rath
2018-08-16 08:22:28 UTC
Permalink
Post by c***@iu.edu
Hi, I am interested in learning a bit more about the architecture,
specifically what is the order of operations between; splitting data into
chunks, compression, deduplication, and encryption?
1. Split
2. Deduplication
3. Compression
4. Encryption
Post by c***@iu.edu
I have a large amount of data that is very similar, but each file is only
somewhat compressible. My concern is that compression might have a negative
effect on the potential for efficient deduplication... is this
possible?
Not unless you do the splitting into blocks after the compression.
Post by c***@iu.edu
Also, is the compression done separately on each chunk, or on a per-file
basis?
Per chunk.
Post by c***@iu.edu
Finally, will smaller chunks (max-file-size) produce better
deduplication (just at the expense of more network operations)?
Whether it will result in more de-duplication depends on your data, but
it will certainly not result in less. It will, however, also increase
the size of your metadata DB.

Best,
-Nikolaus
--
GPG Fingerprint: ED31 791B 2C5C 1613 AF38 8B8A D113 FCAC 3C4E 599F

»Time flies like an arrow, fruit flies like a Banana.«
--
You received this message because you are subscribed to the Google Groups "s3ql" group.
To unsubscribe from this group and stop receiving emails from it, send an email to s3ql+***@googlegroups.com.
For more options, visit https://groups.google.com/d/optout.
Loading...