According to Wikipedia, Data deduplication is a specialized data compression technique for eliminating duplicate copies of repeating data. I didn’t get it at first, sounded like a complicated technique… Yet with some hands-on experience and experiencing what incredibly technology it is, let me explain it in a very simple way.
To understand what Deduplication is, we should first take a look at Data duplication! Imagine a mail system with 1.000 users. User A sends a mail with an attachment with a size of 1MB to the other 999 users. This means the total size of the attachments is 1GB (999 times from the inbox + 1 time in the Sent items). So every time a backup-job runs, all these 1000 identical files are saved.
- This is Data Duplication, an individual file has multiple DUPLICATES (or instances) on disk.
Now, let’s dig a little deeper!
When a special kind of backup software verifies these files, and sees ‘Hey, this file is recurring 1000 times!’, it won’t just transfer the file. It will transfer the file 1 time, and replace the 999 other instances with a shortcut!
- This is data DE-duplication, hence: all the duplicates are removed.
But, what can Data Deduplication do for me?
Let us go back to our email; imagine you don’t have a software capable of deduplication. The mail gets backed up: size on disk: 1GB. 7 days later, there is a next full backup job. Again 1GB on disk.
Now imagine you upgrade to de-duplication software.
The mail gets backed up: size on disk: 1MB. 7 days later, there is a next full backup-job. Extra size needed on disk: 1MB.
Data De-duplication effectively reduces the storage used by backups from 2GB to 2MB! But the storage space is not the only thing saved; also the network gets less traffic!
Image taken from http://www.enterprisestorageguide.com/how-data-deduplication-works