Deduplication in Open Source Backup Software
Table of contents
Introduction
With the rise of big data and the internet of things, IT organizations are now responsible for an ever increasing amount of data. One key component of any organization IT infrastructure is its backup systems. For every byte of data generated by a user, it is often stored redundantly in an array like RAID, and then backed up regularly resulting in numerous copies of each file. In the world of open source backup utilities, most well-established enterprise grade tools backup based on file-level changes. A simple process is used to compare the last change time-stamp to the last backup of the file, if the date is more recent, the file is flagged and backed up again (TechTarget, 2008). Small changes to files will result in many duplicates of this file to be stored with identical blocks in the file from unchanged portions. As organizations continue to expand their collection and storage of data, the need for efficient and reliable backup solutions is crucial to the ongoing availability of this data. Enterprise organizations who have already invested in open-source backup software can benefit from new feature such as compression and data deduplication. Open source backup utilities appear to lag when compared to commercial offerings (Dubash, 2010).
The open protocols and standards in non-private commercial software is an asset but also identify a challenge when adopting new features into an application. Of the leading software to date, Bacula and Bareos, which share a code base do not have any functionality of data deduplication and UrBackup suggest a hardware approach which declines in performance as the dataset increases beyond a terabyte (Bonwick, 2009; Raiber, 2016). Bareos provides a plugin environment (Bareos GmbH, 2016) to the client backup and restore component allowing the injection of a data deduplication algorithm(s) into the existing backup framework. This research project identifies an algorithm which is best suited for deduplication during backup operations and implements this algorithm into Bareos’ backup software to allow comparison of the real world performance difference of deduplication in fully-featured open source backup software. Research Question: Can data deduplication be used to decrease the storage requirements for incremental backups without resulting in a significant increase in the time to create a backup?
Literature Review
Data deduplication was seen as a potential to alleviate backup concerns of data growth stemming from the mid 2000’s. Research at this time focused on what could be done to alleviate the endless need for data backup and provided focus of future research. At this point in time deduplication was identified as a important factor in the future of disaster recovery due to its ability to save space during data replication (Maitland, Bigelow, & TechTarget, 2007). Following this identification and need for further research multiple studies (Li, Xu, NG, & Lee, 2014; Mandal et al., 2016; Xia et al., 2014) examined the potential algorithms to perform deduplication for deduplication. These studies identified and tested stand alone applications such as Ddelta (Xia et al., 2014) and hint based block detection (Mandal et al., 2016) to statistically show that block level deduplication could indeed save space with significant improved computation performance over traditional deduplication algorithms.
The results of these studies allowed for data to be copied but did not integrate these algorithms into backup software. Expanding on others work in identifying efficient deduplication algorithms researches have developed and tested software backup systems (Keeling, 2014; Li et al., 2014; Zahed, Sabitha Rani, Saradhi, & Potluri, n.d.). Keeling’s work was based off of the deduplication algorithm rsync (Andrew Tridgell & Mackerras, 1996) which was used to create a software application, BackUp and Restore Program (BURP) 2 (Keeling, 2014). The success of this standalone software has shown that deduplication can be implemented into backup software. These researchers have shown that not only is deduplication a potential for eliminating many redundant copies of data in backup scenarios but this has been expanded to useable products. This implementation however has lead to ground up implementation of software to study the effectiveness of these algorithms eliminating other features from the software which could impact the study. Next steps are to take these works and integrate them into a fully featured open-source backup solution combining current features of enterprise level software backup tools with the identified benefits to data deduplication.
Backup Software Deduplication
The sub-questions were broken down into three phases; the first phase involved a selection and performance comparison of deduplication algorithms. This phase identified one algorithm which was selected to be incorporated into Bareos. The second phase, answering sub-question three, resulted in the development of this algorithm as a Python based plugin to Bareos file daemon. The third and final phase analysed the performance of the deduplication version of Bareos to the original using a real-world comparison to determine if these improvements result in the hypothesis.
Research Approach Phase 1, comparison of deduplication algorithms, is a quasi-experimental design in which selected samples are used during a post-only test on each algorithm. This data is not random, but instead was selected as sample changes to data that could occur during normal operations. These data changes included modifications to Microsoft Office Word documents, MySQL backups, JPEG modification and compressed files. The data files were copied once and then underwent controlled changes to the file. Each file set was copied using one of the deduplication algorithms and performance metrics were collected. Using these metrics, the best algorithm was selected and implemented into backup software. Phase 2 is when software development occurred to develop the backup software for phase three. Phase 3, testing of the performance of backup operations was performed by a posttest random experiment. The enhanced and original backup software were run on random changes to data to measure the result in backup size and performance of the client. These backups occurred as weekly incremental backups to a Linux based NAS system with a variety of files including images, documents, MySQL binaries and compressed files. Data Five deduplication algorithms were eventually selected for phase 1 of the project answering sub-question 1. These algorithms included:
- Rabin Fingerprinting (Rabin, 1981)
- Rolling Adler (A Tridgell, 1999)
- PLAIN (Spiridonov, Thaker, & Patwardhan, 2005)
- Two Threshold Two Divisor (Eshghi & Tang, n.d.)
- RSYNC (Andrew Tridgell & Mackerras, 1996) Each of these algorithms was tested against 5 different files after a first copy was made to observer the performance of calculating an incremental file set.
Since each of these algorithms had already been proven to result in a decreased size the measurements during this phase were only on the performance of the client. The metrics that were measured included CPU usage, amount of Disk I/O, total time to copy the file and maximum memory used. To vary the types of data 5 files were used. File 1 – A 275 KB Microsoft Word document with every “the” replaced with random words from 0-10 characters long. File 2 – A 10 MB MySQL database export with 10 random lines added to each of the 20 tables. File 3 – A 7.9 MB JPEG Image with all black pixels changed to white. File 4 – A tar file including all the files from 1-3 (This file was created from the first three files, then updated with the same changes to the files inside the tar) The results of each of the algorithms is presented in the next section. In Phase 3 the deduplication version of Bareos performed weekly incremental backups alongside an unmodified version of the software. The size of the incremental file, the total time for backup, the CPU usage and total memory usage were all collected. The data used in this phase came from a production NAS hosting three websites, a git repository cloud based file storage system and a number of MySQL databases for various small organizations. 4 Analysis1 Algorithm Comparison The results of algorithm performance comparison are shown in figures 2-5. Each combination of algorithm and file was sampled three times to produce an average value to be used to answers sub-questions 1 and 2. The order of algorithm performance varied for each metric but stayed consistent among all files within a metric. The consistency across all files indicates that none of the algorithms are designed specifically for one of the selected file types or changes. This consistency and lack of specialization in the algorithms allows for a simpler selection of the “best” algorithm for implementation. The initial results of these comparisons identified Rolling Adler and Two Threshold Two Divisor as significantly better than the other algorithms in all but the memory usage. The main difference between these two algorithms was in the time to complete the copy and the amount of disk I/O. Since all other metrics were very similar the priority over reducing time to copy file was selected over the lower disk I/O and ultimately the Two Threshold Two Divisor algorithm was selected as the “best” algorithm to satisfy sub-question 2.
For the purposes of COMP695, all data is completely fictional and in no way represent true performance metrics. This section does provide a realistic set of analysis that would be performed. 5 Backup Performance Phase two of the project involved the implementation of the two threshold two divisor algorithm as a plugin to Bareos using Python. This phase of the project was pure software development and provided a deduplication plugin to use during the final phase of the project. To answer the final two sub-questions both the deduplication backup software and unmodified backup software were tested against the chosen data. The performance and disk size results are shown below in Figures 6-9. The deduplication backup software resulted in an average of 3.3 times smaller file to answer sub-question 4. The metric in the project was to determine the performance difference on the system. The data shows an increase in all 3 performance metrics, however, even though the CPU usage had doubled the overall impact on the system is still minimal with under 5% of the CPU being used at any one time and because of this can be concluded to have no impact on the backup. The significant results in performance are around the time to take a backup. Figure shows a significant increase in time to complete the backup effectively doubling the total time required as well as a 3x increase in the amount of memory required. The deduplication of software clearly slows the overall time to complete a backup and does require more memory and a negligible increase CPU usage. This research project did not involve human subjects and did not perform a formal research ethics review with Athabasca University. All data used during this project were sourced from either open source sources or were the private property of the authors.
Plagiarism
This research has been completely entirety by the authors listed. All ideas used in this article are acknowledged and referenced in the reference section below.
Conclusion
The systematic analysis of the five sub-questions allowed for a step-by-step identification, comparison and implementation of deduplication software into an existing open-source backup software program Bareos. Five deduplication algorithms were selected and through performance metrics gathered during the operation of these algorithms Two Threshold Two Divisor algorithm was determined to have the least impact on the system. The Two Threshold Two Divisor algorithm was implemented as a Python plugin into Bareos and a side-by-side comparison over three weeks was used to determine the effect deduplication had on fully-featured backup system. The result of this comparison confirmed the hypothesis that deduplication does indeed result in a reduced incremental file size but did identify a performance cost to this reduction in storage. The deduplication algorithm resulted in a decrease in performance on all three performance measures with two resulting in a significant impact. The average time to perform a backup doubled with deduplication enabled and tripled the required memory. An organization decision to implement deduplication into backup software will need to weigh the benefits of reduced storage space to this significant time to take a backup and the memory utilization of each of the backup clients.
Cite this Essay
To export a reference to this article please select a referencing style below