background

Niko's Project Corner

Cheap off-site backup at Amazon Glacier

Description File organization, compression, hashing and uploading to S3.
Languages Bash
Tags AWS
En­cryp­tion
Back­ups
Duration Summer 2014
Modified 17th July 2014
thumbnail

In ad­di­tion to a mir­rored and check-summed ZFS based backup server, I wanted to have back­ups out­side by premises to be safer against haz­ards such as bur­glary, fire and wa­ter dam­age. ZFS can al­ready re­sist sin­gle disk fail­ure and can re­pair silent data cor­rup­tion, but for im­por­tant mem­ories that isn't suf­fi­cient level of pro­tec­tion. My ever-grow­ing data set is cur­rently 150k files, hav­ing a to­tal size of 520 Gb. Ama­zon's Glacier seems to be the most cost ef­fi­cient so­lu­tion with so­phis­ti­cated APIs and SDKs.

Cur­rently the Glacier stor­age at EU Ire­land is priced at 0.011 USD / Gb / month, which is about 0.10 EUR / Gb / year, or only 50 EUR / year for 500 Gb of data. How­ever data re­trieval cost cal­cu­la­tions are quite non-triv­ial, be­cause of nu­mer­ous quo­tas and the pric­ing is based on monthly peak re­trieval rate. The best ref­er­ence for this is an Un­of­fi­cial Ama­zon AWS Glacier Cal­cu­la­tor, which tells us that to down­load 500 Gb in two weeks costs 11.51 USD for re­trieval from Glacier stor­age and 59.88 USD for trans­fer cost, to­talling at roughly 60 USD or 44 EUR. Halv­ing the re­trieval time dou­bles the re­trieval cost, so it is im­por­tant to split the data in smaller files and re­trieve them in a care­fully sched­uled man­ner. This doesn't bother me too much be­cause I hope I don't ever need to re­sort to this, and in­stead it is al­most like write-only mem­ory.

Pric­ing in­for­ma­tion can be ex­pected to change quite rapidly, but I'm go­ing to quote cur­rent prices for ref­er­ence any­way. Per­haps the most pop­ular ser­vice Drop­box quotes 500 EUR / year for 500 Gb, as can be seen from Fig­ure 1. Un­lim­ited Drop­box for Busi­ness is 15 USD / month / user, but a min­imum of 5 users is re­quired. Any­way Drop­box is meant for read-write data stor­age and shar­ing, not for "of­fline-like" backup-only so­lu­tion. I'm un­sure if it sup­ports user-con­fig­ured en­cryp­tion keys, un­less the user has to keep en­crypted and de­crypted copies of data on his/her com­puter.

dropbox_pricing
Figure 1: Drop­box pric­ing for 500 Gb is about 500 EUR / year, 1000% more than Glacier. How­ever it has ready-made desk­top and mo­bile ap­pli­ca­tions, no re­trieval pric­ing and other ad­van­tages.

An other in­ter­est­ing ser­vice is Black­blaze, which of­fers un­lim­ited stor­age for 50 USD / year, which is very com­pet­itive. Un­for­tu­nately they don't sup­port Linux or Unix, which I want to use for main stor­age to get the ben­efits of ZFS. Also their clients are based on con­sumer-friendly syn­chro­niza­tion prin­ci­ple, so there is less fine-tuned con­trol on what is ac­tu­ally hap­pen­ing. They also com­pare them­selves to their com­peti­tors, and quoted prices can be seen in Fig­ure 2.

blackblaze_cmp
Figure 2: Black­blaze of­fers com­pet­itive pric­ing when com­pared to its con­sumer-friendly peers.

Given all these more or less ready-made so­lu­tions, I still ended up in choos­ing the Ama­zon Glacier ser­vice. I was lured in by the AWS Free Tier of­fer and the fa­mil­iar by-en­gi­neers-for-en­gi­neers ap­proach. It also gives me good first-hand ex­pe­ri­ence on Sim­ple Stor­age Ser­vice (S3), which's life cy­cle man­age­ment I use to move files to their fi­nal des­ti­na­tion: Ama­zon Glacier.

I nat­urally or­ga­nize my files based on their topic and date, but on my pre-ex­ist­ing files I had to also split the files into smaller sub­sets. The ini­tial step was to or­ga­nize the data into about 50 - 100 GB fold­ers, each of which is fur­ther split into about 10 GB par­ti­tions. Each file in these sub­sets is then SHA-1 check-summed and this file is stored into Drop­box and other ser­vices. Based on this file I can search files based on their (full) file name and their ac­tual con­tents.

data_chunk
Figure 3: Files are di­vided into about 10 GB sub­sets, zipped and then split­ted into 500 MB chunks. These chunks are then up­loaded to S3/Glacier and check­summed.

Af­ter cre­at­ing in­di­vid­ual par­ti­tions, the ac­tual com­pres­sion and en­cryp­tion can be done. The bash script is based on pip­ing zip and OpenSSL Linux util­ities, and the used en­cryp­tion method is AES-256 in CBC mode with ran­dom salt. It is cru­cial not to for­get the pass­word for these back­ups, be­cause there is no other re­cov­ery method than brute-force.

Since the data in ques­tion is mostly RAW cam­era files they aren't that com­press­ible, the re­sult­ing zip files are roughly 10 GB each as well. Be­fore up­load­ing these are split into 500 MB chunks, which il­lus­trated in Fig­ure 3. This en­ables eas­ier up­load re­tries and fine-tuned Glacier re­trieval rate. The re­sult­ing zip files and their 500 Mb or less chunks are also SHA-1 check­summed. Ama­zon re­ports so-called ETag hashes, which is based on MD5 hashes of up­loaded parts. It can more-or-less prove that Ama­zon got the in­tact copy of the file over the in­ter­net.

My scripts uti­lize stan­dard AWS CLI and s3md5 com­mand line util­ities, so no Perl or PHP cod­ing is re­quired (ex­cept the Python abort script to clean-up aborted multi-part up­loads, which is based on this blog post and boto. The zip, en­crypt, split and check­sum com­mands look like this:

zip - -q -r path/to/sub­set | openssl aes-256-cbc -salt -k pass­word > sub­set.zip.aes
split -d --bytes=500M sub­set.zip.aes sub­set.zip.aes_p
sha1sum sub­set.zip.aes* | tee sub­set.sha1.txt
ls sub­set.zip.aes_p* | xargs -n1 ./s3md5.sh 7 | tee sub­set.etag.txt

Bea­cuse I have 2 x 250 Gb SSD drives in RAID-0 con­fig­ura­tion (via ZFS), even the Core i5 pro­ces­sor is able to cal­cu­late check­sums in par­al­lel at 1 GB/s (8 Gb/s)! Thus mul­ti­ple passes over the data will be fast.

The cur­rent ver­sion of s3md5.sh can only parse in­di­vid­ual files, so xargs is used to gen­er­ate this se­quence of pro­gram ex­ecu­tions. The pa­ram­eter "7" tells s3md5.sh that files are up­loaded in 7 MB chunks. It also gave wrong re­sults if the file hap­pened to be evenly split into x MB chunks, in which case the last chunk is an empty zero-length string which needs to be ig­nored. I'll be send­ing a pull re­quest on these patches later. This com­mand can be used to con­firm that the files will de­crypt cor­rectly:

cat sub­set.zip.aes_p* > sub­set.zip.aes.tmp
openssl aes-256-cbc -d -salt -in sub­set.zip.aes.tmp -k pass­word > /dev/null

Nat­urally I have writ­ten nicer scripts to wrap these com­mands, to en­able easy it­er­ation over files etc. The ac­tual up­load uses the aws s3 cp --stor­age-class RE­DUCED_RE­DUN­DANCY com­mand, which is wrapped in trickle com­mand to limit the up­load band­width. The S3 bucket is con­fig­ured to move files to Glacier stor­age af­ter 1 day, so typ­ically lower stor­age re­dun­dancy is suf­fi­cient. This pro­vides 99.99% dura­bil­ity, where Glacier guar­an­tees 99.999999999%. If the file goes miss­ing within the one day pe­riod it can eas­ily be re-up­loaded (as­sum­ing this doesn't go un­no­ticed). Over­all the AWS Glacier seems like a good so­lu­tion, if you wan to have full con­trol on all as­pects of you stor­age "medium". How­ever it takes ex­tra care to prop­erly or­ga­nize your data and to be sure to use cor­rect en­cryp­tion keys, and hope that there is no in-RAM cor­rup­tion if you aren't us­ing ECC RAM (I'm not).

This pro­ject was ini­tially set-up to archive my old files re­li­ably and cheaply to the cloud, and it suc­ceeds well in this task. Af­ter fin­ish­ing the up­load­ing pro­cess of 520 GB of data, I need to ex­tend these scripts to han­dle au­to­matic in­cre­men­tal back­ups of still-chang­ing fresh files. A sim­ple so­lu­tion would be to use a Elas­tic Block Stor­age vol­ume at Elas­tic Com­pute Cloud, and to use Bit­Tor­rent Sync P2P soft­ware to up­load en­crypted in­cre­men­tal ZIP files. This is a bit less of a con­cern be­cause typ­ically I still have a copy of the files on my lap­top as well.


Related blog posts:

HandsOnHashing
NoKnowledgeNotes
ServiceDiscovery
NginxBridge
AnalyticsPlatform