You might have data that is rarely, if ever, needed, but that you can't delete. You may want to remove it from the cluster storage to save on disk usage fees. Below are two approaches we suggest to achieve this.
We recommend you have two high-quality copies of all original data and difficult-to-reproduce data, and that they reside in different physical locations.
A regular USB harddrive you bought on Amazon does NOT count as a high-quality copy.
We suggest you implement two of the approaches described below, or something similar.
Purchase a small good-quality desktop RAID system to store your data. Typically this will be called NAS (Network-Attached Storage), and you can configure it with as many drives as you need. Buy 3.5“ enterprise-class (aka server-class) drives and set them up in redundant RAID configuration (RAID level 5 at least, level 6 would be better). This means that if one of the disks in the system fails, the others will maintain the data and you can replace the bad drive without losing any data. However you must have someone check on the system periodically to check it's condition, and setup email and other alerts so it tells you when there's an issue. Most all hard drives fail within 5 years of production.
We've had good experiences with the Synology DS (DiskStation) series of RAID systems, for example the DS416. These products have a good user interface and support connection over the network via NFS (linux/OSX), iSCSI (linux/OSX), rsync (linux/OSX), SFTP, Windows File Services. However they don't support use as a directly-connected drive over USB.
A: The key is to have a RAID system 1 or higher, so you have redundancy if one of the drives fails. See here: https://tierradatarecovery.co.uk/dummies-guide-to-raid/
A two-bay system will work depending on what “multiple terabytes” means. If it's mean 3TB, you could put two 4TB drives in there and have a RAID 1 system with total 4TB storage. But big drives cost more, so it might be better to get a larger bay and have smaller drives. e.g. a 4-bay system with 4 2TB drives in RAID 5 configuration will get you 6TB storage, and still allow for one drive to fail w/out losing data. If you want more peace of mind get a big enough bay and large enough drives to have a RAID 6 system, so two drives can fail at the same time.
No, you want two drives at a minimum so you can at least do RAID 1. You can start with two drives, and then add more and expand the raid volume later. (At least with the Synology systems) you can start with RAID 1 and then switch to RAID 5 or 6. Also, each drive is limited to use the size of the smallest drive in the raid, so if you start with 2TB drives, you'll want to expand in the future with 2TB drives (or larger drives, but only 2TB of each one will get used).
Some real-world hard drive reliability stats: https://www.backblaze.com/blog/hard-drive-failure-rates-q1-2017/
A: A typical undergrad/grad-student in the sciences should be able to setup and maintain the system with help of the documentation and google. We have a couple Synology Diskstation brand systems and their interface is very good overall, and reasonably easy to learn while still being powerful. It's all GUI-controlled.
You can set up most (or maybe all) systems to send you email alerts (and maybe text alerts) when there's a problem, but you still want to have a regular schedule for manually checking in on it, say every month, or two at most. The manual check would take just a couple minutes to login and see that there's no warnings/issues that you may have missed because of alert/email issues.
If your data is less than a few hundred gigabytes, you might want to use archive-quality blu-ray discs. This is a somewhat newer option. Writeable blu-ray discs (BD-R) come in 25GB, 50GB and 100GB sizes.
We suggest you make two copies of critical data and store them in separate buildings.
Be sure to get “M-Disc” labeled discs. These are considered archive quality. And make sure you get a blue-ray writer that supports M-Disc discs. I purchased one for my personal archiving recently, the LG Electronics External Blu-ray Disc Rewriter BE14NU40.
NOTE 8/2017 PMACS has new options for storage that may be of use. In particular the "Research Commodity Storage" may be of use to cluster users because of stated ability to conform to HIPAA compliance needs. We have not had time to investigate this ourselves. You are welcome to contact PMACS about this and ask our help to figure out if the new services are usable by cluster users. http://www.med.upenn.edu/pmacsnewsletter/#PMACSStorageServices
This is a service that provides very easy access to a modern robot-controlled high-availability tape archiving system. It provides a simple filesystem-view interface with simple file retrieval. Custom linux commands are provided for the user to make their archiving copies. Note that this is an archiving service, and is not meant to be a regular backup service. You are able to retrieve files, but such retrievals are expected to be rare.
Pricing is $0.015/GB/mo = $0.18/GB/year = $180/TB/year. This is a great price!
Your data is stored on mirrored tapes, meaning there is a redundant copy on a different set of tapes. However both copies reside in the same physical system, so a catastrophic event that destroys the system or the data center will wipe out all your data stored there.
HIPAA-protected data: The system is not yet HIPAA-compliant.
STATUS UPDATE 3/2/2017: PMACS has had to change systems because of a loss of vendor support. The newer system is expected to be ready in a month or two, but HIPAA-compliance is still on the todo list.
In order to create a user account PMACS needs this information:
User Info:
PI Info:
Contact: pmacshpc@med.upenn.edu
For more information, see PMACS HPC Services and HPC:Archive System Wiki