Research Data Preservation

Last updated March 11, 2024

Table of Contents

1 Backing up data
2 Archiving and compressing data
3 CARC Cold Storage System

Although the /home1, /project, and /project2 file systems have some data recovery capabilities, The Center for Advanced Research Computing (CARC) strongly encourages you to also back up your data elsewhere. There are a few different backup locations to consider:

Local storage (e.g., external drive)
Cloud storage
Research data repositories

1 Backing up data

A good plan for backups is the 3-2-1 strategy:

3 copies of data
2 different media (e.g., devices or file systems)
1 copy off-site (e.g., cloud storage)

See our Transferring Research Data guides for instructions on how to use:

External tools with additional features, such as deduplication and compression, include:

Alias commands or backup scripts can help semi-automate backups. It is important to regularly test that backups are accessible and functional. A good rule of thumb is to test every three months.

Research data repositories, such as OSF, Zenodo, Harvard Dataverse, and Dryad are a special type of cloud storage intended for sharing research data with the wider research community. These services typically have an API that can be used at the command line to upload data directly from CARC systems.

CARC also offers our own research data repository for long-term archival storage (see section below).

As part of the process of backing up data, you can also create a single archive file containing multiple files and directories using tar (see section below). This may be useful for versioning and organizing backups.

2 Archiving and compressing data

Archiving and compressing data can help simplify data organization and save storage space, such as after a project is completed and the associated data are not needed in the immediate future. This is also useful for packaging project data in order to distribute them to other researchers. You can use a combination of the programs tar for archiving data and gzip or xz for compressing data.

2.1 Archiving with tar

To create an archive file from a directory of files, use the tar command. For example:

tar -cvf <filename>.tar <dir>

To add multiple directories and files, simply add the paths to these directories and files in the command. To check the integrity of the files, add the -W option.

To extract the archive, use the -x option instead of the -c option. For example:

tar -xvf <filename>.tar

Note that the .tar file will be larger in size than the sum of all the files being archived, primarily because of the added file headers in the archive file. Enter man tar or tar --help for more information and to view all available options.

2.2 Compressing with gzip

To compress data using gzip, use:

gzip -v <filename>

This creates a .gz file. Including the -v option (verbose mode) will print the compression ratio. There are 9 levels of compression, with 9 being the highest/slowest level and 6 being the default. The default is typically the best value to use with respect to the compression/time tradeoff. To maximize compression, at the expense of compression time, add the -9 option.

To uncompress a .gz file, add the -d option:

gzip -dv <filename>.gz

Enter gzip --help to view all available options. In addition, the pigz module is a parallel implementation of gzip that provides faster compression and uncompression times: module load pigz. It can be used as a drop-in replacement for gzip commands.

2.3 Compressing with xz

For better compression ratios or for maximum compression, use xz instead of gzip. With xz you can also use multiple cores to speed up the compression time. For example, to compress using 4 cores, add the -T4 option:

xz -v -T4 <filename>

This creates a .xz file. Including the -v option, verbose mode, will print compression progress and related information. There are 9 levels of compression, with 9 being the highest/slowest level and 6 being the default. The default is typically the best value to use with respect to the compression/time tradeoff. To maximize compression, at the expense of compression time and memory required, add the -9 option.

To uncompress an .xz file, add the -d option:

xz -dv -T4 <filename>.xz

Enter man xz or xz -H for more information and to view all available options.

2.4 Archiving and compressing with tar

You can also archive and compress with one command using tar with the -z option, which uses gzip compression by default. For example:

tar -czvf <filename>.tar.gz <dir>

Alternatively, to use xz to compress, use the -J option instead. In contrast to using gzip or xz separately, tar does not delete the source files by default. Add the --remove-files option to do so.

To uncompress and unarchive in one command, use the -x option:

tar -xvf <filename>.tar.gz

This extracts the contents of the archive into the current directory. tar automatically detects which uncompression program to use. Note that the command will not automatically delete the compressed archive file after extraction.

Software for Linux is typically distributed as a .tar.gz file, so a command like the above extracts the source code or binary files into the current directory.

2.5 Archiving and compressing before transferring files

Creating and compressing a single archive file can be useful before transferring data to or from CARC systems, especially for directories with a large number of files (e.g., > 1000, regardless of the total size of those files). Each file has associated metadata, which can slow down the transfer. Compressing files reduces the amount of data that needs to be transferred. However, it takes time to compress and uncompress files, so the total transfer time may not necessarily decrease depending on factors like network speeds. With fast network speeds, relative to total transfer size, it is typically not worth compressing files (e.g., when on campus).

3 CARC Cold Storage System

The CARC Cold Storage System is intended for long-term (e.g., more then 5 yrs) storage of large data sets (TB to PB scale). It is a fee-based service platform at a current rate of $20/TB/year. Usage updates are sent out monthly, with payment for the year due midway through the fiscal year (July 1 - June 30). Usage will be calculated on a daily average to account for adding and removing data throughout the year.

PIs must request a cold storage allocation via the CARC user portal. For more information on requesting allocations, see our Request a New Allocation user guide.

Cold storage is not intended as a system for frequently backing up and retrieving data. Copying data into and out of cold storage is notably slower than the other available file systems.

CARC’s Cold Storage System preserves one copy of the stored data in one location with no regularly performed data integrity checks. PIs interested in multiple copies of their data and integrity checks should use the USC Digital Repository for their data archiving needs instead. Please submit a help ticket and the CARC team will assist you in facilitating this service.

3.1 Accessing cold storage

Cold storage is only accessible from the login nodes, including discovery[1-2], endeavour[1-2], and knoll.

CARC systems (including the Cold Storage System) require users to connect to the USC Secure Wireless network or secure wired network while on campus. For off-campus users, a connection to a USC VPN is required. Instructions for connecting to a USC VPN can be found here.

3.2 Adding data to cold storage

From a login node, use the command arcput to add a file to cold storage. This command only accepts files, not directories or links. Group read permission is required for the file that you want to archive. The file must be larger than 10 MB to avoid slowing down the Cold Storage System with small files.

To upload files smaller than 10 MB, use tar to bundle them into a single archive file. tar is also used to bundle directories into a single archive file. The arcput command returns instantly, however data copying continues in the background on the Cold Storage System.

arcput can be used repeatedly to upload multiple files, even if the previous files have not finished copying yet—the arcput command will queue in the background.

To check if the file has finished copying to cold storage, use the arcls command described in the next section.

Do not delete files before they have finished copying to cold storage with arcput. You risk permanently losing your data. Check that arcput has successfully completed with arcls before deleting data from the source directory.

3.3 Listing data stored on cold storage

arcls displays a list of all the files stored on the Cold Storage System for which you have group read permission.

$ arcls
{'filename': '/home1/ttrojan/mpi_io.exe', 'uid': 123456, 'gid': 0001, 'size': 112120, 'timestamp': '2023-03-22 11:11:44.146184'}
{'filename': '/project2/hpcroot/ttrojan/test01', 'uid': 123456, 'gid': 0001, 'size': 4194304000, 'timestamp': '2021-09-30 22:14:28'}

Each variable is described below:

Variable	Description
`filename`	The name of the file is listed as it’s full directory path.
`uid`	The number representing your user ID (e.g., ttrojan).
`gid`	The number representing your group ID (e.g., ttrojan_123).
`size`	The size of the file in bytes.
`timestamp`	The time and date that the file was originally created.

To check your current usage on the Cold Storage System, use arcquota.

$ arcquota
{'gid': 0001, 'size': 22332195885, 'num': 8, 'sizelimit': 1099511627776000, 'numlimit': 1000000}

Each variable is described below:

Variable	Description
`gid`	The number representing your group ID (e.g., ttrojan_123).
`size`	The sum of all your files in bytes.
`num`	The number of your files stored in cold storage.
`sizelimit`	The maximum amount of data that can be uploaded to the Cold Storage System (in bytes).
`numlimit`	The maximum number of files that can be uploaded to the Cold Storage System.

Currently, there is no quota on how much data you can store in the Cold Storage System. CARC reserves the right to implement a quota system at any time.

3.4 Retrieving data from cold storage

Use the command arcget file_full_path to restore a file to its original path or arcget file_full_path alternative_file_path if you want to restore it to a different location. The command returns instantly, however data copying continues in the background on the Cold Storage System. Check the target directory to confirm that the file has finished restoring. Alternatively, the file will no longer appear in the Cold Storage System when you run arcls (see section above).

3.5 Deleting data on cold storage

Use the command arcdel full_file_path to delete a file stored on the Cold Storage System. Files deleted from cold storage cannot be retrieved—ensure important data has been retrieved and successfully restored before deleting (see section above).