Managing Files Using the Command Line

The following sections describe how to use command-line tools to manage files on CARC systems. To manage files with a graphical user interface, you can use the features available with CARC OnDemand or an SFTP GUI app.

Sensitive data

Currently, CARC systems do not support the use or storage of sensitive data. If your research work includes sensitive data, including but not limited to HIPAA-, FERPA-, or CUI-regulated data, see our Secure Computing user guides or contact us at carc-support@usc.edu before using our systems.

Organizing files

Project files should be organized within a directory structure of some kind in order to keep files organized, documented, and findable. This may include, for example, having separate directories for raw data, processed data, and code.

To list files and directories, use the ls command. For example, to list files in long format for the current directory use:

ls -l

For other directories, add the directory path to the command. Enter man ls or ls --help for more information and to view all available options.

To create a directory, use the mkdir command:

mkdir directory_name

Enter man mkdir or mkdir --help for more information and to view all available options.

To copy files or directories, use the cp command:

cp /source/path /destination/path

For example, to copy a directory on /scratch1 to /project, use:

cp -r /scratch1/ttrojan/dir /project/ttrojan_123/

The -r option, recursive mode, is needed when copying directories. To print a log of the copying, add the -v option, which enables verbose mode. To copy multiple files or directories to the same destination, simply include additional source paths in the command. Enter man cp or cp --help for more information and to view all available options.

Note: Do not use the -a or -p options with cp if you are copying into a CARC project directory, because this will likely result in incorrect group ownership of files that will produce a "disk quota exceeded" error.

To move files or directories (i.e., copy and also remove the files from the source), use the mv command instead:

mv /source/path /destination/path

To rename files, you can also use the mv command:

mv /source/filename.txt /source/newfilename.txt

Note: Do not use mv if you are moving files from a home or scratch directory into a CARC project directory. This results in incorrect group ownership of files that will produce a "disk quota exceeded" error. Use cp -r instead to copy the files and then use rm to remove the source files.

If you are backing up and syncing a directory, use an rsync command. For example:

rsync /source/dir/ /destination/dir/

Rsync will copy only files that are new or have changed in the source directory. Enter man rsync or rsync --help for more information and to view all available options.

Note: Do not use the -a or -p options with rsync if you are copying into a CARC project directory. This generally results in incorrect group ownership of files that will produce a "disk quota exceeded" error. Use the options rsync -rlt instead.

To delete files or directories, use the rm command:

rm /path/to/file

For example, to delete a directory, use:

rm -r /scratch1/ttrojan/dir

The -r option, recursive mode, is needed to remove directories. To remove multiple files or directories, simply add additional paths to the command. Enter man rm or rm --help for more information and to view all available options.

Checking file disk usage

To check the disk usage of files and directories, use the du -h command:

du -h /path/to/file

Please note that all file systems run ZFS which compresses files, so the file size on disk may be smaller than the actual file size (on your local computer, for example). Using the du --apparent-size -h command will give the uncompressed file size. Alternatively, the ls -lh command should give the same result. Enter man du or du --help for more information and to view all available options.

To list the files or subdirectories in the current directory and sort by size, enter the command cdiskusage. This is a convenience script that uses the du command. Please note that it may take a long time to run for large directories (e.g., the root of a project directory).

Sharing files

The /project directories are the best place to share files. By default, the members of a project group will have full read, write, and execute permissions for all files in a project directory (i.e., permissions set to 770 = drwxrwx---).

You can check the current permissions for a file or directory with the command ls -l /path/to/file.

When sharing your files, please keep the following in mind:

  • Never set the permissions of your directories to 777 (drwxrwxrwx), which means that any other user on CARC systems can access, modify, and delete your files.
  • Do not share or change the permissions of your /home1 directory and its subdirectories. If something goes wrong, you may be blocked from logging in because SSH requires strict permissions for logging in.
  • Granting other users read permission for your files (r--) and read and execute permissions (r-x) for your directories is typically sufficient for sharing. Granting write permission can result in modified or deleted files, so only provide write permission when actually needed.

You can change file and directory permissions using a chmod command.

For example, to provide read and execute permissions but not write permission (r-x) to a project subdirectory for your project group, use:

chmod 750 /project/ttrojan_123/dir

If the subdirectory is actually located within another subdirectory, note that the group would also need read and execute permission to the full hierarchy of subdirectories. Granting write permission to a directory allows users to create, modify, or delete files in that directory, also depending on individual file permissions. Enter man chmod or chmod --help for more information and to view all available options.

Backing up files

Although the /home1 and /project file systems have some file recovery capabilities, we encourage you to also back up your files elsewhere. There are a few different backup locations to consider:

  • Local storage (e.g., external drive)
  • Cloud storage
  • Research data repositories

To transfer files to local or cloud storage, see our guide for Transferring Files Using the Command Line. Rsync is useful for syncing to a backup directory on local storage, and Rclone works similarly for cloud storage. For large transfers to local or cloud storage, Globus can sync two directories in a similar manner. However, these tools do not necessarily version control backups by default. Tools with more features designed for backups, such as deduplication and compression, include rdiff-backup, Borg, Kopia, and Restic.

Research data repositories, such as OSF, Zenodo, Harvard Dataverse, and Dryad, are a special type of cloud storage intended for sharing research data with the wider research community. These services typically have an API that can be used at the command line to upload files directly from CARC systems.

For long-term archival storage, also consider using a research data repository. For private archival storage, you can also consult the USC Digital Repository.

Alias commands or backup scripts can help semi-automate backups.

A good plan for backups is the 3-2-1 strategy:

  • 3 copies of data
  • 2 different media (e.g., devices or file systems)
  • 1 copy off-site (e.g., cloud storage)

Also make sure to test that backups are accessible and functional every so often. A good rule of thumb is test every three months.

Archiving and compressing files

Archiving and compressing files can help simplify file organization and save storage space, such as after a project is completed and the associated files are not needed in the immediate future. This is also useful for packaging project files in order to distribute them to other researchers, for example. You can use a combination of the programs tar for archiving files and gzip or xz for compressing files.

Archiving with tar

To create an archive file from a directory of files, use the tar command. For example:

tar -cvf <filename>.tar <dir>

To add multiple directories and files, simply add the paths to these directories and files in the command. To check the integrity of the files, add the -W option.

To extract the archive, use the -x option instead of the -c option. For example:

tar -xvf <filename>.tar

Note that the .tar file will be larger in size than the sum of all the files being archived, primarily because of the added file headers in the archive file. Enter man tar or tar --help for more information and to view all available options.

Compressing with gzip

To compress files using gzip, use:

gzip -v <filename>

This will create a .gz file. Including the -v option, verbose mode, will print the compression ratio. There are 9 levels of compression, with 9 being the highest/slowest level and 6 being the default. The default is typically the best value to use with respect to the compression/time tradeoff. To maximize compression, at the expense of compression time, add the -9 option.

To uncompress a .gz file, add the -d option:

gzip -dv <filename>.gz

Enter gzip --help to view all available options. In addition, the pigz module is a parallel implementation of gzip that provides faster compression and uncompression times: module load pigz. It can be used as a drop-in replacement for gzip commands.

Compressing with xz

For better compression ratios or for maximum compression, use xz instead of gzip. With xz you can also use multiple cores to speed up the compression time. For example, to compress using 4 cores, add the -T4 option:

xz -v -T4 <filename>

This will create a .xz file. Including the -v option, verbose mode, will print compression progress and related information. There are 9 levels of compression, with 9 being the highest/slowest level and 6 being the default. The default is typically the best value to use with respect to the compression/time tradeoff. To maximize compression, at the expense of compression time and memory required, add the -9 option.

To uncompress an .xz file, add the -d option:

xz -dv -T4 <filename>.xz

Enter man xz or xz -H for more information and to view all available options.

Archiving and compressing with tar

You can also archive and compress with one command using tar with the -z option, which uses gzip compression by default. For example:

tar -czvf <filename>.tar.gz <dir>

Alternatively, to use xz to compress, use the -J option instead. In contrast to using gzip or xz separately, tar does not delete the source files by default. Add the --remove-files option to do so.

To uncompress and unarchive in one command, use the -x option:

tar -xvf <filename>.tar.gz

This will extract the contents of the archive into the current directory. tar will automatically detect which uncompression program to use, and note that it will not automatically delete the compressed archive file after extracting the files.

Software for Linux is typically distributed as a .tar.gz file, so a command like the above will extract the source code or binary files into the current directory.

Archiving and compressing before transferring files

Creating and compressing a single archive file can be useful before transferring files to or from CARC systems, especially for directories with a large number of files (e.g., > 1000, regardless of the total size of those files). Each file has associated metadata, which slows down the transfer. Compressing files will reduce the amount of data that needs to be transferred. However, it takes time to compress and uncompress files, so the total transfer time may not necessarily decrease depending on factors like network speeds. With fast network speeds, relative to total transfer size, it is typically not worth compressing files (e.g., when on campus).

Back to top