Managing Files Using the Command Line
The following sections describe how to use command-line tools to manage files on CARC systems. To manage files with a graphical user interface, you can use the features available with CARC OnDemand or an SFTP GUI app.
Sensitive data
Currently, CARC systems do not support the use or storage of sensitive data. If your research work includes sensitive data, including but not limited to HIPAA-, FERPA-, or CUI-regulated data, see our Secure Computing user guides or contact us at carc-support@usc.edu before using our systems.
Organizing files
Project files should be organized within a directory structure of some kind in order to keep files organized, documented, and findable. This may include, for example, having separate directories for raw data, processed data, and code.
To list files and directories, use the ls
command. For example, to list files in long format for the current directory use:
ls -l
For other directories, add the directory path to the command. Enter man ls
or ls --help
for more information and to view all available options.
To create a directory, use the mkdir
command:
mkdir directory_name
Enter man mkdir
or mkdir --help
for more information and to view all available options.
To copy files or directories, use the cp
command:
cp /source/path /destination/path
For example, to copy a directory on /scratch1 to /project, use:
cp -r /scratch1/ttrojan/dir /project/ttrojan_123/
The -r
option, recursive mode, is needed when copying directories. To print a log of the copying, add the -v
option, which enables verbose mode. To copy multiple files or directories to the same destination, simply include additional source paths in the command. Enter man cp
or cp --help
for more information and to view all available options.
Note: Do not use the
-a
or-p
options withcp
if you are copying into a CARC project directory, because this will likely result in incorrect group ownership of files that will produce a "disk quota exceeded" error.
To move files or directories (i.e., copy and also remove the files from the source), use the mv
command instead:
mv /source/path /destination/path
To rename files, you can also use the mv
command:
mv /source/filename.txt /source/newfilename.txt
Note: Do not use
mv
if you are moving files from a home or scratch directory into a CARC project directory. This results in incorrect group ownership of files that will produce a "disk quota exceeded" error. Usecp -r
instead to copy the files and then userm
to remove the source files.
If you are backing up and syncing a directory, use an rsync
command. For example:
rsync /source/dir/ /destination/dir/
Rsync will copy only files that are new or have changed in the source directory. Enter man rsync
or rsync --help
for more information and to view all available options.
Note: Do not use the
-a
or-p
options withrsync
if you are copying into a CARC project directory. This generally results in incorrect group ownership of files that will produce a "disk quota exceeded" error. Use the optionsrsync -rlt
instead.
To delete files or directories, use the rm
command:
rm /path/to/file
For example, to delete a directory, use:
rm -r /scratch1/ttrojan/dir
The -r
option, recursive mode, is needed to remove directories. To remove multiple files or directories, simply add additional paths to the command. Enter man rm
or rm --help
for more information and to view all available options.
Checking file disk usage
To check the disk usage of files and directories, use the du -h
command:
du -h /path/to/file
Please note that all file systems run ZFS which compresses files, so the file size on disk may be smaller than the actual file size (on your local computer, for example). Using the du --apparent-size -h
command will give the uncompressed file size. Alternatively, the ls -lh
command should give the same result. Enter man du
or du --help
for more information and to view all available options.
To list the files or subdirectories in the current directory and sort by size, enter the command cdiskusage
. This is a convenience script that uses the du
command. Please note that it may take a long time to run for large directories (e.g., the root of a project directory).
Sharing files
The /project directories are the best place to share files. By default, the members of a project group will have full read, write, and execute permissions for all files in a project directory (i.e., permissions set to 770 = drwxrwx---).
You can check the current permissions for a file or directory with the command ls -l /path/to/file
.
When sharing your files, please keep the following in mind:
- Never set the permissions of your directories to 777 (drwxrwxrwx), which means that any other user on CARC systems can access, modify, and delete your files.
- Do not share or change the permissions of your /home1 directory and its subdirectories. If something goes wrong, you may be blocked from logging in because SSH requires strict permissions for logging in.
- Granting other users read permission for your files (
r--
) and read and execute permissions (r-x
) for your directories is typically sufficient for sharing. Granting write permission can result in modified or deleted files, so only provide write permission when actually needed.
You can change file and directory permissions using a chmod
command.
For example, to provide read and execute permissions but not write permission (r-x
) to a project subdirectory for your project group, use:
chmod 750 /project/ttrojan_123/dir
If the subdirectory is actually located within another subdirectory, note that the group would also need read and execute permission to the full hierarchy of subdirectories. Granting write permission to a directory allows users to create, modify, or delete files in that directory, also depending on individual file permissions. Enter man chmod
or chmod --help
for more information and to view all available options.
Backing up files
Although the /home1 and /project file systems have some file recovery capabilities, we encourage you to also back up your files elsewhere. There are a few different backup locations to consider:
- Local storage (e.g., external drive)
- Cloud storage
- Research data repositories
To transfer files to local or cloud storage, see our guide for Transferring Files Using the Command Line. Rsync is useful for syncing to a backup directory on local storage, and Rclone works similarly for cloud storage. For large transfers to local or cloud storage, Globus can sync two directories in a similar manner. However, these tools do not necessarily version control backups by default. Tools with more features designed for backups, such as deduplication and compression, include rdiff-backup, Borg, Kopia, and Restic.
Research data repositories, such as OSF, Zenodo, Harvard Dataverse, and Dryad, are a special type of cloud storage intended for sharing research data with the wider research community. These services typically have an API that can be used at the command line to upload files directly from CARC systems.
For long-term archival storage, also consider using a research data repository. For private archival storage, you can also consult the USC Digital Repository.
Alias commands or backup scripts can help semi-automate backups.
A good plan for backups is the 3-2-1 strategy:
- 3 copies of data
- 2 different media (e.g., devices or file systems)
- 1 copy off-site (e.g., cloud storage)
Also make sure to test that backups are accessible and functional every so often. A good rule of thumb is test every three months.
Archiving and compressing files
Archiving and compressing files can help simplify file organization and save storage space, such as after a project is completed and the associated files are not needed in the immediate future. This is also useful for packaging project files in order to distribute them to other researchers, for example. You can use a combination of the programs tar
for archiving files and gzip
or xz
for compressing files.
Archiving with tar
To create an archive file from a directory of files, use the tar
command. For example:
tar -cvf <filename>.tar <dir>
To add multiple directories and files, simply add the paths to these directories and files in the command. To check the integrity of the files, add the -W
option.
To extract the archive, use the -x
option instead of the -c
option. For example:
tar -xvf <filename>.tar
Note that the .tar file will be larger in size than the sum of all the files being archived, primarily because of the added file headers in the archive file. Enter man tar
or tar --help
for more information and to view all available options.
Compressing with gzip
To compress files using gzip
, use:
gzip -v <filename>
This will create a .gz file. Including the -v
option, verbose mode, will print the compression ratio. There are 9 levels of compression, with 9 being the highest/slowest level and 6 being the default. The default is typically the best value to use with respect to the compression/time tradeoff. To maximize compression, at the expense of compression time, add the -9
option.
To uncompress a .gz file, add the -d
option:
gzip -dv <filename>.gz
Enter gzip --help
to view all available options. In addition, the pigz
module is a parallel implementation of gzip
that provides faster compression and uncompression times: module load pigz
. It can be used as a drop-in replacement for gzip
commands.
Compressing with xz
For better compression ratios or for maximum compression, use xz
instead of gzip
. With xz
you can also use multiple cores to speed up the compression time. For example, to compress using 4 cores, add the -T4
option:
xz -v -T4 <filename>
This will create a .xz file. Including the -v
option, verbose mode, will print compression progress and related information. There are 9 levels of compression, with 9 being the highest/slowest level and 6 being the default. The default is typically the best value to use with respect to the compression/time tradeoff. To maximize compression, at the expense of compression time and memory required, add the -9
option.
To uncompress an .xz file, add the -d
option:
xz -dv -T4 <filename>.xz
Enter man xz
or xz -H
for more information and to view all available options.
Archiving and compressing with tar
You can also archive and compress with one command using tar
with the -z
option, which uses gzip
compression by default. For example:
tar -czvf <filename>.tar.gz <dir>
Alternatively, to use xz
to compress, use the -J
option instead. In contrast to using gzip
or xz
separately, tar
does not delete the source files by default. Add the --remove-files
option to do so.
To uncompress and unarchive in one command, use the -x
option:
tar -xvf <filename>.tar.gz
This will extract the contents of the archive into the current directory. tar
will automatically detect which uncompression program to use, and note that it will not automatically delete the compressed archive file after extracting the files.
Software for Linux is typically distributed as a .tar.gz file, so a command like the above will extract the source code or binary files into the current directory.
Archiving and compressing before transferring files
Creating and compressing a single archive file can be useful before transferring files to or from CARC systems, especially for directories with a large number of files (e.g., > 1000, regardless of the total size of those files). Each file has associated metadata, which slows down the transfer. Compressing files will reduce the amount of data that needs to be transferred. However, it takes time to compress and uncompress files, so the total transfer time may not necessarily decrease depending on factors like network speeds. With fast network speeds, relative to total transfer size, it is typically not worth compressing files (e.g., when on campus).