Why use the command line? #
For most users of modern computers, the computer driven by touch (including mouse): we click on programs to start them, we click on or drag files to open or process them, we click on menu items to select operations to perform and the data on which to perform them, we click on buttons to answer questions, and so forth.
This has the advantage that it is easy to learn, easy to operate, and easy to visualize (e.g., through physical metaphors like documents and folders and windows and menus).
But this graphical user interface (GUI) approach to interacting with computers has several disadvantages:
- The available operations are limited to those specifically chosen by the system’ or application’s programmers.
- The operations cannot be combined into new operations.
- The operations cannot be parameterized for generalization.
- The operations cannot easily be recorded for later recollection, reuse, modification, or automation.
- All that clicking and dragging is slow!
The command line offers an alternative way to interact with the system. We enter textual commands run programs with a chosen configuration on specified data.
This has some disadvantages:
- It has a nontrivial learning curve.
- There are many commands and configuration options, many not particularly mnemonic.
- Many commands focus on processing text files.
But this approach has many compensating advantages:
- Operations may be easily combined to form new and useful operations.
- Very complex commands can be built from several simple pieces.
- Commands can be recorded for later recollection, reuse, modification, or automation.
- Commands can be parameterized for more general uses.
- Once mastered, working on the command line is fast.
These are much like the arguments for using R instead of a point-and-click statistics package.
Soon we will learn how to write our own command-line programs, but first we’ll learn how the command line works and its basic features.
Example use case #
Suppose you have this problem: Read a file of text, determine the n
most frequently used words, and print out a sorted list of those
words along with their frequencies. What can you do?
-
Find an existing program to solve this problem
Maybe it exists, but for many specialized tasks you are out of luck. You might end up moving to an environment like Excel and hacking it together.
-
Write a specialized program
Donald Knuth wrote a 10 page Pascal program for this task. It was intended as an example of literate programming, simultaneously easy for humans and computers to read and reason about.
-
Use the command line directly.
Doug McIlroy wrote this simple chain of commands that does the task in a few moments:
tr -cs A-Za-z '\n' | tr A-Z a-z | sort | uniq -c | sort -rn | sed ${1}q
Each of the steps in this chain is a program that does one basic operation on its input. McIllroy’s “script” configures these programs with command-line options and creates a pipeline to chain them together into a more complex command.
All three approaches are viable. But we want to convince you of the power of #3 for many needs. By combining simple components that do one thing well, we get a vast assortment of useful tools.
The shell #
A shell is a program that provides a command-line interface for running other programs, configuring them with options, and doing useful things with their inputs and outputs.
Shells provide poweful built-in tools for routing and transforming data and for controlling other programs. They are accompanied by a full suite of other programs for manipulating text. Shell commands can be executed interactively or stored in files (called shell scripts) that can be executed as programs themselves.
Mastering a shell will make you a more powerful and more efficient computer user.
Choosing a shell #
There are many different shells available, differing slightly in their syntax and in the features they provide: bash, sh, ksh, zsh, fish, tcsh, ash, dash, pdksh, ….
bash is a common default shell on Linux, while the latest versions of macOS use zsh. We recommend you stick with the default initially. Most shells are quite similar and support the same syntax, with only minor differences for advanced features.
Your chosen shell (which I’ll assume for now is bash) starts for you whenever you open a terminal. You can also start a new shell explicitly by typing the corresponding command on the command line, e.g.,
bash
Automation is your friend #
If you do it once, smile.
If you do it twice, wince.
If you do it thrice, automate it.
Shell scripts – files containing a sequence of shell commands that can themselves take arguments – provide a useful tool for recording and automating the steps you take in a data analysis.
For example, in this course, we will often automate:
- Tests. Most test frameworks provide tools to run all tests in a directory or in certain files
- Data cleaning tools that need to be used on new data files:
Rscript clean.R < big_data_dump.csv > clean_data.csv
- Solvers and analyses that have to run on various input files:
python ingest_crimes.py -s 2501 data/homicides-2014.txt
- Plot and result generation for reproducible research:
Rscript make_paper_plots.R
The basics of shell commands #
(Note: These commonly used commands tend to have rather terse names. There is some mnemonic value here, but it takes some getting used to.)
The shell prompt #
When the shell is ready to accept input from the user
it displays a prompt. The prompt is a string, such as
‘%
’, with a recognizable pattern.
Here’s my prompt as I write this:
:documents/Lectures: 6552%
^ ^^
| ||
+----- Prompt ----------+Command starts here
This prompt string contains special “control characters” that cause the shell to display information about its current state: in this case, a width-limited display of the current directory and the command number.
Now, start up a shell on your computer and follow along.
You will see a prompt something like that below, though it may vary:.
username@LAPTOP:~$
Enter a command and hit Enter, like
echo "Hello, world"
Next, try commands like date
, whoami
, or cal
. Each command prints
its output on the lines following the prompt and then a new
prompt is given.
This pattern – called a Read-Evaluate-Print Loop or REPL – is the same as what you get when running R or Python interactively; only the nature and syntax of the commands is different.
Shell commands are used to control the operating system and to run other programs.
Navigating the file system #
The file and directory tree #
The files on your computer are arranged in a hierarchical directory structure. Directories (aka folders) contain files and other directories, which in turn contain files and other directories, and so on. The files are thus arranged in a tree, and at the root of that tree is the root directory.
/ <-- name of root directory
(Note: directories are actually a special type of file listing information about the files “contained” in the directory.)
The current working directory #
At any point in time when the shell is running, it keeps track of the directory where you are currently working. Unsurprisingly, this is called the current working directory.
Type the command pwd
at the shell prompt. This stands
for “print working directory”.
You will see output with a form something like this:
/Users/genovese/class/s750
This is called a path for the directory.
Paths #
Since the files are arranged in a tree, we can uniquely specify a file by describing the path from the root directory to the file.
Let’s breakdown the pathname /Users/genovese/class/s750
.
- The initial
/
denotes the root directory. Users
is a directory contained within the root directory.genovese
is a directory withinUsers
.class
is a directory withingenovese
.s750
, which is withinclass
, is the current working directory.
This gives the path from root to the file of interest.
A pathname starting with the root /
is called an absolute path.
We can also specify a relative path, which is defined relative
to the current directory.
Example relative pathnames (on my machine) with that same working directory:
-
entrypoints
A file in the current working directory, with absolute path
/Users/genovese/class/s750/entrypoints
. -
course-materials/lectures/week1/week1T.org
The file for this lecture; the absolute pathname is
/Users/genovese/class/s750/course-materials/lectures/week1/week1T.org
. -
./src/scomp-exercise-mode.el
In pathnames,
.
(a single dot) is a special notation for the current directory, and..
(two dots) is a special notation for the parent of the current directory.This represents the file with absolute path
/Users/genovese/class/s750/src/scomp-exercise-mode.el
. -
../218/info/syll.pdf
A file obtained by first moving up towards the root and then down a sibling branch of the tree:
/Users/genovese/class/218/info/syll.pdf
Listing files: The ls
command #
From within your current directory, type the command ls
.
I see the following output:
2018-notes documents problem-bank-source
NOTES entrypoints resources
Project-Links misc roster
course-materials old src
dev problem-bank style
You will see a similar listing. These are the files contained within your current working directory.
Now, we will add an option to the command to change it’s
behavior. Type ls -l
. (There is a space between the ls
and the -l, that is “minus ell”.)
This is a long listing, indicating a variety of information about the files including access permisions, owner, modification time, and name.
Next, type ls -lrtF
. You will see two differences: the ordering
of the files – which should now be in chronological order –
and a special character is attached to the name depending on
file type, e.g., /
for directories.
Note: options are specified by an initial -
. Short (one character)
option names, like -l
or -F
, can often be combined into one string.
You will get the same thing if you type ls -l -r -t -F
.
Next, pick the name of one directory shown in the previous listing
and type ls DIR
where you replace DIR with that directory name.
For instance, I typed ls resources
. What do you see?
Here, resources
is an argument to the command. The ls
command
can take any number of pathnames as arguments.
Finally, try combining some options and arguments in ls
command.
Changing directories: The cd
command #
We can change the working directory with the cd
(“change directory”)
command. In its most common usage, it takes a single directory pathname
as an argument and changes the current working directory to the
given one.
Try cd ..
to move to the parent directory. Follow that with
a pwd
and ls
to look around at your new location.
Try a few more cd
commands, changing your working directory
up or down as you see fit. You can give cd
either relative
or absolute pathnames; try some of each.
If you type cd
with no arguments, it moves you to a special
directory called your home directory, which is the root of the
subtree containing the files you “own”. We will play with that
shortly.
Try this:
- Move to your home directory by typing
cd
at the prompt. - List the files in that directory by typing
ls -F
at the prompt. - Pick a directory, one of the files listed that ends with
/
. - Type
cd DIRECTORY
, replacing DIRECTORY with the name you picked in step 3. - Type
pwd
and thenls
to look at the contents of that directory. - Move to the parent directory with
cd ..
. - Type
pwd
again to confirm that you are back.
Examining file contents: file
, cat
, and more
or less
. #
So far, we’ve seen the names of the files and various metadata about them. We are usually more interested in their contents.
There are various commands to examine file contents; here are two that are especially good for text files.
The cat
command (short for concatenate) prints out the
contents of all files given as arguments (in order).
If I type cat entrypoints NOTES
from my previous
working directory, it prints the concatenated contents
of both files in that order to the screen.
But cat
has a few other tricks up its sleeve.
The file
command outputs a description of the file type
for any pathnames given as arguments. Try it with .
and
..
and one other non-directory file as arguments.
Use ls -lh
and the file
command to find one or more smallish
(text) file in your directory tree, and cat
their contents to
the screen. For example, when I type file entrypoints
, I
get the output
entrypoints: ASCII text
telling me it is a text file, and ls -lh entrypoints NOTES
displays
-rw-r--r-- 1 genovese staff 12K Oct 28 2017 NOTES
-rw-r--r-- 1 genovese staff 228 Oct 26 2017 entrypoints
telling me that the latter has size 12 kilobytes and the former 228 bytes.
If you cat
a long file, the output will all scroll by.
This has its uses, but more often you want to see
the contents a little at a time. Like cat
, the
more
and less
commands (whichever you have) take
pathnames as arguments but scroll the contents at
your discretion. Try it on a longer file to see what
happens; use space to scroll another page (though there
are other commands).
Ex: less ~course-materials/lectures/week1/week1T.org
.
Manipulating files and directories (cp
, mv
, rm
, mkdir
, rmdir
, chmod
) #
These common commands help you manage your files:
Command | What it does |
---|---|
cp |
copy files to new locations |
mv |
move (rename) files |
rm |
remove (delete) files |
mkdir |
create a new directory |
rmdir |
remove an empty directory |
chmod |
change a file’s access permissions |
Common forms:
cp file newfile
- copy
file
tonewfile
, overwriting existing without-i
option. cp file1 file2 ... fileN dir
- copy file’s to directory
dir
mv file newfile
- rename
file
tonewfile
, overwriting existing without-i
option. mv file1 file2 ... fileN dir
- move file’s to directory
dir
rm file1 file2 ... fileN
- remove files (with
-i
option, will ask to confirm) mkdir dir1 dir2 ... dirN
- create new directories with given path
rmdir dir1 dir2 ... dirN
- remove empty directories
chmod +x file
- tell the shell that
file
is an executable program
Try mkdir foobarzap
. Type ls -F
to see that it has been
created. Then type rmdir foobarzap
to eliminate it.
If your directory were not empty, you would need to remove
its contents before doing rmdir
.
Be careful with the rm
command, especially with wildcards. You cannot undo a
delete! The shell does not have a Recycle Bin. For this reason, I always use
rm -i
, which asks me before deleting anything.
Shell commands and arguments #
Shell commands are the names of executable files – aka
programs or apps – on the file system. Shell commands start
with the command name followed by one or more arguments. The
arguments are separated by spaces. (Some commands, like git,
include a subcommand name after the command name, such as
git branch
or git commit,
but otherwise look similar.)
We saw the echo
command used above. Here are a few more examples.
Try these at your shell:
echo Hello, World
which git
ls
ls -A
ls -F
ls -lrt ..
ls -1F /bin/ls /bin/bash /usr/local/bin/git
In all of these, the first word names the command, and what follows is a mixture of options and other arguments. Some commands require arguments, some work with none.
By convention, arguments that begin with -
set configuration
options. Compare the results of ls
, ls -l
(minus ell), ls -lr
(minus ell, r), and ls -lrt
(minus ell, r, t). What do
the included ell, r, and t appear to do?
Because of historical developments, there are a variety of ways that programs represent these options. The most common current conventions (POSIX) is:
-
A single
-
starts a short option or flag, which is named by a single letter. A short option may itself require arguments, which must immediately follow the option (either within the same argument or as the next argument). For instancedvips -o foo.ps foo.dvi sort -k6,7 file sort -k 1,3 file perl -iorig program.pl
A flag, on the other hand, represents an on or off switch. It does not take arguments, so multiple flags can be combined in one string, as in
-lrt
above. -
A double
--
starts a long option, which is named by a word or hyphen-separated phrase. Any arguments required by the long option are specified after an ‘=’ in the option string or in the next argument. (Different commands and options may differ in whether they support ‘=’ or next argument or both.)sort --key=6,7 --output=sorted.out file git branch --contains 0fcea97 grep --ignore-case --max-depth 3 pattern file
-
Options may be given in any order and may appear multiple times.
-
All options should appear before any required arguments to the command. (This is sometimes violated, e.g., git.)
-
Some specific options like
--help
and--version
are always supplied to help the user understand how to use the command.Try
git --help
orgit --version
orgrep --help
for instance.Sadly, this is not uniformly supported. Try
-h
or-help
as alternatives. -
A double
--
by itself represents the end of optional arguments, everything that follows is a required argument. -
A single
-
by itself usually denotes a special input/output channel, standard input or standard output, which we’ll cover below.
For assignments, you will often write shell commands to run your programs, and we will ask that you follow these conventions. The notes on writing command-line programs discuss how to do this.
Finding commands #
If I want to add commands to the system, how does the shell find them?
The key point is that commands are represented by files, so the shell just needs to look for commands in special places in the file system.
Fortunately, you can tell it where to look and how to
prioritize the different locations using the shell variable
PATH
.
PATH
is a string that contains a list of directories
separated by :
’s. (Why colons? Don’t ask.)
When the shell sees a name N
in the command position of a
line, the shell looks for an executable file named N
in the
first directory on the path. If it finds it, it executes it.
Otherwise, it moves on to the second directory of the path.
And so on, along the entire list.
Here’s my default PATH, for instance:
.:./bin:../bin:/Users/genovese/bin:/usr/local/priority/bin:/usr/local/bin:/Library/TeX/texbin:/usr/bin:/bin:/usr/X11R6/bin:/usr/sbin:/sbin
It’s worth mentioning at this point that .
and ..
are
special directory names that represent, respectively, the
current directory and the parent directory of the current
directory. The special name ~
represents the user’s home
directory. So /Users/genovese/bin/foo
can be written as
~/bin/foo
.
In bash
, you can set your path by changing the PATH
variable:
export PATH=/foo/bar:/blah/zap/bin:/usr/local/bin
and you can prepend or append to the path by
export PATH=/high/priority:$PATH
export PATH=$PATH:/fallback/place
To make this permanent, put this line in the file .profile
in your home directory.
A few comments on names #
- Try to avoid spaces in file and directory names, since spaces separate arguments in shell commands. If you have to provide a path as an argument to a command, it may be misinterpreted if it contains spaces. Hyphens are a good alternative.
- Avoid using special characters such as
*?&:$!#()[]<>{}
in your file and directory name because these characters have other uses on the command line. - On non-Windows systems, file names can be as long as you like.
- Macs do not distinguish case in names, though it records the
case as you name it.
foo.txt
is the same asFOO.txt
.
A useful tool #
The site https://explainshell.com/ provides a useful tool for understanding the structure of shell commands.
Enter a bash command into the given text field, and it will identify each of the components of the command with useful explanation and help for what the command, options, and arguments mean and expect.
Keep that handy as you learn the shell.
Let’s try a couple examples from what we’ve seen so far…
Patterns and variables #
Patterns, quotes, and substitutions #
The shell has a rich and powerful pattern language for matching filenames. It is extremely useful but takes some time to learn. The best way is to learn by doing. We have time only for a few highlights, but check out the documentation on bash (or whatever shell you use) for more details.
Illustrative examples:
ls foo*.pdf
ls foo?.txt
ls ./*/tests*/foo[2-3A-E]*.txt
ls dir ./*/A*[0-9][0-9].*
ls -d ./**/*[0-9]*
ls ../foo/bar/abc{def,ghi,jkl}qrs.txt
This isn’t just for ls
: any command can accept wildcard arguments like this.
We can also protect our arguments from this expansion in various ways:
- Single quotes
'foo'
prevent expansion within them - Double quotes
"foo"
allow variable expansion and special characters, so"foo bar"
is treated as one argument, not two - Command substitution produces the output of a command as a string,
e.g.,
cat $(echo entrypoints)
.
Environments, variables, configuration (printenv, set, export) #
The shell also maintains some state in the form of variables that you can get and set. Most important are variables that can be accessed by programs run in the shell; these are called environment variables.
Type printenv
to list the names and values of the current
environment variables.
The three most important for common use are the PATH
, which determines
where the shell looks for commands, HOME
which records the pathname
of your home directory, and the prompt (PS1
in bash).
We set PATH
and HOME
above. Try one of these to set the prompt
and see how it changes:
export PS1=":\W: \!$ "
export PS1=":\[\033[0;34m\]\W\[\033[0m\]: \[\033[37m\]\!\[\033[0m\]$ "
export PS1=":\[\033[0;34m\]\w\[\033[0m\]: \[\033[37m\]\!\[\033[0m\]$ "
The export
command tells the shell to make the variable
part of the environment. You typically use it in your
.profile when setting variables such as the PATH
and PS1
.
export PS1=":\W: \!$ "
Other commonly used shell commands #
Now that we have an idea of how the shell and shell commands work, let’s look at some of the other commonly used commands.
Finding things #
grep
(“globally search a regular exprssion and print”) finds lines in its input whose contents matches a special patternfind
finds files matching specifed criteria
There’s a lot to both of these commands, but even at their simplest, they are useful.
Let’s look at grep shell
and grep -i shell
applied to this
file.
Here’s how I would find “org” files in my lectures directory containing the word shell:
find .. -type f -name '*.org' -exec grep -q shell '{}' \; -exec ls -ld '{}' \;
Text processing (sort, uniq, wc, head, tail, diff, grep) #
There are a variety of commands for processing text files. They can be combined to produce sophisticated operations.
Command | What it does | Useful options |
---|---|---|
sort |
sort lines of input | -n , -r , -k |
uniq |
remove or count duplicates | -c , -d |
head |
output first N lines of input | -N |
tail |
output last N lines of input | -N |
diff |
output differences between two files | |
grep |
output lines of input that match pattern | -v, -e |
Working with commands (which, type, alias, history, man, help) #
Many commands help us use other commands
Command | What it does | Basic form or example |
---|---|---|
which |
find pathname of command | which command |
type |
indicate kind of command | type command , type diff |
alias |
set alias for command string | alias ls=‘ls -F’ |
history |
list previous commands in session | history |
man |
complete command documentation | man command |
help |
documentation on builtin commands | help builtin |
Try:
which ls
type diff
alias ls='ls -F'
man grep
help type
Input and output #
Standard input, standard output, standard error #
Command-line programs have three special input and output streams associated with them:
- standard input (stdin, STDIN) – by default the input the user types
- standard output (stdout, STDOUT) – by default echoes to the terminal
- standard error (stderr, STDERR) – by default echoes to the terminal
For example, when I type, echo hello world
, the string “hello
world” (with a newline) is written to the standard output
(the terminal).
When I type cat
(with no arguments), it waits for the input I type and echoes
that to the terminal, passing its standard input to its standard output when
you hit Enter. Try it. (To quit the program, type Ctrl-C.)
The third stream, standard error, is used to separate warnings and other special messages from a programs output. It is by default directed to the terminal, but by changing where that points, we can keep our results and our diagnostics from interferring with each other.
What makes these streams so useful in the shell is that we can redirect each of them to other places.
Redirection and pipelines #
You can arbitrarily redirect standard input, output, and error to other files or devices.
For example,
./maze-solver < maze.txt > solution.txt
runs the program maze-solver
, reading input from the file maze.txt
as though it were typed at the keyboard and directing its output
to the file solution.txt
instead of the terminal (overwriting that
file in the process.
For any command
(with or without options and arguments)
that reads from standard input and writes to standard output,
we can redirect in many ways, including
-
command < input.file
Gets standard input from
input.file
contents rather than the keyboard. -
command > output.file
Puts standard output into
output.file
rather than the screen. It overwrites the file’s contents (careful) or creates the file if it does not exist. -
command >> output.file
Puts standard output into
output.file
rather than the screen, but appends this output to the file’s existing contents. If the file does not exist, it is created before redirecting the output. -
command | command2
This is called a pipeline (or pipe for short). The standard output of
command
is connected to the standard input ofcommand2
. That is,command2
gets as input the output ofcommand
.Multiple pipes can be chained together, and the first (last) command in the chain can use (input) output redirections.
With a few new command, try
ls -l | grep '^d' | wc -l echo "Hello, world" | cat | tr 'a-z' 'A-Z'
The first counts the number of directories in the working directory. What did the second do?
Use
ls
to make sure there are no files named FOO or BAR in your current directory. (If so, make up two names that are not used.)Consider these commands:
echo "Hello, world" > FOO cat < FOO | tr 'a-z' 'A-Z' > BAR cat BAR rm FOO BAR
What do the first three commands do? Feel free to try them. (The last command deletes the file
FOO
andBAR
.) -
command 2> file
Direct standard error to
file.
There are other combinations as well, like 2>&1
(stderr to stdout).
If you’re writing a command in your favorite programming language, you have ways to read from standard input and write to standard output; check the notes on writing command-line programs to see examples.
Iteration, conditionals, and other programming constructs #
Besides commands, the shell also provides a variety of programming constructs
including iteration (for
), conditionals (if
), functions, nested commands, and
a variety of simple data structures.
This makes it possible to create quite elaborate shell scripts, which can be useful for automation and other tasks.
In general, above a certain level of complexity, I prefer to write programs in a scripting language instead, but your mileage may vary.
The lecture on writing command-line programs explains how to write programs in your favorite programming language that work at the command line.
Configuration #
We have seen PS1
and PATH
as ways to customize the shell’s
behavior. But much more is possible.
Each shell maintains a collection of variables that describe its environment. You can customize these variables, define new commands, define aliases to making typing more convenient, and do much more so that every time you start the shell you have it configured as you prefer.
For bash, the .profile
and .bashrc
files in your home directory
are automatically read for such customizations and configurations.
(The latter is specifically for interactive shells; the former
is ready whenever the shell starts.)
Remote shells #
The Secure Shell (SSH) protocol lets you log in to a remote computer and use its shell. The Statistics Department maintains a set of servers for use by students and faculty to do their work on – if you need 32GB of memory and eight CPUs for your analysis, a department server may be right for you.
For Mac and Linux users, an SSH client is already provided. Just type
ssh yourusername@hydra1.stat.cmu.edu
and you’ll be connected to the shell on hydra1.
On Windows, Git Bash provides an SSH client, which can be run in the exact same way.
Resources #
If you are new to the command line, you should try one of these:
- Software Carpentry has a Unix Shell Tutorial
- LearnShell.org is an interactive tutorial you can run in your browser, with exercises
- ExplainShell is a useful tool for understanding command lines: give it any command and it will explain what each part of it means
There is also a cli-murders
homework exercise to grow your
understanding of the command line and the tools it provides.