The Shell

— Christopher Genovese and Alex Reinhart

Why use the command line? #

For most users of modern computers, the computer driven by touch (including mouse): we click on programs to start them, we click on or drag files to open or process them, we click on menu items to select operations to perform and the data on which to perform them, we click on buttons to answer questions, and so forth.

This has the advantage that it is easy to learn, easy to operate, and easy to visualize (e.g., through physical metaphors like documents and folders and windows and menus).

But this graphical user interface (GUI) approach to interacting with computers has several disadvantages:

  • The available operations are limited to those specifically chosen by the system’ or application’s programmers.
  • The operations cannot be combined into new operations.
  • The operations cannot be parameterized for generalization.
  • The operations cannot easily be recorded for later recollection, reuse, modification, or automation.
  • All that clicking and dragging is slow!

The command line offers an alternative way to interact with the system. We enter textual commands run programs with a chosen configuration on specified data.

This has some disadvantages:

  • It has a nontrivial learning curve.
  • There are many commands and configuration options, many not particularly mnemonic.
  • Many commands focus on processing text files.

But this approach has many compensating advantages:

  • Operations may be easily combined to form new and useful operations.
  • Very complex commands can be built from several simple pieces.
  • Commands can be recorded for later recollection, reuse, modification, or automation.
  • Commands can be parameterized for more general uses.
  • Once mastered, working on the command line is fast.

These are much like the arguments for using R instead of a point-and-click statistics package.

Soon we will learn how to write our own command-line programs, but first we’ll learn how the command line works and its basic features.

Example use case #

Suppose you have this problem: Read a file of text, determine the n most frequently used words, and print out a sorted list of those words along with their frequencies. What can you do?

  1. Find an existing program to solve this problem

    Maybe it exists, but for many specialized tasks you are out of luck. You might end up moving to an environment like Excel and hacking it together.

  2. Write a specialized program

    Donald Knuth wrote a 10 page Pascal program for this task. It was intended as an example of literate programming, simultaneously easy for humans and computers to read and reason about.

  3. Use the command line directly.

    Doug McIlroy wrote this simple chain of commands that does the task in a few moments:

    tr -cs A-Za-z '\n' |
    tr A-Z a-z |
    sort |
    uniq -c |
    sort -rn |
    sed ${1}q
    

    Each of the steps in this chain is a program that does one basic operation on its input. McIllroy’s “script” configures these programs with command-line options and creates a pipeline to chain them together into a more complex command.

All three approaches are viable. But we want to convince you of the power of #3 for many needs. By combining simple components that do one thing well, we get a vast assortment of useful tools.

The shell #

A shell is a program that provides a command-line interface for running other programs, configuring them with options, and doing useful things with their inputs and outputs.

Shells provide poweful built-in tools for routing and transforming data and for controlling other programs. They are accompanied by a full suite of other programs for manipulating text. Shell commands can be executed interactively or stored in files (called shell scripts) that can be executed as programs themselves.

Mastering a shell will make you a more powerful and more efficient computer user.

Choosing a shell #

There are many different shells available, differing slightly in their syntax and in the features they provide: bash, sh, ksh, zsh, fish, tcsh, ash, dash, pdksh, ….

bash is a common default shell on Linux, while the latest versions of macOS use zsh. We recommend you stick with the default initially. Most shells are quite similar and support the same syntax, with only minor differences for advanced features.

Your chosen shell (which I’ll assume for now is bash) starts for you whenever you open a terminal. You can also start a new shell explicitly by typing the corresponding command on the command line, e.g.,

bash

Automation is your friend #

If you do it once, smile.

If you do it twice, wince.

If you do it thrice, automate it.

Shell scripts – files containing a sequence of shell commands that can themselves take arguments – provide a useful tool for recording and automating the steps you take in a data analysis.

For example, in this course, we will often automate:

  • Tests. Most test frameworks provide tools to run all tests in a directory or in certain files
  • Data cleaning tools that need to be used on new data files:
    Rscript clean.R < big_data_dump.csv > clean_data.csv
    
  • Solvers and analyses that have to run on various input files:
    python ingest_crimes.py -s 2501 data/homicides-2014.txt
    
  • Plot and result generation for reproducible research:
    Rscript make_paper_plots.R
    

The basics of shell commands #

(Note: These commonly used commands tend to have rather terse names. There is some mnemonic value here, but it takes some getting used to.)

The shell prompt #

When the shell is ready to accept input from the user it displays a prompt. The prompt is a string, such as ‘%’, with a recognizable pattern.

Here’s my prompt as I write this:

:documents/Lectures: 6552%
^                         ^^
|                         ||
+-----  Prompt  ----------+Command starts here

This prompt string contains special “control characters” that cause the shell to display information about its current state: in this case, a width-limited display of the current directory and the command number.

Now, start up a shell on your computer and follow along.

You will see a prompt something like that below, though it may vary:.

username@LAPTOP:~$

Enter a command and hit Enter, like

echo "Hello, world"

Next, try commands like date, whoami, or cal. Each command prints its output on the lines following the prompt and then a new prompt is given.

This pattern – called a Read-Evaluate-Print Loop or REPL – is the same as what you get when running R or Python interactively; only the nature and syntax of the commands is different.

Shell commands are used to control the operating system and to run other programs.

The file and directory tree #

The files on your computer are arranged in a hierarchical directory structure. Directories (aka folders) contain files and other directories, which in turn contain files and other directories, and so on. The files are thus arranged in a tree, and at the root of that tree is the root directory.

/   <-- name of root directory

(Note: directories are actually a special type of file listing information about the files “contained” in the directory.)

The current working directory #

At any point in time when the shell is running, it keeps track of the directory where you are currently working. Unsurprisingly, this is called the current working directory.

Type the command pwd at the shell prompt. This stands for “print working directory”.

You will see output with a form something like this:

/Users/genovese/class/s750

This is called a path for the directory.

Paths #

Since the files are arranged in a tree, we can uniquely specify a file by describing the path from the root directory to the file.

Let’s breakdown the pathname /Users/genovese/class/s750.

  1. The initial / denotes the root directory.
  2. Users is a directory contained within the root directory.
  3. genovese is a directory within Users.
  4. class is a directory within genovese.
  5. s750, which is within class, is the current working directory.

This gives the path from root to the file of interest.

A pathname starting with the root / is called an absolute path. We can also specify a relative path, which is defined relative to the current directory.

Example relative pathnames (on my machine) with that same working directory:

  1. entrypoints

    A file in the current working directory, with absolute path /Users/genovese/class/s750/entrypoints.

  2. course-materials/lectures/week1/week1T.org

    The file for this lecture; the absolute pathname is /Users/genovese/class/s750/course-materials/lectures/week1/week1T.org.

  3. ./src/scomp-exercise-mode.el

    In pathnames, . (a single dot) is a special notation for the current directory, and .. (two dots) is a special notation for the parent of the current directory.

    This represents the file with absolute path /Users/genovese/class/s750/src/scomp-exercise-mode.el.

  4. ../218/info/syll.pdf

    A file obtained by first moving up towards the root and then down a sibling branch of the tree: /Users/genovese/class/218/info/syll.pdf

Listing files: The ls command #

From within your current directory, type the command ls.

I see the following output:

2018-notes              documents               problem-bank-source
NOTES                   entrypoints             resources
Project-Links           misc                    roster
course-materials        old                     src
dev                     problem-bank            style

You will see a similar listing. These are the files contained within your current working directory.

Now, we will add an option to the command to change it’s behavior. Type ls -l. (There is a space between the ls and the -l, that is “minus ell”.)

This is a long listing, indicating a variety of information about the files including access permisions, owner, modification time, and name.

Next, type ls -lrtF. You will see two differences: the ordering of the files – which should now be in chronological order – and a special character is attached to the name depending on file type, e.g., / for directories.

Note: options are specified by an initial -. Short (one character) option names, like -l or -F, can often be combined into one string. You will get the same thing if you type ls -l -r -t -F.

Next, pick the name of one directory shown in the previous listing and type ls DIR where you replace DIR with that directory name. For instance, I typed ls resources. What do you see? Here, resources is an argument to the command. The ls command can take any number of pathnames as arguments.

Finally, try combining some options and arguments in ls command.

Changing directories: The cd command #

We can change the working directory with the cd (“change directory”) command. In its most common usage, it takes a single directory pathname as an argument and changes the current working directory to the given one.

Try cd .. to move to the parent directory. Follow that with a pwd and ls to look around at your new location.

Try a few more cd commands, changing your working directory up or down as you see fit. You can give cd either relative or absolute pathnames; try some of each.

If you type cd with no arguments, it moves you to a special directory called your home directory, which is the root of the subtree containing the files you “own”. We will play with that shortly.

Try this:

  1. Move to your home directory by typing cd at the prompt.
  2. List the files in that directory by typing ls -F at the prompt.
  3. Pick a directory, one of the files listed that ends with /.
  4. Type cd DIRECTORY, replacing DIRECTORY with the name you picked in step 3.
  5. Type pwd and then ls to look at the contents of that directory.
  6. Move to the parent directory with cd ...
  7. Type pwd again to confirm that you are back.

Examining file contents: file, cat, and more or less#

So far, we’ve seen the names of the files and various metadata about them. We are usually more interested in their contents.

There are various commands to examine file contents; here are two that are especially good for text files.

The cat command (short for concatenate) prints out the contents of all files given as arguments (in order).

If I type cat entrypoints NOTES from my previous working directory, it prints the concatenated contents of both files in that order to the screen. But cat has a few other tricks up its sleeve.

The file command outputs a description of the file type for any pathnames given as arguments. Try it with . and .. and one other non-directory file as arguments.

Use ls -lh and the file command to find one or more smallish (text) file in your directory tree, and cat their contents to the screen. For example, when I type file entrypoints, I get the output

entrypoints: ASCII text

telling me it is a text file, and ls -lh entrypoints NOTES displays

-rw-r--r-- 1 genovese staff 12K Oct 28  2017 NOTES
-rw-r--r-- 1 genovese staff 228 Oct 26  2017 entrypoints

telling me that the latter has size 12 kilobytes and the former 228 bytes.

If you cat a long file, the output will all scroll by. This has its uses, but more often you want to see the contents a little at a time. Like cat, the more and less commands (whichever you have) take pathnames as arguments but scroll the contents at your discretion. Try it on a longer file to see what happens; use space to scroll another page (though there are other commands).

Ex: less ~course-materials/lectures/week1/week1T.org.

Manipulating files and directories (cp, mv, rm, mkdir, rmdir, chmod#

These common commands help you manage your files:

Command What it does
cp copy files to new locations
mv move (rename) files
rm remove (delete) files
mkdir create a new directory
rmdir remove an empty directory
chmod change a file’s access permissions

Common forms:

cp file newfile
copy file to newfile, overwriting existing without -i option.
cp file1 file2 ... fileN dir
copy file’s to directory dir
mv file newfile
rename file to newfile, overwriting existing without -i option.
mv file1 file2 ... fileN dir
move file’s to directory dir
rm file1 file2 ... fileN
remove files (with -i option, will ask to confirm)
mkdir dir1 dir2 ... dirN
create new directories with given path
rmdir dir1 dir2 ... dirN
remove empty directories
chmod +x file
tell the shell that file is an executable program

Try mkdir foobarzap. Type ls -F to see that it has been created. Then type rmdir foobarzap to eliminate it.

If your directory were not empty, you would need to remove its contents before doing rmdir.

Be careful with the rm command, especially with wildcards. You cannot undo a delete! The shell does not have a Recycle Bin. For this reason, I always use rm -i, which asks me before deleting anything.

Shell commands and arguments #

Shell commands are the names of executable files – aka programs or apps – on the file system. Shell commands start with the command name followed by one or more arguments. The arguments are separated by spaces. (Some commands, like git, include a subcommand name after the command name, such as git branch or git commit, but otherwise look similar.)

We saw the echo command used above. Here are a few more examples. Try these at your shell:

  • echo Hello, World
  • which git
  • ls
  • ls -A
  • ls -F
  • ls -lrt ..
  • ls -1F /bin/ls /bin/bash /usr/local/bin/git

In all of these, the first word names the command, and what follows is a mixture of options and other arguments. Some commands require arguments, some work with none.

By convention, arguments that begin with - set configuration options. Compare the results of ls, ls -l (minus ell), ls -lr (minus ell, r), and ls -lrt (minus ell, r, t). What do the included ell, r, and t appear to do?

Because of historical developments, there are a variety of ways that programs represent these options. The most common current conventions (POSIX) is:

  • A single - starts a short option or flag, which is named by a single letter. A short option may itself require arguments, which must immediately follow the option (either within the same argument or as the next argument). For instance

    dvips -o foo.ps foo.dvi
    sort -k6,7 file
    sort -k 1,3 file
    perl -iorig program.pl
    

    A flag, on the other hand, represents an on or off switch. It does not take arguments, so multiple flags can be combined in one string, as in -lrt above.

  • A double -- starts a long option, which is named by a word or hyphen-separated phrase. Any arguments required by the long option are specified after an ‘=’ in the option string or in the next argument. (Different commands and options may differ in whether they support ‘=’ or next argument or both.)

    sort --key=6,7 --output=sorted.out file
    git branch --contains 0fcea97
    grep --ignore-case --max-depth 3 pattern file
    
  • Options may be given in any order and may appear multiple times.

  • All options should appear before any required arguments to the command. (This is sometimes violated, e.g., git.)

  • Some specific options like --help and --version are always supplied to help the user understand how to use the command.

    Try git --help or git --version or grep --help for instance.

    Sadly, this is not uniformly supported. Try -h or -help as alternatives.

  • A double -- by itself represents the end of optional arguments, everything that follows is a required argument.

  • A single - by itself usually denotes a special input/output channel, standard input or standard output, which we’ll cover below.

For assignments, you will often write shell commands to run your programs, and we will ask that you follow these conventions. The notes on writing command-line programs discuss how to do this.

Finding commands #

If I want to add commands to the system, how does the shell find them?

The key point is that commands are represented by files, so the shell just needs to look for commands in special places in the file system.

Fortunately, you can tell it where to look and how to prioritize the different locations using the shell variable PATH.

PATH is a string that contains a list of directories separated by :’s. (Why colons? Don’t ask.)

When the shell sees a name N in the command position of a line, the shell looks for an executable file named N in the first directory on the path. If it finds it, it executes it. Otherwise, it moves on to the second directory of the path. And so on, along the entire list.

Here’s my default PATH, for instance:

.:./bin:../bin:/Users/genovese/bin:/usr/local/priority/bin:/usr/local/bin:/Library/TeX/texbin:/usr/bin:/bin:/usr/X11R6/bin:/usr/sbin:/sbin

It’s worth mentioning at this point that . and .. are special directory names that represent, respectively, the current directory and the parent directory of the current directory. The special name ~ represents the user’s home directory. So /Users/genovese/bin/foo can be written as ~/bin/foo.

In bash, you can set your path by changing the PATH variable:

export PATH=/foo/bar:/blah/zap/bin:/usr/local/bin

and you can prepend or append to the path by

export PATH=/high/priority:$PATH
export PATH=$PATH:/fallback/place

To make this permanent, put this line in the file .profile in your home directory.

A few comments on names #

  • Try to avoid spaces in file and directory names, since spaces separate arguments in shell commands. If you have to provide a path as an argument to a command, it may be misinterpreted if it contains spaces. Hyphens are a good alternative.
  • Avoid using special characters such as *?&:$!#()[]<>{} in your file and directory name because these characters have other uses on the command line.
  • On non-Windows systems, file names can be as long as you like.
  • Macs do not distinguish case in names, though it records the case as you name it. foo.txt is the same as FOO.txt.

A useful tool #

The site https://explainshell.com/ provides a useful tool for understanding the structure of shell commands.

Enter a bash command into the given text field, and it will identify each of the components of the command with useful explanation and help for what the command, options, and arguments mean and expect.

Keep that handy as you learn the shell.

Let’s try a couple examples from what we’ve seen so far…

Patterns and variables #

Patterns, quotes, and substitutions #

The shell has a rich and powerful pattern language for matching filenames. It is extremely useful but takes some time to learn. The best way is to learn by doing. We have time only for a few highlights, but check out the documentation on bash (or whatever shell you use) for more details.

Illustrative examples:

  • ls foo*.pdf
  • ls foo?.txt
  • ls ./*/tests*/foo[2-3A-E]*.txt
  • ls dir ./*/A*[0-9][0-9].*
  • ls -d ./**/*[0-9]*
  • ls ../foo/bar/abc{def,ghi,jkl}qrs.txt

This isn’t just for ls: any command can accept wildcard arguments like this.

We can also protect our arguments from this expansion in various ways:

  • Single quotes 'foo' prevent expansion within them
  • Double quotes "foo" allow variable expansion and special characters, so "foo bar" is treated as one argument, not two
  • Command substitution produces the output of a command as a string, e.g., cat $(echo entrypoints).

Environments, variables, configuration (printenv, set, export) #

The shell also maintains some state in the form of variables that you can get and set. Most important are variables that can be accessed by programs run in the shell; these are called environment variables.

Type printenv to list the names and values of the current environment variables.

The three most important for common use are the PATH, which determines where the shell looks for commands, HOME which records the pathname of your home directory, and the prompt (PS1 in bash).

We set PATH and HOME above. Try one of these to set the prompt and see how it changes:

export PS1=":\W: \!$ "
export PS1=":\[\033[0;34m\]\W\[\033[0m\]: \[\033[37m\]\!\[\033[0m\]$ "
export PS1=":\[\033[0;34m\]\w\[\033[0m\]: \[\033[37m\]\!\[\033[0m\]$ "

The export command tells the shell to make the variable part of the environment. You typically use it in your .profile when setting variables such as the PATH and PS1.

export PS1=":\W: \!$ "

Other commonly used shell commands #

Now that we have an idea of how the shell and shell commands work, let’s look at some of the other commonly used commands.

Finding things #

  • grep (“globally search a regular exprssion and print”) finds lines in its input whose contents matches a special pattern
  • find finds files matching specifed criteria

There’s a lot to both of these commands, but even at their simplest, they are useful.

Let’s look at grep shell and grep -i shell applied to this file.

Here’s how I would find “org” files in my lectures directory containing the word shell:

find .. -type f -name '*.org' -exec grep -q shell '{}' \; -exec ls -ld '{}' \;

Text processing (sort, uniq, wc, head, tail, diff, grep) #

There are a variety of commands for processing text files. They can be combined to produce sophisticated operations.

Command What it does Useful options
sort sort lines of input -n, -r, -k
uniq remove or count duplicates -c, -d
head output first N lines of input -N
tail output last N lines of input -N
diff output differences between two files
grep output lines of input that match pattern -v, -e

Working with commands (which, type, alias, history, man, help) #

Many commands help us use other commands

Command What it does Basic form or example
which find pathname of command which command
type indicate kind of command type command, type diff
alias set alias for command string alias ls=‘ls -F’
history list previous commands in session history
man complete command documentation man command
help documentation on builtin commands help builtin

Try:

  • which ls
  • type diff
  • alias ls='ls -F'
  • man grep
  • help type

Input and output #

Standard input, standard output, standard error #

Command-line programs have three special input and output streams associated with them:

  • standard input (stdin, STDIN) – by default the input the user types
  • standard output (stdout, STDOUT) – by default echoes to the terminal
  • standard error (stderr, STDERR) – by default echoes to the terminal

For example, when I type, echo hello world, the string “hello world” (with a newline) is written to the standard output (the terminal).

When I type cat (with no arguments), it waits for the input I type and echoes that to the terminal, passing its standard input to its standard output when you hit Enter. Try it. (To quit the program, type Ctrl-C.)

The third stream, standard error, is used to separate warnings and other special messages from a programs output. It is by default directed to the terminal, but by changing where that points, we can keep our results and our diagnostics from interferring with each other.

What makes these streams so useful in the shell is that we can redirect each of them to other places.

Redirection and pipelines #

You can arbitrarily redirect standard input, output, and error to other files or devices.

For example,

./maze-solver < maze.txt > solution.txt

runs the program maze-solver, reading input from the file maze.txt as though it were typed at the keyboard and directing its output to the file solution.txt instead of the terminal (overwriting that file in the process.

For any command (with or without options and arguments) that reads from standard input and writes to standard output, we can redirect in many ways, including

  • command < input.file

    Gets standard input from input.file contents rather than the keyboard.

  • command > output.file

    Puts standard output into output.file rather than the screen. It overwrites the file’s contents (careful) or creates the file if it does not exist.

  • command >> output.file

    Puts standard output into output.file rather than the screen, but appends this output to the file’s existing contents. If the file does not exist, it is created before redirecting the output.

  • command | command2

    This is called a pipeline (or pipe for short). The standard output of command is connected to the standard input of command2. That is, command2 gets as input the output of command.

    Multiple pipes can be chained together, and the first (last) command in the chain can use (input) output redirections.

    With a few new command, try

    ls -l | grep '^d' | wc -l
    echo "Hello, world" | cat | tr 'a-z' 'A-Z'
    

    The first counts the number of directories in the working directory. What did the second do?

    Use ls to make sure there are no files named FOO or BAR in your current directory. (If so, make up two names that are not used.)

    Consider these commands:

    echo "Hello, world" > FOO
    cat < FOO | tr 'a-z' 'A-Z' > BAR
    cat BAR
    rm FOO BAR
    

    What do the first three commands do? Feel free to try them. (The last command deletes the file FOO and BAR.)

  • command 2> file

    Direct standard error to file.

There are other combinations as well, like 2>&1 (stderr to stdout).

If you’re writing a command in your favorite programming language, you have ways to read from standard input and write to standard output; check the notes on writing command-line programs to see examples.

Iteration, conditionals, and other programming constructs #

Besides commands, the shell also provides a variety of programming constructs including iteration (for), conditionals (if), functions, nested commands, and a variety of simple data structures.

This makes it possible to create quite elaborate shell scripts, which can be useful for automation and other tasks.

In general, above a certain level of complexity, I prefer to write programs in a scripting language instead, but your mileage may vary.

The lecture on writing command-line programs explains how to write programs in your favorite programming language that work at the command line.

Configuration #

We have seen PS1 and PATH as ways to customize the shell’s behavior. But much more is possible.

Each shell maintains a collection of variables that describe its environment. You can customize these variables, define new commands, define aliases to making typing more convenient, and do much more so that every time you start the shell you have it configured as you prefer.

For bash, the .profile and .bashrc files in your home directory are automatically read for such customizations and configurations. (The latter is specifically for interactive shells; the former is ready whenever the shell starts.)

Remote shells #

The Secure Shell (SSH) protocol lets you log in to a remote computer and use its shell. The Statistics Department maintains a set of servers for use by students and faculty to do their work on – if you need 32GB of memory and eight CPUs for your analysis, a department server may be right for you.

For Mac and Linux users, an SSH client is already provided. Just type

ssh yourusername@hydra1.stat.cmu.edu

and you’ll be connected to the shell on hydra1.

On Windows, Git Bash provides an SSH client, which can be run in the exact same way.

Resources #

If you are new to the command line, you should try one of these:

  • Software Carpentry has a Unix Shell Tutorial
  • LearnShell.org is an interactive tutorial you can run in your browser, with exercises
  • ExplainShell is a useful tool for understanding command lines: give it any command and it will explain what each part of it means

There is also a cli-murders homework exercise to grow your understanding of the command line and the tools it provides.