Version Control

In even a moderate-sized software project, there are many associated files, and they change quickly. Version Control is a way to keep track of those changes so that the history of development can be recovered.

But it does much more. Version control enables:

working on code simultaneously with other team members
creating “branches” of code, to try new or experimental features
distributing code over the web for varied access
tagging parts of the history for special use
examining changes in the history in minute detail
stashing some changes/ideas for later and recovering them

Today we will talk about Git, a popular and powerful distributed version control system. There are dozens of others, most notably Mercurial and Subversion.

Why version control? #

One of the key themes of this course is that code should be understandable to others. Next week we’ll talk about the basic rules of clear, readable, reusable code, such as formatting and style. But there’s another piece to it: a reader must understand why your code is the way it is. They must be able to deduce its history.

And even if you’re not writing for an audience, there’s a saying I heard somewhere: The person who knows the most about your code is you six months ago, and you don’t reply to email.

So even when you’re working alone, version control provides many features:

Keep a complete record of changes. You can always revert back to a previous version or recall what you changed.
Store a message with every change, so the rationale for changes is always recorded.
Mark snapshots of your code: “the version submitted to JASA” or “the version used for my conference presentation”.
Easily distribute your code or back it up with a Git hosting service (like GitHub or Bitbucket).

Git concepts #

You have a folder with some code (R files, Python files, whatever) in it, plus maybe data, documentation, an analysis report – just about anything. This folder is your repository. You’d like to make snapshots of this folder: this in the version where I fixed that nasty bug, this is the version where I added new diagnostic plots, this is the version with the updated dataset.

You’d also like to be able to compare the snapshots – what changed between last week and now? – and to run previous snapshots of the code, if needed.

In Git, these snapshots are called commits. Each commit is the state of the code at a certain time. Git tracks all the files in your folder to see which have changed and what’s eligible to be committed. Every commit has a parent – the commit from which it came.

When you’re ready to make a snapshot, you tell Git which changes you want to include in the snapshot, and write a commit message, describing the snapshot. Git then records it all permanently.

This record of snapshots can be shared with others, who can clone your files and snapshots to have their own copy to work with.

Later, you can ask Git to check out a specific snapshot, and it will change all the files in the repository to match that snapshot of the code. Then you can ask Git to take you right back to where you were.

Crucially, Git can store parallel histories: if two different people have copies of the same repository and make changes independently, or if one person tries two different ways of changing the same code, snapshots of their work can exist in parallel, and the work can be reconciled (“merged”) by Git later.

These parallel lines of development are called branches, and can be given names:

Git works by creating a special hidden folder (called .git/) inside the folder you’re taking snapshots of. All snapshot history, commit information, and other data is stored in this folder. No other websites or services are necessary, though you can push your snapshots to a server (we’ll return to that later).

Some Git terminology #

Git is not particularly friendly to new users. You’ll need some terminology:

Repository: a directory (local or remote) containing tracked files
Commit: a project snapshot
Branch: a movable pointer to a commit, usually representing a specific line of development
Tag: a metadata object attached to commits
HEAD: a pointer to the local branch you are currently on
Tree: an object that maps file names to stored data (like file snapshots)

Three important trees:

Working directory: the files you see
Index: the staging area for accumulating changes
HEAD: a pointer to the latest commit

A simple Git workflow illustrated #

Sometimes, learning Git can feel like this:

Let’s learn by doing instead.

Install Git and configure it according to the system setup instructions.

Cloning the Repo #

Clone (download) an existing repository:

cd ~/s750
git clone https://github.com/36-750/git-demo.git
cd git-demo/            # move into the cloned repository
git status              # check the status
ls -a                   # observe the .git hidden directory

Open the git-demo folder in Explorer or Finder or whatever you use on your computer to find files. Look – it’s the same stuff.

A Quick Look at the Plumbing #

Every git object (blobs, trees, commits, and tags) is stored as an indexed piece of data in the .git directory, indexed by its hash value.

ls .git
ls .git/objects
echo 'We can add objects at will' | git hash-object -w --stdin  # creates a new object
# prints hash =># 312b5389b6c43a4bf8867f6c607fcf2efd67824f
ls .git/objects/31   # first two digits of the hash is a directory
git cat-file -p 312b5389b6c43a4bf8867f6c607fcf2efd67824f   # use your hash if different

HEAD points to a commit, refs

cat .git/HEAD             # shows a "ref", a friendly alias for a commit
cat .git/refs/heads/main  # this contains the (less friendly) hash
git cat-file -p 655eca13a3622464056a0d5a1d60cccd582aebe6  # what is in the commit?

We can see the tree at the current main branch

git cat-file -p 'main^{tree}'    # Don't forget the quotes

Immutable structure sharing in action

echo 'version 1' > test.txt
git hash-object -w test.txt  #=> 83baae61804e65cc73a7201a7252750c76066a30
echo 'version 2' > test.txt
git hash-object -w test.txt  #=> 1f7a7a472abf3dd9643fd615f6da379c4acb3e3a
find .git/objects -type f    # both versions are there!
rm test.txt
git status                   # Not a visible change because no file
git cat-file -p 1f7a7a472abf3dd9643fd615f6da379c4acb3e3a > test.txt
cat test.txt
git cat-file -p 83baae61804e65cc73a7201a7252750c76066a30 > test.txt
cat test.txt
rm test.txt

Basic Operations #

Make a branch and switch to it.

git branch your-clever-name-here  # Create a branch
git branch                        # Look at available branches, main still current
git switch your-clever-name-here  # Switch to the branch
git status

We are now working on the branch, but since we haven’t changed it, main and your-clever-name-here are pointing to the same commit.

Advance the your-clever-name-here branch by two commits

git cat-file -p 1f7a7a472abf3dd9643fd615f6da379c4acb3e3a > test.txt
git status       # see what is highlighted
git switch main  # test.txt is 'untracked', so when we switch it 'follows'
git switch your-clever-name-here
git add test.txt # staging test.txt
git status       # note the differnce
git commit -m 'meta: added a test version descriptor'
git status
git cat-file -p 83baae61804e65cc73a7201a7252750c76066a30 > test.txt
git add test.txt # staging test.txt, you need to stage again (cf. Three Trees)
git diff         # Look at the differences
git commit -m 'meta: updated test version descriptor'
git status
git log --oneline --color --graph --branches

Change your work but try to switch before committing

echo 'version 3' > test.txt
git switch main  # oops
git commit       # try out your editor config, enter comit message and close the file
git switch main  # good now
git log --oneline --color --graph --branches

Look at your working directory in this branch. Switch back and forth and compare. Go back to main.

Create a new branch named YOURNAME-demo off main and switch to it. (Replace YOURNAME with your github username.)
```
git branch YOURNAME-demo
git switch YOURNAME-demo
```
This will be our reference branch for this exercise, playing the role that main would play in practice. (Avoids conflicts
Make a new branch and switch to it.

Make some changes to your repository. Add files, edit something, whatever. Stage and commit the files. Do it one more time, two commits on the new branch.

Switch back to YOURNAME-demo.
```
git log --oneline --color --graph --branches
```
Merge the your-clever-name-here branch back into YOURNAME-demo. This should have no conflicts.

git merge will merge the specified branch into the current branch.
```
git switch YOURNAME-demo
git status  # check we are where we think we are!!
# if we were working with a remote, git pull to ensure we have the latest version
git merge your-clever-name-here
git log --oneline --color --graph --branches
```
You could delete the old branch if you like with git branch -d, though you do not have to. For today, keep it for later.
Make a new branch off YOURNAME-demo, switch to it, and change an existing file, commit. Switch back to YOURNAME-demo and change that same file. Commit. Merge the new branch into YOURNAME-demo. Suppose you called the branch conflict.
```
git switch YOURNAME-demo
git status
git merge conflict
# Surprise... conflict?
```
Look at the conflicting file. Edit out the markers and commit it. The merge will continue. You can abort a merge with git merge --abort. Note: you have to stage the edited file.
```
git log --oneline --color --graph --branches
```
In YOURNAME-demo, add some changes to the index, so they are staged to be committed. Then commit the changes.
```
git status
git add file_you_changed.py
git commit
```
Make more changes and stage them.

Look at differences

git log --oneline --color --graph --branches
git diff 3597a84 e3f8f5d           # replace any two hashes you want to compare
git diff # how is this diff different :?)

Commit if you like.

Push this branch to the remote repository (on GitHub)
```
git push --set-upstream origin YOURNAME-demo
```
The --set-upstream option is only necessary once, to tell Git that the “upstream” for this branch – the remote location for it – is the corresponding branch on GitHub, which will be created automatically.
Switch back to the main branch:
```
git switch main
ls
```
Now look at the files in your repository. Notice they’ve all changed back to what they looked like before you switched to your branch.

Look at the status

git status
git log --oneline --color --graph --branches

Make a pull request on GitHub

https://github.com/36-750/git-demo

This is similar to the workflow we’ll use for submitting homework. We’ll provide you a script that automates part of it, but it’s good to know basic Git operations so you know what’s going on when you switch branches and submit homework.

Work Trees #

Working with branches can sometimes be tricky because one has changes in progress that keep you from switching branches when you want to.

git-worktree simulates working with multiple branches simultaneously by storing linked repositories in another directory. Start in main.

Pick two of the branches you created in the demo above. I’ll call them foo and bar. Create work trees
```
mkdir ../worktrees-git-demo
git worktree add ../worktrees-git-demo/foo foo
git worktree add ../worktrees-git-demo/bar bar
```
Branches foo and bar are now checked out in those directories.

Move to one and commit some changes

cd ../worktrees-git-demo/foo   # I like to keep my worktrees elsewhere but YMMV
# make and commit changes
git status

If you go back to the main git-demo directory and try to git switch foo. It won’t let you; foo is already spoken for.

cd ~/s750/git-demo
git worktree remove ../worktrees-git-demo/foo
git switch foo
ls  # Notice the changes (in whatever way)
git switch main

This is a useful workflow for your homework. For each branch you are working on, create a worktree and work there. E.g.,

cd ~/s750
mkdir exercises  # directory for your active exercises
cd assignments-YOURNAME
git worktree add ../exercises/ex-name ex-name
cd ../exercises/ex-name
# work on it

When done, remove the worktree. Create it again if you need to work on the exercise.

Git repositories can be “pushed” and “pulled” from one computer to another. Various services have sprouted up around this, offering web-based Git hosting services with bug trackers, code review features, and all sorts of goodies. GitHub is the biggest and most popular; Bitbucket is an alternative, and GitLab is an open-source competitor.

We’ll use GitHub in this course. You should already have an account set up.

(You can upgrade your GitHub account with a free educational discount at education.github.com, including unlimited private repositories. It’s not required for this class, but may be useful to you later.)

Many of these online services provide handy features for collaborating on code. GitHub, for example, has “pull requests”: if you’ve pushed a branch to a repository, you can request that its changes be merged into another branch. GitHub provides a convenient view of all the changes, lets people leave comments, and then has a big button to do the merge. We’ll see how to do this shortly.

Collaboration in Git #

Git is built to allow multiple people to collaborate on code, using this system of snapshots and branches.

Multiple people can have copies of a repository, and can all “push” their separate snapshots to a server, on separate branches of development. These branches can then be merged to incorporate the changes made in each.

Sometimes this is easy to do, and Git can do it automatically. Sometimes each branch contains changes to the same parts of files, and Git doesn’t know which changes you want to keep; in this case, it asks you to manually resolve the merge conflict.

An activity #

Let’s go back to that commit graph we were looking at. Suppose we merged the experimental branch into main:

git switch main
git merge experimental

The commit af1200ec is a merge commit, the snapshot of the code after experimental has been merged to main. The merge commit has two parents.

Git can show you a graph of commits with the command

git log --graph --oneline --color

I’d like you to recreate the graph above by making commits and branches. Team up with someone next to you to plan how to do this. Start from the main branch:

git switch main

To make a graph just starting from main, not including previous commits, try

git log --graph --oneline --color e625475.. --

which tells Git to graph the commits since e625475, the most recent I made on the main branch. It should be empty at first, since there is nothing new on main.

Good Git habits #

Write good commit messages! #

Commit messages should explain what you changed and why you changed it, so you understand your code later, and collaborators understand the purpose of your changes. An example Git log entry:

commit c52d398a97ba4a4c933945d3045cd69cbf7f9de0
Author: Alex Reinhart <areinhar@stat.cmu.edu>
Date:   Tue May 24 16:31:23 2016 -0400

    Fix error in intensity bounds calculation

    We incorrectly assumed that if the query node entirely preceded the
    data node, the bounds had to be zero. This is not the case: the
    background component is constant in time.

    Instead, have t_bounds return negative bounds when this occurs, which
    cause the foreground component only to be estimated as zero.

    Update kde_test to remove the fudge factor and instead test based on
    the desired eps, not the (max - min) reported by tree_intensity. I
    suspect rounding problems mean the (max - min) is sometimes zero,
    despite the reported value being very slightly different from what
    it should be.

Notice:

The first line is a short summary (subject line)
There are blank lines between paragraphs and after the summary line
The message describes the reasoning for the change
The message is written in the imperative mood (“Fix error” instead of “Fixed error”)

More advanced branching #

If you want to try a new algorithm, reorganize your code, or make changes separately from someone else working on the same code, make a branch.

We used a branch above to make a pull request. But what if the main branch is edited at the same time we’re working on our own separate branch?

Fundamental branch operations: merging and rebasing.

Make another change on main and commit it
Back to your branch
```
git switch your-clever-name-here
```
Rebase on main
```
git rebase main
```
Check the log now – we have all of main’s commits.
Alternately, we can merge instead of rebasing. Merging takes another branch’s changes, runs them on the current branch, and saves them as a new commit.
```
git switch main
git merge your-clever-name-here
```
In this case, the merge was a “fast-forward”: there were no other changes in main, so no merge commit was necessary.

Use git blame to find out why changes were made #

If you ever wonder why a line of code is the way it is, use git blame.

git blame fileHomework.py

Use “git show” to see the responsible commit.

RStudio #

For those of you using RStudio, there is a handy graphical interface built in, once you create an RStudio project for your work. We recommend learning the command-line basics first, however.

Resources #

Some links to helpful resources:

Learn Git Branching is an interactive tutorial, with diagrams
The Software Carpentry Git lesson walks you through basic Git operations on your computer
ProGit is a full book about Git that you can read online
The Git Reference Manual lists every single Git command and its options
GitHub’s Git Cheat Sheets explain some of the common commands
Dangit, Git! explains how to fix common Git problems you may encounter
RStudio Git integration documentation

Git has extensive (but not always helpful) manual pages, e.g. man git-merge.

We recommend trying the git-challenge homework assignment and referring to the resources above as you do it, so you understand each step you’re taking.

Version Control

Why version control? #

Git concepts #

Some Git terminology #

A simple Git workflow illustrated #

Cloning the Repo #

A Quick Look at the Plumbing #

Basic Operations #

Work Trees #

Collaboration in Git #

An activity #

Good Git habits #

Write good commit messages! #

More advanced branching #

Use git blame to find out why changes were made #

RStudio #

Resources #

Course Info

Tools

Practices

Version Control

Why version control? #

Git concepts #

Some Git terminology #

A simple Git workflow illustrated #

Cloning the Repo #

A Quick Look at the Plumbing #

Basic Operations #

Work Trees #

Git-related services #

Collaboration in Git #

An activity #

Good Git habits #

Write good commit messages! #

More advanced branching #

Use git blame to find out why changes were made #

RStudio #

Resources #