Version Control

— Christopher Genovese and Alex Reinhart

In even a moderate-sized software project, there are many associated files, and they change quickly. Version Control is a way to keep track of those changes so that the history of development can be recovered.

But it does much more. Version control enables:

  • working on code simultaneously with other team members
  • creating “branches” of code, to try new or experimental features
  • distributing code over the web for varied access
  • tagging parts of the history for special use
  • examining changes in the history in minute detail
  • stashing some changes/ideas for later and recovering them

Today we will talk about Git, a popular and powerful distributed version control system. There are dozens of others, most notably Mercurial and Subversion.

Why version control? #

One of the key themes of this course is that code should be understandable to others. Next week we’ll talk about the basic rules of clear, readable, reusable code, such as formatting and style. But there’s another piece to it: a reader must understand why your code is the way it is. They must be able to deduce its history.

And even if you’re not writing for an audience, there’s a saying I heard somewhere: The person who knows the most about your code is you six months ago, and you don’t reply to email.

So even when you’re working alone, version control provides many features:

  • Keep a complete record of changes. You can always revert back to a previous version or recall what you changed.
  • Store a message with every change, so the rationale for changes is always recorded.
  • Mark snapshots of your code: “the version submitted to JASA” or “the version used for my conference presentation”.
  • Easily distribute your code or back it up with a Git hosting service (like GitHub or Bitbucket).

Git concepts #

You have a folder with some code (R files, Python files, whatever) in it, plus maybe data, documentation, an analysis report – just about anything. This folder is your repository. You’d like to make snapshots of this folder: this in the version where I fixed that nasty bug, this is the version where I added new diagnostic plots, this is the version with the updated dataset.

You’d also like to be able to compare the snapshots – what changed between last week and now? – and to run previous snapshots of the code, if needed.

In Git, these snapshots are called commits. Each commit is the state of the code at a certain time. Git tracks all the files in your folder to see which have changed and what’s eligible to be committed. Every commit has a parent – the commit from which it came.

When you’re ready to make a snapshot, you tell Git which changes you want to include in the snapshot, and write a commit message, describing the snapshot. Git then records it all permanently.

This record of snapshots can be shared with others, who can clone your files and snapshots to have their own copy to work with.

Later, you can ask Git to check out a specific snapshot, and it will change all the files in the repository to match that snapshot of the code. Then you can ask Git to take you right back to where you were.

Crucially, Git can store parallel histories: if two different people have copies of the same repository and make changes independently, or if one person tries two different ways of changing the same code, snapshots of their work can exist in parallel, and the work can be reconciled (“merged”) by Git later.

These parallel lines of development are called branches, and can be given names:

/ox-hugo/branches.svg

Git works by creating a special hidden folder (called .git/) inside the folder you’re taking snapshots of. All snapshot history, commit information, and other data is stored in this folder. No other websites or services are necessary, though you can push your snapshots to a server (we’ll return to that later).

Some Git terminology #

Git is not particularly friendly to new users. You’ll need some terminology:

Repository
a directory (local or remote) containing tracked files
Commit
a project snapshot
Branch
a movable pointer to a commit, usually representing a specific line of development
Tag
a metadata object attached to commits
HEAD
a pointer to the local branch you are currently on
Tree
an object that maps file names to stored data (like file snapshots)

Three important trees:

Working directory
the files you see
Index
the staging area for accumulating changes
HEAD
a pointer to the latest commit

A simple Git workflow illustrated #

Sometimes, learning Git can feel like this:

/ox-hugo/memorizing-git-commands.jpg

Let’s learn by doing instead.

Install Git and configure it according to the system setup instructions.

Cloning the Repo #

  1. Clone (download) an existing repository:

    cd ~/s750
    git clone https://github.com/36-750/git-demo.git
    cd git-demo/            # move into the cloned repository
    git status              # check the status
    ls -a                   # observe the .git hidden directory
    
  2. Open the git-demo folder in Explorer or Finder or whatever you use on your computer to find files. Look – it’s the same stuff.

A Quick Look at the Plumbing #

  1. Every git object (blobs, trees, commits, and tags) is stored as an indexed piece of data in the .git directory, indexed by its hash value.

    ls .git
    ls .git/objects
    echo 'We can add objects at will' | git hash-object -w --stdin  # creates a new object
    # prints hash =># 312b5389b6c43a4bf8867f6c607fcf2efd67824f
    ls .git/objects/31   # first two digits of the hash is a directory
    git cat-file -p 312b5389b6c43a4bf8867f6c607fcf2efd67824f   # use your hash if different
    
  2. HEAD points to a commit, refs

    cat .git/HEAD             # shows a "ref", a friendly alias for a commit
    cat .git/refs/heads/main  # this contains the (less friendly) hash
    git cat-file -p 655eca13a3622464056a0d5a1d60cccd582aebe6  # what is in the commit?
    
  3. We can see the tree at the current main branch

    git cat-file -p 'main^{tree}'    # Don't forget the quotes
    
  4. Immutable structure sharing in action

    echo 'version 1' > test.txt
    git hash-object -w test.txt  #=> 83baae61804e65cc73a7201a7252750c76066a30
    echo 'version 2' > test.txt
    git hash-object -w test.txt  #=> 1f7a7a472abf3dd9643fd615f6da379c4acb3e3a
    find .git/objects -type f    # both versions are there!
    rm test.txt
    git status                   # Not a visible change because no file
    git cat-file -p 1f7a7a472abf3dd9643fd615f6da379c4acb3e3a > test.txt
    cat test.txt
    git cat-file -p 83baae61804e65cc73a7201a7252750c76066a30 > test.txt
    cat test.txt
    rm test.txt
    

Basic Operations #

  1. Make a branch and switch to it.

    git branch your-clever-name-here  # Create a branch
    git branch                        # Look at available branches, main still current
    git switch your-clever-name-here  # Switch to the branch
    git status
    

    We are now working on the branch, but since we haven’t changed it, main and your-clever-name-here are pointing to the same commit.

  2. Advance the your-clever-name-here branch by two commits

    git cat-file -p 1f7a7a472abf3dd9643fd615f6da379c4acb3e3a > test.txt
    git status       # see what is highlighted
    git switch main  # test.txt is 'untracked', so when we switch it 'follows'
    git switch your-clever-name-here
    git add test.txt # staging test.txt
    git status       # note the differnce
    git commit -m 'meta: added a test version descriptor'
    git status
    git cat-file -p 83baae61804e65cc73a7201a7252750c76066a30 > test.txt
    git add test.txt # staging test.txt, you need to stage again (cf. Three Trees)
    git diff         # Look at the differences
    git commit -m 'meta: updated test version descriptor'
    git status
    git log --oneline --color --graph --branches
    
  3. Change your work but try to switch before committing

    echo 'version 3' > test.txt
    git switch main  # oops
    git commit       # try out your editor config, enter comit message and close the file
    git switch main  # good now
    git log --oneline --color --graph --branches
    

    Look at your working directory in this branch. Switch back and forth and compare. Go back to main.

  4. Create a new branch named YOURNAME-demo off main and switch to it. (Replace YOURNAME with your github username.)

    git branch YOURNAME-demo
    git switch YOURNAME-demo
    

    This will be our reference branch for this exercise, playing the role that main would play in practice. (Avoids conflicts

  5. Make a new branch and switch to it.

    Make some changes to your repository. Add files, edit something, whatever. Stage and commit the files. Do it one more time, two commits on the new branch.

    Switch back to YOURNAME-demo.

    git log --oneline --color --graph --branches
    
  6. Merge the your-clever-name-here branch back into YOURNAME-demo. This should have no conflicts.

    git merge will merge the specified branch into the current branch.

    git switch YOURNAME-demo
    git status  # check we are where we think we are!!
    # if we were working with a remote, git pull to ensure we have the latest version
    git merge your-clever-name-here
    git log --oneline --color --graph --branches
    

    You could delete the old branch if you like with git branch -d, though you do not have to. For today, keep it for later.

  7. Make a new branch off YOURNAME-demo, switch to it, and change an existing file, commit. Switch back to YOURNAME-demo and change that same file. Commit. Merge the new branch into YOURNAME-demo. Suppose you called the branch conflict.

    git switch YOURNAME-demo
    git status
    git merge conflict
    # Surprise... conflict?
    

    Look at the conflicting file. Edit out the markers and commit it. The merge will continue. You can abort a merge with git merge --abort. Note: you have to stage the edited file.

    git log --oneline --color --graph --branches
    
  8. In YOURNAME-demo, add some changes to the index, so they are staged to be committed. Then commit the changes.

    git status
    git add file_you_changed.py
    git commit
    
  9. Make more changes and stage them.

  10. Look at differences

    git log --oneline --color --graph --branches
    git diff 3597a84 e3f8f5d           # replace any two hashes you want to compare
    git diff # how is this diff different :?)
    

    Commit if you like.

  11. Push this branch to the remote repository (on GitHub)

    git push --set-upstream origin YOURNAME-demo
    

    The --set-upstream option is only necessary once, to tell Git that the “upstream” for this branch – the remote location for it – is the corresponding branch on GitHub, which will be created automatically.

  12. Switch back to the main branch:

    git switch main
    ls
    

    Now look at the files in your repository. Notice they’ve all changed back to what they looked like before you switched to your branch.

  13. Look at the status

    git status
    git log --oneline --color --graph --branches
    
  14. Make a pull request on GitHub

    https://github.com/36-750/git-demo

This is similar to the workflow we’ll use for submitting homework. We’ll provide you a script that automates part of it, but it’s good to know basic Git operations so you know what’s going on when you switch branches and submit homework.

Work Trees #

Working with branches can sometimes be tricky because one has changes in progress that keep you from switching branches when you want to.

git-worktree simulates working with multiple branches simultaneously by storing linked repositories in another directory. Start in main.

  1. Pick two of the branches you created in the demo above. I’ll call them foo and bar. Create work trees

    mkdir ../worktrees-git-demo
    git worktree add ../worktrees-git-demo/foo foo
    git worktree add ../worktrees-git-demo/bar bar
    

    Branches foo and bar are now checked out in those directories.

  2. Move to one and commit some changes

    cd ../worktrees-git-demo/foo   # I like to keep my worktrees elsewhere but YMMV
    # make and commit changes
    git status
    

    If you go back to the main git-demo directory and try to git switch foo. It won’t let you; foo is already spoken for.

    cd ~/s750/git-demo
    git worktree remove ../worktrees-git-demo/foo
    git switch foo
    ls  # Notice the changes (in whatever way)
    git switch main
    

This is a useful workflow for your homework. For each branch you are working on, create a worktree and work there. E.g.,

cd ~/s750
mkdir exercises  # directory for your active exercises
cd assignments-YOURNAME
git worktree add ../exercises/ex-name ex-name
cd ../exercises/ex-name
# work on it

When done, remove the worktree. Create it again if you need to work on the exercise.

Git repositories can be “pushed” and “pulled” from one computer to another. Various services have sprouted up around this, offering web-based Git hosting services with bug trackers, code review features, and all sorts of goodies. GitHub is the biggest and most popular; Bitbucket is an alternative, and GitLab is an open-source competitor.

We’ll use GitHub in this course. You should already have an account set up.

(You can upgrade your GitHub account with a free educational discount at education.github.com, including unlimited private repositories. It’s not required for this class, but may be useful to you later.)

Many of these online services provide handy features for collaborating on code. GitHub, for example, has “pull requests”: if you’ve pushed a branch to a repository, you can request that its changes be merged into another branch. GitHub provides a convenient view of all the changes, lets people leave comments, and then has a big button to do the merge. We’ll see how to do this shortly.

Collaboration in Git #

Git is built to allow multiple people to collaborate on code, using this system of snapshots and branches.

Multiple people can have copies of a repository, and can all “push” their separate snapshots to a server, on separate branches of development. These branches can then be merged to incorporate the changes made in each.

Sometimes this is easy to do, and Git can do it automatically. Sometimes each branch contains changes to the same parts of files, and Git doesn’t know which changes you want to keep; in this case, it asks you to manually resolve the merge conflict.

An activity #

Let’s go back to that commit graph we were looking at. Suppose we merged the experimental branch into main:

git switch main
git merge experimental
/ox-hugo/branches-merged.svg

The commit af1200ec is a merge commit, the snapshot of the code after experimental has been merged to main. The merge commit has two parents.

Git can show you a graph of commits with the command

git log --graph --oneline --color

I’d like you to recreate the graph above by making commits and branches. Team up with someone next to you to plan how to do this. Start from the main branch:

git switch main

To make a graph just starting from main, not including previous commits, try

git log --graph --oneline --color e625475.. --

which tells Git to graph the commits since e625475, the most recent I made on the main branch. It should be empty at first, since there is nothing new on main.

Good Git habits #

Write good commit messages! #

Commit messages should explain what you changed and why you changed it, so you understand your code later, and collaborators understand the purpose of your changes. An example Git log entry:

commit c52d398a97ba4a4c933945d3045cd69cbf7f9de0
Author: Alex Reinhart <areinhar@stat.cmu.edu>
Date:   Tue May 24 16:31:23 2016 -0400

    Fix error in intensity bounds calculation

    We incorrectly assumed that if the query node entirely preceded the
    data node, the bounds had to be zero. This is not the case: the
    background component is constant in time.

    Instead, have t_bounds return negative bounds when this occurs, which
    cause the foreground component only to be estimated as zero.

    Update kde_test to remove the fudge factor and instead test based on
    the desired eps, not the (max - min) reported by tree_intensity. I
    suspect rounding problems mean the (max - min) is sometimes zero,
    despite the reported value being very slightly different from what
    it should be.

Notice:

  • The first line is a short summary (subject line)
  • There are blank lines between paragraphs and after the summary line
  • The message describes the reasoning for the change
  • The message is written in the imperative mood (“Fix error” instead of “Fixed error”)

More advanced branching #

If you want to try a new algorithm, reorganize your code, or make changes separately from someone else working on the same code, make a branch.

We used a branch above to make a pull request. But what if the main branch is edited at the same time we’re working on our own separate branch?

Fundamental branch operations: merging and rebasing.

  1. Make another change on main and commit it

  2. Back to your branch

    git switch your-clever-name-here
    
  3. Rebase on main

    git rebase main
    
  4. Check the log now – we have all of main’s commits.

  5. Alternately, we can merge instead of rebasing. Merging takes another branch’s changes, runs them on the current branch, and saves them as a new commit.

    git switch main
    git merge your-clever-name-here
    

    In this case, the merge was a “fast-forward”: there were no other changes in main, so no merge commit was necessary.

Use git blame to find out why changes were made #

If you ever wonder why a line of code is the way it is, use git blame.

git blame fileHomework.py

Use “git show” to see the responsible commit.

RStudio #

For those of you using RStudio, there is a handy graphical interface built in, once you create an RStudio project for your work. We recommend learning the command-line basics first, however.

Resources #

Some links to helpful resources:

Git has extensive (but not always helpful) manual pages, e.g. man git-merge.

We recommend trying the git-challenge homework assignment and referring to the resources above as you do it, so you understand each step you’re taking.