In even a moderate-sized software project, there are many associated files, and they change quickly. Version Control is a way to keep track of those changes so that the history of development can be recovered.
But it does much more. Version control enables:
- working on code simultaneously with other team members
- creating “branches” of code, to try new or experimental features
- distributing code over the web for varied access
- tagging parts of the history for special use
- examining changes in the history in minute detail
- stashing some changes/ideas for later and recovering them
Today we will talk about Git, a popular and powerful distributed version control system. There are dozens of others, most notably Mercurial and Subversion.
Why version control? #
One of the key themes of this course is that code should be understandable to others. Next week we’ll talk about the basic rules of clear, readable, reusable code, such as formatting and style. But there’s another piece to it: a reader must understand why your code is the way it is. They must be able to deduce its history.
And even if you’re not writing for an audience, there’s a saying I heard somewhere: The person who knows the most about your code is you six months ago, and you don’t reply to email.
So even when you’re working alone, version control provides many features:
- Keep a complete record of changes. You can always revert back to a previous version or recall what you changed.
- Store a message with every change, so the rationale for changes is always recorded.
- Mark snapshots of your code: “the version submitted to JASA” or “the version used for my conference presentation”.
- Easily distribute your code or back it up with a Git hosting service (like GitHub or Bitbucket).
Git concepts #
You have a folder with some code (R files, Python files, whatever) in it, plus maybe data, documentation, an analysis report – just about anything. This folder is your repository. You’d like to make snapshots of this folder: this in the version where I fixed that nasty bug, this is the version where I added new diagnostic plots, this is the version with the updated dataset.
You’d also like to be able to compare the snapshots – what changed between last week and now? – and to run previous snapshots of the code, if needed.
In Git, these snapshots are called commits. Each commit is the state of the code at a certain time. Git tracks all the files in your folder to see which have changed and what’s eligible to be committed. Every commit has a parent – the commit from which it came.
When you’re ready to make a snapshot, you tell Git which changes you want to include in the snapshot, and write a commit message, describing the snapshot. Git then records it all permanently.
This record of snapshots can be shared with others, who can clone your files and snapshots to have their own copy to work with.
Later, you can ask Git to check out a specific snapshot, and it will change all the files in the repository to match that snapshot of the code. Then you can ask Git to take you right back to where you were.
Crucially, Git can store parallel histories: if two different people have copies of the same repository and make changes independently, or if one person tries two different ways of changing the same code, snapshots of their work can exist in parallel, and the work can be reconciled (“merged”) by Git later.
These parallel lines of development are called branches, and can be given names:
Git works by creating a special hidden folder (called .git/
) inside the
folder you’re taking snapshots of. All snapshot history, commit information,
and other data is stored in this folder. No other websites or services are
necessary, though you can push your snapshots to a server (we’ll return to
that later).
Some Git terminology #
Git is not particularly friendly to new users. You’ll need some terminology:
- Repository
- a directory (local or remote) containing tracked files
- Commit
- a project snapshot
- Branch
- a movable pointer to a commit, usually representing a specific line of development
- Tag
- a metadata object attached to commits
- HEAD
- a pointer to the local branch you are currently on
- Tree
- an object that maps file names to stored data (like file snapshots)
Three important trees:
- Working directory
- the files you see
- Index
- the staging area for accumulating changes
- HEAD
- a pointer to the latest commit
A simple Git workflow illustrated #
Sometimes, learning Git can feel like this:
Let’s learn by doing instead.
Install Git and configure it according to the system setup instructions.
Cloning the Repo #
-
Clone (download) an existing repository:
cd ~/s750 git clone https://github.com/36-750/git-demo.git cd git-demo/ # move into the cloned repository git status # check the status ls -a # observe the .git hidden directory
-
Open the
git-demo
folder in Explorer or Finder or whatever you use on your computer to find files. Look – it’s the same stuff.
A Quick Look at the Plumbing #
-
Every git object (blobs, trees, commits, and tags) is stored as an indexed piece of data in the .git directory, indexed by its hash value.
ls .git ls .git/objects echo 'We can add objects at will' | git hash-object -w --stdin # creates a new object # prints hash =># 312b5389b6c43a4bf8867f6c607fcf2efd67824f ls .git/objects/31 # first two digits of the hash is a directory git cat-file -p 312b5389b6c43a4bf8867f6c607fcf2efd67824f # use your hash if different
-
HEAD points to a commit, refs
cat .git/HEAD # shows a "ref", a friendly alias for a commit cat .git/refs/heads/main # this contains the (less friendly) hash git cat-file -p 655eca13a3622464056a0d5a1d60cccd582aebe6 # what is in the commit?
-
We can see the tree at the current main branch
git cat-file -p 'main^{tree}' # Don't forget the quotes
-
Immutable structure sharing in action
echo 'version 1' > test.txt git hash-object -w test.txt #=> 83baae61804e65cc73a7201a7252750c76066a30 echo 'version 2' > test.txt git hash-object -w test.txt #=> 1f7a7a472abf3dd9643fd615f6da379c4acb3e3a find .git/objects -type f # both versions are there! rm test.txt git status # Not a visible change because no file git cat-file -p 1f7a7a472abf3dd9643fd615f6da379c4acb3e3a > test.txt cat test.txt git cat-file -p 83baae61804e65cc73a7201a7252750c76066a30 > test.txt cat test.txt rm test.txt
Basic Operations #
-
Make a branch and switch to it.
git branch your-clever-name-here # Create a branch git branch # Look at available branches, main still current git switch your-clever-name-here # Switch to the branch git status
We are now working on the branch, but since we haven’t changed it,
main
andyour-clever-name-here
are pointing to the same commit. -
Advance the
your-clever-name-here
branch by two commitsgit cat-file -p 1f7a7a472abf3dd9643fd615f6da379c4acb3e3a > test.txt git status # see what is highlighted git switch main # test.txt is 'untracked', so when we switch it 'follows' git switch your-clever-name-here git add test.txt # staging test.txt git status # note the differnce git commit -m 'meta: added a test version descriptor' git status git cat-file -p 83baae61804e65cc73a7201a7252750c76066a30 > test.txt git add test.txt # staging test.txt, you need to stage again (cf. Three Trees) git diff # Look at the differences git commit -m 'meta: updated test version descriptor' git status git log --oneline --color --graph --branches
-
Change your work but try to switch before committing
echo 'version 3' > test.txt git switch main # oops git commit # try out your editor config, enter comit message and close the file git switch main # good now git log --oneline --color --graph --branches
Look at your working directory in this branch. Switch back and forth and compare. Go back to
main
. -
Create a new branch named
YOURNAME-demo
offmain
and switch to it. (Replace YOURNAME with your github username.)git branch YOURNAME-demo git switch YOURNAME-demo
This will be our reference branch for this exercise, playing the role that
main
would play in practice. (Avoids conflicts -
Make a new branch and switch to it.
Make some changes to your repository. Add files, edit something, whatever. Stage and commit the files. Do it one more time, two commits on the new branch.
Switch back to YOURNAME-demo.
git log --oneline --color --graph --branches
-
Merge the
your-clever-name-here
branch back into YOURNAME-demo. This should have no conflicts.git merge
will merge the specified branch into the current branch.git switch YOURNAME-demo git status # check we are where we think we are!! # if we were working with a remote, git pull to ensure we have the latest version git merge your-clever-name-here git log --oneline --color --graph --branches
You could delete the old branch if you like with
git branch -d
, though you do not have to. For today, keep it for later. -
Make a new branch off YOURNAME-demo, switch to it, and change an existing file, commit. Switch back to YOURNAME-demo and change that same file. Commit. Merge the new branch into YOURNAME-demo. Suppose you called the branch
conflict
.git switch YOURNAME-demo git status git merge conflict # Surprise... conflict?
Look at the conflicting file. Edit out the markers and commit it. The merge will continue. You can abort a merge with
git merge --abort
. Note: you have to stage the edited file.git log --oneline --color --graph --branches
-
In
YOURNAME-demo
, add some changes to the index, so they are staged to be committed. Then commit the changes.git status git add file_you_changed.py git commit
-
Make more changes and stage them.
-
Look at differences
git log --oneline --color --graph --branches git diff 3597a84 e3f8f5d # replace any two hashes you want to compare git diff # how is this diff different :?)
Commit if you like.
-
Push this branch to the remote repository (on GitHub)
git push --set-upstream origin YOURNAME-demo
The
--set-upstream
option is only necessary once, to tell Git that the “upstream” for this branch – the remote location for it – is the corresponding branch on GitHub, which will be created automatically. -
Switch back to the main branch:
git switch main ls
Now look at the files in your repository. Notice they’ve all changed back to what they looked like before you switched to your branch.
-
Look at the status
git status git log --oneline --color --graph --branches
-
Make a pull request on GitHub
This is similar to the workflow we’ll use for submitting homework. We’ll provide you a script that automates part of it, but it’s good to know basic Git operations so you know what’s going on when you switch branches and submit homework.
Work Trees #
Working with branches can sometimes be tricky because one has changes in progress that keep you from switching branches when you want to.
git-worktree
simulates working with multiple branches simultaneously by storing
linked repositories in another directory. Start in main
.
-
Pick two of the branches you created in the demo above. I’ll call them
foo
andbar
. Create work treesmkdir ../worktrees-git-demo git worktree add ../worktrees-git-demo/foo foo git worktree add ../worktrees-git-demo/bar bar
Branches
foo
andbar
are now checked out in those directories. -
Move to one and commit some changes
cd ../worktrees-git-demo/foo # I like to keep my worktrees elsewhere but YMMV # make and commit changes git status
If you go back to the main git-demo directory and try to
git switch foo
. It won’t let you;foo
is already spoken for.cd ~/s750/git-demo git worktree remove ../worktrees-git-demo/foo git switch foo ls # Notice the changes (in whatever way) git switch main
This is a useful workflow for your homework. For each branch you are working on, create a worktree and work there. E.g.,
cd ~/s750
mkdir exercises # directory for your active exercises
cd assignments-YOURNAME
git worktree add ../exercises/ex-name ex-name
cd ../exercises/ex-name
# work on it
When done, remove the worktree. Create it again if you need to work on the exercise.
Git-related services #
Git repositories can be “pushed” and “pulled” from one computer to another. Various services have sprouted up around this, offering web-based Git hosting services with bug trackers, code review features, and all sorts of goodies. GitHub is the biggest and most popular; Bitbucket is an alternative, and GitLab is an open-source competitor.
We’ll use GitHub in this course. You should already have an account set up.
(You can upgrade your GitHub account with a free educational discount at education.github.com, including unlimited private repositories. It’s not required for this class, but may be useful to you later.)
Many of these online services provide handy features for collaborating on code. GitHub, for example, has “pull requests”: if you’ve pushed a branch to a repository, you can request that its changes be merged into another branch. GitHub provides a convenient view of all the changes, lets people leave comments, and then has a big button to do the merge. We’ll see how to do this shortly.
Collaboration in Git #
Git is built to allow multiple people to collaborate on code, using this system of snapshots and branches.
Multiple people can have copies of a repository, and can all “push” their separate snapshots to a server, on separate branches of development. These branches can then be merged to incorporate the changes made in each.
Sometimes this is easy to do, and Git can do it automatically. Sometimes each branch contains changes to the same parts of files, and Git doesn’t know which changes you want to keep; in this case, it asks you to manually resolve the merge conflict.
An activity #
Let’s go back to that commit graph we were looking at. Suppose we merged the
experimental
branch into main
:
git switch main
git merge experimental
The commit af1200ec
is a merge commit, the snapshot of the code after
experimental
has been merged to main
. The merge commit has two parents.
Git can show you a graph of commits with the command
git log --graph --oneline --color
I’d like you to recreate the graph above by making commits and branches. Team
up with someone next to you to plan how to do this. Start from the main
branch:
git switch main
To make a graph just starting from main
, not including previous commits,
try
git log --graph --oneline --color e625475.. --
which tells Git to graph the commits since e625475
, the most recent I made on
the main
branch. It should be empty at first, since there is nothing new on
main
.
Good Git habits #
Write good commit messages! #
Commit messages should explain what you changed and why you changed it, so you understand your code later, and collaborators understand the purpose of your changes. An example Git log entry:
commit c52d398a97ba4a4c933945d3045cd69cbf7f9de0
Author: Alex Reinhart <areinhar@stat.cmu.edu>
Date: Tue May 24 16:31:23 2016 -0400
Fix error in intensity bounds calculation
We incorrectly assumed that if the query node entirely preceded the
data node, the bounds had to be zero. This is not the case: the
background component is constant in time.
Instead, have t_bounds return negative bounds when this occurs, which
cause the foreground component only to be estimated as zero.
Update kde_test to remove the fudge factor and instead test based on
the desired eps, not the (max - min) reported by tree_intensity. I
suspect rounding problems mean the (max - min) is sometimes zero,
despite the reported value being very slightly different from what
it should be.
Notice:
- The first line is a short summary (subject line)
- There are blank lines between paragraphs and after the summary line
- The message describes the reasoning for the change
- The message is written in the imperative mood (“Fix error” instead of “Fixed error”)
More advanced branching #
If you want to try a new algorithm, reorganize your code, or make changes separately from someone else working on the same code, make a branch.
We used a branch above to make a pull request. But what if the main
branch
is edited at the same time we’re working on our own separate branch?
Fundamental branch operations: merging and rebasing.
-
Make another change on
main
and commit it -
Back to your branch
git switch your-clever-name-here
-
Rebase on
main
git rebase main
-
Check the log now – we have all of main’s commits.
-
Alternately, we can merge instead of rebasing. Merging takes another branch’s changes, runs them on the current branch, and saves them as a new commit.
git switch main git merge your-clever-name-here
In this case, the merge was a “fast-forward”: there were no other changes in
main
, so no merge commit was necessary.
Use git blame to find out why changes were made #
If you ever wonder why a line of code is the way it is, use git blame.
git blame fileHomework.py
Use “git show” to see the responsible commit.
RStudio #
For those of you using RStudio, there is a handy graphical interface built in, once you create an RStudio project for your work. We recommend learning the command-line basics first, however.
Resources #
Some links to helpful resources:
- Learn Git Branching is an interactive tutorial, with diagrams
- The Software Carpentry Git lesson walks you through basic Git operations on your computer
- ProGit is a full book about Git that you can read online
- The Git Reference Manual lists every single Git command and its options
- GitHub’s Git Cheat Sheets explain some of the common commands
- Dangit, Git! explains how to fix common Git problems you may encounter
- RStudio Git integration documentation
Git has extensive (but not always helpful) manual pages, e.g. man git-merge
.
We recommend trying the git-challenge
homework assignment and referring to the
resources above as you do it, so you understand each step you’re taking.