Two common paradigms for structuring programs differ (roughly speaking) in whether they focus on verbs or on nouns. These are functional programming and object-oriented programming respectively. Today we’ll focus on nouns.
Programming with nouns means organizing our code around pieces of data with specific associated behaviors. An object is exactly that: a collection of data with defined methods which operate on that data. By choosing our objects carefully and defining interfaces for their behaviors, they can make our code more generalizable while limiting the spread of complexity.
Good object-oriented code improves development like interchangeable parts improve manufacturing. A key concept here is encapsulation.
A forest of trees #
Let’s start with an example. I’d like to create a binary tree, as we discussed earlier. (The example would work equally well with other data structures we will discuss.)
One way would be to define the tree recursively and pass around the root node. Different functions operating on that tree would manipulate it manually, by recursing through the tree and performing their operations.
But this has the disadvantage that the operations are tied to the internal structure. If I want to switch to a different kind of tree – say, a red-black tree instead of an ordinary binary tree – all my code needs to be updated. If I change the representation of my data, using arrays to store node values and connections, all my code needs to be updated. The complexity of my implementation has leaked into the rest of my program, and everything is specialized to this specific implementation.
Classes, attributes, and methods #
I can create a class specifying a binary tree:
class BinaryTree:
def __init__(self, nodes):
"""Construct a binary tree."""
self.value = nodes.pop()
for node in nodes:
self.insert(node)
def insert(self, value):
"""Insert a node into the tree."""
if value < self.value:
if hasattr(self, "left"):
self.left.insert(value)
else:
self.left = BinaryTree([value])
else:
...
def delete(self, value):
...
def search(self, value):
...
def inorder(self):
"""Iterate over the tree in sorted order."""
...
The class specifies a constructor (__init__
) which creates a binary tree
and the methods that operate upon it. Note that each method takes the
parameter self
. When I call delete
on a binary tree object, the delete
method gets that object as self
so it may operate upon it.
Then I can create and use binary trees:
# calls the __init__ method to construct a tree
b = BinaryTree([2, 4, 7, 1, 16, 3])
b.search(4) # True
b.insert(5)
b.search(5) # True
b2 = BinaryTree([-3, 2, 17])
b2.search(5) # False
# same thing:
BinaryTree.search(b2, 5)
Here b
and b2
are instances of the BinaryTree
class. Each has
separate data and does not affect the other.
The attributes of objects are also available:
b.left # the left child
These can be manipulated by other code, though it’s often best to make changes through methods instead of direct access.
Interchangeable trees #
I want to change my binary tree implementation. I’d like to use a red-black tree, which automatically rebalances the tree so searching always takes O(log n) time. This requires changing the insertion and deletion operations, but a red-black tree is still a binary tree, so searching behaves the same way.
Classes can inherit methods and data from other classes:
class RedBlackTree(BinaryTree):
# No need to define __init__ or search -- they're inherited from BinaryTree
def insert(self, value): # these override the methods from BinaryTree
...
def delete(self, value):
...
Now RedBlackTree
is a subclass of BinaryTree
. If I call insert
on a
RedBlackTree
instance, I get the method defined specifically for it; if
there is no method defined for it, I get the parent class’s method.
Some languages allow multiple inheritance: a class can inherit from several classes simultaneously. This should be used with caution. It is often difficult to figure out which of several inherited methods are being used, if the parent classes happen to have methods of the same name. Conflicts can cause great confusion.
There is a relationship between superclasses and subclasses:
b = RedBlackTree([1, 2, 4])
isinstance(b, BinaryTree) # True
isinstance(b, RedBlackTree) # True
issubclass(RedBlackTree, BinaryTree) # True
Because a RedBlackTree
is a BinaryTree
, any code that expects a
BinaryTree
can work just as well with a RedBlackTree
, or an AVLTree
, or
whatever other tree types I might define. I can swap them out as
necessary. The implementation details of the tree are hidden within the
class, and code outside it does not need to know.
Encapsulation #
Objects provide encapsulation: the data and methods for a single object are wrapped up in one place. All the complexity of the implementation of that object is contained, and other code need only interact through it with its interface.
This can limit bugs. If other code tries to directly manipulate the tree
instead of calling insert
, it may accidentally break the rules of the
red-black tree or leave the object in an inconsistent state. By only
operating through the public interface (the methods), we ensure that the
object is always valid and its invariants are maintained.
(Some languages, like Java and C++, allow more formal enforcement of this by marking some data and methods as private, only accessible to methods of that class. Outside code cannot see or modify private class data.)
More importantly, this lets me change implementation details of BinaryTree
without changing any of the code that uses trees. If I want to change how it
stores trees, its traversal strategies, and so on, I don’t have edit any
other code.
SOLID design #
- Single responsibility
- Each class should have a single responsibility
- Open/closed
- Open for extension, closed for modification. Write classes that can easily be extended without being modified.
- Liskov substitution
- You should be able to substitute a subclass for a class without breaking the program
- Interface segregation
- Specific interfaces for specific tasks are usually better than one giant interface to do everything for all users
- Dependency inversion
- Depend on the abstractions, not the concrete details. Some languages make this explicit with abstract classes and interfaces
Examples of object usage #
Everything #
In some languages, everything is an object. In Python, for example, lists,
dictionaries, strings and so on all have methods, like foo.sort()
. A great
deal of syntax is just sugar for method calls:
class defaultdict(dict):
def __init__(self, default):
self.default = default
def __getitem__(self, key):
if not key in self:
self[key] = self.default
return super(defaultdict, self).__getitem__(key)
foo = defaultdict(14)
foo[142] # 14
foo[142] = 7
...
Square brackets turn into a __getitem__
call, which can implement arbitrary
behavior. (Don’t get too creative – it should match the usual interface of
indexing.) Even accesses to instance data (e.g. foo.bar
) turn into calls to
__getattr__
if the attribute isn’t found. See Python’s data model documentation.
(A more robust defaultdict is provided in the collections
module.)
In R, many things are S3 objects, and you use them without even realizing it. We’ll return to S3 objects soon.
Iterators #
Iterators are a generic interface to iterating over sequences. In Python, iterators are implemented using objects with just two methods:
__iter__
- Returns the iterator object
__next__
- Returns the next item from the sequence. Raises the
StopIteration
exception if there are no more elements.
__next__
can do arbitrarily complicated calculations, so we could define an
iterator which produces elements from any sequence we’d like. Most commonly,
iterators iterate over collections.
For example, our binary tree could produce preorder, inorder, and postorder iterators. A dictionary could provide iterators over its keys and values. An infinite sequence could be represented as an iterator – the iterator returns one element at a time, so it need not fit in memory. A file object can produce an iterator over lines or characters in the file. A database query can return an iterator which produces the query results.
Here’s a very simple iterator object:
class Naturals:
def __init__(self):
self.count = 0
def __iter__(self):
return self
def __next__(self):
self.count += 1
return self.count
for
loops automatically support iterators, so we could write:
for element in Naturals():
do_stuff(element)
foo = Naturals()
next(foo) # 1
next(foo) # 2
foo.__next__() # same thing
You could define iterators for many objects: maybe your BinaryTree
could have
a method that returns an iterator that iterates through the nodes in a
specific order, for example.
In R up until a couple years ago, when we write for (ii in 1:1000)
, the :
operator literally allocated enough memory for 1000 numbers and wrote the
numbers 1 to 1000 in the space, before the loop even starts; an iterator,
like Python’s range(1000)
, just stores the current number and counts up on
each iteration (each call to __next__
).
Exceptions and conditions #
In Python, exceptions are objects. In R, conditions are S3 objects. You can define new types of exceptions and conditions by defining new classes:
class InputError(Exception):
def __init__(self, expr, msg):
self.expr = expr
self.msg = msg
raise InputError(foo, bar)
S3 objects in R are just lists with a class attribute, so we can create a constructor:
input_error <- function(text) {
msg <- paste0("Input error: ", text)
structure(
list(message = msg, text = text, call = sys.call(-1)),
class = c("input_error", "error", "condition")
)
}
The structure
function takes its argument and adds the supplied attributes to
it. (A better way to make custom condition in R is to use rlang’s abort()
function, which does the work for you without fiddling with lists and
classes.)
S3 classes in R #
R uses a rather different system from Python or Java for its object-oriented programming. You’ve probably interacted with this system without even knowing it.
R actually has foure (or more!) different systems for object-oriented programming: S3, S4, RC, and R6 classes are the biggest. S3 classes are what you’ll encounter most often, though S4 has many nice features as well.
S3 classes don’t have formal definitions, like in Python, where we state what
methods and data are supported by every instance of the class. To make a new
instance of a class, we take some object (a list, say, containing the data)
and set its class
attribute:
node <- structure(list(left = stuff, right = other_stuff, value = 7),
class = "tree")
## or
node <- list(left = stuff, right = other_stuff, value = 7)
class(node) <- "tree"
The class
function returns the class of any variable:
> class(iris)
[1] "data.frame"
In modern R packages, it is a good idea to register the S3 classes
in your package using S3method(fun, Class, method)
, where fun
is a
generic function, Class
is the class name, and method
is typically
fun.Class
. It’s a good idea to also define an S4 method (see below)
for this function
setMethod(fun, Class, method)
In addition, the NAMESPACE
field in the package DESCRIPTION
should contain lines like
S3method(fun, class)
exportMethods(fun)
The latter is for the S4 version of the method and is not needed
without the setMethod
call.
So when building an S3 class in R, we take a few steps:
- What data does this object need to contain? How should it be stored? Anything that creates an object should make data in this form (a list, a data frame, a matrix, whatever).
- What operations should I be able to perform on this data? Plan out the methods.
- Write a function that creates objects of this class: a function that takes
the right data and builds the list, data frame, or whatever that you want,
then sets its
class
. - Write methods (generic functions).
Generic functions #
Notice we didn’t specify the methods of our tree
class above. Instead, R uses
generic functions: functions which behave differently for different classes
of arguments.
Many built-in functions are generic. For example, if I ask R for the print
function:
> print
function (x, ...)
UseMethod("print")
<bytecode: 0x7fc3a14eb428>
<environment: namespace:base>
UseMethod
looks up which print
method to call, based on what class x
is. What
are the options?
> methods(print)
[1] print.acf*
[2] print.anova*
[3] print.aov*
[4] print.aovlist*
[5] print.ar*
[6] print.Arima*
[7] print.arima0*
[8] print.AsIs
[9] print.aspell*
[10] print.aspell_inspect_context*
[11] print.bibentry*
...
[189] print.xtabs*
To create a method for your new class, just use the right name:
print.tree <- function(tree) {
...
}
print(t)
If you’re creating a brand-new function and want it to be generic, just use
UseMethod
:
## we add the ... argument so methods can take more arguments
## if desired
inorder <- function(x, ...) {
UseMethod("inorder")
}
inorder.binarytree <- function(x) {
## do some stuff
}
inorder.redblacktree <- function(x) {
## do different stuff
}
## Called for classes without an inorder method
## otherwise defined
inorder.default <- function(x) {
## do stuff
}
## Now we can just do
some_tree <- RedBlackTree(a_bunch_of_data)
inorder(some_tree)
Example: Model fit objects #
In R, the results of most model fits are objects containing slots (list
entries) for the data, parameters, diagnostics, and so on. Methods (like
residuals
, confint
, and plot
) are defined to operate on these objects.
For example, you can think of lm
as the constructor of lm
objects:
fit <- lm(dist ~ speed, data = cars)
fit$residuals # access an attribute
predict(fit) # call a method
Using a dollar sign to access attributes means you’re reaching into the
internals of the object. Using methods like predict()
and residuals()
is
using methods.
If you fit a new kind of model, you can easily implement the same methods so your results can be manipulated the same way. When you write code that uses model fit objects to do stuff, you should generally prefer to use the methods instead of accessing the attributes; some models store their internal data in different ways, but provide methods that process it appropriately to be consistent with all the other model objects.
Example: The hyperreal numbers #
S3 methods are single-dispatch: based on the class of the first argument to the method, R figures out which method you want to call. But this isn’t the only way to do it.
Suppose I would like to model the hyperreal numbers. I won’t go into great detail, but the hyperreals offer a rigorous definition of infinitesimals, the “dx"s you often handwaved away in introductory calculus. A hyperreal number has a real part (or standard part) and an infinitesimal part.
So we may define an S4 class in R:
setClass("hyperreal", slots = c(x = "numeric", dx = "numeric"))
This states that the class “hyperreal” has two slots: x
and dx
. Slots
contain data; each instance of a hyperreal can have different values in those
slots.
I can define a function to create a hyperreal from its real (“standard”) and infinitesimal parts:
hyper <- function(x, dx) {
new("hyperreal", x = x, dx = dx)
}
The new
function constructs a new instance of an S4 object. Slots can be
accessed with the @
operator: foo@dx
.
This is a bit like a list and the $
operator, except we declare which slots an
object has in advance.
Methods and multiple dispatch #
S4 uses multiple dispatch. You can define generic functions: functions which behave differently depending on the types of their arguments. Instead of depending on the type of one object, like in our Python or S3 code, they can depend on the types of as many arguments as you’d like. Julia uses a similar system.
For example, to define hyperreal arithmetic, we might do
setMethod("+", signature(e1 = "hyperreal", e2 = "hyperreal"),
function(e1, e2) {
hyper(e1@x + e2@x, e1@dx + e2@dx)
})
setMethod("+", signature(e1 = "hyperreal", e2 = "numeric"),
function(e1, e2) {
hyper(e1@x + e2, e1@dx)
})
...
By specifying a signature, I’m saying that if +
is called with two
hyperreals, I add them one way, but if it’s called with a hyperreal and an
ordinary number, I add them a different way. I can create as many different
methods as I’d like. I can also handle other mathematical functions on
hyperreals:
setMethod("sin", signature(x="hyperreal"),
function(x) {
hyper(sin(x@x), cos(x@x) * x@dx)
})
setMethod("cos", signature(x="hyperreal"),
function(x) {
hyper(cos(x@x), - sin(x@x) * x@dx)
})
...
Polymorphism #
Once I have defined these methods, any functions which work on numerics also work on hyperreals:
foo <- function(x) {
sin(x)^2 + 3*x^2 + log(x) - 4
}
foo
is polymorphic or generic: it operates on any type which implements
the required operations. Then I have
> foo(4)
[1] 45.95904
> foo(hyper(4, 1))
An object of class "hyperreal"
Slot "x":
[1] 45.95904
Slot "dx":
[1] 25.23936
It just so happens that 25.23936 is the exact numerical derivative of foo
when evaluated at x=4. By defining a new object and methods upon it, I can
get exact numerical derivatives of any function which uses ordinary
arithmetic and mathematical functions. (Notice that this is not the secant
method or any other approximation method.)
S4 classes in R #
S4 is a more formal generic-function framework for objects and classes that provides helper functions for defining generic functions and their methods. S4 also allows multiple dispatch; that is, the method called can be determined by any or all of the arguments to the method, not just the first.
Example: Monoids #
Simple version
setClass("Monoid")
setGeneric("munit", function(monoid) { standardGeneric("munit") })
setGeneric("mdot", function(monoid, x, y) { standardGeneric("mdot") })
Plays nicely with S3:
munit <- function(monoid, ...) # S3 generic for S3 dispatch
UseMethod("munit")
setGeneric("munit") # S4 generic for S4 dispatch, default is S3 generic
munit.Monoid <- function(monoid, ...) {} # S3 method for S3 class
setMethod("munit", "Monoid", munit.Monoid) # S4 method for S4 class
Inheritance:
SumM <- setClass("SumM", contains="Monoid")
Sum <- SumM()
setMethod("munit", signature("SumM"), function(monoid) 0)
setMethod("mdot", signature("SumM", "numeric", "numeric"), function(monoid, x, y) x + y)
The contains
field here can be “ANY” to denote a generic superclass;
it can also be a vector of classes, though some care is needed
with multiple superclasses.
In this simple class, we have no data. In general, we use the slots
argument
to setClass
, which must be named, giving a named vector with the fields
and their classes. (If you just give a vector of names, the classes
are taken as "ANY"
by default.)
setClass("point2d", slots=c(x="numeric", y="numeric"))
The methods here have a trivial signature, given as a single
string. The signature
argument to setMethod
can indicate the
classes of the method arguments that are used to dispatch. The
single string case is just single dispatch. Usually, this is
either a single string or a vector of strings covering the first
n arguments to the function. But see method.skeleton
for
generating nontrivial signatures
R OOP summary #
R has three different systems for object-oriented programming:
- S3
- The oldest and simplest system, built on lists (usually). An object is just a variable that’s been labeled as having a certain class. Generic functions can be written to operate on different classes. Commonly used in base R.
- S4
- A more sophisticated system with inheritance, multiple dispatch, and more formality. (I used S4 to define the hyperreals.) Less common, but used when needed, such as in the Matrix package.
- RC
- “Reference classes” behave more like objects in Python or Java, with
methods called as
object$method(foo)
. RC objects are passed by reference, meaning that they are not copied on modification like most R types. - R6
- A lighter weight and higher performance version of RC. Has reference semantics, public/private methods, properties (called active bindings), and other features familiar in other languages. (See r6.r-lib.org for details.)
Usually you will use and interact with S3 classes. They’re great when you have some objects – like model fits, estimates, data tables, data structures, or whatever – that have common operations defined for them, and should be easily passed to functions which don’t care about their internal implementation.
See help on Methods_Details
, Methods_for_S3
, and Classes_Details
A brief R6 example #
Queue <- R6Class("Queue",
public = list(
# The constructor accessed with $new()
initialize = function(...) {
for (item in list(...)) {
self$add(item)
}
},
# Public Methods
add = function(x) {
private$queue <- c(private$queue, list(x))
invisible(self) # Makes the method chainable q$add()$add()...
},
remove = function() {
if (private$length() == 0) return(NULL)
# Can use private$queue for explicit access
head <- private$queue[[1]]
private$queue <- private$queue[-1]
head
}
),
private = list(
queue = list(), # The underlying representation
length = function() base::length(private$queue)
)
)
q <- Queue$new(5, 6, "foo")
Classes with reference semantics are great for mutable things:
the fields can be mutated by the methods by using the <<-
operator, without having to copy the data. This is great if you
want to pass something to a function that operates on it without
copying it.
Resources #
Principles #
- Design Patterns: Elements of Reusable Object-Oriented Software, by Gamma, Helm, Johnson, and Vlissides. Based on C++, but widely applicable.
- Refactoring: Improving the Design of Existing Code, by Fowler, on improving and redesigning object-oriented code.
- Growing Object-Oriented Software, Guided by Tests, by Freeman and Pyrce, on test-driven development for object-oriented programs.
- Software Architecture in Practice, by Bass, Clements, and Kazman.
Implementation #
- For R, Hadley Wickham’s Advanced R has a detailed chapter on OOP, but does not describe when or why it is useful.
- John Chambers’ Programming with Data is a detailed introduction to S4, with many examples of its use. (Written originally for S instead of R. This book introduced S4 classes to the world.)
- The Python tutorial has a long section on classes.