Database Fundamentals

— Alex Reinhart and Christopher Genovese

Where do you store your data? #

You have a data set that you want to use in your work. How do you store it so that you can use it most effectively?

A few common scenarios:

  • Keep data in an ASCII file (e.g., CSV file)

    • Pros: Easy to read, easy to edit, easy to archive and transfer
    • Cons: No checking of data, low data density, hard to search/query, must keep in sync with version used in software, requires separate file for documentation that must be kept in sync
    • Questions: 1. If you read the data into an R data frame and change it, which is the authoritative copy of the data?
      1. If you make a mistake editing the file, what happens?
      2. What is needed to put a comma in a CSV field?
  • Keep data in an encoded file

    • Pros: High data density (i.e., compact)
    • Cons: Same problems as for ASCII file but more intense because you need a program to read the data
  • Keep data in an R Data Frame (or analogous data structure)

    • Pros: Easy to use from a program, can attach metadata (e.g., documentation)
    • Cons: Not persistent, requires translation to use from another platform
  • Keep data in memory

    • Pros: Very fast access, flexible structuring of the data
    • Cons: Not persistent, not accessible in parallel

A database is a structure for organizing information that can be efficiently accessed, updated, and managed – at scale. It is designed to address the many problems of these more informal approaches.

There are several different models for constructing databases, many of them new and optimized for large data sets in some way. But relational databases (RDBs) have been the dominant model since the 1970s and still remains popular and important. So we will discuss them first.

Relational databases tend to be design-first systems. First, you specify a schema for your data, and then you enter data that conforms to that schema. A properly designed schema can provide very flexible and powerful queries.

Relational Databases (to be called RDBs or databases for the next few classes) are commonly manipulated and queried using a (mostly) standardized language called SQL, which stands for Structured Query Language.

We will be using a powerful, open-source RDB system called PostgreSQL (aka “postgres”). It is fast, flexible, reliable, and can handle large data sets. It is highly compliant to the ANSI-standard for SQL and has some nice extensions. And it has been used successfully for many production systems over many years.

How do you use your data? #

With rich data sets, we have the potential to answer many different kinds of questions. As we work with the data, we construct new view of the data, new summaries, new statistics and then pose new questions.

This involves cycles of update and query. The basic operations on our data involve adding variables, changing values, creating summaries, selecting data that meet certain criteria, adding or removing cases that meet those criteria, and establishing relationships between different entities in our data.

These fundamental operations are supported in many frameworks and platforms, often with different syntax. For example, in R, the dplyr package gives a set of operations for updating and querying a data frame. But the ideas and the cycle of operation is similar.

We will look at SQL, a key language for expressing these operations, in detail, but keep in mind that the concepts are quite general and broadly applicable.

A few database concepts #

Data types #

The type of a piece of data describes the set of possible values that data can have and the operations that can apply to it.

In an RDB, we specify the type of each data attribute in advance. Postgres, for instance, supports a wide variety of data types, including:

  • Numeric Types, such as integers, fixed-precision floating point numbers, arbitrary precision real numbers, and auto-incrementing integer (serial).
  • Text, including fixed-length and arbitrary character strings.
  • Monetary values
  • Date and Time Stamps
  • Boolean values
  • Geometric types, such as points, lines, shapes
  • Elements in sets
  • JSON structures

See the Postgres documentation on “Data Types” for details and for more examples.

Tables (relations, schemas, entities) #

The basic unit of data storage in an RDB is the table. Tables are also sometimes called relations, schemas, and entities in an RDB context.

A table is defined by its attributes, or columns, each of which has a name and a type.

Each row of a table defines a mapping from attribute names to values. Unlike data frames in R or Python, the rows do not have an order; there is no notion of row indices unless you create a column for that purpose.

id time persona element latency score feedback
17 2015-07-11 09:42:11 3271 97863 329.4 240 Consider…
18 2015-07-11 09:48:37 3271 97864 411.9 1000
19 2015-07-08 11:22:01 499 104749 678.2 750 The mean is…
22 2015-07-30 08:44:22 6742 7623 599.7 800 Try to think of…
24 2015-08-04 23:56:33 1837 424933 421.3 0 Please select…
32 2015-07-11 10:11:07 499 97863 702.1 820 What does the…
99 2015-07-22 16:11:27 24 88213 443.0 1000

What are the attribute names and types for this table?

Relationships between tables #

We can think of tables as representing some entity that we are modeling in our problem. For example, each row of Events represents a single “event” of some sort; each persona in the Personae table represents a single student in a single class (in a specified term).

We link tables to define relationships among entities.

For example, each persona is linked to many events, while each event has a single associated persona and element.

A good design of the database tables can make it more efficient to query these relationships.

Unique, primary, and foreign keys #

It is valuable (even necessary) in practice for each row of a database table to be distinct. To that end, it is common to define a unique key – one or more attributes whose collective values uniquely identify every row.

In the Events table above, id is a unique key consisting of a single attribute.

There may be more than one unique key in a table, some resulting from the joint values of several attributes. One of these keys is usually chosen as the primary key – the key that is used in queries and in other tables to identify particular rows.

In the Events table above, id is also the primary key for the table. In practice, the primary key is often an auto-incrementing, or serial, integer like this.

When a table’s primary key is used as an attribute in another table, it acts as a link to a row in the first table. A key used in this way is called a foreign key. Columns that store foreign keys are used for linking and cross-referencing tables efficiently.

In the Events table above, the persona and element attributes are foreign keys, referencing other tables, which I have not shown you.

ACID guarantees #

An RDB stores our data, and we read and operate on that data through requests sent to the database. These requests are formally called transactions.

Modern RDBs may receive many transactions at once, often operating on the same pieces of data. Particular care is needed to ensure that transactions are performed reliably and consistently.

For example, consider what would happen in the following cases:

  • A transaction for a commercial payment is transfering money from your bank account and to another account. But the process ends after the money is deduced from one account but before adding it to the other.

  • A similar transaction completes just before the power goes out in the server room

  • A similar transaction completes even though you don’t have enough money in your account to make the payment.

These are all boundary cases, but they can happen. And if they do, the viability of the entire system can be compromised.

So, RDBs are designed to make several strong guarantees about their performance, the so-called ACID guarantees:

  • Atomic

    A transaction either succeeds entirely or fails leaving the database unchanged.

  • Consistent

    A transaction must change the database in a way that maintains all defined rules and constraints.

  • Isolated

    Concurrent execution of transactions results in a transformation that would be obtained if the transactions were executed serially.

  • Durable

    Once a transaction is committed, it remains so even in the face of crashes, power loss, and other errors.

This is another advantage of RDBs over ad hoc data storage.