Where do you store your data? #
You have a data set that you want to use in your work. How do you store it so that you can use it most effectively?
A few common scenarios:
- 
Keep data in a text file (e.g., CSV file) - Pros: Easy to read, easy to edit, easy to archive and transfer
- Cons: No checking of data, low data density, hard to search/query, must keep in sync with version used in software, requires separate file for documentation that must be kept in sync
- Questions: 1. If you read the data into an R data frame and change it, which is the authoritative copy of the data? 2. If you make a mistake editing the file, what happens? 3. What is needed to put a comma or newline in a CSV field?
 
- 
Keep data in an encoded file (e.g., .Rdataor.rds, HDF5, Numpy.npy)- Pros: High data density (i.e., compact)
- Cons: Same problems as for text file but more intense because you need a program to read the data
 
- 
Keep data in an R/Python data frame - Pros: Easy to use from a program, can attach metadata (e.g., documentation)
- Cons: Not persistent, requires translation to use from another platform
 
- 
Keep data in memory - Pros: Very fast access, flexible structuring of the data
- Cons: Not persistent, not accessible in parallel
 
A database is a structure for organizing information that can be efficiently accessed, updated, and managed – at scale. It is designed to address the many problems of these more informal approaches.
Relational Databases have been the dominant model for databases since the 1970s and are still very important. There are many new types of databases now available that offer various trade-offs in performance and use.
Relational databases tend to be design-first systems:
- Design a schema for your data
- Check that your schema satisfies the properties of normalization
- Create tables (aka relations) corresponding to the entities in your schema
- Enter data into those tables.
A properly designed schema allows flexible and powerful queries.
Relational Databases (to be called RDBs or databases for the next few classes) are commonly manipulated and queried using a (mostly) standardized language called SQL, which stands for Structured Query Language.
We will be using a powerful, open-source RDB system called PostgreSQL (aka “postgres”). It is fast, flexible, reliable, and can handle large data sets. It is highly compliant to the ANSI-standard for SQL and has some nice extensions. And it has been used successfully for many production systems over many years.
How do you use your data? #
With rich data sets, we have the potential to answer many different kinds of questions. As we work with the data, we construct new view of the data, new summaries, new statistics and then pose new questions.
This involves cycles of update and query. The basic operations on our data involve adding variables, changing values, creating summaries, selecting data that meet certain criteria, adding or removing cases that meet those criteria, and establishing relationships between different entities in our data.
These fundamental operations are supported in many frameworks and
platforms, with varying syntax. Example: dplyr in R allows
updating and querying a data frame.
Even though we will focus on SQL, the concepts apply quite broadly.
A few database concepts #
The client-server model #
Most (not all!) SQL database systems are based on a client-server model.
- Server
- A process running continuously on some server. Accepts requests (potentially from many users at the same time) to update and query data, and stores the canonical version of the database.
- Client
- Any program that can connect to the server, send queries, and receive results from it.
The client and server need not be on the same computer, or even on the same continent. Often, companies with large databases will have a central database server with huge hard drives and lots of memory; business systems (like an inventory tracker or logistics system) will send their queries to this database to keep it up-to-date, while analysts will sit on their laptops and send queries to produce reports for their bosses.
A key feature of the client-server model is that the server can serve multiple clients at the same time. The server goes to extraordinary efforts to ensure that it can process the analyst’s queries while simultaneously recording every new sale on the company’s website – and while keeping all results consistent and complete.
Tables (relations, schemas, entities) #
The basic unit of data storage in an RDB is the table. Tables are also sometimes called relations, schemas, and entities in an RDB context. There may be many tables in a single database.
A table is defined by its attributes, or columns, each of which has a name and a type.
Each row of a table defines a mapping from attribute names to values. Unlike data frames in R or Python, the rows do not have an order; there is no notion of row indices unless you create a column for that purpose.
| id | time | persona | element | latency | score | feedback | 
|---|---|---|---|---|---|---|
| 17 | 2015-07-11 09:42:11 | 3271 | 97863 | 329.4 | 240 | Consider… | 
| 18 | 2015-07-11 09:48:37 | 3271 | 97864 | 411.9 | 1000 | |
| 19 | 2015-07-08 11:22:01 | 499 | 104749 | 678.2 | 750 | The mean is… | 
| 22 | 2015-07-30 08:44:22 | 6742 | 7623 | 599.7 | 800 | Try to think of… | 
| 24 | 2015-08-04 23:56:33 | 1837 | 424933 | 421.3 | 0 | Please select… | 
| 32 | 2015-07-11 10:11:07 | 499 | 97863 | 702.1 | 820 | What does the… | 
| 99 | 2015-07-22 16:11:27 | 24 | 88213 | 443.0 | 1000 | 
What are the attribute names and types for this table?
Data types #
The type of a piece of data describes the set of possible values that data can have and the operations that can apply to it.
In an RDB, we specify the type of each data attribute in advance. Postgres, for instance, supports a wide variety of data types, including:
- Numeric Types, such as integers, fixed-precision floating point
numbers, arbitrary precision real numbers, and auto-incrementing
integer (serial).
- Text, including fixed-length and arbitrary character strings.
- Monetary values
- Date and Time Stamps
- Boolean values
- Geometric types, such as points, lines, shapes
- Elements in sets
- JSON structures
When you can, use the built-in types. For example, store dates and times using the date and time types, not as strings. Databases usually support functions to operate directly on these types and will ensure they are valid (i.e. you can’t enter February 29, 2015 at 15:44pm).
See the Postgres documentation on “Data Types” for details and for more examples.
Relationships between tables #
We can think of tables as representing some entity that
we are modeling in our problem. For example, each row
of Events represents a single “event” of some sort;
each persona in the Personae table represents a single
student in a single class (in a specified term).
We link tables to define relationships among entities.
For example, each persona is linked to many events, while each event
has a single associated persona and element.
A good design of the database tables can make it more efficient to query these relationships.
Unique, primary, and foreign keys #
It is valuable (even necessary) in practice for each row of a database table to be distinct. To that end, it is common to define a unique key – one or more attributes whose collective values uniquely identify every row.
In the Events table above, id is a unique key consisting of a single
attribute.
There may be more than one unique key in a table, some resulting from the joint values of several attributes. One of these keys is usually chosen as the primary key – the key that is used in queries and in other tables to identify particular rows.
In the Events table, id is also the primary key for the table.
In practice, the primary key is often an auto-incrementing, or
serial, integer.
When a table’s primary key is used as an attribute in another table, it acts as a link to a row in the first table.
A key used in this way is called a foreign key. Columns that store foreign keys are used for linking and cross-referencing tables efficiently.
In the Events table above, the persona and element attributes
are foreign keys, referencing other tables, which I have not
shown you.
ACID guarantees #
An RDB stores our data, and we read and operate on that data through requests sent to the database. These requests are formally called transactions.
Modern RDBs may receive many transactions at once, often operating on the same pieces of data. Particular care is needed to ensure that transactions are performed reliably and consistently.
For example, consider what would happen in the following cases:
- 
A transaction for a commercial payment is transfering money from your bank account and to another account. But the process ends after the money is deducted from one account but before adding it to the other. 
- 
A similar transaction completes just before the power goes out in the server room 
- 
A similar transaction completes even though you don’t have enough money in your account to make the payment. 
These are all boundary cases, but they can happen. And if they do, the viability of the entire system can be compromised.
So, RDBs are designed to make several strong guarantees about their performance, the so-called ACID guarantees:
- 
Atomic A transaction either succeeds entirely or fails leaving the database unchanged. 
- 
Consistent A transaction must change the database in a way that maintains all defined rules and constraints. 
- 
Isolated Concurrent execution of transactions results in a transformation that would be obtained if the transactions were executed serially. 
- 
Durable Once a transaction is committed, it remains so even in the face of crashes, power loss, and other errors. 
This is another advantage of RDBs over ad hoc data storage.