SQL

— Christopher Genovese and Alex Reinhart

Introducing SQL and Postgres #

Getting Started #

Use instruction page to connect to the SQL server. Invoke psql at the shell prompt to start the PostgreSQL REPL.

Getting Help #

Type \? at the prompt to get a list of meta-commands (these are psql, not SQL commands).

A few of these are quite common:

  • \h provides help on an SQL command or lists available commands
  • \d list or describe tables, views, and sequences
  • \l lists databases
  • \c connect to a different database
  • \i read input from a file (like source)
  • \o send query output to a file or pipoe
  • \! execute a shell command
  • \cd change directory
  • \q quit psql

Commands and Files #

Update the documents repository from github. There are several files in Activities/sql that you will find useful for the next couple classes:

  • events.csv
  • events.sql
  • commands-1.sql through commands-3.sql

The last of these are text files containing SQL commands that you can copy and paste into the prompt to save typing. Of course, typing the commands is fine too and is not a bad way to get a feel for how the commands work.

Entering SQL Statements #

SQL consists of a sequence of statements.

Each statement is built around a specific command, with a variety of modifiers and optional clauses.

SQL statements can span several lines, and all SQL statements end in a semi-colon (;).

Keep in mind: strings are delimited by single quotes ‘like this’, not double quotes “like this”.

SQL comments are lines starting with --.

To get help:

  • You can get brief help on any SQL command with \h <command>.
  • You can get detailed and helpful information on any aspect of postgres through the online documentation.
  • The stat server is running version 10.5, that that will be updated if needed.

A Simple Example #

Try the following (or copy it from the given file).

create table products (
       product_id SERIAL PRIMARY KEY,
       name text,
       price numeric CHECK (price > 0),
       sale_price numeric CHECK (sale_price > 0),
       CHECK (price > sale_price)
);

Then type \d at the prompt. You should see the table.

Next, we will enter some data.

INSERT INTO products (name, price, sale_price) values ('furby', 100, 95);
insert into products (name, price, sale_price)
       values ('frozen lunchbox', 10, 8),
              ('uss enterprise', 12, 11),
              ('spock action figure', 8, 7),
              ('slime', 1, 0.50);

Do the following, one at a time.

select * from products;
select name, price from products;
select name as product, price as howmuch from products;

Discussion…

Making Tables #

Creating Tables #

We use the CREATE TABLE command. In it’s most basic form, it looks like

create table NAME (attribute1 type1, attribute2 type2, ...);

A simple version of the previous products table is:

create table products (
       product_id integer,
       name text,
       price real,
       sale_price real
);

This gets the idea, but a few wrinkles are nice. Here’s the fancy version again:

create table products (
       product_id SERIAL PRIMARY KEY,
       name text,
       price numeric CHECK (price > 0),
       sale_price numeric CHECK (sale_price > 0),
       CHECK (price > sale_price)
);

Discussion, including

  • Column product_id is automatically set when we add a row.
  • We have told postgres that product_id is the primary key.
  • Columns price and sale_price must satisfy some constraints.
  • What happens if we try to add data that violates those constraints? Try this:

    insert into products (name, price, sale_price)
           values ('kirk action figure', 50, 52);
    • There are two kinds of constraints here: constraints on columns and constraints on the table. Which are which?

    Here’s an alternative approach to making the products table:

    create table products (
       product_id SERIAL PRIMARY KEY,
       label text UNIQUE NOT NULL CHECK (char_length(label) > 0),
       price numeric CHECK (price >= 0),
       discount numeric DEFAULT 0.0 CHECK (discount >= 0),
       CHECK (price > discount)
    );

Notice that there are a variety of functions that postgres offers for operating on the different data types. For instance, char_length() returns the length of a string.

Now, which one of these will work?

insert into products (label, price)
       values ('kirk action figure', 50);
insert into products (price, discount)
       values (50, 42);
insert into products (label, price, discount)
       values ('', 50, 42);

Altering Tables #

The ALTER TABLE command allows you to change a variety of table features. This includes adding and removing columns, renaming attributes, changing constraints or attribute types, and setting column defaults. See the full documentation for more.

A few examples using the most recent definition of products:

  • Let’s rename product_id to just id for simplicity.

    alter table products
          rename product_id to id;
    • Let’s add a brand_name column.
    alter table products add brand_name text DEFAULT 'generic' NOT NULL;
  • Let’s drop the discount column

    alter table products drop discount;
    • Let’s set a default value for brand_name.
    alter table products
          alter brand_name SET DEFAULT 'generic';

Deleting Tables #

The command is DROP TABLE.

drop table products;
drop table products cascade;

Try it, then type \d at the prompt.

Working with CRUD #

The four most basic operations on our data are

  • Create
  • Read
  • Update
  • Delete

collectively known as CRUD operations.

In SQL, these correspond to the four core commands INSERT, SELECT, UPDATE, and DELETE.

To start our exploration, let’s create a table.

create table events (
       id SERIAL PRIMARY KEY,
       moment timestamp DEFAULT 'now',
       persona integer NOT NULL,
       element integer NOT NULL,
       score integer NOT NULL DEFAULT 0 CHECK (score >= 0 and score <= 1000),
       hints integer NOT NULL DEFAULT 0 CHECK (hints >= 0),
       latency real,
       answer text,
       feedback text
);

Note: Later on, persona and element will be foreign keys, but for now, they will just be arbitrary integers.

INSERT #

The basic template is

INSERT INTO <tablename> (<column1>, ..., <columnk>)
       VALUES (<value1>, ..., <valuek>)
       RETURNING <expression|*>;

where the RETURNING clause is optional. If the column names are excluded, then values for all columns must be provided. You can use DEFAULT in place of a value for a column with a default setting.

You can also insert multiple rows at once

INSERT INTO <tablename> (<column1>, ..., <columnk>)
       VALUES (<value11>, ..., <value1k>),
              (<value21>, ..., <value2k>),
              ...
              (<valuem1>, ..., <valuemk>);

Examples #

First, copy data from events.csv into the events table:

\COPY events FROM '/home/areinhar/events.csv'
       WITH DELIMITER ',';
SELECT setval('events_id_seq', 1001, false);

You should replace the first string by the correct path to the events.csv file on your computer.

insert into events (persona, element, score, answer, feedback)
       values (1211, 29353, 824, 'C', 'How do the mean and median differ?');
insert into events (persona, element, score, answer, feedback)
       values (1207, 29426, 1000, 'A', 'You got it!')
       RETURNING id;
insert into events (persona, element, score, answer, feedback)
       values (1117, 29433,  842, 'C', 'Try simplifying earlier.'),
              (1199, 29435,    0, 'B', 'Your answer was blank'),
              (1207, 29413, 1000, 'C', 'You got it!'),
              (1207, 29359,  200, 'A', 'A square cannot be negative')
       RETURNING *;

Try inserting a few valid rows giving latencies but not id or feedback. Find the value of the id’s so inserted.

SELECT #

The SELECT command is how we query the database. It is versatile and powerful command.

The simplest query is to look at all rows and columns of a table:

select * from events;

The * is a shorthand for “all columns.”

Selects can include expressions, not just column names, as the quantities selected. And we can use as clauses to name (or rename) the results.

select 1 as one;
select ceiling(10*random()) as r;
select 1 as ones from generate_series(1,10);
select min(r), avg(r) as mean, max(r) from
       (select random() as r from generate_series(1,10000)) as _;
select timestamp '2015-01-22 08:00:00' + random() * interval '64 days'
       as w from generate_series(1,10);

Notice how we used a select to create a virtual table and then selected from it.

Most importantly, we can qualify our queries with conditions that refine the selection. We do this with the WHERE clause, which accepts a logical condition on any expression and selects only those rows that satisfy the condition. The conditional expression can include column names (even temporary ones) as variables.

select * from events where id > 20 and id < 40;

As we will see more next time, we can also order the output using the ORDER BY clause and group rows for aggregation using the GROUP BY clause values over groups.

select score, element from events
    where persona = 1202 order by element, score;
select count(answer) from events where answer = 'A';

select element, count(answer) as numAs
       from events where answer = 'A'
       group by element
       order by numAs;
select persona, avg(score) as mean_score
       from events
       group by persona
       order by mean_score;

Examples #

Try to craft selects in events for the following:

  1. List all event ids for events taking place after 20 March 2015 at 8am. (Hint: > and < should work as you hope.)
  2. List all ids, persona, score where a score > 900 occurred.
  3. List all persona (sorted numerically) who score > 900. Can you eliminate duplicates here? (Hint: Consider SELECT DISTINCT)
  4. Can you guess how to list all persona whose average score > 600. You will need to do a GROUP BY as above. (Hint: use HAVING instead of WHERE for the aggregate condition.)
  5. Produce a table showing how many times each instructional element was practiced.

UPDATE #

The UPDATE command allows us to modify existing entries in any way we like. The basic syntax looks like this

UPDATE table
    SET col1 = expression1,
        col2 = expression2,
        ...
    WHERE condition;

The UPDATE command can update one or more columns and can have a RETURNING clause like INSERT.

Examples #

create table gems (label text DEFAULT '',
                   facets integer DEFAULT 0,
                   price money);

insert into gems (select '', ceiling(20*random()+1), money '1.00' from generate_series(1,20) as k);

update gems set label = ('{thin,quality,wow}'::text[])[ceil(random()*3)];

update gems set label = 'thin'
       where facets < 10;
update gems set label = 'quality',
                price = 25.00 + cast(10*random() as numeric)
       where facets >= 10 and facets < 20;
update gems set label = 'wow', price = money '100.00'
       where facets >= 20;

select * from gems;

Try it (in the events table):

  1. Set answer for id > 800 to a random letter A through D.
  2. Update the scores to subtract 50 points for every hint taken when id > 800. Check before and after to make sure it worked.

DELETE #

The DELETE command allows you to remove rows from a table that satisfy a condition. The basic syntax is:

DELETE FROM table WHERE condition;

Example:

delete from gems where facets < 5;
delete from events where id > 1000 and answer = 'B';

Try to delete a few selected rows in one of your existing tables. (Remember: you can do \d at the prompt to check the table list.)

Activity #

Here, we will do some brief practice with CRUD operations by generating a table of random data and playing with it.

  1. Create a table rdata with five columns: one integer column id, two text columns a and b, one date moment, and one numeric column x.

  2. Use a SELECT command with the generate_series function to display the sequence from 1 to 100.

  3. Use a SELECT command with the random() function converted to text (via random()::text) and the md5 function to create a random text string. The md5 function takes in a string and turns it into a long string of random-looking text:

    areinhar=> SELECT md5('foobar');
                   md5
    ----------------------------------
     3858f62230ac3c915f300c664312c63f
    (1 row)
    1. Use a SELECT command to choose a random element from a fixed array of strings. A fixed text array can be obtained with ('{X,Y,Z}'::text[]) and then indexed using the ceil (ceiling) and random functions to make a selection.

    (FYI, ('{X,Y,Z}'::text[])[1] would give ‘X’. SQL is 1-indexed.)

    1. SELECT a random date in 2017. You can do this by adding an integer to date '2017-01-01'. For instance, try
    select date '2017-01-01' + 7 as random_date;

    For a non-integer type, append ::integer to convert it to an integer.

  4. Use INSERT to populate the rdata table with 101 rows, where the id goes from 1 to 100, a is random text, b is random choice from a set of strings (at least three in size), moment contains random days in 2017, and x contains random real numbers in some range.

  5. Use SELECT to display rows of the table for which b is equal to a particular choice.

  6. Use SELECT with either the ~* or ilike operators to display rows for which a matches a specific pattern, e.g.,

    select * from rdata where a ~* '[0-9][0-9][a-c]a';
    1. Use SELECT with the overlaps operator on dates to find all rows with moment in the month of November.

    2. Use UPDATE to set the value of b to a fixed choice for all rows that are divisible by 3 and 5.

    3. Use DELETE to remove all rows for which id is even and greater than 2. (Hint: % is the mod operator.)

    4. Use a few more DELETE’s (four more should do it) to remove all rows where id is not prime.

    Joins and Foreign Keys #

    As we will see shortly, principles of good database design tell us that tables represent distinct entities with a single authoritative copy of relevant data. This is the DRY principle in action, in this case eliminating data redundancy.

    An example of this in the events table are the persona and element columns, which point to information about students and components of the learning environment. We do not repeat the student’s information each time we refer to that student. Instead, we use a link to the student that points into a separate Personae table.

    But if our databases are to stay DRY in this way, we need two things:

    1. A way to define links between tables (and thus define relationships between the corresponding entities).

    2. An efficient way to combine information across these links.

    The former is supplied by foreign keys and the latter by the operations known as joins. We will tackle both in turn.

    Foreign Keys #

    A foreign key is a field (or collection of fields) in one table that uniquely specifies a row in another table. We specify foreign keys in Postgresql using the REFERENCES keyword when we define a column or table. A foreign key that references another table must be the value of a unique key in that table, though it is most common to reference a primary key.

    Example:

    create table countries (
       country_code char(2) PRIMARY KEY,
       country_name text UNIQUE
    );
    insert into countries
    values ('us', 'United States'), ('mx', 'Mexico'), ('au', 'Australia'),
         ('gb', 'Great Britain'), ('de', 'Germany'), ('ol', 'OompaLoompaland');
    select * from countries;
    delete from countries where country_code = 'ol';
    
    create table cities (
       name text NOT NULL,
       postal_code varchar(9) CHECK (postal_code <> ''),
       country_code char(2) REFERENCES countries,
       PRIMARY KEY (country_code, postal_code)
    );

Foreign keys can also be added (and altered) as table constraints that look like FOREIGN KEY (<key>) references <table>.

Now try this

insert into cities values ('Toronto', 'M4C185', 'ca'), ('Portland', '87200', 'us');

Notice that the insertion did not work – and the entire transaction was rolled back – because the implicit foreign key constraint was violated. There was no row with country code ‘ca’.

So let’s fix it. Try it!

insert into countries values ('ca', 'Canada');
insert into cities values ('Toronto', 'M4C185', 'ca'), ('Portland', '87200', 'us');
update cities set postal_code = '97205' where name = 'Portland';

Joins #

Suppose we want to display features of an event with the name and course of the student who generated it. If we’ve kept to DRY design and used a foreign key for the persona column, this seems inconvenient.

That is the purpose of a join. For instance, we can write:

select personae.lastname, personae.firstname, events.score, events.moment
       from events
       join personae on persona = personae.id
       where moment > timestamp '2015-03-26 08:00:00'
       order by moment;

Joins incorporate additional tables into a select. This is done by appending to the from clause:

from <table> join <table> on <condition> ...

where the on condition specifies which rows of the different tables are included. And within the select, we can disambiguate columns by referring them to by <table>.<column>. Look at the example above with this in mind.

We will start by seeing what joins mean in a simple case.

create table A (id SERIAL PRIMARY KEY, name text);
insert into A (name)
       values ('Pirate'),
              ('Monkey'),
              ('Ninja'),
              ('Flying Spaghetti Monster');

create table B (id SERIAL PRIMARY KEY, name text);
insert into B (name)
       values ('Rutabaga'),
              ('Pirate'),
              ('Darth Vader'),
              ('Ninja');
select * from A;
select * from B;

Let’s look at several kinds of joins. (There are others, but this will get across the most common types.)

Inner Join #

An inner join produces the rows for which attributes in both tables match. (If you just say JOIN in SQL, you get an inner join; the word INNER is optional.)

select * from A INNER JOIN B on A.name = B.name;

Full Outer Join #

A full outer join produces the full set of rows in all tables, matching where possible but null otherwise.

select * from A FULL OUTER JOIN B on A.name = B.name;

Left Outer Join #

A left outer join produces all the rows from A, the table on the “left” side of the join operator, along with matching rows from B if available, or null otherwise. (LEFT JOIN is a shorthand for LEFT OUTER JOIN in postgresql.)

select * from A LEFT OUTER JOIN B on A.name = B.name;

Set Difference #

Exercise: Give a selection that gives all the rows of A that are not in B.

select * from A LEFT OUTER JOIN B on A.name = B.name where B.id IS null;

Symmetric Difference #

Exercise: Select the rows of A not in B and the rows of B not in A.

select * from A FULL OUTER JOIN B on A.name = B.name
    where B.id IS null OR A.id IS null;

A simple join exercise #

Using the cities and countries tables we created earlier, do the following:

  1. List city name, postal code, and country name for each city.
  2. List city name, country, and address as a valid string.

A bigger join exercise #

Now let’s create a venues table as well:

create table venues (
       id SERIAL PRIMARY KEY,
       name varchar(255),
       street_address text,
       type char(7) CHECK (type in ('public', 'private')) DEFAULT 'public',
       postal_code varchar(9),
       country_code char(2),
       FOREIGN KEY (country_code, postal_code)
         REFERENCES cities (country_code, postal_code) MATCH FULL
);
insert into venues (name, postal_code, country_code)
  values ('Crystal Ballroom', '97205', 'us'),
         ('Voodoo Donuts', '97205', 'us'),
         ('CN Tower', 'M4C185', 'ca');
update venues set type = 'private' where name = 'CN Tower';
select * from venues;

Now create a social_events with a few social events:

create table social_events (
       id SERIAL PRIMARY KEY,
       title text,
       starts timestamp DEFAULT timestamp 'now' + interval '1 month',
       ends timestamp DEFAULT timestamp 'now' + interval '1 month' + interval '3 hours',
       venue_id integer REFERENCES venues (id)
);
insert into social_events (title, venue_id) values ('LARP Club', 3);
insert into social_events (title, starts, ends)
  values ('Fight Club', timestamp 'now' + interval '12 hours', timestamp 'now' + interval '16 hours');
insert into social_events (title, venue_id)
  values ('Arbor Day Party', 1), ('Doughnut Dash', 2);
select * from social_events;

Exercise: List a) all social events with a venue with the venue names, and b) all social events with venue names even if missing.

When we know we will search on certain fields regularly, it can be helpful to create an index, which speeds up those particular searches.

create index social_events_title  on social_events using hash(title);
create index social_events_starts on social_events using btree(starts);

select * from social_events where title = 'Fight Club';
select * from social_events where starts >= '2015-11-28';

Database Schema Design Principles #

The key design principle for database schema is to keep the design DRY – that is, eliminate data redundancy. The process of making a design DRY is called normalization, and a DRY database is said to be in “normal form.”

The basic modeling process:

  1. Identify and model the entities in your problem
  2. Model the relationships between entities
  3. Include relevant attributes
  4. Normalize by the steps below

Example #

Consider a database to manage songs:

Album         Artist              Label     Songs
------------- ------------------- --------- ----------------------------------
Talking Book  Stevie Wonder       Motown    You are the sunshine of my life,
                                            Maybe your baby, Superstition, ...
Miles Smiles  Miles Davis Quintet Columbia  Orbits, Circle, ...
Speak No Evil Wayne Shorter       Blue Note Witch Hunt, Fee-Fi-Fo-Fum, ...
Headhunters   Herbie Hancock      Columbia  Chameleon, Watermelon Man, ...
Maiden Voyage Herbie Hancock      Blue Note Maiden Voyage
American Fool John Couger         Riva      Hurts so good, Jack & Diane, ...
...

This seems fine at first, but why might this format be problematic or inconvenient?

  • Difficult to get songs from a long list in one column
  • Same artist has multiple albums
  • “Best Of” albums

  • A few thoughts:

    • What happens if an artist changes names partway through his or her career (e.g., John Cougar)?
    • Suppose we want mis-spelled ``Herbie Hancock” and wanted to update it. We would have to change every row corresponding to a Herbie Hancock album.
    • Suppose we want to search for albums with a particular song; we have to search specially within the list for each album.

The schema here can be represented as

Album (artist, name, record_label, song_list)

where Album is the entity and the labels in parens are its attributes.

To normalize this design, we will add new entities and define their attributes so each piece of data has a single authoritative copy.

Step 1. Give Each Entity a Unique Identifier #

This will be its primary key, and we will call it id here.

Key features of a primary key are that it is unique, non-null, and it never changes for the lifetime of the entity.

Step 2. Give Each Attribute a Single (Atomic) Value #

What does each attribute describe? What attributes are repeated in Albums, either implicitly or explicitly?

Consider the relationship between albums and songs. An album can have one or more songs; in other words, the attribute song_list is non-atomic (it is composed of other types, in this case a list of text strings). The attribute describes a collection of another entity – Song.

So, we now have two entities, Album and Song. How do we express these entities in our design? It depends on our model. Let’s look at two ways this could play out.

  1. Assume (at least hypothetically) that each song can only appear on one album. Then Album and Song would have a one-to-many relationship.

    • Album(id, title, label, artist)
    • Song(id, name, duration, album_id)

    Question: What do our CREATE TABLE commands look like under this model?

  2. Alternatively, suppose our model recognizes that while an album can have one or more songs, a song can also appear on one or more albums (e.g., a greatest hits album). Then, these two entities have a many-to-many relationship.

    This gives us two entities that look like:

    • Album(id, title, label, artist)
    • Song(id, name, duration)

    This is fine, but it doesn’t seem to capture that many-to-many relationship. How should we capture that?

    • An answer: This model actually describes a new entity – Track. The schema looks like:
      • Album(id, title, label, artist)
      • Song(id, name, duration)
      • Track(id, song_id, album_id, index)

Step 3. Make All Non-Key Attributes Dependent Only on the Primary Key #

This step is satisfied if each non-key column in the table serves to describe what the primary key identifies.

Any attributes that do not satisfy this condition should be moved to another table.

In our schema of the last step (and in the example table), both the artist and label field contain data that describes something else. We should move these to new tables, which leads to two new entities:

  • Artist(id, name)
  • RecordLabel (id, name, street_address, city, state_name, state_abbrev, zip)

Each of these may have additional attributes. For instance, producer in the latter case, and in the former, we may have additional entities describing members in the band.

Step 4. Make All Non-Key Attributes Independent of Other Non-Key Attributes #

Consider RecordLabel. The state_name, state_abbrev, and zip code are all non-key fields that depend on each other. (If you know the zip code, you know the state name and thus the abbreviation.)

This suggests to another entity State, with name and abbreviation as attributes. And so on.

Exercise #

Convert this normalized schema into a series of CREATE TABLE commands.