Using SQL from Code

— Christopher Genovese and Alex Reinhart

SQL in Code #

It’s nice to be able to type queries into psql and see results, but most often you’d like to do more than that. You’re not just making a database to run handwritten queries – you’re using it to store data for a big project, and that data then needs to be used to fit models, make plots, prepare reports, and all sorts of other useful things. Or perhaps your code is generating data which needs to be stored in a database for later use.

Regardless, you’d like to run queries inside R, Python, or your preferred programming language, and get the results back in a form that can easily be manipulated and used.

Fortunately, PostgreSQL – and most other SQL database systems – use the client-server model of database access. The database is a server, accessible to any program on the local machine (like the psql client) and even to programs on other machines, if the firewall allows it.

Setting Your Password #

You must first set your Postgres password. When you run psql on sculptor, for instance, it knows who you are – you’re logged in via SSH. But when you connect from an R or Python script on your laptop, Postgres has no idea who you are. We need to set a password so Postgres can recognize you.

You only need to do this once. If you’ve set a password once, you don’t need to do it again unless you want to change your password.

Run psql, and at the prompt, type

\password

psql will ask you to Enter new password:. Enter a password you won’t forget. The password will not show up as you type, so don’t be confused that letters or stars don’t appear.

psql will ask you to confirm your new password, and then will return you to the prompt. You can now connect with your username and the password you just set, following the instructions below.

Do not use the same password as for your machine or other databases

Be sure to remember/record the password you type (e.g., in a password manager)

SQL in R #

The RPostgreSQL package provides the interface you need to connect to Postgres from within R. There are similar packages for other database systems, all using a similar interface called DBI, so you can switch to MySQL or MS SQL without changing much code.

To start using Postgres from within R, you need to create a connection object, which represents your connection to the server.

library(RPostgreSQL)

con <- dbConnect(PostgreSQL(), user = "yourusername", password = "yourpassword",
                 dbname = "yourusername", host = "sculptor.stat.cmu.edu")

con now represents the connection to Postgres. Queries can be sent over this connection. You can connect to multiple different databases and send them different queries.

To send a query, use dbSendQuery:

result <- dbSendQuery(con, "SELECT persona, score FROM events WHERE ...")

result is an object representing the result, but does not load the actual results all at once. If the query result is very big, you may want to only look at chunks of it at a time; otherwise, you can load the whole thing into a data frame. dbFetch loads the requested number of rows from the result, or defaults to loading the entire result if you’d prefer, all in a data frame.

data <- dbFetch(result) # load all data

data <- dbFetch(result, n = 10) # load only ten rows

dbClearResult(result)

As a shortcut, dbGetQuery runs a query, fetches all of its results, and clears the result, all in one step.

SQL in Python #

Psycopg is a popular PostgreSQL package for Python. It has a different interface: since Python doesn’t have native data frames, you can instead iterate over the result rows, where each row is a tuple of the columns. To connect:

import psycopg2

conn = psycopg2.connect(host="sculptor.stat.cmu.edu", database="yourusername",
                        user="yourusername", password="yourpassword")

cur = conn.cursor()

baz = "walrus"
spam = "penguin"

cur.execute("INSERT INTO foo (bar, baz, spam) "
            "VALUES (17, %s, %s)", (baz, spam))

Notice the use of interpolation to put variables into the query – see the Safe SQL section below.

If we do a SELECT, we can get the results with a for loop or the fetchone and fetchmany methods:

cur.execute("SELECT * FROM events")

# iterating:
for row in cur:
    print(row)

# instead, one at a time:
row = cur.fetchone()

The execute method is used regardless of the type of query.

If you use pandas for your data frames, you can also convert the results of any query directly into a pandas data frame:

import pandas as pd

d = pd.read_sql_query("SELECT foo, bar FROM events WHERE id = %(id)s",
                      conn, params = {'id': 17})

Storing Your Password #

The code above stores your password right in the source file. This is a bad idea. If the code is ever shared with anyone, posted online, or otherwise revealed, anyone who sees it now has your database username and password and can view or modify any of your data. If you commit the file to Git, your password is now in your Git history forever. Fortunately, there are ways to work around this.

#

Run this R code:

file.edit(file.path("~", ".Rprofile"))

This will create a file called ~/.Rprofile in your home directory and open it for editing. In this file, write something like

DB_USER <- "yourusername"
DB_PASSWORD <- "yourpassword"

Save and close the file. Start a new R session. The DB_USER and DB_PASSWORD variables will be defined in any R script you run, so you can use them in your code. And since the .Rprofile is not in your assignments repository, you won’t accidentally commit it to your Git history.

Python #

Python doesn’t have something like ~/.Rprofile. Instead, when you do an assignment that requires SQL access, create a separate file credentials.py defining your username and password in variables. You can import credentials and then use credentials.DB_USER and credentials.DB_PASSWORD in your other code files.

To avoid accidentally committing credentials.py, create (or modify, if it exists) a file called .gitignore in the root of your assignments repository. Add the line credentials.py to it. This will make Git ignore any files called credentials.py, so you don’t accidentally commit them.

You can use the cryptography or bcrypt packages to encrypt your stored password data as well, though the keys will need to be accessible in most cases.

SQL in Unit Tests #

Question: If you have secret SQL credentials, but your unit tests need to check functions that access the database, how can your tests run on our “public” server? How can anyone else run your tests?

Answer: They can’t. This is fine for our class; in a business environment, you would use a secret-management system to ensure code can always access the passwords it needs, or set up special test databases for development. But we’re keeping things simple here.

So instead you should do the following:

  1. Ensure SQL access is limited to the functions that really need it. Most of your code doesn’t need to access SQL – it just needs to be provided the right data in a convenient form. Only a few functions need to run SQL queries and put the data into the right form for the rest of your code to use. If you follow this design principle, most of your functions can still be tested.
  2. Set your tests to be skipped if the credentials are not available.

For point 2, most unit testing frameworks provide ways of marking tests as being skipped in certain circumstances. For example, in Python:

import pytest
import psycopg2

try:
    import credentials
except ImportError as e:
    no_database = True
else:
    no_database = False

@pytest.mark.skipif(no_database, reason="No database credentials available")
def test_some_database_thingy():
    conn = psycopg2.connect(...)

    # do stuff

In R:

library(testthat)

test_that("database access works", {
  skip_if_not(exists("DB_USER"))

  # do some database stuff
})

Practicing Safe SQL #

Suppose you’ve loaded some data from an external source – a CSV file, input from a user, from a website, another database, wherever. You need to use some of this data to do a SQL query.

result <- dbSendQuery(paste0("SELECT * FROM users WHERE username = '", username, "' ",
                             "AND password = '", password, "'"))

Now suppose username is the string “’; DROP TABLE users;–”. What does the query look like before we send it to Postgres?

SELECT * FROM users
WHERE username = ''; DROP TABLE users; -- AND password = 'theirpassword'

We have injected a new SQL statement, which drops the table. Because -- represents a comment in SQL, the commands following are not executed.

/ox-hugo/xkcd-327.png

(source: xkcd)

Less maliciously, the username might contain a single quote, confusing Postgres about where the string ends and causing syntax errors. Or any number of other weird characters which mess up the query. Clever attackers can use SQL injection to do all kinds of things: imagine if the password variable were foo' OR 1=1 – we’d be able to log in without knowing the right password!

We need a better way of writing queries with parameters determined by the code. Fortunately, database systems provide parametrized queries, where the database software is explicitly told “this is an input, with this value” so it knows not to treat it as SQL syntax. For example:

username <- "'; DROP TABLE users;--"
password <- "walruses"

query <- sqlInterpolate(con,
                        "SELECT * FROM users WHERE username = ?user AND password = ?pass",
                        user = username, pass = password)

users <- dbGetQuery(con, query)

Strings of the form ?var are replaced with the corresponding var in the arguments, but with any special characters escaped so they do not affect the meaning of the query. In this example, query is now

SELECT * FROM users WHERE username = '''; DROP TABLE users;--'
AND password = 'walruses'

Note how the single quote at the beginning of username is doubled there: that’s a standard way of escaping quotation marks, so Postgres recognizes it’s a quote inside a string, not the boundary of the string.

psycopg2 provides similar facilities:

cur.execute("SELECT * FROM users "
            "WHERE username = %(user)s AND password = %(pass)s",
            {"user": username, "pass": password})

You should always use this approach to insert data into SQL queries. You may think it’s safe with your data, but at the least opportune moment, you’ll encounter nasal demons.