SQL in Code #
It’s nice to be able to type queries into psql
and see results, but most often
you’d like to do more than that. You’re not just making a database to run
handwritten queries – you’re using it to store data for a big project, and
that data then needs to be used to fit models, make plots, prepare reports,
and all sorts of other useful things. Or perhaps your code is generating data
which needs to be stored in a database for later use.
Regardless, you’d like to run queries inside R, Python, or your preferred programming language, and get the results back in a form that can easily be manipulated and used.
Fortunately, PostgreSQL – and most other SQL database systems – use the
client-server model of database access. The database is a server, accessible
to any program on the local machine (like the psql
client) and even to
programs on other machines, if the firewall allows it.
Setting Your Password #
You must first set your Postgres password. When you run psql
on sculptor
, for instance, it
knows who you are – you’re logged in via SSH. But when you connect from an R
or Python script on your laptop, Postgres has no idea who you are. We need to
set a password so Postgres can recognize you.
You only need to do this once. If you’ve set a password once, you don’t need to do it again unless you want to change your password.
Run psql
, and at the prompt, type
\password
psql
will ask you to Enter new password:
. Enter a password you
won’t forget. The password will not show up as you type, so don’t
be confused that letters or stars don’t appear.
psql
will ask you to confirm your new password, and then will return you to
the prompt. You can now connect with your username and the password you just
set, following the instructions below.
Do not use the same password as for your machine or other databases
Be sure to remember/record the password you type (e.g., in a password manager)
SQL in R #
The RPostgreSQL package provides the interface you need to connect to Postgres from within R. There are similar packages for other database systems, all using a similar interface called DBI, so you can switch to MySQL or MS SQL without changing much code.
To start using Postgres from within R, you need to create a connection object, which represents your connection to the server.
library(RPostgreSQL)
con <- dbConnect(PostgreSQL(), user = "yourusername", password = "yourpassword",
dbname = "yourusername", host = "sculptor.stat.cmu.edu")
con
now represents the connection to Postgres. Queries can be sent over this
connection. You can connect to multiple different databases and send them
different queries.
To send a query, use dbSendQuery
:
result <- dbSendQuery(con, "SELECT persona, score FROM events WHERE ...")
result
is an object representing the result, but does not load the actual
results all at once. If the query result is very big, you may want to only
look at chunks of it at a time; otherwise, you can load the whole thing into a
data frame. dbFetch
loads the requested number of rows from the result, or
defaults to loading the entire result if you’d prefer, all in a data frame.
data <- dbFetch(result) # load all data
data <- dbFetch(result, n = 10) # load only ten rows
dbClearResult(result)
As a shortcut, dbGetQuery
runs a query, fetches all of its results, and clears
the result, all in one step.
SQL in Python #
Psycopg is a popular PostgreSQL package for Python. It has a different interface: since Python doesn’t have native data frames, you can instead iterate over the result rows, where each row is a tuple of the columns. To connect:
import psycopg2
conn = psycopg2.connect(host="sculptor.stat.cmu.edu", database="yourusername",
user="yourusername", password="yourpassword")
cur = conn.cursor()
baz = "walrus"
spam = "penguin"
cur.execute("INSERT INTO foo (bar, baz, spam) "
"VALUES (17, %s, %s)", (baz, spam))
Notice the use of interpolation to put variables into the query – see the Safe SQL section below.
If we do a SELECT
, we can get the results with a for
loop or the fetchone
and
fetchmany
methods:
cur.execute("SELECT * FROM events")
# iterating:
for row in cur:
print(row)
# instead, one at a time:
row = cur.fetchone()
The execute
method is used regardless of the type of query.
If you use pandas for your data frames, you can also convert the results of any query directly into a pandas data frame:
import pandas as pd
d = pd.read_sql_query("SELECT foo, bar FROM events WHERE id = %(id)s",
conn, params = {'id': 17})
Storing Your Password #
The code above stores your password right in the source file. This is a bad idea. If the code is ever shared with anyone, posted online, or otherwise revealed, anyone who sees it now has your database username and password and can view or modify any of your data. If you commit the file to Git, your password is now in your Git history forever. Fortunately, there are ways to work around this.
R #
Run this R code:
file.edit(file.path("~", ".Rprofile"))
This will create a file called ~/.Rprofile
in your home directory and open
it for editing. In this file, write something like
DB_USER <- "yourusername"
DB_PASSWORD <- "yourpassword"
Save and close the file. Start a new R session. The DB_USER
and DB_PASSWORD
variables will be defined in any R script you run, so you can use them in
your code. And since the .Rprofile
is not in your assignments repository,
you won’t accidentally commit it to your Git history.
Python #
Python doesn’t have something like ~/.Rprofile
. Instead, when you do an
assignment that requires SQL access, create a separate file credentials.py
defining your username and password in variables. You can import credentials
and then use credentials.DB_USER
and credentials.DB_PASSWORD
in your other
code files.
To avoid accidentally committing credentials.py
, create (or modify, if it
exists) a file called .gitignore
in the root of your assignments
repository.
Add the line credentials.py
to it. This will make Git ignore any files
called credentials.py
, so you don’t accidentally commit them.
You can use the cryptography
or bcrypt
packages to encrypt your stored
password data as well, though the keys will need to be accessible
in most cases.
SQL in Unit Tests #
Question: If you have secret SQL credentials, but your unit tests need to check functions that access the database, how can your tests run on our “public” server? How can anyone else run your tests?
Answer: They can’t. This is fine for our class; in a business environment, you would use a secret-management system to ensure code can always access the passwords it needs, or set up special test databases for development. But we’re keeping things simple here.
So instead you should do the following:
- Ensure SQL access is limited to the functions that really need it. Most of your code doesn’t need to access SQL – it just needs to be provided the right data in a convenient form. Only a few functions need to run SQL queries and put the data into the right form for the rest of your code to use. If you follow this design principle, most of your functions can still be tested.
- Set your tests to be skipped if the credentials are not available.
For point 2, most unit testing frameworks provide ways of marking tests as being skipped in certain circumstances. For example, in Python:
import pytest
import psycopg2
try:
import credentials
except ImportError as e:
no_database = True
else:
no_database = False
@pytest.mark.skipif(no_database, reason="No database credentials available")
def test_some_database_thingy():
conn = psycopg2.connect(...)
# do stuff
In R:
library(testthat)
test_that("database access works", {
skip_if_not(exists("DB_USER"))
# do some database stuff
})
Practicing Safe SQL #
Suppose you’ve loaded some data from an external source – a CSV file, input from a user, from a website, another database, wherever. You need to use some of this data to do a SQL query.
result <- dbSendQuery(paste0("SELECT * FROM users WHERE username = '", username, "' ",
"AND password = '", password, "'"))
Now suppose username
is the string “’; DROP TABLE users;–”. What does the
query look like before we send it to Postgres?
SELECT * FROM users
WHERE username = ''; DROP TABLE users; -- AND password = 'theirpassword'
We have injected a new SQL statement, which drops the table. Because --
represents a comment in SQL, the commands following are not executed.
(source: xkcd)
Less maliciously, the username might contain a single quote, confusing
Postgres about where the string ends and causing syntax errors. Or any number
of other weird characters which mess up the query. Clever attackers can use
SQL injection to do all kinds of things: imagine if the password
variable were
foo' OR 1=1
– we’d be able to log in without knowing the right password!
We need a better way of writing queries with parameters determined by the code. Fortunately, database systems provide parametrized queries, where the database software is explicitly told “this is an input, with this value” so it knows not to treat it as SQL syntax. For example:
username <- "'; DROP TABLE users;--"
password <- "walruses"
query <- sqlInterpolate(con,
"SELECT * FROM users WHERE username = ?user AND password = ?pass",
user = username, pass = password)
users <- dbGetQuery(con, query)
Strings of the form ?var
are replaced with the corresponding var
in the
arguments, but with any special characters escaped so they do not affect the
meaning of the query. In this example, query
is now
SELECT * FROM users WHERE username = '''; DROP TABLE users;--'
AND password = 'walruses'
Note how the single quote at the beginning of username
is doubled there:
that’s a standard way of escaping quotation marks, so Postgres recognizes it’s
a quote inside a string, not the boundary of the string.
psycopg2 provides similar facilities:
cur.execute("SELECT * FROM users "
"WHERE username = %(user)s AND password = %(pass)s",
{"user": username, "pass": password})
You should always use this approach to insert data into SQL queries. You may think it’s safe with your data, but at the least opportune moment, you’ll encounter nasal demons.