## Advanced Example: Text Processing #

The purpose of this example is to give a flavor for how Postgres can be used for novel data types, like text data. While you may want some of this logic in your program, there can be value in keeping it close to the data as well.

Load extensions fuzzystrmatch, cube, and pg_tgrm by doing the following postgres commands:

create extension fuzzystrmatch;
create extension cube;
create extension pg_trgm;

Get the file ClassFiles/week4/movies_data.sql from the documents repository and import it into your running psql with \i movies_data.sql or something similar.

The LIKE (and ILIKE for case-insensitive matching) is a simple SQL facility for text searching. Try this:

select title from movies where title ilike 'stardust%';

This searches for text that matches a specified pattern: in this case, the word stardust followed by any number of other characters. In these patterns % and _ are wildcard characters: % matches any number (zero or more) of characters and _ matches any single character. The following forces at least one character after the word stardust:

select title from movies where title ilike 'stardust_%';

These are useful commands, but the types of patterns they can capture are rather simple. A more powerful pattern language is given by regular expressions. We’ll see some examples today and talk about them in more detail as we proceed through the semester.

The regular expression matching operator is ~, which is preceded by ! to negate the search and followed by * to make the search case insensitive. Not especially mnemonic!

select count(*) from movies
where title !~* '^the.*';

This counts how many movies (in table movies) do not start with the word ‘the’. The following counts the movies that have the word ‘the’ anywhere in the title, where the case of the word is ignored:

select count(*) from movies
where title ~* 'the';

Compare what happens when we don’t ignore case:

select count(*) from movies
where title ~ 'the';

Regular expressions offer a rich variety of possible patterns; PostgreSQL supports essentially the POSIX version of regular expressions.

We can make searches like this faster by creating an index:

create index movie_title_pattern on movies (lower(title) text_pattern_ops);

This optimizes title searches for case insensitive pattern matching.

The Levenshtein distance measures how far apart two words are lexically, effectively counting the number of simple steps (character insertions, deletions, swaps, etc) required to transform one word into another. We can compute this with the levenshtein() function:

select levenshtein('guava', 'guano');
levenshtein('bat','fat')  as fat,
levenshtein('bat','bad')  as bad;

This allows us to fuzzy” search our text data:

select movie_id, title from movies
where levenshtein(lower(title), lower('a hard day nght')) <= 3;
select movie_id, title from movies
where levenshtein(lower(title), lower('a hard day nght')) <= 8;

A trigram is a group of three consecutive characters taken from a string:

select show_trgm('Avatar');

We can create a trigram index for faster string searching based on the number of matching trigrams.

create index movies_title_trigram on movies
using gist(title gist_trgm_ops);
select title from movies where title % 'Avatre';

It will be faster even if not too noticeably for this data size.

We can also do full-text searches using a bag of words style match with the @@ operator. For example:

select title from movies where title @@ 'night & day';

This uses as its lexicon a large dictionary in a specified language (English by default on our server). The bag of words and query vector are stored in data types ts_vector and ts_query, respectively. We can see the representations using some helper functions:

select to_tsvector('A Hard Day''s Night'), to_tsquery('english', 'night & day');

Notice that the ‘A’ is missing in the first column: simple, high-frequency words (called ‘stop words’) are dropped by default. It is possible to configure the list of stop words use (or even make it empty, as is more current common practice).

We can create indexes to speed this search:

explain select * from movies where title @@ 'night & day';
create index movies_title_searchable on movies
using gin(to_tsvector('english', title));
explain select * from movies where title @@ 'night & day';
explain select * from movies
where to_tsvector('english', title) @@ 'night & day';

The index does nothing (but add overhead) for the first select because we did not specify the english ts-vector directly. The last search makes it clear, and postgres uses the index to reduce cost substantially.

Finally, we look at multi-dimensional attributes. The genre column of movies gives a score to each movie in several categories of movie genres as defined by the table genres.

Using the cube extension, we can search this attribute in useful ways:

select * from genres;
select * from movies where title @@ 'star & wars';
select name, cube_ur_coord('(0, 7, 0, 0, 0, 0, 0, 0, 0, 7, 0, 0, 0, 0, 10, 0, 0, 0)', position) as score
from genres g
where cube_ur_coord('(0, 7, 0, 0, 0, 0, 0, 0, 0, 7, 0, 0, 0, 0, 10, 0, 0, 0)', position) > 0;

This lists what genres the movie Star Wars belongs to (and with what scores).

Now we can find movies that have similar genre assignments:

select title,
cube_distance(genre, '(0, 7, 0, 0, 0, 0, 0, 0, 0, 7, 0, 0, 0, 0, 10, 0, 0, 0)') as dist
from movies
ORDER BY dist
LIMIT 16;

This gives the 16 movies closest in genre to ‘Star Wars’.

We can also look at movies in a bounding box using the cube_enlarge() function and the @> special contains operator. The following gives the 10 closest movies to ‘Mad Max’ in a size 5 cube (in the 18-dimensional genre space) around that movies genre vector:

select m.movie_id, m.title
from movies as m,
(select genre, title from movies
where title = 'Mad Max') as s
where cube_enlarge(s.genre, 5, 18) @> m.genre AND
s.title != m.title
order by cube_distance(m.genre, s.genre)
limit 10;

And there’s much more in this vein…

## Subqueries #

You can use select query in ()’s within a WHERE clause:

select title, starts from social_events where
venue_id in (select id from venues where name ~ 'room');

There are various other functions/operators that can be used on subqueries as well, such as in, not in, exists, any, all, and some.

For example, this looks like an inner join:

select title, starts from social_events where
exists (select 1 from venues where id = social_events.venue_id);

How would you do this with a join?

## Aggregate Functions and Grouping #

Aggregate functions operate on one or more attributes to produce a summary value. Examples: count, max, min, sum.

Add Pittsburgh (and perhaps some other cities) to the cities table. Add two or more Pittsburgh venues and one or more social events at each of those venues.

Let’s count the number of social events at one of those venues:

select count(*) from social_events where venue_id = 4;

Exercise: Given a city name (like ‘Pittburgh’), count the total number of social events within that city.

We often want to apply aggregate functions not just to whole columns but to groups of rows within columns. This is the province of the GROUP BY clause.

We could write

select count(*) from events where venue_id = 1;
select count(*) from events where venue_id = 2;
select count(*) from events where venue_id = 3;
select count(*) from events where venue_id = 4;

But that is tedious and gives four separate scalars.

Instead, we use the GROUP BY modifier:

select venue_id, count(*) from events group by venue_id;

You can apply conditions on grouped queries. Instead of WHERE for those conditions, you use HAVING, with otherwise the same syntax. Short version: WHERE select rows, and HAVING selects groups.

select venue_id, count(*) from events group by venue_id
having venue_id is not NULL;
select venue_id, count(*) from events group by venue_id
having count(*) > 1 AND venue_id is not NULL;

## Window Functions #

Aggregate functions let you summarize mulitiple rows, but they summarize with a single value for each group. Window functions are similar except they let us tag each row with the summary.

(Make sure you have at least one venue with more than one social event in the example to follow. For instance,

insert into social_events (title, venue_id)
values ('Valentine''s Day Party', 1), ('April Fool''s Day Party', 1);

Notice the double “ to escape the single quote.)

Now, compare

select venue_id, count(*) from social_events group by venue_id
order by venue_id;
select venue_id, count(*) OVER (PARTITION BY venue_id)
from social_events order by venue_id;