Advanced Example: Text Processing #
The purpose of this example is to give a flavor for how Postgres can be used for novel data types, like text data. While you may want some of this logic in your program, there can be value in keeping it close to the data as well.
Load extensions fuzzystrmatch
, cube
, and pg_tgrm
by doing the
following postgres commands:
create extension fuzzystrmatch;
create extension cube;
create extension pg_trgm;
Get the file Activities/sql/movies_data.sql
from the documents
repository
and import it into your running psql with \i movies_data.sql
or something
similar.
The LIKE
(and ILIKE
for case-insensitive matching) is a simple SQL
facility for text searching. Try this:
select title from movies where title ilike 'stardust%';
This searches for text that matches a specified pattern: in this case,
the word stardust followed by any number of other characters.
In these patterns %
and _
are wildcard characters: %
matches any
number (zero or more) of characters and _
matches any single character.
The following forces at least one character after the word stardust:
select title from movies where title ilike 'stardust_%';
These are useful commands, but the types of patterns they can capture are rather simple. A more powerful pattern language is given by regular expressions. We’ll see some examples today and talk about them in more detail as we proceed through the semester.
The regular expression matching operator is ~
, which is
preceded by !
to negate the search and followed by *
to
make the search case insensitive. Not especially mnemonic!
select count(*) from movies
where title !~* '^the.*';
This counts how many movies (in table movies
) do not start
with the word ’the’. The following counts the movies
that have the word ’the’ anywhere in the title,
where the case of the word is ignored:
select count(*) from movies
where title ~* 'the';
Compare what happens when we don’t ignore case:
select count(*) from movies
where title ~ 'the';
Regular expressions offer a rich variety of possible patterns; PostgreSQL supports essentially the POSIX version of regular expressions.
We can make searches like this faster by creating an index:
create index movie_title_pattern on movies (lower(title) text_pattern_ops);
This optimizes title searches for case insensitive pattern matching.
The Levenshtein distance measures how far apart two words are
lexically, effectively counting the number of simple steps
(character insertions, deletions, swaps, etc) required to transform
one word into another. We can compute this with the levenshtein()
function:
select levenshtein('guava', 'guano');
select levenshtein('bat','fads') as fads,
levenshtein('bat','fad') as fad,
levenshtein('bat','fat') as fat,
levenshtein('bat','bad') as bad;
This allows us to ``fuzzy’’ search our text data:
select movie_id, title from movies
where levenshtein(lower(title), lower('a hard day nght')) <= 3;
select movie_id, title from movies
where levenshtein(lower(title), lower('a hard day nght')) <= 8;
A trigram is a group of three consecutive characters taken from a string:
select show_trgm('Avatar');
We can create a trigram index for faster string searching based on the number of matching trigrams.
create index movies_title_trigram on movies
using gist(title gist_trgm_ops);
select title from movies where title % 'Avatre';
It will be faster even if not too noticeably for this data size.
We can also do full-text searches using a bag of words style match
with the @@
operator.
For example:
select title from movies where title @@ 'night & day';
This uses as its lexicon a large dictionary in a specified
language (English by default on our server).
The bag of words and query vector are stored in data types
ts_vector
and ts_query
, respectively. We can see the
representations using some helper functions:
select to_tsvector('A Hard Day''s Night'), to_tsquery('english', 'night & day');
Notice that the ‘A’ is missing in the first column: simple, high-frequency words (called ‘stop words’) are dropped by default. It is possible to configure the list of stop words use (or even make it empty, as is more current common practice).
We can create indexes to speed this search:
explain select * from movies where title @@ 'night & day';
create index movies_title_searchable on movies
using gin(to_tsvector('english', title));
explain select * from movies where title @@ 'night & day';
explain select * from movies
where to_tsvector('english', title) @@ 'night & day';
The index does nothing (but add overhead) for the first select because we did not specify the english ts-vector directly. The last search makes it clear, and postgres uses the index to reduce cost substantially.
Finally, we look at multi-dimensional attributes.
The genre
column of movies
gives a score to each
movie in several categories of movie genres as defined
by the table genres
.
Using the cube
extension, we can search this attribute
in useful ways:
select * from genres;
select * from movies where title @@ 'star & wars';
select name, cube_ur_coord('(0, 7, 0, 0, 0, 0, 0, 0, 0, 7, 0, 0, 0, 0, 10, 0, 0, 0)', position) as score
from genres g
where cube_ur_coord('(0, 7, 0, 0, 0, 0, 0, 0, 0, 7, 0, 0, 0, 0, 10, 0, 0, 0)', position) > 0;
This lists what genres the movie Star Wars belongs to (and with what scores).
Now we can find movies that have similar genre assignments:
select title,
cube_distance(genre, '(0, 7, 0, 0, 0, 0, 0, 0, 0, 7, 0, 0, 0, 0, 10, 0, 0, 0)') as dist
from movies
ORDER BY dist
LIMIT 16;
This gives the 16 movies closest in genre to ‘Star Wars’.
We can also look at movies in a bounding box using the
cube_enlarge()
function and the @>
special contains operator.
The following gives the 10 closest movies to ‘Mad Max’ in
a size 5 cube (in the 18-dimensional genre space) around
that movies genre vector:
select m.movie_id, m.title
from movies as m,
(select genre, title from movies
where title = 'Mad Max') as s
where cube_enlarge(s.genre, 5, 18) @> m.genre AND
s.title != m.title
order by cube_distance(m.genre, s.genre)
limit 10;
And there’s much more in this vein…
Subqueries #
You can use select query in ()
’s within a WHERE
clause:
select title, starts from social_events where
venue_id in (select id from venues where name ~ 'room');
There are various other functions/operators that can be used on
subqueries as well, such as in
, not in
, exists
, any
, all
, and some
.
For example, this looks like an inner join:
select title, starts from social_events where
exists (select 1 from venues where id = social_events.venue_id);
How would you do this with a join?
Window Functions #
Aggregate functions let you summarize mulitiple rows, but they summarize with a single value for each group. Window functions are similar except they let us tag each row with the summary.
(Make sure you have at least one venue with more than one social event in the example to follow. For instance,
insert into social_events (title, venue_id)
values ('Valentine''s Day Party', 1), ('April Fool''s Day Party', 1);
Notice the double ’’ to escape the single quote.)
Now, compare
select venue_id, count(*) from social_events group by venue_id
order by venue_id;
select venue_id, count(*) OVER (PARTITION BY venue_id)
from social_events order by venue_id;