kivikakk.ee

After many discussions recently at work with my coworker who does our webdev and DBA about how crap MySQL is, I decided to finally check out the alternative which I’ve known since the start was better (yet never have bothered [dared?] to try): PostgreSQL.

A brief overview of PostgreSQL

Postgres is a generally better polished piece of software (often compared to , with less rubbish surrounding its licensing (fuck you, MySQL AB Oracle), better extensibility (PL/*), better standards compliance, foreign-key constraints (which are often forced in software instead in MySQL), a much richer type set .. sold you yet? You can have frickin’ multi-dimensional arrays as column types, and query on subelements! Amazing.

Of course, if you install the postgresql and postgresql-client packages, you’ll find you have an installation which you can do very little with. (no way to get in by invoking psql).

Postgres does its authentication—by default—a bit differently. After installing the server, you’ll have a new postgres user on your system; if you su into it (perhaps via a superuser account), you’ll now be the superuser for your database server, too, by virtue of being logged into that account. Postgres—again, by default—ties the database accounts to the system ones.

We can now use createuser to make an account for our regular user. Since I’m doing this on a development machine (my laptop), I’ll just create a superuser account for it. My account’s called “celtic”, so it’s just a matter of entering createuser -s celtic. Read createuser’s man page for more.

Exiting out of those subshells and back as our regular user, we now can create ourselves a database to toy with, using createdb. All Postgres tools that operate on databases will default to using your username as a database name (i.e. celtic’s default database is named celtic). createdb is no different, so you can just invoke it with no arguments and it’ll make your default database after a short delay.

You can now invoke psql, and get a lovely prompt:

(celtic) ~:6320 (0)% psql
psql (8.4.5)
Type "help" for help.

celtic=#

The word celtic at the prompt represents the database I’m using, and the # is because I’m a superuser (non-superusers see >).

Now that we can connect to our database and mess around with stuff, let’s take a look at the SQL understood by Postgres.

It differs in small ways from MySQL. Table and column names lose their capitalisation if you don’t quote them, for example:

celtic=# create table XYZabc ();
CREATE TABLE
celtic=# \d
        List of relations
 Schema |  Name  | Type  | Owner  
--------+--------+-------+--------
 public | xyzabc | table | celtic
(1 row)

celtic=# create table "XYZabc" ();
CREATE TABLE
celtic=# \d
        List of relations
 Schema |  Name  | Type  | Owner  
--------+--------+-------+--------
 public | XYZabc | table | celtic
 public | xyzabc | table | celtic
(2 rows)

celtic=#

Quoting is another good point. Backticks aren’t used for identifiers; instead, use double-quotes (!). Single-quotes quote strings. This may take a moment to get used to.

Another thing is Postgres’s explicit use of sequences. In MySQL, if you want a unique numeric ID for a table, you’d probably just add a column like abcId INT PRIMARY KEY NOT NULL AUTO_INCREMENT, with a few caveats: there can be only one AUTO_INCREMENT per table, it must be indexed, they don’t work if you try to use negative numbers with them, and it has to be an integer or floating point number column.

Postgres has a construct that looks quite similar; abcId SERIAL PRIMARY KEY NOT NULL. In this case, SERIAL is the “type” of this column, but what it ends up looking like is quite different:

celtic=# create table abc (abcID serial primary key not null);
NOTICE:  CREATE TABLE will create implicit sequence "abc_abcid_seq"
for serial column "abc.abcid"
NOTICE:  CREATE TABLE / PRIMARY KEY will create implicit index
"abc_pkey" for table "abc"
CREATE TABLE
celtic=# \d
             List of relations
 Schema |     Name      |   Type   | Owner  
--------+---------------+----------+--------
 public | abc           | table    | celtic
 public | abc_abcid_seq | sequence | celtic
(2 rows)

celtic=# \d abc
                           Table "public.abc"
 Column |  Type   |                      Modifiers                      
--------+---------+-----------------------------------------------------
 abcid  | integer | not null default nextval('abc_abcid_seq'::regclass)
Indexes:
    "abc_pkey" PRIMARY KEY, btree (abcid)

celtic=# \d abc_abcid_seq
        Sequence "public.abc_abcid_seq"
    Column     |  Type   |        Value        
---------------+---------+---------------------
 sequence_name | name    | abc_abcid_seq
 last_value    | bigint  | 1
 start_value   | bigint  | 1
 increment_by  | bigint  | 1
 max_value     | bigint  | 9223372036854775807
 min_value     | bigint  | 1
 cache_value   | bigint  | 1
 log_cnt       | bigint  | 1
 is_cycled     | boolean | f
 is_called     | boolean | f

celtic=#

That’s a lot to digest!

When we create the table, PgSQL tells us it’s creating an implicit sequence called abc_abcid_seq for the serial column abc.abcid. Okay. The PRIMARY KEY invocation is also expanded into creation of an index; note that Postgres tells you these things explicitly. This is also a nice reminder that these things can be done manually, too.

When we detail the database with \d, we now see there are two relations; our table, and its sequence. Note that the sequence isn’t really explicitly bound to the table in any way, it’s just named so you can tell it belongs to it.

Inspecting the table, we can see that abcid is now actually an integer column marked not null. It also has a default value, nextval('abc_abcid_seq'::regclass). This is actually a function call—Postgres supports arbitrary expressions (subject to some caveats) for calculating default values, not just NOW().

I won’t delve into the function call too much at this stage, but nextval() is an Postgres-supplied function which takes a handle to a relation (which should be a sequence), which is of type regclass, and returns the next value the sequence should generate, as well as causing it to advance its internal pointer such that it’ll return the following number next time.

The paamayim nekudotayim :: is the cast operator, and in this case casts the string-literal 'abc_abcid_seq' to a regclass. Technically it’s not necessary, as it gets casted implicitly for you anyway:

celtic=# select nextval('abc_abcid_seq');
 nextval
---------
       1
(1 row)

celtic=# select nextval('abc_abcid_seq');
 nextval
---------
       2
(1 row)

celtic=#

.. but I suppose the system is being explicit, given that it’s the one generating the values.

So that’s how the serial column “type” works in Postgres. You can use these for anything, and not necessarily tied to only one table, or any table at all; perhaps you create a standalone sequence (CREATE SEQUENCE xyzzy), then use it in some functions. There’s a family of functions for manipulating its value manually, so you have complete control of how the sequence is generated. It’s a really neat mechanism, and I’m looking forward to exploiting it in my own applications.

Enter PG’OCaml

Now, let’s say we wanted to use this amazing database system in our very own OCaml application. The first thing I did was look for a Debian package called libpostgresql-ocaml-dev, and lo and behold, there’s one! I installed it, and searching for docs for it, I stumbled upon the homepage of PG’OCaml. Normally I eschew the “non-standard” library, but I was pleasantly surprised to find that this was also in Debian: libpgocaml-ocaml-dev! Let’s install that instead.

Its killer feature? PG’OCaml type-checks your SQL statements at compile-time to make a strong, type-safe SQL library with syntax extensions so you do no manual (and unsafe) casting of data, nor do you accidentally pass the wrong type of data when trying to concatenate a barely-injection-safe query together.

The “downside” (if you can call it that) is that PG’OCaml needs access to your database at compile-time to find out the structure of your database. This is rarely an issue. The amazing thing is that, not only does it make sure that you’re working with the right types and doing all the ugly conversion under-the-table for you, it also carries the amazing feeling of using a safe static-type-inferring language right through to your database layer: make a change to your database (that mangles your code without you realising), and you’ll now get a compile-error when you recompile! Of course, the caveat is that you have to recompile to get the knowledge upfront, but you’ll get runtime errors as well, just like you would with any other database connector.

Let’s begin coding with PG’OCaml:

let report (name,age) =
  Printf.printf "%s is %d years old\n%!" name age

let dbh = PGOCaml.connect ()
in
let results =
  PGSQL (dbh) "select name, age from tblpeople"
in
List.iter report results

This is deceptively simple: we connect to a database (where are the connection settings?!), query it using PGSQL, and then List.iter on the supposedly well-formed and well-typed result?!

The answer, of course, is yes.

To get this running, try compiling it using ocamlfind with a line like this:

ocamlfind ocamlc -linkpkg -package pgocaml.syntax -syntax camlp4o test.ml -o test

Bam! Compile error:

ERROR: 42P01: relation "tblpeople" does not exist
File "test.ml", line 7, characters 2-47:
Camlp4: Uncaught exception: PGOCaml_generic.Make(Thread).PostgreSQL_Error ("ERROR: 42P01: relation \"tblpeople\" does not exist", [(83 | CstTag84, "ERROR"); (67 | CstTag68, "42P01"); (77 | CstTag78, "relation \"tblpeople\" does not exist"); (80 | CstTag81, "23"); (70 | CstTag71, "parse_relation.c"); (76 | CstTag77, "885"); (82 | CstTag83, "parserOpenTable")])

File "test.ml", line 1, characters 0-1:
Error: Preprocessor error

There’s a lengthy error courtesy of camlp4, but the first line is the important one; the table doesn’t exist! We’re informed at compile-time. Saviour!

Let’s fix this in psql:

celtic=# create table tblpeople (name varchar primary key not null, age int not null);
NOTICE:  CREATE TABLE / PRIMARY KEY will create implicit index "tblpeople_pkey" for table "tblpeople"
CREATE TABLE
celtic=# insert into tblpeople values ('Mysterious', 21), ('Anneli', 20), ('Bjoerk', 45);
INSERT 0 3
celtic=#

Compiling again gives us a slightly different error:

File "test.ml", line 9, characters 17-24:
Error: This expression has type
         (string * int32) list PGOCaml.monad PGOCaml.monad =
           (string * int32) list
       but an expression was expected of type (string * int) list

Well, you can’t get everything right the first time. We defined tblpeople.age as an int in Postgres, which is 4 bytes long. An OCaml int, on the other hand, is either 31 or 63 bits long (depending on your machine architecture) due to their tagged pointer representation, and PG’OCaml will not sacrifice the accuracy silently.

Since we’re not concerned about the last bit (we hope no one should live that long ..), just fix the report function:

let report (name,age) =
  Printf.printf "%s is %d years old\n%!" name (Int32.to_int age)

(Edit: as Eric pointed out in the comments, an easier way is to just print the int32 itself using %ld in the format string!)

Recompiling gives great success. Now run:

(celtic) pgo:6427 (0)% ./test
Mysterious is 21 years old
Anneli is 20 years old
Bjoerk is 45 years old
(celtic) pgo:6428 (0)%

Just like that, it works! The types were all inferred, the database connection was smooth.. it all just worked.

PG’OCaml details

Connection strings, connection options

PG’OCaml looks to your environment to determine which database to use, both when compiling and running. The environment variables are the same used by the Postgres tools themselves, so this is a great help: set PGDATABASE, PGHOST, PGHOST, etc. if need be.

You can also override options manually in the code itself, and they’re specified separately (and possibly extraneously) for the compile-time stage, and run-time.

For compile-time, you can set options by using connection strings just before a query itself:

let results =
  PGSQL (dbh) "database=mydb" "user=jenkins" "select name, age from tblpeople"

These are honoured at compile-time when type-checking (and type-determining, I suppose) each individual PG’OCaml expression, and override environment variable settings.

These are not to be found at run-time—indeed, the information is relevant and necessary only once at runtime: when connecting. For that reason, the PGOCaml.connect function has this type-signature:

val connect : ?host:string ->
       ?port:int ->
       ?user:string ->
       ?password:string ->
       ?database:string ->
       ?unix_domain_socket_dir:string ->
       unit ->
       'a t monad

Thus, to have the program use the same (forced) connection settings at runtime:

let dbh = PGOCaml.connect ~user:"jenkins" ~database:"mydb" ()

There’s obviously an advantage to these being separate, as it means computed/inputted values can be used for the connection at runtime (i.e. in production), a concept which makes little sense at compile-time. I tend to use environment variables to control the compile-time settings.

PGOCaml

The module name in OCaml is PGOCaml, not PGOcaml. This tripped me up.

Parameterised fields

They’re easy! (surprise!) Let’s say you have a value (id : int) which you want to match on an integer column in the database:

let result =
  PGSQL (dbh) "SELECT data FROM tblentities WHERE entityid=$id"

It’s that simple! Right? Wrong! OCaml’s int and Postgres’s integer don’t mix! Simply fixed, however:

let pid = Int32.of_int id
in
let result =
  PGSQL (dbh) "SELECT data FROM tblentities WHERE entityid=$pid"

It’s even better, though, if you just agree to use int32s elsewhere, though!

The exact same syntax works for INSERTs, too. There’s also a nifty feature: use $?name instead of $name, and it’ll type as an option type—Some x will become x (converted appropriately), but None will render as NULL. Nifty!

For more reading on PG’OCaml

See the PG’OCaml website, and this informative tutorial by Dario Teixeira. I strongly suggest you read both, as there’s a lot more to both PostgreSQL and PG’OCaml.