After many discussions recently at work with my coworker who does our webdev and DBA about how crap MySQL is, I decided to finally check out the alternative which I’ve known since the start was better (yet never have bothered [dared?] to try): PostgreSQL.
A brief overview of PostgreSQL
Postgres is a generally better polished piece of software (often compared to , with less rubbish surrounding its licensing (fuck you, MySQL AB Oracle), better extensibility (PL/*), better standards compliance, foreign-key constraints (which are often forced in software instead in MySQL), a much richer type set .. sold you yet? You can have frickin’ multi-dimensional arrays as column types, and query on subelements! Amazing.
Of course, if you install the postgresql
and postgresql-client
packages, you’ll find you have an installation which you can do very little with. (no way to get in by invoking psql
).
Postgres does its authentication—by default—a bit differently. After installing the server, you’ll have a new postgres
user on your system; if you su
into it (perhaps via a superuser account), you’ll now be the superuser for your database server, too, by virtue of being logged into that account. Postgres—again, by default—ties the database accounts to the system ones.
We can now use createuser
to make an account for our regular user. Since I’m doing this on a development machine (my laptop), I’ll just create a superuser account for it. My account’s called “celtic”, so it’s just a matter of entering createuser -s celtic
. Read createuser
’s man page for more.
Exiting out of those subshells and back as our regular user, we now can create ourselves a database to toy with, using createdb
. All Postgres tools that operate on databases will default to using your username as a database name (i.e. celtic
’s default database is named celtic
). createdb
is no different, so you can just invoke it with no arguments and it’ll make your default database after a short delay.
You can now invoke psql
, and get a lovely prompt:
(celtic) ~:6320 (0)% psql
psql (8.4.5)
Type "help" for help.
celtic=#
The word celtic
at the prompt represents the database I’m using, and the #
is because I’m a superuser (non-superusers see >
).
Now that we can connect to our database and mess around with stuff, let’s take a look at the SQL understood by Postgres.
It differs in small ways from MySQL. Table and column names lose their capitalisation if you don’t quote them, for example:
celtic=# create table XYZabc ();
CREATE TABLE
celtic=# \d
List of relations
Schema | Name | Type | Owner
--------+--------+-------+--------
public | xyzabc | table | celtic
(1 row)
celtic=# create table "XYZabc" ();
CREATE TABLE
celtic=# \d
List of relations
Schema | Name | Type | Owner
--------+--------+-------+--------
public | XYZabc | table | celtic
public | xyzabc | table | celtic
(2 rows)
celtic=#
Quoting is another good point. Backticks aren’t used for identifiers; instead, use double-quotes (!). Single-quotes quote strings. This may take a moment to get used to.
Another thing is Postgres’s explicit use of sequences. In MySQL, if you want a unique numeric ID for a table, you’d probably just add a column like abcId INT PRIMARY KEY NOT NULL AUTO_INCREMENT
, with a few caveats: there can be only one AUTO_INCREMENT
per table, it must be indexed, they don’t work if you try to use negative numbers with them, and it has to be an integer or floating point number column.
Postgres has a construct that looks quite similar; abcId SERIAL PRIMARY KEY NOT NULL
. In this case, SERIAL
is the “type” of this column, but what it ends up looking like is quite different:
celtic=# create table abc (abcID serial primary key not null);
NOTICE: CREATE TABLE will create implicit sequence "abc_abcid_seq"
for serial column "abc.abcid"
NOTICE: CREATE TABLE / PRIMARY KEY will create implicit index
"abc_pkey" for table "abc"
CREATE TABLE
celtic=# \d
List of relations
Schema | Name | Type | Owner
--------+---------------+----------+--------
public | abc | table | celtic
public | abc_abcid_seq | sequence | celtic
(2 rows)
celtic=# \d abc
Table "public.abc"
Column | Type | Modifiers
--------+---------+-----------------------------------------------------
abcid | integer | not null default nextval('abc_abcid_seq'::regclass)
Indexes:
"abc_pkey" PRIMARY KEY, btree (abcid)
celtic=# \d abc_abcid_seq
Sequence "public.abc_abcid_seq"
Column | Type | Value
---------------+---------+---------------------
sequence_name | name | abc_abcid_seq
last_value | bigint | 1
start_value | bigint | 1
increment_by | bigint | 1
max_value | bigint | 9223372036854775807
min_value | bigint | 1
cache_value | bigint | 1
log_cnt | bigint | 1
is_cycled | boolean | f
is_called | boolean | f
celtic=#
That’s a lot to digest!
When we create the table, PgSQL tells us it’s creating an implicit sequence called abc_abcid_seq
for the serial column abc.abcid
. Okay. The PRIMARY KEY
invocation is also expanded into creation of an index; note that Postgres tells you these things explicitly. This is also a nice reminder that these things can be done manually, too.
When we detail the database with \d
, we now see there are two relations; our table, and its sequence. Note that the sequence isn’t really explicitly bound to the table in any way, it’s just named so you can tell it belongs to it.
Inspecting the table, we can see that abcid
is now actually an integer
column marked not null
. It also has a default value, nextval('abc_abcid_seq'::regclass)
. This is actually a function call—Postgres supports arbitrary expressions (subject to some caveats) for calculating default values, not just NOW()
.
I won’t delve into the function call too much at this stage, but nextval()
is an Postgres-supplied function which takes a handle to a relation (which should be a sequence), which is of type regclass
, and returns the next value the sequence should generate, as well as causing it to advance its internal pointer such that it’ll return the following number next time.
The paamayim nekudotayim ::
is the cast operator, and in this case casts the string-literal 'abc_abcid_seq'
to a regclass
. Technically it’s not necessary, as it gets casted implicitly for you anyway:
celtic=# select nextval('abc_abcid_seq');
nextval
---------
1
(1 row)
celtic=# select nextval('abc_abcid_seq');
nextval
---------
2
(1 row)
celtic=#
.. but I suppose the system is being explicit, given that it’s the one generating the values.
So that’s how the serial
column “type” works in Postgres. You can use these for anything, and not necessarily tied to only one table, or any table at all; perhaps you create a standalone sequence (CREATE SEQUENCE xyzzy
), then use it in some functions. There’s a family of functions for manipulating its value manually, so you have complete control of how the sequence is generated. It’s a really neat mechanism, and I’m looking forward to exploiting it in my own applications.
Enter PG’OCaml
Now, let’s say we wanted to use this amazing database system in our very own OCaml application. The first thing I did was look for a Debian package called libpostgresql-ocaml-dev
, and lo and behold, there’s one! I installed it, and searching for docs for it, I stumbled upon the homepage of PG’OCaml. Normally I eschew the “non-standard” library, but I was pleasantly surprised to find that this was also in Debian: libpgocaml-ocaml-dev
! Let’s install that instead.
Its killer feature? PG’OCaml type-checks your SQL statements at compile-time to make a strong, type-safe SQL library with syntax extensions so you do no manual (and unsafe) casting of data, nor do you accidentally pass the wrong type of data when trying to concatenate a barely-injection-safe query together.
The “downside” (if you can call it that) is that PG’OCaml needs access to your database at compile-time to find out the structure of your database. This is rarely an issue. The amazing thing is that, not only does it make sure that you’re working with the right types and doing all the ugly conversion under-the-table for you, it also carries the amazing feeling of using a safe static-type-inferring language right through to your database layer: make a change to your database (that mangles your code without you realising), and you’ll now get a compile-error when you recompile! Of course, the caveat is that you have to recompile to get the knowledge upfront, but you’ll get runtime errors as well, just like you would with any other database connector.
Let’s begin coding with PG’OCaml:
let report (name,age) =
Printf.printf "%s is %d years old\n%!" name age
let dbh = PGOCaml.connect ()
in
let results =
PGSQL (dbh) "select name, age from tblpeople"
in
List.iter report results
This is deceptively simple: we connect to a database (where are the connection settings?!), query it using PGSQL
, and then List.iter
on the supposedly well-formed and well-typed result?!
The answer, of course, is yes.
To get this running, try compiling it using ocamlfind
with a line like this:
ocamlfind ocamlc -linkpkg -package pgocaml.syntax -syntax camlp4o test.ml -o test
Bam! Compile error:
ERROR: 42P01: relation "tblpeople" does not exist
File "test.ml", line 7, characters 2-47:
Camlp4: Uncaught exception: PGOCaml_generic.Make(Thread).PostgreSQL_Error ("ERROR: 42P01: relation \"tblpeople\" does not exist", [(83 | CstTag84, "ERROR"); (67 | CstTag68, "42P01"); (77 | CstTag78, "relation \"tblpeople\" does not exist"); (80 | CstTag81, "23"); (70 | CstTag71, "parse_relation.c"); (76 | CstTag77, "885"); (82 | CstTag83, "parserOpenTable")])
File "test.ml", line 1, characters 0-1:
Error: Preprocessor error
There’s a lengthy error courtesy of camlp4, but the first line is the important one; the table doesn’t exist! We’re informed at compile-time. Saviour!
Let’s fix this in psql
:
celtic=# create table tblpeople (name varchar primary key not null, age int not null);
NOTICE: CREATE TABLE / PRIMARY KEY will create implicit index "tblpeople_pkey" for table "tblpeople"
CREATE TABLE
celtic=# insert into tblpeople values ('Mysterious', 21), ('Anneli', 20), ('Bjoerk', 45);
INSERT 0 3
celtic=#
Compiling again gives us a slightly different error:
File "test.ml", line 9, characters 17-24:
Error: This expression has type
(string * int32) list PGOCaml.monad PGOCaml.monad =
(string * int32) list
but an expression was expected of type (string * int) list
Well, you can’t get everything right the first time. We defined tblpeople.age
as an int
in Postgres, which is 4 bytes long. An OCaml int
, on the other hand, is either 31 or 63 bits long (depending on your machine architecture) due to their tagged pointer representation, and PG’OCaml will not sacrifice the accuracy silently.
Since we’re not concerned about the last bit (we hope no one should live that long ..), just fix the report
function:
let report (name,age) =
Printf.printf "%s is %d years old\n%!" name (Int32.to_int age)
(Edit: as Eric pointed out in the comments, an easier way is to just print the int32
itself using %ld
in the format string!)
Recompiling gives great success. Now run:
(celtic) pgo:6427 (0)% ./test
Mysterious is 21 years old
Anneli is 20 years old
Bjoerk is 45 years old
(celtic) pgo:6428 (0)%
Just like that, it works! The types were all inferred, the database connection was smooth.. it all just worked.
PG’OCaml details
Connection strings, connection options
PG’OCaml looks to your environment to determine which database to use, both when compiling and running. The environment variables are the same used by the Postgres tools themselves, so this is a great help: set PGDATABASE
, PGHOST
, PGHOST
, etc. if need be.
You can also override options manually in the code itself, and they’re specified separately (and possibly extraneously) for the compile-time stage, and run-time.
For compile-time, you can set options by using connection strings just before a query itself:
let results =
PGSQL (dbh) "database=mydb" "user=jenkins" "select name, age from tblpeople"
These are honoured at compile-time when type-checking (and type-determining, I suppose) each individual PG’OCaml expression, and override environment variable settings.
These are not to be found at run-time—indeed, the information is relevant and necessary only once at runtime: when connecting. For that reason, the PGOCaml.connect
function has this type-signature:
val connect : ?host:string ->
?port:int ->
?user:string ->
?password:string ->
?database:string ->
?unix_domain_socket_dir:string ->
unit ->
'a t monad
Thus, to have the program use the same (forced) connection settings at runtime:
let dbh = PGOCaml.connect ~user:"jenkins" ~database:"mydb" ()
There’s obviously an advantage to these being separate, as it means computed/inputted values can be used for the connection at runtime (i.e. in production), a concept which makes little sense at compile-time. I tend to use environment variables to control the compile-time settings.
PGOCaml
The module name in OCaml is PGOCaml
, not PGOcaml
. This tripped me up.
Parameterised fields
They’re easy! (surprise!) Let’s say you have a value (id : int)
which you want to match on an integer column in the database:
let result =
PGSQL (dbh) "SELECT data FROM tblentities WHERE entityid=$id"
It’s that simple! Right? Wrong! OCaml’s int
and Postgres’s integer
don’t mix! Simply fixed, however:
let pid = Int32.of_int id
in
let result =
PGSQL (dbh) "SELECT data FROM tblentities WHERE entityid=$pid"
It’s even better, though, if you just agree to use int32
s elsewhere, though!
The exact same syntax works for INSERT
s, too. There’s also a nifty feature: use $?name
instead of $name
, and it’ll type as an option
type—Some x
will become x
(converted appropriately), but None
will render as NULL
. Nifty!
For more reading on PG’OCaml
See the PG’OCaml website, and this informative tutorial by Dario Teixeira. I strongly suggest you read both, as there’s a lot more to both PostgreSQL and PG’OCaml.