psycopg3 and adaptation choices

2020-11-07

The psycopg3 project is currently under active development.

Sponsoring this project will help achieve swift completion and ensure the maintenance of psycopg2, psycopg3 and other related projects.

If you use Python and PostgreSQL, and you would like to support the creation of the most advanced adapter between the two systems, please consider becoming a sponsor. Thank you!

One of the main behavioural differences between psycopg2 and 3, and a big enought cause of incompatibility to require a "3", is the use of server-side binding vs. client-side (psycopg2 predates the FE-BE protocol supporting it). Passing Python objects to the server as separate arguments leads sometimes to a behaviour of the server different to what people are used to using psql and psycopg2. An example: what do you think

cursor.execute("select %s, %s", ["hello", 10])

should fetch? I assume it would be desirable to have a record ("hello", 10) back - consistently to what you would return if you merged argument client-side and queried for select 'hello', 10. However keeping this behaviour predictable proved already tricky. When passing arguments to merge server-side, the client must pass their string (or binary) representation and, optionally, the oid of their types. The most intuitive thing to do would be to associate the text type oid (25) to Python strings. This works well for queries like the above one, but it proves too strict a type definition in context where a cast from text is not allowed. For instance this will fail:

cursor.execute("create table testjson(data jsonb)")
cursor.execute("insert into testjson (data) values (%s)", ["{}"])

The latter will raise the same exception you would get in psql if you execute insert into testjson (data) values ('{}'::text). The reason why a values ('{}') works as expected in psql is because the untyped literal is interpreted as unknown, and the server can always cast from unknown to any other type. If we wanted to insert the data above the query should be rewritten as:

cursor.execute("insert into testjson (data) values (%s::jsonb)", ["{}"])

About a week ago I started the task to write a Django backend for psycopg3: running its test suite is proving a great way to observe the behaviour of server-side binding in a plethora of different environments. The problem of the non-implicit cast from text to jsonb made porting the backend from psycopg2 to 3 quite a tricky matter.

In order to provide an experience more similar to the use of psql and of psycopg2, one possibility is to not pass an oid type for the strings, leaving it unknown. When I tried with that approach, passing strings to jsonb fields (which is not necessarily what you would do, but it's what Django does, probably to uniform JSON dumping across different adapter) became immediately much simpler. So it seems a good choice, but not an overall win: our "hello world" query:

cursor.execute("select %s, %s", ["hello", 10])

fails with PostgreSQL 9.6 and older versions, returning an error such as could not determine data type of parameter $1. PostgreSQL 10 and following are more permissive and convert unknown to text on output.

It seems like there is a choice to make here: mine at the moment is to keep the latter behaviour, both because that's what allows to write the simpler code with the majority of the supported PostgreSQL versions. People using 9.6 and previous version would need to adjust to:

cursor.execute("select %s::text, %s", ["hello", 10])

but the use case of inserting data into tables seems a more important use case than parroting back a string after a database round-trip.

Choices to cast Python str type:

To unknown:

  • pro: passed without casting to most data types and functions
  • con: requires a cast to roundtrip back with PostgreSQL 9.6 and previous

To text:

  • pro: roundtrips with PostgreSQL 9.6 and previous
  • con: may require several casts for most of data type, both in insert and functions

unknown seems the winner.


What about the numbers? That's a problem thorny in a different way. Python integers are of arbitrary size, so they map better to a subset of the numeric type than to int4 or int8. However there is a similar problem, where seemingly familiar expression don't behave as expected. If this works in psql:

select data -> 'a' -> 2 from testjson;

a similar statement with 2 as a parameter will fail with an error like operator does not exist: jsonb -> numeric. Passing unknown as oid would work, but now our greeting would return instead ('hello', '10'), with the number converted to string. This is probably too much of a departure from the expected behaviour; at least, unlike the test-to-json case, there is an automatic cast from numeric to integer, so an insert into integer_field works as expected. Only operators and functions whose arguments are declared integer and there is no numeric version trip into an error: JSON's ->, binary operators such as >> and &, and few others:

piro=# select current_date, current_date + 1;
 current_date |  ?column?
---------------+------------
 2020-11-07   | 2020-11-08
(1 row)

piro=# select current_date + 1::numeric;
ERROR:  operator does not exist: date + numeric
LINE 1: select current_date + 1::numeric;
                            ^
HINT:  No operator matches the given name and argument types. You might
need to add explicit type casts.

Note that when Postgres says integer it means 4 bytes signed: defaulting the Python int to the bigint oid solves no problem - the same functions would fail the same way, and defaulting it to int4 I feel it would overflow too easily in a 64 bits world.

How to fix that, if you are an end user? As the hint suggests, the user would have to either add a cast to the query, which would look like:

cur.execute("select current_date + %s::int", [offset])

or to specify an object that would get converted in Postgres to the oid of integer: in psycopg3 I'm testing with adding subclasses of int called Int4, Int8 etc. to map more closely on the Postgres types:

cur.execute("select current_date + %s", [Int4(offset)])

Choices to cast Python int type:

To unknown:

  • +1: can express the full range of the unbounded Python int
  • +2: can be passed without cast to most data types and functions
  • -2: it round-trips back to string, error on PostgreSQL < 10

To numeric:

  • +1: can express the full range of the unbounded Python int
  • +1: can be passed without cast to most data types
  • -1: requires a cast for some functions
  • -1: it round-trips to 'Decimal'
  • +0.5: it can round-trip back to int, with additional care - more to follow

To int8:

  • -0.5: can't express values not fitting in 64 bits (relatively limited use case: if the target column is numeric then it would be wise for the user to pass a Decimal)
  • +1: can be passed without cast to most data types
  • -1: requires a cast for some functions
  • +1: it round-trips back to int

To int4:

  • -1: limited range: couldn't be used to pass a value not fitting into 32 bytes from Python to a bigint.
  • +2: can be passed without cast to most data types and functions
  • +1: it round-trips back to int

Winner: probably numeric, with int8 runner-up?

What about the observation on numeric? If we dump int -> numeric to the db, and load back numeric -> Decimal from it, we end up with integers round-tripping to Decimal, which could easily create errors in Python contexts which are not ready to deal with fixed-point arithmetic. However, upon receiving a numeric from the database, we can check what number it is: if it has no decimal digit it can be returned to Python as int, if it has any decimal digit it must be returned as Decimal.

This mechanism can be as trivial as looking if there is a '.' in the data returned by the database; however in many cases the is job made simpler (or at least more efficient) by the presence of the numeric modifiers: numeric comes in three flavours:

  • numeric: arbitrary precision (number of digits) and scale (number of digits after the decimal point),
  • numeric(n): limited precision, no digit after the decimal point (equivalent to numeric(n, 0)),
  • numeric(n, m): limited precision, fixed number of digits after the decimal point.

The modifier information is returned in many contexts (among which the most important: selecting data from tables): if we know the scale we can decide upfront to load the entire column as int if the scale is 0, Decimal otherwise. If the modifier is unknown we can look at the presence of the dot.

Returning an int in a context where Decimal are expected doesn't seem a big problem: throwing an integer to a fixed-point calculation doesn't wreak havoc has it does throwing a Decimal in a context of flaot calculations. Running the entire Django test suite after this change caused no test to fail, which gives me some comfort.


Being the adaptation system flexible, users are able to override these choices, but I don't expect many people wanting to do it, and doing it process-wise might cause interoperation problems across libraries. The behaviour out-of-the-box is obviously important and I would like to get the trade-offs right.

What do you think?

Note

This article was modified after experimenting with the numeric to int strategy. Thank you for rubber-ducking. 🦆