How to use DuckDB’s ST_Read
function to read and convert zipped shapefiles
Today I learnt …
DuckDB is a (relatively) new RDBMS designed for in-process analytical queries, but really it’s useful for all kinds of things. As a database it’s most similar to SQLite, but DuckDB’s killer feature for me is that it comes with a spatial extension. That makes it really approachable for geospatial data processing — think PostgreSQL + PostGIS but without the hassle of running a server.
DuckDB’s spatial extension provides an ST_Read
function that lets you read spatial data from various file formats as if they were database tables. All that’s need to query a GeoJSON file named source.geojson
is:
INSTALL spatial; -- You only need to install it once.
LOAD spatial; -- But you need to load it each session.
SELECT * FROM ST_Read('source.geojson');
Under the hood, ST_Read
uses GDAL, which means DuckDB supports a huge number of vector file formats. It also means that DuckDB supports GDAL’s more esoteric features, such as virtual file systems. The /vsizip/
prefix, for example, lets you SELECT
from a file within a ZIP file, decompressing it on-the-fly. Shapefiles are often distributed as ZIP files (because a ‘shapefile’ is usually several files), and it can be annoying to need two steps to get to the file. But using /vsizip/
lets you query geospatial data directly from a ZIP file. Here’s how you could query a shapefile named bar.shp
within a ZIP file named foo.zip
, removing any features that don’t cover a point in Australia:
SELECT
*
FROM
ST_Read('/vsizip/foo.zip/bar.shp')
WHERE
ST_Covers(geom, ST_Point(128.28, -21.69));
DuckDB’s spatial extension includes a COPY
function to export geospatial tables to different formats. This works on ST_Read
virtual tables, which means you can convert a zipped shapefile to, for example, a GeoJSON file in a single query:
COPY ⟨
SELECT * FROM ST_Read('/vsizip/foo.zip/bar.shp')
⟩
TO 'some_file.geojson'
WITH (FORMAT GDAL, DRIVER 'GeoJSON');
And of course the SELECT
sub-query can be as complicated as you need it to be, filtering out data as you go.