I found mysqlf in the situation that I had to check few million lines of web logs in Parquet files and I have seen the raise of DuckDB in few articles already.
DuckDB is a high-performance, embedded analytical SQL database. It is designed to support efficient and fast querying of large datasets, including those stored in various file formats like Parquet. DuckDB can read Parquet files natively, without the need for conversion.
It is time to try it out!
Installing DuckDB works as expected, there is an installer for most platforms:
brew install duckdb
Version:
❯ duckdb --version
v0.7.1 b00b93f0b1
The user interface is pretty similar to Sqlite3, because DuckDB builds on the top os Sqlite.
❯ duckdb
v0.7.1 b00b93f0b1
Enter ".help" for usage hints.
Connected to a transient in-memory database.
Use ".open FILENAME" to reopen on a persistent database.
D
The files are located in a folder. We have a file for each day.
❯ exa -ah *.parq
2023-03-01.gz.parq 2023-03-13.gz.parq
2023-03-02.gz.parq 2023-03-14.gz.parq
2023-03-03.gz.parq 2023-03-15.gz.parq
2023-03-04.gz.parq 2023-03-16.gz.parq
2023-03-05.gz.parq 2023-03-17.gz.parq
2023-03-06.gz.parq 2023-03-18.gz.parq
2023-03-07.gz.parq 2023-03-19.gz.parq
2023-03-08.gz.parq 2023-03-20.gz.parq
2023-03-09.gz.parq 2023-03-21.gz.parq
2023-03-10.gz.parq 2023-03-22.gz.parq
2023-03-11.gz.parq 2023-03-23.gz.parq
2023-03-12.gz.parq 2023-03-24.gz.parq
Lets just get the name and type of the fields:
D DESCRIBE SELECT * FROM read_parquet('*.gz.parq');
┌─────────────────────────────┬─────────────┬─────────┬─────────┬─────────┬─────────┐
│ column_name │ column_type │ null │ key │ default │ extra │
│ varchar │ varchar │ varchar │ varchar │ varchar │ varchar │
├─────────────────────────────┼─────────────┼─────────┼─────────┼─────────┼─────────┤
│ iso-datetime │ VARCHAR │ YES │ │ │ │
│ x-edge-location │ VARCHAR │ YES │ │ │ │
│ sc-bytes │ BIGINT │ YES │ │ │ │
│ c-ip │ VARCHAR │ YES │ │ │ │
│ cs-method │ VARCHAR │ YES │ │ │ │
│ cs-host │ VARCHAR │ YES │ │ │ │
│ cs-uri-stem │ VARCHAR │ YES │ │ │ │
│ sc-status │ VARCHAR │ YES │ │ │ │
│ cs-referer │ VARCHAR │ YES │ │ │ │
│ cs-user-agent │ VARCHAR │ YES │ │ │ │
│ cs-uri-query │ VARCHAR │ YES │ │ │ │
│ cs-cookie │ VARCHAR │ YES │ │ │ │
│ x-edge-result-type │ VARCHAR │ YES │ │ │ │
│ x-edge-request-id │ VARCHAR │ YES │ │ │ │
│ x-host-header │ VARCHAR │ YES │ │ │ │
│ cs-protocol │ VARCHAR │ YES │ │ │ │
│ cs-bytes │ BIGINT │ YES │ │ │ │
│ time-taken │ DOUBLE │ YES │ │ │ │
│ x-forwarded-for │ VARCHAR │ YES │ │ │ │
│ ssl-protocol │ VARCHAR │ YES │ │ │ │
│ ssl-cipher │ VARCHAR │ YES │ │ │ │
│ x-edge-response-result-type │ VARCHAR │ YES │ │ │ │
│ cs-protocol-version │ VARCHAR │ YES │ │ │ │
│ fle-status │ VARCHAR │ YES │ │ │ │
│ fle-encrypted-fields │ VARCHAR │ YES │ │ │ │
│ c-port │ VARCHAR │ YES │ │ │ │
│ time-to-first-byte │ DOUBLE │ YES │ │ │ │
│ x-edge-detailed-result-type │ VARCHAR │ YES │ │ │ │
│ sc-content-type │ VARCHAR │ YES │ │ │ │
│ sc-content-len │ BIGINT │ YES │ │ │ │
│ sc-range-start │ VARCHAR │ YES │ │ │ │
│ sc-range-end │ VARCHAR │ YES │ │ │ │
├─────────────────────────────┴─────────────┴─────────┴─────────┴─────────┴─────────┤
│ 32 rows 6 columns │
└───────────────────────────────────────────────────────────────────────────────────┘
D
I would like to generate reader statistics by the URLs that start with posts. DuckDB has amazing support for working with strings beyond the basic SQL features. One way of filtering results is to use a GLOB.
D SELECT "cs-uri-stem" FROM read_parquet('*.gz.parq')
WHERE "cs-uri-stem" GLOB '/posts/*' LIMIT 10;
┌──────────────────────────────────────────────────────────────────┐
│ cs-uri-stem │
│ varchar │
├──────────────────────────────────────────────────────────────────┤
│ /posts/2021/03/22/using-ldap-in-docker-with-caching/ │
│ /posts/2021/03/22/using-ldap-in-docker-with-caching │
│ /posts/2020/05/08/why-i-chose-fsharp-for-our-aws-lambda-project/ │
│ /posts/2023/02/28/using-python-3.11-with-aws-lambda/ │
│ /posts/2023/02/28/using-python-3.11-with-aws-lambda │
│ /posts/2023/02/28/using-python-3.11-with-aws-lambda/ │
│ /posts/2023/02/28/using-python-3.11-with-aws-lambda/ │
│ /posts/2023/02/28/using-python-3.11-with-aws-lambda/ │
│ /posts/2023/02/28/using-python-3.11-with-aws-lambda/ │
│ /posts/2023/02/28/using-python-3.11-with-aws-lambda/ │
├──────────────────────────────────────────────────────────────────┤
│ 10 rows │
└──────────────────────────────────────────────────────────────────┘
I would like to have the statistics so that the closing / does not matter and I am only interested in the last bit of the URL"
SELECT COUNT(1) AS cnt,
split_part(rtrim("cs-uri-stem", '/'), '/', 6) AS url
FROM read_parquet('*.gz.parq')
WHERE "cs-uri-stem" GLOB '/posts/*'
GROUP BY url
HAVING url <> ''
ORDER BY cnt DESC LIMIT 15;
┌────────┬──────────────────────────────────────────────────────────┐
│ cnt │ url │
│ int64 │ varchar │
├────────┼──────────────────────────────────────────────────────────┤
│ 109023 │ using-llama-with-m1-mac │
│ 4372 │ using-python-3.11-with-aws-lambda │
│ 711 │ misusing-ninja │
│ 287 │ beyond-the-borrow-checker │
│ 142 │ getting-started-with-firecracker-on-raspberry-pi │
│ 135 │ new-tools │
│ 93 │ compressing-data-with-parquet │
│ 90 │ running-asp.net-web-application-with-falco-on-aws-lambda │
│ 89 │ using-ldap-in-docker-with-caching │
│ 83 │ diving-into-firecracker-with-alpine │
│ 80 │ compressing-aws-s3-logs-after-getting-hackernewsed │
│ 30 │ why-i-chose-fsharp-for-our-aws-lambda-project │
│ 25 │ how-long-will-the-worlds-uranium-supplies-last │
│ 23 │ matching-binary-patterns │
│ 18 │ freenas-11.3-upgrade-issues │
├────────┴──────────────────────────────────────────────────────────┤
│ 15 rows 2 columns │
└───────────────────────────────────────────────────────────────────┘
I was wondering what DuckDB does under the hood:
EXPLAIN ...
┌─────────────────────────────┐
│┌───────────────────────────┐│
││ Physical Plan ││
│└───────────────────────────┘│
└─────────────────────────────┘
┌───────────────────────────┐
│ TOP_N │
│ ─ ─ ─ ─ ─ ─ ─ ─ ─ ─ ─ │
│ Top 15 │
│ ─ ─ ─ ─ ─ ─ ─ ─ ─ ─ ─ │
│ count(1) DESC │
└─────────────┬─────────────┘
┌─────────────┴─────────────┐
│ PROJECTION │
│ ─ ─ ─ ─ ─ ─ ─ ─ ─ ─ ─ │
│ cnt │
│ 1 │
└─────────────┬─────────────┘
┌─────────────┴─────────────┐
│ HASH_GROUP_BY │
│ ─ ─ ─ ─ ─ ─ ─ ─ ─ ─ ─ │
│ #0 │
│ count_star() │
└─────────────┬─────────────┘
┌─────────────┴─────────────┐
│ PROJECTION │
│ ─ ─ ─ ─ ─ ─ ─ ─ ─ ─ ─ │
│ url │
└─────────────┬─────────────┘
┌─────────────┴─────────────┐
│ FILTER │
│ ─ ─ ─ ─ ─ ─ ─ ─ ─ ─ ─ │
│((cs-uri-stem ~~~ '/posts/*│
│ ') AND (COALESCE │
│(array_extract(string_... │
│-uri-stem, '/'), '/'),... │
│ ) != '')) │
│ ─ ─ ─ ─ ─ ─ ─ ─ ─ ─ ─ │
│ EC: 0 │
└─────────────┬─────────────┘
┌─────────────┴─────────────┐
│ PARQUET_SCAN │
│ ─ ─ ─ ─ ─ ─ ─ ─ ─ ─ ─ │
│ cs-uri-stem │
│ ─ ─ ─ ─ ─ ─ ─ ─ ─ ─ ─ │
│ EC: 0 │
└───────────────────────────┘
DuckDB is pretty great at reading Parquet files but how about writing?
D COPY (SELECT * FROM read_parquet('*.gz.parq'))
TO 'output.parquet' (FORMAT PARQUET, CODEC 'ZSTD');
100% ▕████████████████████████████████████████████████████████████▏
It does not dissapoint.
What about the performance? There is a built in way to analyze performance.
EXPLAIN ANALYZE
COPY (SELECT * FROM read_parquet('*.gz.parq'))
TO 'output.parquet' (FORMAT PARQUET, CODEC 'ZSTD');
┌─────────────────────────────────────┐
│┌───────────────────────────────────┐│
││ Total Time: 4.91s ││
│└───────────────────────────────────┘│
└─────────────────────────────────────┘
D EXPLAIN ANALYZE
SELECT COUNT(1) AS cnt,
split_part(rtrim("cs-uri-stem", '/'), '/', 6) AS url
FROM read_parquet('*.gz.parq')
WHERE "cs-uri-stem" GLOB '/posts/*'
GROUP BY url
HAVING url <> ''
ORDER BY cnt
DESC LIMIT 15;
┌─────────────────────────────┐
│┌───────────────────────────┐│
│└───────────────────────────┘│
└─────────────────────────────┘
┌─────────────────────────────────────┐
│┌───────────────────────────────────┐│
││ Query Profiling Information ││
│└───────────────────────────────────┘│
└─────────────────────────────────────┘
EXPLAIN ANALYZE
SELECT
COUNT(1) AS cnt,
split_part(rtrim("cs-uri-stem", '/'), '/', 6) AS url
FROM read_parquet('*.gz.parq')
WHERE "cs-uri-stem" GLOB '/posts/*'
GROUP BY url
HAVING url <> ''
ORDER BY cnt DESC LIMIT 15;
┌─────────────────────────────────────┐
│┌───────────────────────────────────┐│
││ Total Time: 0.0727s ││
│└───────────────────────────────────┘│
└─────────────────────────────────────┘
┌───────────────────────────┐
│ EXPLAIN_ANALYZE │
│ ─ ─ ─ ─ ─ ─ ─ ─ ─ ─ ─ │
│ 0 │
│ (0.00s) │
└─────────────┬─────────────┘
┌─────────────┴─────────────┐
│ TOP_N │
│ ─ ─ ─ ─ ─ ─ ─ ─ ─ ─ ─ │
│ Top 15 │
│ ─ ─ ─ ─ ─ ─ ─ ─ ─ ─ ─ │
│ count(1) DESC │
│ ─ ─ ─ ─ ─ ─ ─ ─ ─ ─ ─ │
│ 15 │
│ (0.00s) │
└─────────────┬─────────────┘
┌─────────────┴─────────────┐
│ PROJECTION │
│ ─ ─ ─ ─ ─ ─ ─ ─ ─ ─ ─ │
│ cnt │
│ 1 │
│ ─ ─ ─ ─ ─ ─ ─ ─ ─ ─ ─ │
│ 46 │
│ (0.00s) │
└─────────────┬─────────────┘
┌─────────────┴─────────────┐
│ HASH_GROUP_BY │
│ ─ ─ ─ ─ ─ ─ ─ ─ ─ ─ ─ │
│ #0 │
│ count_star() │
│ ─ ─ ─ ─ ─ ─ ─ ─ ─ ─ ─ │
│ 46 │
│ (0.01s) │
└─────────────┬─────────────┘
┌─────────────┴─────────────┐
│ PROJECTION │
│ ─ ─ ─ ─ ─ ─ ─ ─ ─ ─ ─ │
│ url │
│ ─ ─ ─ ─ ─ ─ ─ ─ ─ ─ ─ │
│ 115262 │
│ (0.07s) │
└─────────────┬─────────────┘
┌─────────────┴─────────────┐
│ FILTER │
│ ─ ─ ─ ─ ─ ─ ─ ─ ─ ─ ─ │
│((cs-uri-stem ~~~ '/posts/*│
│ ') AND (COALESCE │
│(array_extract(string_... │
│-uri-stem, '/'), '/'),... │
│ ) != '')) │
│ ─ ─ ─ ─ ─ ─ ─ ─ ─ ─ ─ │
│ EC: 0 │
│ ─ ─ ─ ─ ─ ─ ─ ─ ─ ─ ─ │
│ 115262 │
│ (0.11s) │
└─────────────┬─────────────┘
┌─────────────┴─────────────┐
│ PARQUET_SCAN │
│ ─ ─ ─ ─ ─ ─ ─ ─ ─ ─ ─ │
│ cs-uri-stem │
│ ─ ─ ─ ─ ─ ─ ─ ─ ─ ─ ─ │
│ EC: 0 │
│ ─ ─ ─ ─ ─ ─ ─ ─ ─ ─ ─ │
│ 563607 │
│ (0.04s) │
└───────────────────────────┘
This is insane performance for a M1 Mac for an average dataset.
It is a really nice way of working with a local tool without too much hassle. The performance that you can get is amazing.