How fast DuckDB can query Parquet files?

2023/03/26

#duckdb #query #parquet #sql

Table of contents

LLaMA

How fast DuckDB can query Parquet files?

I found mysqlf in the situation that I had to check few million lines of web logs in Parquet files and I have seen the raise of DuckDB in few articles already.

DuckDB is a high-performance, embedded analytical SQL database. It is designed to support efficient and fast querying of large datasets, including those stored in various file formats like Parquet. DuckDB can read Parquet files natively, without the need for conversion.

It is time to try it out!

Getting started

Installing DuckDB works as expected, there is an installer for most platforms:

brew install duckdb

Version:

❯ duckdb --version
v0.7.1 b00b93f0b1

Querying a bunch of files

The user interface is pretty similar to Sqlite3, because DuckDB builds on the top os Sqlite.

❯ duckdb
v0.7.1 b00b93f0b1
Enter ".help" for usage hints.
Connected to a transient in-memory database.
Use ".open FILENAME" to reopen on a persistent database.
D

The files are located in a folder. We have a file for each day.

❯ exa -ah *.parq
2023-03-01.gz.parq  2023-03-13.gz.parq
2023-03-02.gz.parq  2023-03-14.gz.parq
2023-03-03.gz.parq  2023-03-15.gz.parq
2023-03-04.gz.parq  2023-03-16.gz.parq
2023-03-05.gz.parq  2023-03-17.gz.parq
2023-03-06.gz.parq  2023-03-18.gz.parq
2023-03-07.gz.parq  2023-03-19.gz.parq
2023-03-08.gz.parq  2023-03-20.gz.parq
2023-03-09.gz.parq  2023-03-21.gz.parq
2023-03-10.gz.parq  2023-03-22.gz.parq
2023-03-11.gz.parq  2023-03-23.gz.parq
2023-03-12.gz.parq  2023-03-24.gz.parq

Lets just get the name and type of the fields:

D DESCRIBE SELECT * FROM read_parquet('*.gz.parq');
┌─────────────────────────────┬─────────────┬─────────┬─────────┬─────────┬─────────┐
         column_name          column_type   null      key    default   extra  
           varchar              varchar    varchar  varchar  varchar  varchar 
├─────────────────────────────┼─────────────┼─────────┼─────────┼─────────┼─────────┤
 iso-datetime                 VARCHAR      YES                                
 x-edge-location              VARCHAR      YES                                
 sc-bytes                     BIGINT       YES                                
 c-ip                         VARCHAR      YES                                
 cs-method                    VARCHAR      YES                                
 cs-host                      VARCHAR      YES                                
 cs-uri-stem                  VARCHAR      YES                                
 sc-status                    VARCHAR      YES                                
 cs-referer                   VARCHAR      YES                                
 cs-user-agent                VARCHAR      YES                                
 cs-uri-query                 VARCHAR      YES                                
 cs-cookie                    VARCHAR      YES                                
 x-edge-result-type           VARCHAR      YES                                
 x-edge-request-id            VARCHAR      YES                                
 x-host-header                VARCHAR      YES                                
 cs-protocol                  VARCHAR      YES                                
 cs-bytes                     BIGINT       YES                                
 time-taken                   DOUBLE       YES                                
 x-forwarded-for              VARCHAR      YES                                
 ssl-protocol                 VARCHAR      YES                                
 ssl-cipher                   VARCHAR      YES                                
 x-edge-response-result-type  VARCHAR      YES                                
 cs-protocol-version          VARCHAR      YES                                
 fle-status                   VARCHAR      YES                                
 fle-encrypted-fields         VARCHAR      YES                                
 c-port                       VARCHAR      YES                                
 time-to-first-byte           DOUBLE       YES                                
 x-edge-detailed-result-type  VARCHAR      YES                                
 sc-content-type              VARCHAR      YES                                
 sc-content-len               BIGINT       YES                                
 sc-range-start               VARCHAR      YES                                
 sc-range-end                 VARCHAR      YES                                
├─────────────────────────────┴─────────────┴─────────┴─────────┴─────────┴─────────┤
 32 rows                                                                 6 columns 
└───────────────────────────────────────────────────────────────────────────────────┘
D

I would like to generate reader statistics by the URLs that start with posts. DuckDB has amazing support for working with strings beyond the basic SQL features. One way of filtering results is to use a GLOB.

D SELECT "cs-uri-stem" FROM read_parquet('*.gz.parq')
WHERE "cs-uri-stem" GLOB '/posts/*' LIMIT 10;
┌──────────────────────────────────────────────────────────────────┐
                           cs-uri-stem                            
                             varchar                              
├──────────────────────────────────────────────────────────────────┤
 /posts/2021/03/22/using-ldap-in-docker-with-caching/             
 /posts/2021/03/22/using-ldap-in-docker-with-caching              
 /posts/2020/05/08/why-i-chose-fsharp-for-our-aws-lambda-project/ 
 /posts/2023/02/28/using-python-3.11-with-aws-lambda/             
 /posts/2023/02/28/using-python-3.11-with-aws-lambda              
 /posts/2023/02/28/using-python-3.11-with-aws-lambda/             
 /posts/2023/02/28/using-python-3.11-with-aws-lambda/             
 /posts/2023/02/28/using-python-3.11-with-aws-lambda/             
 /posts/2023/02/28/using-python-3.11-with-aws-lambda/             
 /posts/2023/02/28/using-python-3.11-with-aws-lambda/             
├──────────────────────────────────────────────────────────────────┤
                             10 rows                              
└──────────────────────────────────────────────────────────────────┘

I would like to have the statistics so that the closing / does not matter and I am only interested in the last bit of the URL"

SELECT COUNT(1) AS cnt,
split_part(rtrim("cs-uri-stem", '/'), '/', 6) AS url
FROM read_parquet('*.gz.parq')
WHERE "cs-uri-stem" GLOB '/posts/*'
GROUP BY url
HAVING url <> ''
ORDER BY cnt DESC LIMIT 15;

┌────────┬──────────────────────────────────────────────────────────┐
  cnt                              url                            
 int64                           varchar                          
├────────┼──────────────────────────────────────────────────────────┤
 109023  using-llama-with-m1-mac                                  
   4372  using-python-3.11-with-aws-lambda                        
    711  misusing-ninja                                           
    287  beyond-the-borrow-checker                                
    142  getting-started-with-firecracker-on-raspberry-pi         
    135  new-tools                                                
     93  compressing-data-with-parquet                            
     90  running-asp.net-web-application-with-falco-on-aws-lambda 
     89  using-ldap-in-docker-with-caching                        
     83  diving-into-firecracker-with-alpine                      
     80  compressing-aws-s3-logs-after-getting-hackernewsed       
     30  why-i-chose-fsharp-for-our-aws-lambda-project            
     25  how-long-will-the-worlds-uranium-supplies-last           
     23  matching-binary-patterns                                 
     18  freenas-11.3-upgrade-issues                              
├────────┴──────────────────────────────────────────────────────────┤
 15 rows                                                 2 columns 
└───────────────────────────────────────────────────────────────────┘

I was wondering what DuckDB does under the hood:

EXPLAIN ...
┌─────────────────────────────┐
│┌───────────────────────────┐│
││       Physical Plan       ││
│└───────────────────────────┘│
└─────────────────────────────┘
┌───────────────────────────┐
           TOP_N           
                
           Top 15          
                
       count(1) DESC       
└─────────────┬─────────────┘
┌─────────────┴─────────────┐
         PROJECTION        
                
            cnt            
             1             
└─────────────┬─────────────┘
┌─────────────┴─────────────┐
       HASH_GROUP_BY       
                
             #0            
        count_star()       
└─────────────┬─────────────┘
┌─────────────┴─────────────┐
         PROJECTION        
                
            url            
└─────────────┬─────────────┘
┌─────────────┴─────────────┐
           FILTER          
                
((cs-uri-stem ~~~ '/posts/*│
│      ') AND (COALESCE     
(array_extract(string_...  
-uri-stem, '/'), '/'),...  
         ) != ''))         
                
           EC: 0           
└─────────────┬─────────────┘
┌─────────────┴─────────────┐
        PARQUET_SCAN       
                
        cs-uri-stem        
                
           EC: 0           
└───────────────────────────┘

DuckDB is pretty great at reading Parquet files but how about writing?

D COPY (SELECT * FROM read_parquet('*.gz.parq'))
TO 'output.parquet' (FORMAT PARQUET, CODEC 'ZSTD');
100% ▕████████████████████████████████████████████████████████████▏

It does not dissapoint.

What about the performance? There is a built in way to analyze performance.

EXPLAIN ANALYZE
COPY (SELECT * FROM read_parquet('*.gz.parq'))
TO 'output.parquet' (FORMAT PARQUET, CODEC 'ZSTD');
┌─────────────────────────────────────┐
│┌───────────────────────────────────┐│
││         Total Time: 4.91s         ││
│└───────────────────────────────────┘│
└─────────────────────────────────────┘
D EXPLAIN ANALYZE
SELECT COUNT(1) AS cnt,
split_part(rtrim("cs-uri-stem", '/'), '/', 6) AS url
FROM read_parquet('*.gz.parq')
WHERE "cs-uri-stem" GLOB '/posts/*'
GROUP BY url
HAVING url <> ''
ORDER BY cnt
DESC LIMIT 15;

┌─────────────────────────────┐
│┌───────────────────────────┐│
│└───────────────────────────┘│
└─────────────────────────────┘
┌─────────────────────────────────────┐
│┌───────────────────────────────────┐│
││    Query Profiling Information    ││
│└───────────────────────────────────┘│
└─────────────────────────────────────┘
EXPLAIN ANALYZE
SELECT
    COUNT(1) AS cnt,
    split_part(rtrim("cs-uri-stem", '/'), '/', 6) AS url
    FROM read_parquet('*.gz.parq')
    WHERE "cs-uri-stem" GLOB '/posts/*'
    GROUP BY url
    HAVING url <> ''
    ORDER BY cnt DESC LIMIT 15;
┌─────────────────────────────────────┐
│┌───────────────────────────────────┐│
││        Total Time: 0.0727s        ││
│└───────────────────────────────────┘│
└─────────────────────────────────────┘
┌───────────────────────────┐
      EXPLAIN_ANALYZE      
                
             0             
          (0.00s)          
└─────────────┬─────────────┘
┌─────────────┴─────────────┐
           TOP_N           
                
           Top 15          
                
       count(1) DESC       
                
             15            
          (0.00s)          
└─────────────┬─────────────┘
┌─────────────┴─────────────┐
         PROJECTION        
                
            cnt            
             1             
                
             46            
          (0.00s)          
└─────────────┬─────────────┘
┌─────────────┴─────────────┐
       HASH_GROUP_BY       
                
             #0            
        count_star()       
                
             46            
          (0.01s)          
└─────────────┬─────────────┘
┌─────────────┴─────────────┐
         PROJECTION        
                
            url            
                
           115262          
          (0.07s)          
└─────────────┬─────────────┘
┌─────────────┴─────────────┐
           FILTER          
                
((cs-uri-stem ~~~ '/posts/*│
│      ') AND (COALESCE     
(array_extract(string_...  
-uri-stem, '/'), '/'),...  
         ) != ''))         
                
           EC: 0           
                
           115262          
          (0.11s)          
└─────────────┬─────────────┘
┌─────────────┴─────────────┐
        PARQUET_SCAN       
                
        cs-uri-stem        
                
           EC: 0           
                
           563607          
          (0.04s)          
└───────────────────────────┘

This is insane performance for a M1 Mac for an average dataset.

It is a really nice way of working with a local tool without too much hassle. The performance that you can get is amazing.