Skip to main content

Introduction

This document is an attempt to do some typical table data manipulation using PRQL and DuckDB.

PRQL is a in development modern language for data manipulation, and can be compiled to SQL. DuckDB is a modern in-memory SQL OLAP (very fast) database management system.

Motivation

CLI, JavaScript, Python, R, etc., we can combine PRQL compiler and DuckDB in various places to manipulate data with PRQL.

At the same time, new features are being added to DuckDB’s SQL on a constant basis, and the syntax is becoming more friendly.

So, I thought that by comparing PRQL and the latest DuckDB SQL (and other query libraries), we could clarify the missing features of the PRQL and highlight the latest features of DuckDB.

Content and Credit

The content of each data manipulation is based on the awesome Modern Polars book by Kevin Heavey. And furthermore its origin is Modern Pandas by Tom Augsburger.

Each query is side-by-side with that of the original Python Polars one and the equivalent operations in PRQL, DuckDB SQL, and dplyr (R).

Since this document focuses on the differences between the grammars of the different languages and does not go into detail about the processing details, I suggest that you also check out the original excellent artcles and book.

important

The versions we are currently using are as follows:

  • PRQL: 0.13.0
  • DuckDB: v1.1.0

Running the Code on IPython

As mentioned above, PRQL and DuckDB can run on a variety of languages. This section describes, as an example, the use of IPython for easy interactive execution of PRQL, SQL, and Polars.

Setup

This command installs the necessary Python packages.

Terminal
python -m pip install pyprql polars[pyarrow]

Then, we can start ipython.

Run load_ext magic to activate PRQL magic (from pyprql) and SQL magic (from jupysql).

IPython
%load_ext pyprql.magic
%load_ext sql

Use SQL magic connects to DuckDB in-memory database. Note that PRQL magic shares the connection with SQL magic.

IPython
%sql duckdb://

To fix the SQL dialect used inside PRQL magic, we need to set the target with config magic. By setting "sql.duckdb" here, we can specify to the PRQL compiler that it should generate optimized SQL for DuckDB.

IPython
%config PrqlMagic.target = "sql.duckdb"

All available targets can be found in the prqlc.get_targets function.

IPython
import prqlc

print(prqlc.get_targets())
['sql.any', 'sql.ansi', 'sql.bigquery', 'sql.clickhouse', 'sql.duckdb', 'sql.generic', 'sql.glaredb', 'sql.mssql', 'sql.mysql', 'sql.postgres', 'sql.sqlite', 'sql.snowflake']

Combine DuckDB and Polars

By setting autopolars config, the results of PRQL and SQL executions can be converted to polars.DataFrame.

IPython
%config PrqlMagic.autopolars = True
%config SqlMagic.autopolars = True

Also, since DuckDB can execute queries against polars.DataFrame and polars.LazyFrame etc., these can be referenced directly from PRQL or SQL, as shown below.

IPython
import polars as pl

lf = pl.LazyFrame({"a": list(range(1, 6))})
important

With JupySQL 0.10.14 and DuckDB 1.1, the following setting is needed to scan all frames in Python.

IPython
%%sql
set python_scan_all_frames=true
IPython
%%prql
from lf
derive b = a * 5
take 3
shape: (3, 2)
a b
i64 i64
1 5
2 10
3 15

License

This website’s content is licensed under the MIT license.