Skip to main content

Define Functions

note

This page is based on the chapter “Method Chaining” of the Modern Polars book.

To eliminate the need to copy and paste the same code, it is convenient to make the process a function.

PRQL can chain functions by | (pipe operator), DuckDB SQL can chain functions by . (DuckDB >= 0.8, duckdb/duckdb#6725), and R can chain functions by |> (pipe operator, R >= 4.1). They are defined so that the previous value in the chain becomes the first (DuckDB SQL and R) or the last (PRQL) parameter of the function.

In Python, we can chain methods using ..

Preparing Data

important

Since the PRQL and SQL results shown on this page are after being converted to R DataFrame via knitr, they have been converted from DuckDB types to R types. So NULL in DuckDB is shown as NA.

Download

Download the data to be analysis (zipped CSV file) and write the data to a Parquet file.

This document uses R here, but we can do it in another language, or, manually download and unzip and create the Parquet file (with DuckDB CLI).

Code
# Create "data" directory, download the zip file into the directory, and create a Parquet file.
data_dir <- "data"
dest <- file.path(data_dir, "flights.csv.zip")
csv_name <- "On_Time_Reporting_Carrier_On_Time_Performance_(1987_present)_2022_1.csv"
csv_path <- file.path(data_dir, csv_name)
parquet_path <- file.path(data_dir, "flights.parquet")

if (!fs::file_exists(parquet_path)) {
if (!fs::file_exists(dest)) {
fs::dir_create(data_dir)
curl::curl_download(
"https://transtats.bts.gov/PREZIP/On_Time_Reporting_Carrier_On_Time_Performance_1987_present_2022_1.zip",
dest,
handle = curl::new_handle(ssl_verifypeer = FALSE)
)
}

unzip(dest, csv_name, exdir = data_dir)
duckdb:::sql(glue::glue("COPY (FROM read_csv_auto('{csv_path}')) TO '{parquet_path}' (FORMAT PARQUET)"))
}

Load the Data

After the Parquet file is ready, load it into DuckDB (in-memory) database table, R DataFrame, and Python polars.LazyFrame.

CREATE TABLE tab AS SELECT * FROM 'data/flights.parquet'
FROM tab LIMIT 5
YearQuarterMonthDayofMonthDayOfWeekFlightDateReporting_AirlineDOT_ID_Reporting_AirlineIATA_CODE_Reporting_AirlineTail_NumberFlight_Number_Reporting_AirlineOriginAirportIDOriginAirportSeqIDOriginCityMarketIDOriginOriginCityNameOriginStateOriginStateFipsOriginStateNameOriginWacDestAirportIDDestAirportSeqIDDestCityMarketIDDestDestCityNameDestStateDestStateFipsDestStateNameDestWacCRSDepTimeDepTimeDepDelayDepDelayMinutesDepDel15DepartureDelayGroupsDepTimeBlkTaxiOutWheelsOffWheelsOnTaxiInCRSArrTimeArrTimeArrDelayArrDelayMinutesArrDel15ArrivalDelayGroupsArrTimeBlkCancelledCancellationCodeDivertedCRSElapsedTimeActualElapsedTimeAirTimeFlightsDistanceDistanceGroupCarrierDelayWeatherDelayNASDelaySecurityDelayLateAircraftDelayFirstDepTimeTotalAddGTimeLongestAddGTimeDivAirportLandingsDivReachedDestDivActualElapsedTimeDivArrDelayDivDistanceDiv1AirportDiv1AirportIDDiv1AirportSeqIDDiv1WheelsOnDiv1TotalGTimeDiv1LongestGTimeDiv1WheelsOffDiv1TailNumDiv2AirportDiv2AirportIDDiv2AirportSeqIDDiv2WheelsOnDiv2TotalGTimeDiv2LongestGTimeDiv2WheelsOffDiv2TailNumDiv3AirportDiv3AirportIDDiv3AirportSeqIDDiv3WheelsOnDiv3TotalGTimeDiv3LongestGTimeDiv3WheelsOffDiv3TailNumDiv4AirportDiv4AirportIDDiv4AirportSeqIDDiv4WheelsOnDiv4TotalGTimeDiv4LongestGTimeDiv4WheelsOffDiv4TailNumDiv5AirportDiv5AirportIDDiv5AirportSeqIDDiv5WheelsOnDiv5TotalGTimeDiv5LongestGTimeDiv5WheelsOffDiv5TailNumcolumn109
2022111452022-01-14YX20452YXN119HQ487911066110660631066CMHColumbus, OHOH39Ohio4411278112780530852DCAWashington, DCVA51Virginia3812241221-300-11200-1259281249135241352135644001300-13590088956313232NANANANANANANA0NANANANANANANANANANANANANANANANANANANANANANANANANA
2022111562022-01-15YX20452YXN122HQ487911066110660631066CMHColumbus, OHOH39Ohio4411278112780530852DCAWashington, DCVA51Virginia3812241214-1000-11200-12591912331323513521328-2400-21300-13590088745013232NANANANANANANA0NANANANANANANANANANANANANANANANANANANANANANANANANA
2022111672022-01-16YX20452YXN412YX487911066110660631066CMHColumbus, OHOH39Ohio4411278112780530852DCAWashington, DCVA51Virginia3812241218-600-11200-125916123413271213521339-1300-11300-13590088815313232NANANANANANANA0NANANANANANANANANANANANANANANANANANANANANANANANANA
2022111712022-01-17YX20452YXN405YX487911066110660631066CMHColumbus, OHOH39Ohio4411278112780530852DCAWashington, DCVA51Virginia3812241217-700-11200-12593212491345161352140199001300-135900881045613232NANANANANANANA0NANANANANANANANANANANANANANANANANANANANANANANANANA
2022111822022-01-18YX20452YXN420YX487911066110660631066CMHColumbus, OHOH39Ohio4411278112780530852DCAWashington, DCVA51Virginia3812241218-600-11200-12591112291317613521323-2900-21300-13590088654813232NANANANANANANA0NANANANANANANANANANANANANANANANANANANANANANANANANA

Functions

Define some functions and see if they work.

important

PRQL currently does not allow functions to be stored in a separate session or file, and they must always be defined with the main query. (PRQL/prql#1803)

Extract City Names

PRQL does not currently have the ability to apply a function to multiple columns at once. (PRQL/prql#2386)

The others can apply a function to multiple columns at once, but DuckDB SQL differs in that the column names are updated and that behavior cannot be prevented.

let extract_city_name = col -> s"regexp_replace({col}, ',.*', '')"

from tab
select {
OriginCityName = extract_city_name OriginCityName,
DestCityName = extract_city_name DestCityName
}
take 5
OriginCityNameDestCityName
ColumbusWashington
ColumbusWashington
ColumbusWashington
ColumbusWashington
ColumbusWashington

Timestamp Calculation

PRQL and SQL can only define functions that return a single column.

The column name (here FlightDate) cannot be used in the function definition of DuckDB SQL.

note

::VARCHARs in TRY_CAST are workarounds for a bug in DuckDB 0.9.1. (duckdb/duckdb#9380)

let time_to_datetime = string -> s"""
FlightDate::TIMESTAMP +
TRY_CAST(regexp_replace({string}::VARCHAR, '^2400$', '0000').substr(1, 2).concat(' hours') AS INTERVAL) +
TRY_CAST(regexp_replace({string}::VARCHAR, '^2400$', '0000').substr(3, 2).concat(' minutes') AS INTERVAL)
"""

from tab
select {
FlightDate,
DepTimeOld = DepTime
}
derive {
DepTime = (time_to_datetime DepTimeOld)
}
take 5
FlightDateDepTimeOldDepTime
2022-01-1412212022-01-14 12:21:00
2022-01-1512142022-01-15 12:14:00
2022-01-1612182022-01-16 12:18:00
2022-01-1712172022-01-17 12:17:00
2022-01-1812182022-01-18 12:18:00

Use Functions in the Query

dplyr and Polars allow column names to be predefined as vectors or lists, which can then be referenced in the query.

let extract_city_name = col -> s"regexp_replace({col}, ',.*', '')"

let time_to_datetime = string -> s"""
FlightDate::TIMESTAMP +
TRY_CAST(regexp_replace({string}::VARCHAR, '^2400$', '0000').substr(1, 2).concat(' hours') AS INTERVAL) +
TRY_CAST(regexp_replace({string}::VARCHAR, '^2400$', '0000').substr(3, 2).concat(' minutes') AS INTERVAL)
"""

from tab
select {
Dest,
Tail_Number,
IATA_CODE_Reporting_Airline,
CancellationCode,
DepTime = time_to_datetime DepTime,
ArrTime = time_to_datetime ArrTime,
CRSArrTime = time_to_datetime CRSArrTime,
CRSDepTime = time_to_datetime CRSDepTime,
FlightDate,
Flight_Number_Reporting_Airline,
OriginCityName = extract_city_name OriginCityName,
DestCityName = extract_city_name DestCityName,
Origin,
DepDelay
}
take 5
DestTail_NumberIATA_CODE_Reporting_AirlineCancellationCodeDepTimeArrTimeCRSArrTimeCRSDepTimeFlightDateFlight_Number_Reporting_AirlineOriginCityNameDestCityNameOriginDepDelay
DCAN119HQYX2022-01-14 12:21:002022-01-14 13:56:002022-01-14 13:52:002022-01-14 12:24:002022-01-144879ColumbusWashingtonCMH-3
DCAN122HQYX2022-01-15 12:14:002022-01-15 13:28:002022-01-15 13:52:002022-01-15 12:24:002022-01-154879ColumbusWashingtonCMH-10
DCAN412YXYX2022-01-16 12:18:002022-01-16 13:39:002022-01-16 13:52:002022-01-16 12:24:002022-01-164879ColumbusWashingtonCMH-6
DCAN405YXYX2022-01-17 12:17:002022-01-17 14:01:002022-01-17 13:52:002022-01-17 12:24:002022-01-174879ColumbusWashingtonCMH-7
DCAN420YXYX2022-01-18 12:18:002022-01-18 13:23:002022-01-18 13:52:002022-01-18 12:24:002022-01-184879ColumbusWashingtonCMH-6