Skip to main content

Define Functions

note

This page is based on the chapter “Method Chaining” of the Modern Polars book.

To eliminate the need to copy and paste the same code, it is convenient to make the process a function.

PRQL can chain functions by | (pipe operator), DuckDB SQL can chain functions by . (DuckDB >= 0.8, duckdb/duckdb#6725), and R can chain functions by |> (pipe operator, R >= 4.1). They are defined so that the previous value in the chain becomes the first (DuckDB SQL and R) or the last (PRQL) parameter of the function.

In Python, we can chain methods using ..

Preparing Data

important

Since the PRQL and SQL results shown on this page are after being converted to R DataFrame via knitr, they have been converted from DuckDB types to R types. So NULL in DuckDB is shown as NA.

Download

Download the data to be analysis (zipped CSV file) and write the data to a Parquet file.

This document uses R here, but we can do it in another language, or, manually download and unzip and create the Parquet file (with DuckDB CLI).

Code
R
# Create "data" directory, download the zip file into the directory, and create a Parquet file.
data_dir <- "data"
dest <- file.path(data_dir, "flights.csv.zip")
csv_name <- "On_Time_Reporting_Carrier_On_Time_Performance_(1987_present)_2022_1.csv"
csv_path <- file.path(data_dir, csv_name)
parquet_path <- file.path(data_dir, "flights.parquet")

if (!fs::file_exists(parquet_path)) {
if (!fs::file_exists(dest)) {
fs::dir_create(data_dir)
curl::curl_download(
"https://transtats.bts.gov/PREZIP/On_Time_Reporting_Carrier_On_Time_Performance_1987_present_2022_1.zip",
dest,
handle = curl::new_handle(ssl_verifypeer = FALSE)
)
}

unzip(dest, csv_name, exdir = data_dir)
duckdb:::sql(glue::glue("COPY (FROM read_csv_auto('{csv_path}')) TO '{parquet_path}' (FORMAT PARQUET)"))
}

Load the Data

After the Parquet file is ready, load it into DuckDB (in-memory) database table, R DataFrame, and Python polars.LazyFrame.

SQL
CREATE TABLE tab AS SELECT * FROM 'data/flights.parquet'
SQL
FROM tab LIMIT 5
Year Quarter Month DayofMonth DayOfWeek FlightDate Reporting_Airline DOT_ID_Reporting_Airline IATA_CODE_Reporting_Airline Tail_Number Flight_Number_Reporting_Airline OriginAirportID OriginAirportSeqID OriginCityMarketID Origin OriginCityName OriginState OriginStateFips OriginStateName OriginWac DestAirportID DestAirportSeqID DestCityMarketID Dest DestCityName DestState DestStateFips DestStateName DestWac CRSDepTime DepTime DepDelay DepDelayMinutes DepDel15 DepartureDelayGroups DepTimeBlk TaxiOut WheelsOff WheelsOn TaxiIn CRSArrTime ArrTime ArrDelay ArrDelayMinutes ArrDel15 ArrivalDelayGroups ArrTimeBlk Cancelled CancellationCode Diverted CRSElapsedTime ActualElapsedTime AirTime Flights Distance DistanceGroup CarrierDelay WeatherDelay NASDelay SecurityDelay LateAircraftDelay FirstDepTime TotalAddGTime LongestAddGTime DivAirportLandings DivReachedDest DivActualElapsedTime DivArrDelay DivDistance Div1Airport Div1AirportID Div1AirportSeqID Div1WheelsOn Div1TotalGTime Div1LongestGTime Div1WheelsOff Div1TailNum Div2Airport Div2AirportID Div2AirportSeqID Div2WheelsOn Div2TotalGTime Div2LongestGTime Div2WheelsOff Div2TailNum Div3Airport Div3AirportID Div3AirportSeqID Div3WheelsOn Div3TotalGTime Div3LongestGTime Div3WheelsOff Div3TailNum Div4Airport Div4AirportID Div4AirportSeqID Div4WheelsOn Div4TotalGTime Div4LongestGTime Div4WheelsOff Div4TailNum Div5Airport Div5AirportID Div5AirportSeqID Div5WheelsOn Div5TotalGTime Div5LongestGTime Div5WheelsOff Div5TailNum column109
2022 1 1 14 5 2022-01-14 YX 20452 YX N119HQ 4879 11066 1106606 31066 CMH Columbus, OH OH 39 Ohio 44 11278 1127805 30852 DCA Washington, DC VA 51 Virginia 38 1224 1221 -3 0 0 -1 1200-1259 28 1249 1352 4 1352 1356 4 4 0 0 1300-1359 0 0 88 95 63 1 323 2 NA NA NA NA NA NA NA 0 NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA
2022 1 1 15 6 2022-01-15 YX 20452 YX N122HQ 4879 11066 1106606 31066 CMH Columbus, OH OH 39 Ohio 44 11278 1127805 30852 DCA Washington, DC VA 51 Virginia 38 1224 1214 -10 0 0 -1 1200-1259 19 1233 1323 5 1352 1328 -24 0 0 -2 1300-1359 0 0 88 74 50 1 323 2 NA NA NA NA NA NA NA 0 NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA
2022 1 1 16 7 2022-01-16 YX 20452 YX N412YX 4879 11066 1106606 31066 CMH Columbus, OH OH 39 Ohio 44 11278 1127805 30852 DCA Washington, DC VA 51 Virginia 38 1224 1218 -6 0 0 -1 1200-1259 16 1234 1327 12 1352 1339 -13 0 0 -1 1300-1359 0 0 88 81 53 1 323 2 NA NA NA NA NA NA NA 0 NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA
2022 1 1 17 1 2022-01-17 YX 20452 YX N405YX 4879 11066 1106606 31066 CMH Columbus, OH OH 39 Ohio 44 11278 1127805 30852 DCA Washington, DC VA 51 Virginia 38 1224 1217 -7 0 0 -1 1200-1259 32 1249 1345 16 1352 1401 9 9 0 0 1300-1359 0 0 88 104 56 1 323 2 NA NA NA NA NA NA NA 0 NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA
2022 1 1 18 2 2022-01-18 YX 20452 YX N420YX 4879 11066 1106606 31066 CMH Columbus, OH OH 39 Ohio 44 11278 1127805 30852 DCA Washington, DC VA 51 Virginia 38 1224 1218 -6 0 0 -1 1200-1259 11 1229 1317 6 1352 1323 -29 0 0 -2 1300-1359 0 0 88 65 48 1 323 2 NA NA NA NA NA NA NA 0 NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA

Functions

Define some functions and see if they work.

important

PRQL currently does not allow functions to be stored in a separate session or file, and they must always be defined with the main query. (PRQL/prql#1803)

Extract City Names

PRQL does not currently have the ability to apply a function to multiple columns at once. (PRQL/prql#2386)

The others can apply a function to multiple columns at once, but DuckDB SQL differs in that the column names are updated and that behavior cannot be prevented.

PRQL
let extract_city_name = col -> s"regexp_replace({col}, ',.*', '')"

from tab
select {
OriginCityName = extract_city_name OriginCityName,
DestCityName = extract_city_name DestCityName
}
take 5
OriginCityName DestCityName
Columbus Washington
Columbus Washington
Columbus Washington
Columbus Washington
Columbus Washington

Timestamp Calculation

PRQL and SQL can only define functions that return a single column.

The column name (here FlightDate) cannot be used in the function definition of DuckDB SQL.

PRQL
let time_to_datetime = string -> s"""
FlightDate::TIMESTAMP +
TRY_CAST(regexp_replace({string}, '^2400$', '0000').substr(1, 2).concat(' hours') AS INTERVAL) +
TRY_CAST(regexp_replace({string}, '^2400$', '0000').substr(3, 2).concat(' minutes') AS INTERVAL)
"""

from tab
select {
FlightDate,
DepTimeOld = DepTime
}
derive {
DepTime = (time_to_datetime DepTimeOld)
}
take 5
FlightDate DepTimeOld DepTime
2022-01-14 1221 2022-01-14 12:21:00
2022-01-15 1214 2022-01-15 12:14:00
2022-01-16 1218 2022-01-16 12:18:00
2022-01-17 1217 2022-01-17 12:17:00
2022-01-18 1218 2022-01-18 12:18:00

Use Functions in the Query

dplyr and Polars allow column names to be predefined as vectors or lists, which can then be referenced in the query.

PRQL
let extract_city_name = col -> s"regexp_replace({col}, ',.*', '')"

let time_to_datetime = string -> s"""
FlightDate::TIMESTAMP +
TRY_CAST(regexp_replace({string}, '^2400$', '0000').substr(1, 2).concat(' hours') AS INTERVAL) +
TRY_CAST(regexp_replace({string}, '^2400$', '0000').substr(3, 2).concat(' minutes') AS INTERVAL)
"""

from tab
select {
Dest,
Tail_Number,
IATA_CODE_Reporting_Airline,
CancellationCode,
DepTime = time_to_datetime DepTime,
ArrTime = time_to_datetime ArrTime,
CRSArrTime = time_to_datetime CRSArrTime,
CRSDepTime = time_to_datetime CRSDepTime,
FlightDate,
Flight_Number_Reporting_Airline,
OriginCityName = extract_city_name OriginCityName,
DestCityName = extract_city_name DestCityName,
Origin,
DepDelay
}
take 5
Dest Tail_Number IATA_CODE_Reporting_Airline CancellationCode DepTime ArrTime CRSArrTime CRSDepTime FlightDate Flight_Number_Reporting_Airline OriginCityName DestCityName Origin DepDelay
DCA N119HQ YX 2022-01-14 12:21:00 2022-01-14 13:56:00 2022-01-14 13:52:00 2022-01-14 12:24:00 2022-01-14 4879 Columbus Washington CMH -3
DCA N122HQ YX 2022-01-15 12:14:00 2022-01-15 13:28:00 2022-01-15 13:52:00 2022-01-15 12:24:00 2022-01-15 4879 Columbus Washington CMH -10
DCA N412YX YX 2022-01-16 12:18:00 2022-01-16 13:39:00 2022-01-16 13:52:00 2022-01-16 12:24:00 2022-01-16 4879 Columbus Washington CMH -6
DCA N405YX YX 2022-01-17 12:17:00 2022-01-17 14:01:00 2022-01-17 13:52:00 2022-01-17 12:24:00 2022-01-17 4879 Columbus Washington CMH -7
DCA N420YX YX 2022-01-18 12:18:00 2022-01-18 13:23:00 2022-01-18 13:52:00 2022-01-18 12:24:00 2022-01-18 4879 Columbus Washington CMH -6