Define Functions
This page is based on the chapter “Method Chaining” of the Modern Polars book.
To eliminate the need to copy and paste the same code, it is convenient to make the process a function.
PRQL can chain functions by |
(pipe operator), DuckDB SQL can chain
functions by .
(DuckDB >= 0.8,
duckdb/duckdb#6725), and R
can chain functions by |>
(pipe operator, R >= 4.1). They are defined
so that the previous value in the chain becomes the first (DuckDB SQL
and R) or the last (PRQL) parameter of the function.
In Python, we can chain methods using .
.
Preparing Data
Since the PRQL and SQL results shown on this page are after being
converted to R DataFrame via knitr, they have been converted from DuckDB
types to R types. So NULL
in DuckDB is shown as NA
.
Download
Download the data to be analysis (zipped CSV file) and write the data to a Parquet file.
This document uses R here, but we can do it in another language, or, manually download and unzip and create the Parquet file (with DuckDB CLI).
- R
- Shell
Code
# Create "data" directory, download the zip file into the directory, and create a Parquet file.
data_dir <- "data"
dest <- file.path(data_dir, "flights.csv.zip")
csv_name <- "On_Time_Reporting_Carrier_On_Time_Performance_(1987_present)_2022_1.csv"
csv_path <- file.path(data_dir, csv_name)
parquet_path <- file.path(data_dir, "flights.parquet")
if (!fs::file_exists(parquet_path)) {
if (!fs::file_exists(dest)) {
fs::dir_create(data_dir)
curl::curl_download(
"https://transtats.bts.gov/PREZIP/On_Time_Reporting_Carrier_On_Time_Performance_1987_present_2022_1.zip",
dest,
handle = curl::new_handle(ssl_verifypeer = FALSE)
)
}
unzip(dest, csv_name, exdir = data_dir)
duckdb:::sql(glue::glue("COPY (FROM read_csv_auto('{csv_path}')) TO '{parquet_path}' (FORMAT PARQUET)"))
}
This is a sample command to download the zipped CSV file and covert it to a Parquet file.
On Windows, the unzip
command is not available by default, so use
something like
Expand-Archive -Path data/flights.csv.zip -DestinationPath data
in
PowerShell insead.
mkdir data
curl -sL https://transtats.bts.gov/PREZIP/On_Time_Reporting_Carrier_On_Time_Performance_1987_present_2022_1.zip -o data/flights.csv.zip
unzip -d data data/flights.csv.zip
duckdb -c "COPY (FROM read_csv_auto('data/On_Time_Reporting_Carrier_On_Time_Performance_(1987_present)_2022_1.csv')) TO 'data/flights.parquet' (FORMAT PARQUET)"
Load the Data
After the Parquet file is ready, load it into DuckDB (in-memory) database table, R DataFrame, and Python polars.LazyFrame.
- DuckDB
- R DataFrame
- Python polars.LazyFrame
CREATE TABLE tab AS SELECT * FROM 'data/flights.parquet'
FROM tab LIMIT 5
Year | Quarter | Month | DayofMonth | DayOfWeek | FlightDate | Reporting_Airline | DOT_ID_Reporting_Airline | IATA_CODE_Reporting_Airline | Tail_Number | Flight_Number_Reporting_Airline | OriginAirportID | OriginAirportSeqID | OriginCityMarketID | Origin | OriginCityName | OriginState | OriginStateFips | OriginStateName | OriginWac | DestAirportID | DestAirportSeqID | DestCityMarketID | Dest | DestCityName | DestState | DestStateFips | DestStateName | DestWac | CRSDepTime | DepTime | DepDelay | DepDelayMinutes | DepDel15 | DepartureDelayGroups | DepTimeBlk | TaxiOut | WheelsOff | WheelsOn | TaxiIn | CRSArrTime | ArrTime | ArrDelay | ArrDelayMinutes | ArrDel15 | ArrivalDelayGroups | ArrTimeBlk | Cancelled | CancellationCode | Diverted | CRSElapsedTime | ActualElapsedTime | AirTime | Flights | Distance | DistanceGroup | CarrierDelay | WeatherDelay | NASDelay | SecurityDelay | LateAircraftDelay | FirstDepTime | TotalAddGTime | LongestAddGTime | DivAirportLandings | DivReachedDest | DivActualElapsedTime | DivArrDelay | DivDistance | Div1Airport | Div1AirportID | Div1AirportSeqID | Div1WheelsOn | Div1TotalGTime | Div1LongestGTime | Div1WheelsOff | Div1TailNum | Div2Airport | Div2AirportID | Div2AirportSeqID | Div2WheelsOn | Div2TotalGTime | Div2LongestGTime | Div2WheelsOff | Div2TailNum | Div3Airport | Div3AirportID | Div3AirportSeqID | Div3WheelsOn | Div3TotalGTime | Div3LongestGTime | Div3WheelsOff | Div3TailNum | Div4Airport | Div4AirportID | Div4AirportSeqID | Div4WheelsOn | Div4TotalGTime | Div4LongestGTime | Div4WheelsOff | Div4TailNum | Div5Airport | Div5AirportID | Div5AirportSeqID | Div5WheelsOn | Div5TotalGTime | Div5LongestGTime | Div5WheelsOff | Div5TailNum | column109 |
---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
2022 | 1 | 1 | 14 | 5 | 2022-01-14 | YX | 20452 | YX | N119HQ | 4879 | 11066 | 1106606 | 31066 | CMH | Columbus, OH | OH | 39 | Ohio | 44 | 11278 | 1127805 | 30852 | DCA | Washington, DC | VA | 51 | Virginia | 38 | 1224 | 1221 | -3 | 0 | 0 | -1 | 1200-1259 | 28 | 1249 | 1352 | 4 | 1352 | 1356 | 4 | 4 | 0 | 0 | 1300-1359 | 0 | 0 | 88 | 95 | 63 | 1 | 323 | 2 | NA | NA | NA | NA | NA | NA | NA | 0 | NA | NA | NA | NA | NA | NA | NA | NA | NA | NA | NA | NA | NA | NA | NA | NA | NA | NA | NA | NA | NA | NA | NA | NA | NA | ||||||||||||||||||||||
2022 | 1 | 1 | 15 | 6 | 2022-01-15 | YX | 20452 | YX | N122HQ | 4879 | 11066 | 1106606 | 31066 | CMH | Columbus, OH | OH | 39 | Ohio | 44 | 11278 | 1127805 | 30852 | DCA | Washington, DC | VA | 51 | Virginia | 38 | 1224 | 1214 | -10 | 0 | 0 | -1 | 1200-1259 | 19 | 1233 | 1323 | 5 | 1352 | 1328 | -24 | 0 | 0 | -2 | 1300-1359 | 0 | 0 | 88 | 74 | 50 | 1 | 323 | 2 | NA | NA | NA | NA | NA | NA | NA | 0 | NA | NA | NA | NA | NA | NA | NA | NA | NA | NA | NA | NA | NA | NA | NA | NA | NA | NA | NA | NA | NA | NA | NA | NA | NA | ||||||||||||||||||||||
2022 | 1 | 1 | 16 | 7 | 2022-01-16 | YX | 20452 | YX | N412YX | 4879 | 11066 | 1106606 | 31066 | CMH | Columbus, OH | OH | 39 | Ohio | 44 | 11278 | 1127805 | 30852 | DCA | Washington, DC | VA | 51 | Virginia | 38 | 1224 | 1218 | -6 | 0 | 0 | -1 | 1200-1259 | 16 | 1234 | 1327 | 12 | 1352 | 1339 | -13 | 0 | 0 | -1 | 1300-1359 | 0 | 0 | 88 | 81 | 53 | 1 | 323 | 2 | NA | NA | NA | NA | NA | NA | NA | 0 | NA | NA | NA | NA | NA | NA | NA | NA | NA | NA | NA | NA | NA | NA | NA | NA | NA | NA | NA | NA | NA | NA | NA | NA | NA | ||||||||||||||||||||||
2022 | 1 | 1 | 17 | 1 | 2022-01-17 | YX | 20452 | YX | N405YX | 4879 | 11066 | 1106606 | 31066 | CMH | Columbus, OH | OH | 39 | Ohio | 44 | 11278 | 1127805 | 30852 | DCA | Washington, DC | VA | 51 | Virginia | 38 | 1224 | 1217 | -7 | 0 | 0 | -1 | 1200-1259 | 32 | 1249 | 1345 | 16 | 1352 | 1401 | 9 | 9 | 0 | 0 | 1300-1359 | 0 | 0 | 88 | 104 | 56 | 1 | 323 | 2 | NA | NA | NA | NA | NA | NA | NA | 0 | NA | NA | NA | NA | NA | NA | NA | NA | NA | NA | NA | NA | NA | NA | NA | NA | NA | NA | NA | NA | NA | NA | NA | NA | NA | ||||||||||||||||||||||
2022 | 1 | 1 | 18 | 2 | 2022-01-18 | YX | 20452 | YX | N420YX | 4879 | 11066 | 1106606 | 31066 | CMH | Columbus, OH | OH | 39 | Ohio | 44 | 11278 | 1127805 | 30852 | DCA | Washington, DC | VA | 51 | Virginia | 38 | 1224 | 1218 | -6 | 0 | 0 | -1 | 1200-1259 | 11 | 1229 | 1317 | 6 | 1352 | 1323 | -29 | 0 | 0 | -2 | 1300-1359 | 0 | 0 | 88 | 65 | 48 | 1 | 323 | 2 | NA | NA | NA | NA | NA | NA | NA | 0 | NA | NA | NA | NA | NA | NA | NA | NA | NA | NA | NA | NA | NA | NA | NA | NA | NA | NA | NA | NA | NA | NA | NA | NA | NA |
library(dplyr, warn.conflicts = FALSE)
df <- duckdb:::sql("FROM 'data/flights.parquet'")
df |> head(5)
Year | Quarter | Month | DayofMonth | DayOfWeek | FlightDate | Reporting_Airline | DOT_ID_Reporting_Airline | IATA_CODE_Reporting_Airline | Tail_Number | Flight_Number_Reporting_Airline | OriginAirportID | OriginAirportSeqID | OriginCityMarketID | Origin | OriginCityName | OriginState | OriginStateFips | OriginStateName | OriginWac | DestAirportID | DestAirportSeqID | DestCityMarketID | Dest | DestCityName | DestState | DestStateFips | DestStateName | DestWac | CRSDepTime | DepTime | DepDelay | DepDelayMinutes | DepDel15 | DepartureDelayGroups | DepTimeBlk | TaxiOut | WheelsOff | WheelsOn | TaxiIn | CRSArrTime | ArrTime | ArrDelay | ArrDelayMinutes | ArrDel15 | ArrivalDelayGroups | ArrTimeBlk | Cancelled | CancellationCode | Diverted | CRSElapsedTime | ActualElapsedTime | AirTime | Flights | Distance | DistanceGroup | CarrierDelay | WeatherDelay | NASDelay | SecurityDelay | LateAircraftDelay | FirstDepTime | TotalAddGTime | LongestAddGTime | DivAirportLandings | DivReachedDest | DivActualElapsedTime | DivArrDelay | DivDistance | Div1Airport | Div1AirportID | Div1AirportSeqID | Div1WheelsOn | Div1TotalGTime | Div1LongestGTime | Div1WheelsOff | Div1TailNum | Div2Airport | Div2AirportID | Div2AirportSeqID | Div2WheelsOn | Div2TotalGTime | Div2LongestGTime | Div2WheelsOff | Div2TailNum | Div3Airport | Div3AirportID | Div3AirportSeqID | Div3WheelsOn | Div3TotalGTime | Div3LongestGTime | Div3WheelsOff | Div3TailNum | Div4Airport | Div4AirportID | Div4AirportSeqID | Div4WheelsOn | Div4TotalGTime | Div4LongestGTime | Div4WheelsOff | Div4TailNum | Div5Airport | Div5AirportID | Div5AirportSeqID | Div5WheelsOn | Div5TotalGTime | Div5LongestGTime | Div5WheelsOff | Div5TailNum | column109 |
---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
2022 | 1 | 1 | 14 | 5 | 2022-01-14 | YX | 20452 | YX | N119HQ | 4879 | 11066 | 1106606 | 31066 | CMH | Columbus, OH | OH | 39 | Ohio | 44 | 11278 | 1127805 | 30852 | DCA | Washington, DC | VA | 51 | Virginia | 38 | 1224 | 1221 | -3 | 0 | 0 | -1 | 1200-1259 | 28 | 1249 | 1352 | 4 | 1352 | 1356 | 4 | 4 | 0 | 0 | 1300-1359 | 0 | 0 | 88 | 95 | 63 | 1 | 323 | 2 | NA | NA | NA | NA | NA | NA | NA | 0 | NA | NA | NA | NA | NA | NA | NA | NA | NA | NA | NA | NA | NA | NA | NA | NA | NA | NA | NA | NA | NA | NA | NA | NA | NA | ||||||||||||||||||||||
2022 | 1 | 1 | 15 | 6 | 2022-01-15 | YX | 20452 | YX | N122HQ | 4879 | 11066 | 1106606 | 31066 | CMH | Columbus, OH | OH | 39 | Ohio | 44 | 11278 | 1127805 | 30852 | DCA | Washington, DC | VA | 51 | Virginia | 38 | 1224 | 1214 | -10 | 0 | 0 | -1 | 1200-1259 | 19 | 1233 | 1323 | 5 | 1352 | 1328 | -24 | 0 | 0 | -2 | 1300-1359 | 0 | 0 | 88 | 74 | 50 | 1 | 323 | 2 | NA | NA | NA | NA | NA | NA | NA | 0 | NA | NA | NA | NA | NA | NA | NA | NA | NA | NA | NA | NA | NA | NA | NA | NA | NA | NA | NA | NA | NA | NA | NA | NA | NA | ||||||||||||||||||||||
2022 | 1 | 1 | 16 | 7 | 2022-01-16 | YX | 20452 | YX | N412YX | 4879 | 11066 | 1106606 | 31066 | CMH | Columbus, OH | OH | 39 | Ohio | 44 | 11278 | 1127805 | 30852 | DCA | Washington, DC | VA | 51 | Virginia | 38 | 1224 | 1218 | -6 | 0 | 0 | -1 | 1200-1259 | 16 | 1234 | 1327 | 12 | 1352 | 1339 | -13 | 0 | 0 | -1 | 1300-1359 | 0 | 0 | 88 | 81 | 53 | 1 | 323 | 2 | NA | NA | NA | NA | NA | NA | NA | 0 | NA | NA | NA | NA | NA | NA | NA | NA | NA | NA | NA | NA | NA | NA | NA | NA | NA | NA | NA | NA | NA | NA | NA | NA | NA | ||||||||||||||||||||||
2022 | 1 | 1 | 17 | 1 | 2022-01-17 | YX | 20452 | YX | N405YX | 4879 | 11066 | 1106606 | 31066 | CMH | Columbus, OH | OH | 39 | Ohio | 44 | 11278 | 1127805 | 30852 | DCA | Washington, DC | VA | 51 | Virginia | 38 | 1224 | 1217 | -7 | 0 | 0 | -1 | 1200-1259 | 32 | 1249 | 1345 | 16 | 1352 | 1401 | 9 | 9 | 0 | 0 | 1300-1359 | 0 | 0 | 88 | 104 | 56 | 1 | 323 | 2 | NA | NA | NA | NA | NA | NA | NA | 0 | NA | NA | NA | NA | NA | NA | NA | NA | NA | NA | NA | NA | NA | NA | NA | NA | NA | NA | NA | NA | NA | NA | NA | NA | NA | ||||||||||||||||||||||
2022 | 1 | 1 | 18 | 2 | 2022-01-18 | YX | 20452 | YX | N420YX | 4879 | 11066 | 1106606 | 31066 | CMH | Columbus, OH | OH | 39 | Ohio | 44 | 11278 | 1127805 | 30852 | DCA | Washington, DC | VA | 51 | Virginia | 38 | 1224 | 1218 | -6 | 0 | 0 | -1 | 1200-1259 | 11 | 1229 | 1317 | 6 | 1352 | 1323 | -29 | 0 | 0 | -2 | 1300-1359 | 0 | 0 | 88 | 65 | 48 | 1 | 323 | 2 | NA | NA | NA | NA | NA | NA | NA | 0 | NA | NA | NA | NA | NA | NA | NA | NA | NA | NA | NA | NA | NA | NA | NA | NA | NA | NA | NA | NA | NA | NA | NA | NA | NA |
import polars as pl
lf = pl.scan_parquet("data/flights.parquet")
lf.fetch(5)
Year | Quarter | Month | DayofMonth | DayOfWeek | FlightDate | Reporting_Airline | DOT_ID_Reporting_Airline | IATA_CODE_Reporting_Airline | Tail_Number | Flight_Number_Reporting_Airline | OriginAirportID | OriginAirportSeqID | OriginCityMarketID | Origin | OriginCityName | OriginState | OriginStateFips | OriginStateName | OriginWac | DestAirportID | DestAirportSeqID | DestCityMarketID | Dest | DestCityName | DestState | DestStateFips | DestStateName | DestWac | CRSDepTime | DepTime | DepDelay | DepDelayMinutes | DepDel15 | DepartureDelayGroups | DepTimeBlk | TaxiOut | … | Div1TotalGTime | Div1LongestGTime | Div1WheelsOff | Div1TailNum | Div2Airport | Div2AirportID | Div2AirportSeqID | Div2WheelsOn | Div2TotalGTime | Div2LongestGTime | Div2WheelsOff | Div2TailNum | Div3Airport | Div3AirportID | Div3AirportSeqID | Div3WheelsOn | Div3TotalGTime | Div3LongestGTime | Div3WheelsOff | Div3TailNum | Div4Airport | Div4AirportID | Div4AirportSeqID | Div4WheelsOn | Div4TotalGTime | Div4LongestGTime | Div4WheelsOff | Div4TailNum | Div5Airport | Div5AirportID | Div5AirportSeqID | Div5WheelsOn | Div5TotalGTime | Div5LongestGTime | Div5WheelsOff | Div5TailNum | column109 |
---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
i64 | i64 | i64 | i64 | i64 | date | str | i64 | str | str | i64 | i64 | i64 | i64 | str | str | str | str | str | i64 | i64 | i64 | i64 | str | str | str | str | str | i64 | str | str | f64 | f64 | f64 | i64 | str | f64 | … | f64 | f64 | str | str | str | i64 | i64 | str | f64 | f64 | str | str | str | str | str | str | str | str | str | str | str | str | str | str | str | str | str | str | str | str | str | str | str | str | str | str | str |
2022 | 1 | 1 | 14 | 5 | 2022-01-14 | "YX" | 20452 | "YX" | "N119HQ" | 4879 | 11066 | 1106606 | 31066 | "CMH" | "Columbus, OH" | "OH" | "39" | "Ohio" | 44 | 11278 | 1127805 | 30852 | "DCA" | "Washington, DC" | "VA" | "51" | "Virginia" | 38 | "1224" | "1221" | -3.0 | 0.0 | 0.0 | -1 | "1200-1259" | 28.0 | … | null | null | "" | "" | "" | null | null | "" | null | null | "" | "" | "" | null | null | "" | null | null | "" | "" | "" | null | null | "" | null | null | "" | "" | "" | null | null | "" | null | null | "" | "" | null |
2022 | 1 | 1 | 15 | 6 | 2022-01-15 | "YX" | 20452 | "YX" | "N122HQ" | 4879 | 11066 | 1106606 | 31066 | "CMH" | "Columbus, OH" | "OH" | "39" | "Ohio" | 44 | 11278 | 1127805 | 30852 | "DCA" | "Washington, DC" | "VA" | "51" | "Virginia" | 38 | "1224" | "1214" | -10.0 | 0.0 | 0.0 | -1 | "1200-1259" | 19.0 | … | null | null | "" | "" | "" | null | null | "" | null | null | "" | "" | "" | null | null | "" | null | null | "" | "" | "" | null | null | "" | null | null | "" | "" | "" | null | null | "" | null | null | "" | "" | null |
2022 | 1 | 1 | 16 | 7 | 2022-01-16 | "YX" | 20452 | "YX" | "N412YX" | 4879 | 11066 | 1106606 | 31066 | "CMH" | "Columbus, OH" | "OH" | "39" | "Ohio" | 44 | 11278 | 1127805 | 30852 | "DCA" | "Washington, DC" | "VA" | "51" | "Virginia" | 38 | "1224" | "1218" | -6.0 | 0.0 | 0.0 | -1 | "1200-1259" | 16.0 | … | null | null | "" | "" | "" | null | null | "" | null | null | "" | "" | "" | null | null | "" | null | null | "" | "" | "" | null | null | "" | null | null | "" | "" | "" | null | null | "" | null | null | "" | "" | null |
2022 | 1 | 1 | 17 | 1 | 2022-01-17 | "YX" | 20452 | "YX" | "N405YX" | 4879 | 11066 | 1106606 | 31066 | "CMH" | "Columbus, OH" | "OH" | "39" | "Ohio" | 44 | 11278 | 1127805 | 30852 | "DCA" | "Washington, DC" | "VA" | "51" | "Virginia" | 38 | "1224" | "1217" | -7.0 | 0.0 | 0.0 | -1 | "1200-1259" | 32.0 | … | null | null | "" | "" | "" | null | null | "" | null | null | "" | "" | "" | null | null | "" | null | null | "" | "" | "" | null | null | "" | null | null | "" | "" | "" | null | null | "" | null | null | "" | "" | null |
2022 | 1 | 1 | 18 | 2 | 2022-01-18 | "YX" | 20452 | "YX" | "N420YX" | 4879 | 11066 | 1106606 | 31066 | "CMH" | "Columbus, OH" | "OH" | "39" | "Ohio" | 44 | 11278 | 1127805 | 30852 | "DCA" | "Washington, DC" | "VA" | "51" | "Virginia" | 38 | "1224" | "1218" | -6.0 | 0.0 | 0.0 | -1 | "1200-1259" | 11.0 | … | null | null | "" | "" | "" | null | null | "" | null | null | "" | "" | "" | null | null | "" | null | null | "" | "" | "" | null | null | "" | null | null | "" | "" | "" | null | null | "" | null | null | "" | "" | null |
Functions
Define some functions and see if they work.
PRQL currently does not allow functions to be stored in a separate session or file, and they must always be defined with the main query. (PRQL/prql#1803)
Extract City Names
PRQL does not currently have the ability to apply a function to multiple columns at once. (PRQL/prql#2386)
The others can apply a function to multiple columns at once, but DuckDB SQL differs in that the column names are updated and that behavior cannot be prevented.
- PRQL DuckDB
- SQL DuckDB
- dplyr R
- Python Polars
let extract_city_name = col -> s"regexp_replace({col}, ',.*', '')"
from tab
select {
OriginCityName = extract_city_name OriginCityName,
DestCityName = extract_city_name DestCityName
}
take 5
OriginCityName | DestCityName |
---|---|
Columbus | Washington |
Columbus | Washington |
Columbus | Washington |
Columbus | Washington |
Columbus | Washington |
CREATE MACRO extract_city_name(col) AS regexp_replace(col, ',.*', '')
The function works as follows:
SELECT
extract_city_name(COLUMNS(['OriginCityName', 'DestCityName']))
FROM
tab
LIMIT 5
OriginCityName | DestCityName |
---|---|
Columbus | Washington |
Columbus | Washington |
Columbus | Washington |
Columbus | Washington |
Columbus | Washington |
extract_city_name <- function(col) stringr::str_remove(col, ",.*")
The function works as follows:
df |>
head(5) |>
mutate(
across(c(OriginCityName, DestCityName), extract_city_name),
.keep = "none"
)
OriginCityName | DestCityName |
---|---|
Columbus | Washington |
Columbus | Washington |
Columbus | Washington |
Columbus | Washington |
Columbus | Washington |
def extract_city_name(cols: list[str]) -> pl.Expr:
return pl.col(cols).str.replace(",.*", "")
The function works as follows:
lf.fetch(5).select(extract_city_name(["OriginCityName", "DestCityName"]))
OriginCityName | DestCityName |
---|---|
str | str |
"Columbus" | "Washington" |
"Columbus" | "Washington" |
"Columbus" | "Washington" |
"Columbus" | "Washington" |
"Columbus" | "Washington" |
Timestamp Calculation
PRQL and SQL can only define functions that return a single column.
The column name (here FlightDate
) cannot be used in the function
definition of DuckDB SQL.
- PRQL DuckDB
- SQL DuckDB
- dplyr R
- Python Polars
let time_to_datetime = string -> s"""
FlightDate::TIMESTAMP +
TRY_CAST(regexp_replace({string}, '^2400$', '0000').substr(1, 2).concat(' hours') AS INTERVAL) +
TRY_CAST(regexp_replace({string}, '^2400$', '0000').substr(3, 2).concat(' minutes') AS INTERVAL)
"""
from tab
select {
FlightDate,
DepTimeOld = DepTime
}
derive {
DepTime = (time_to_datetime DepTimeOld)
}
take 5
FlightDate | DepTimeOld | DepTime |
---|---|---|
2022-01-14 | 1221 | 2022-01-14 12:21:00 |
2022-01-15 | 1214 | 2022-01-15 12:14:00 |
2022-01-16 | 1218 | 2022-01-16 12:18:00 |
2022-01-17 | 1217 | 2022-01-17 12:17:00 |
2022-01-18 | 1218 | 2022-01-18 12:18:00 |
CREATE MACRO time_to_datetime(date, string) AS
date::TIMESTAMP +
TRY_CAST(regexp_replace(string, '^2400$', '0000').substr(1, 2).concat(' hours') AS INTERVAL) +
TRY_CAST(regexp_replace(string, '^2400$', '0000').substr(3, 2).concat(' minutes') AS INTERVAL)
The function works as follows:
SELECT
FlightDate,
time_to_datetime(FlightDate, DepTime) AS DepTime,
DepTime AS DepTimeOld
FROM
tab
LIMIT 5
FlightDate | DepTime | DepTimeOld |
---|---|---|
2022-01-14 | 2022-01-14 12:21:00 | 1221 |
2022-01-15 | 2022-01-15 12:14:00 | 1214 |
2022-01-16 | 2022-01-16 12:18:00 | 1218 |
2022-01-17 | 2022-01-17 12:17:00 | 1217 |
2022-01-18 | 2022-01-18 12:18:00 | 1218 |
A function returns a single column.
time_col <- function(col) {
col |>
case_match(
"2400" ~ "0000",
.default = col
) |>
(\(x) {
(stringr::str_sub(x, 1, 2) |> lubridate::hours()) +
(stringr::str_sub(x, 3, 4) |> lubridate::minutes())
})()
}
A function returns a DataFrame.
time_to_datetime <- function(df, columns) {
df |>
mutate(across(all_of({{ columns }}), \(x) FlightDate + time_col(x)))
}
The function works as follows:
df |>
head(5) |>
select(FlightDate, DepTime) |>
mutate(DepTimeOld = DepTime) |>
time_to_datetime("DepTime")
FlightDate | DepTime | DepTimeOld |
---|---|---|
2022-01-14 | 2022-01-14 12:21:00 | 1221 |
2022-01-15 | 2022-01-15 12:14:00 | 1214 |
2022-01-16 | 2022-01-16 12:18:00 | 1218 |
2022-01-17 | 2022-01-17 12:17:00 | 1217 |
2022-01-18 | 2022-01-18 12:18:00 | 1218 |
A function returns a single polars.Expr
.
def time_col(col: str) -> pl.Expr:
col_expr = pl.col(col)
return (
pl.when(col_expr == "2400")
.then(pl.lit("0000"))
.otherwise(col_expr)
.str.strptime(pl.Time, "%H%M", strict=False)
.alias(col)
)
A function returns a list of polars.Expr
.
def time_to_datetime(columns: list[str]) -> list[pl.Expr]:
date_col = pl.col("FlightDate")
return [date_col.dt.combine(time_col(col)).alias(col) for col in columns]
The function works as follows:
lf.fetch(5).select(["FlightDate", "DepTime"]).with_columns(
DepTimeOld=pl.col("DepTime"), *time_to_datetime(["DepTime"])
)
FlightDate | DepTime | DepTimeOld |
---|---|---|
date | datetime[μs] | str |
2022-01-14 | 2022-01-14 12:21:00 | "1221" |
2022-01-15 | 2022-01-15 12:14:00 | "1214" |
2022-01-16 | 2022-01-16 12:18:00 | "1218" |
2022-01-17 | 2022-01-17 12:17:00 | "1217" |
2022-01-18 | 2022-01-18 12:18:00 | "1218" |
Use Functions in the Query
dplyr and Polars allow column names to be predefined as vectors or lists, which can then be referenced in the query.
- PRQL DuckDB
- SQL DuckDB
- dplyr R
- Python Polars
let extract_city_name = col -> s"regexp_replace({col}, ',.*', '')"
let time_to_datetime = string -> s"""
FlightDate::TIMESTAMP +
TRY_CAST(regexp_replace({string}, '^2400$', '0000').substr(1, 2).concat(' hours') AS INTERVAL) +
TRY_CAST(regexp_replace({string}, '^2400$', '0000').substr(3, 2).concat(' minutes') AS INTERVAL)
"""
from tab
select {
Dest,
Tail_Number,
IATA_CODE_Reporting_Airline,
CancellationCode,
DepTime = time_to_datetime DepTime,
ArrTime = time_to_datetime ArrTime,
CRSArrTime = time_to_datetime CRSArrTime,
CRSDepTime = time_to_datetime CRSDepTime,
FlightDate,
Flight_Number_Reporting_Airline,
OriginCityName = extract_city_name OriginCityName,
DestCityName = extract_city_name DestCityName,
Origin,
DepDelay
}
take 5
Dest | Tail_Number | IATA_CODE_Reporting_Airline | CancellationCode | DepTime | ArrTime | CRSArrTime | CRSDepTime | FlightDate | Flight_Number_Reporting_Airline | OriginCityName | DestCityName | Origin | DepDelay |
---|---|---|---|---|---|---|---|---|---|---|---|---|---|
DCA | N119HQ | YX | 2022-01-14 12:21:00 | 2022-01-14 13:56:00 | 2022-01-14 13:52:00 | 2022-01-14 12:24:00 | 2022-01-14 | 4879 | Columbus | Washington | CMH | -3 | |
DCA | N122HQ | YX | 2022-01-15 12:14:00 | 2022-01-15 13:28:00 | 2022-01-15 13:52:00 | 2022-01-15 12:24:00 | 2022-01-15 | 4879 | Columbus | Washington | CMH | -10 | |
DCA | N412YX | YX | 2022-01-16 12:18:00 | 2022-01-16 13:39:00 | 2022-01-16 13:52:00 | 2022-01-16 12:24:00 | 2022-01-16 | 4879 | Columbus | Washington | CMH | -6 | |
DCA | N405YX | YX | 2022-01-17 12:17:00 | 2022-01-17 14:01:00 | 2022-01-17 13:52:00 | 2022-01-17 12:24:00 | 2022-01-17 | 4879 | Columbus | Washington | CMH | -7 | |
DCA | N420YX | YX | 2022-01-18 12:18:00 | 2022-01-18 13:23:00 | 2022-01-18 13:52:00 | 2022-01-18 12:24:00 | 2022-01-18 | 4879 | Columbus | Washington | CMH | -6 |
SELECT
Dest,
Tail_Number,
IATA_CODE_Reporting_Airline,
CancellationCode,
time_to_datetime(
FlightDate,
COLUMNS([
DepTime,
ArrTime,
CRSArrTime,
CRSDepTime
])
),
FlightDate,
Flight_Number_Reporting_Airline,
extract_city_name(COLUMNS([OriginCityName, DestCityName])),
Origin,
DepDelay
FROM
tab
LIMIT 5
Dest | Tail_Number | IATA_CODE_Reporting_Airline | CancellationCode | CRSDepTime | DepTime | CRSArrTime | ArrTime | FlightDate | Flight_Number_Reporting_Airline | OriginCityName | DestCityName | Origin | DepDelay |
---|---|---|---|---|---|---|---|---|---|---|---|---|---|
DCA | N119HQ | YX | 2022-01-14 12:24:00 | 2022-01-14 12:21:00 | 2022-01-14 13:52:00 | 2022-01-14 13:56:00 | 2022-01-14 | 4879 | Columbus | Washington | CMH | -3 | |
DCA | N122HQ | YX | 2022-01-15 12:24:00 | 2022-01-15 12:14:00 | 2022-01-15 13:52:00 | 2022-01-15 13:28:00 | 2022-01-15 | 4879 | Columbus | Washington | CMH | -10 | |
DCA | N412YX | YX | 2022-01-16 12:24:00 | 2022-01-16 12:18:00 | 2022-01-16 13:52:00 | 2022-01-16 13:39:00 | 2022-01-16 | 4879 | Columbus | Washington | CMH | -6 | |
DCA | N405YX | YX | 2022-01-17 12:24:00 | 2022-01-17 12:17:00 | 2022-01-17 13:52:00 | 2022-01-17 14:01:00 | 2022-01-17 | 4879 | Columbus | Washington | CMH | -7 | |
DCA | N420YX | YX | 2022-01-18 12:24:00 | 2022-01-18 12:18:00 | 2022-01-18 13:52:00 | 2022-01-18 13:23:00 | 2022-01-18 | 4879 | Columbus | Washington | CMH | -6 |
category_cols <- c(
"Dest",
"Tail_Number",
"IATA_CODE_Reporting_Airline",
"CancellationCode"
)
time_cols <- c("DepTime", "ArrTime", "CRSArrTime", "CRSDepTime")
cols <- c(
category_cols,
time_cols,
c(
"FlightDate",
"Flight_Number_Reporting_Airline",
"OriginCityName",
"DestCityName",
"Origin",
"DepDelay"
)
)
df |>
select(all_of(cols)) |>
mutate(
across({{ category_cols }}, as.factor),
across(c(OriginCityName, DestCityName), extract_city_name)
) |>
time_to_datetime(time_cols) |>
head(5)
Dest | Tail_Number | IATA_CODE_Reporting_Airline | CancellationCode | DepTime | ArrTime | CRSArrTime | CRSDepTime | FlightDate | Flight_Number_Reporting_Airline | OriginCityName | DestCityName | Origin | DepDelay |
---|---|---|---|---|---|---|---|---|---|---|---|---|---|
DCA | N119HQ | YX | 2022-01-14 12:21:00 | 2022-01-14 13:56:00 | 2022-01-14 13:52:00 | 2022-01-14 12:24:00 | 2022-01-14 | 4879 | Columbus | Washington | CMH | -3 | |
DCA | N122HQ | YX | 2022-01-15 12:14:00 | 2022-01-15 13:28:00 | 2022-01-15 13:52:00 | 2022-01-15 12:24:00 | 2022-01-15 | 4879 | Columbus | Washington | CMH | -10 | |
DCA | N412YX | YX | 2022-01-16 12:18:00 | 2022-01-16 13:39:00 | 2022-01-16 13:52:00 | 2022-01-16 12:24:00 | 2022-01-16 | 4879 | Columbus | Washington | CMH | -6 | |
DCA | N405YX | YX | 2022-01-17 12:17:00 | 2022-01-17 14:01:00 | 2022-01-17 13:52:00 | 2022-01-17 12:24:00 | 2022-01-17 | 4879 | Columbus | Washington | CMH | -7 | |
DCA | N420YX | YX | 2022-01-18 12:18:00 | 2022-01-18 13:23:00 | 2022-01-18 13:52:00 | 2022-01-18 12:24:00 | 2022-01-18 | 4879 | Columbus | Washington | CMH | -6 |
category_cols = [
"Dest",
"Tail_Number",
"IATA_CODE_Reporting_Airline",
"CancellationCode",
]
time_cols = ["DepTime", "ArrTime", "CRSArrTime", "CRSDepTime"]
cols = (
category_cols
+ time_cols
+ [
"FlightDate",
"Flight_Number_Reporting_Airline",
"OriginCityName",
"DestCityName",
"Origin",
"DepDelay",
]
)
lf.select(cols).with_columns(
[
pl.col(category_cols).cast(pl.Categorical),
extract_city_name(["OriginCityName", "DestCityName"]),
*time_to_datetime(time_cols),
]
).head(5).collect()
Dest | Tail_Number | IATA_CODE_Reporting_Airline | CancellationCode | DepTime | ArrTime | CRSArrTime | CRSDepTime | FlightDate | Flight_Number_Reporting_Airline | OriginCityName | DestCityName | Origin | DepDelay |
---|---|---|---|---|---|---|---|---|---|---|---|---|---|
cat | cat | cat | cat | datetime[μs] | datetime[μs] | datetime[μs] | datetime[μs] | date | i64 | str | str | str | f64 |
"DCA" | "N119HQ" | "YX" | "" | 2022-01-14 12:21:00 | 2022-01-14 13:56:00 | 2022-01-14 13:52:00 | 2022-01-14 12:24:00 | 2022-01-14 | 4879 | "Columbus" | "Washington" | "CMH" | -3.0 |
"DCA" | "N122HQ" | "YX" | "" | 2022-01-15 12:14:00 | 2022-01-15 13:28:00 | 2022-01-15 13:52:00 | 2022-01-15 12:24:00 | 2022-01-15 | 4879 | "Columbus" | "Washington" | "CMH" | -10.0 |
"DCA" | "N412YX" | "YX" | "" | 2022-01-16 12:18:00 | 2022-01-16 13:39:00 | 2022-01-16 13:52:00 | 2022-01-16 12:24:00 | 2022-01-16 | 4879 | "Columbus" | "Washington" | "CMH" | -6.0 |
"DCA" | "N405YX" | "YX" | "" | 2022-01-17 12:17:00 | 2022-01-17 14:01:00 | 2022-01-17 13:52:00 | 2022-01-17 12:24:00 | 2022-01-17 | 4879 | "Columbus" | "Washington" | "CMH" | -7.0 |
"DCA" | "N420YX" | "YX" | "" | 2022-01-18 12:18:00 | 2022-01-18 13:23:00 | 2022-01-18 13:52:00 | 2022-01-18 12:24:00 | 2022-01-18 | 4879 | "Columbus" | "Washington" | "CMH" | -6.0 |