Skip to main content

Selecting and Filtering

note

This page is based on the chapter “Indexing (Or Lack Thereof)” of the Modern Polars book.

Preparing Data

important

Since the PRQL and SQL results shown on this page are after being converted to R DataFrame via knitr, they have been converted from DuckDB types to R types. So NULL in DuckDB is shown as NA.

Download

Download the data to be analysis (zipped CSV file) and write the data to a Parquet file.

This document uses R here, but we can do it in another language, or, manually download and unzip and create the Parquet file (with DuckDB CLI).

Code
# Create "data" directory, download the zip file into the directory, and create a Parquet file.
data_dir <- "data"
dest <- file.path(data_dir, "flights.csv.zip")
csv_name <- "On_Time_Reporting_Carrier_On_Time_Performance_(1987_present)_2022_1.csv"
csv_path <- file.path(data_dir, csv_name)
parquet_path <- file.path(data_dir, "flights.parquet")

if (!fs::file_exists(parquet_path)) {
if (!fs::file_exists(dest)) {
fs::dir_create(data_dir)
curl::curl_download(
"https://transtats.bts.gov/PREZIP/On_Time_Reporting_Carrier_On_Time_Performance_1987_present_2022_1.zip",
dest,
handle = curl::new_handle(ssl_verifypeer = FALSE)
)
}

unzip(dest, csv_name, exdir = data_dir)
duckdb:::sql(glue::glue("COPY (FROM read_csv_auto('{csv_path}')) TO '{parquet_path}' (FORMAT PARQUET)"))
}

Load the Data

After the Parquet file is ready, load it into DuckDB (in-memory) database table, R DataFrame, and Python polars.LazyFrame.

CREATE TABLE tab AS SELECT * FROM 'data/flights.parquet'
FROM tab LIMIT 5
YearQuarterMonthDayofMonthDayOfWeekFlightDateReporting_AirlineDOT_ID_Reporting_AirlineIATA_CODE_Reporting_AirlineTail_NumberFlight_Number_Reporting_AirlineOriginAirportIDOriginAirportSeqIDOriginCityMarketIDOriginOriginCityNameOriginStateOriginStateFipsOriginStateNameOriginWacDestAirportIDDestAirportSeqIDDestCityMarketIDDestDestCityNameDestStateDestStateFipsDestStateNameDestWacCRSDepTimeDepTimeDepDelayDepDelayMinutesDepDel15DepartureDelayGroupsDepTimeBlkTaxiOutWheelsOffWheelsOnTaxiInCRSArrTimeArrTimeArrDelayArrDelayMinutesArrDel15ArrivalDelayGroupsArrTimeBlkCancelledCancellationCodeDivertedCRSElapsedTimeActualElapsedTimeAirTimeFlightsDistanceDistanceGroupCarrierDelayWeatherDelayNASDelaySecurityDelayLateAircraftDelayFirstDepTimeTotalAddGTimeLongestAddGTimeDivAirportLandingsDivReachedDestDivActualElapsedTimeDivArrDelayDivDistanceDiv1AirportDiv1AirportIDDiv1AirportSeqIDDiv1WheelsOnDiv1TotalGTimeDiv1LongestGTimeDiv1WheelsOffDiv1TailNumDiv2AirportDiv2AirportIDDiv2AirportSeqIDDiv2WheelsOnDiv2TotalGTimeDiv2LongestGTimeDiv2WheelsOffDiv2TailNumDiv3AirportDiv3AirportIDDiv3AirportSeqIDDiv3WheelsOnDiv3TotalGTimeDiv3LongestGTimeDiv3WheelsOffDiv3TailNumDiv4AirportDiv4AirportIDDiv4AirportSeqIDDiv4WheelsOnDiv4TotalGTimeDiv4LongestGTimeDiv4WheelsOffDiv4TailNumDiv5AirportDiv5AirportIDDiv5AirportSeqIDDiv5WheelsOnDiv5TotalGTimeDiv5LongestGTimeDiv5WheelsOffDiv5TailNumcolumn109
2022111452022-01-14YX20452YXN119HQ487911066110660631066CMHColumbus, OHOH39Ohio4411278112780530852DCAWashington, DCVA51Virginia3812241221-300-11200-1259281249135241352135644001300-13590088956313232NANANANANANANA0NANANANANANANANANANANANANANANANANANANANANANANANANA
2022111562022-01-15YX20452YXN122HQ487911066110660631066CMHColumbus, OHOH39Ohio4411278112780530852DCAWashington, DCVA51Virginia3812241214-1000-11200-12591912331323513521328-2400-21300-13590088745013232NANANANANANANA0NANANANANANANANANANANANANANANANANANANANANANANANANA
2022111672022-01-16YX20452YXN412YX487911066110660631066CMHColumbus, OHOH39Ohio4411278112780530852DCAWashington, DCVA51Virginia3812241218-600-11200-125916123413271213521339-1300-11300-13590088815313232NANANANANANANA0NANANANANANANANANANANANANANANANANANANANANANANANANA
2022111712022-01-17YX20452YXN405YX487911066110660631066CMHColumbus, OHOH39Ohio4411278112780530852DCAWashington, DCVA51Virginia3812241217-700-11200-12593212491345161352140199001300-135900881045613232NANANANANANANA0NANANANANANANANANANANANANANANANANANANANANANANANANA
2022111822022-01-18YX20452YXN420YX487911066110660631066CMHColumbus, OHOH39Ohio4411278112780530852DCAWashington, DCVA51Virginia3812241218-600-11200-12591112291317613521323-2900-21300-13590088654813232NANANANANANANA0NANANANANANANANANANANANANANANANANANANANANANANANANA

Read the Data with PRQL

Since PRQL cannot be used to create tables, data must be read from pre-created tables in a DB.

But when PRQL is used on DuckDB, it is possible to treat Parquet and other files as tables, so that PRQL queries can be executed without creating tables.

from `data/flights.parquet`
take 5
YearQuarterMonthDayofMonthDayOfWeekFlightDateReporting_AirlineDOT_ID_Reporting_AirlineIATA_CODE_Reporting_AirlineTail_NumberFlight_Number_Reporting_AirlineOriginAirportIDOriginAirportSeqIDOriginCityMarketIDOriginOriginCityNameOriginStateOriginStateFipsOriginStateNameOriginWacDestAirportIDDestAirportSeqIDDestCityMarketIDDestDestCityNameDestStateDestStateFipsDestStateNameDestWacCRSDepTimeDepTimeDepDelayDepDelayMinutesDepDel15DepartureDelayGroupsDepTimeBlkTaxiOutWheelsOffWheelsOnTaxiInCRSArrTimeArrTimeArrDelayArrDelayMinutesArrDel15ArrivalDelayGroupsArrTimeBlkCancelledCancellationCodeDivertedCRSElapsedTimeActualElapsedTimeAirTimeFlightsDistanceDistanceGroupCarrierDelayWeatherDelayNASDelaySecurityDelayLateAircraftDelayFirstDepTimeTotalAddGTimeLongestAddGTimeDivAirportLandingsDivReachedDestDivActualElapsedTimeDivArrDelayDivDistanceDiv1AirportDiv1AirportIDDiv1AirportSeqIDDiv1WheelsOnDiv1TotalGTimeDiv1LongestGTimeDiv1WheelsOffDiv1TailNumDiv2AirportDiv2AirportIDDiv2AirportSeqIDDiv2WheelsOnDiv2TotalGTimeDiv2LongestGTimeDiv2WheelsOffDiv2TailNumDiv3AirportDiv3AirportIDDiv3AirportSeqIDDiv3WheelsOnDiv3TotalGTimeDiv3LongestGTimeDiv3WheelsOffDiv3TailNumDiv4AirportDiv4AirportIDDiv4AirportSeqIDDiv4WheelsOnDiv4TotalGTimeDiv4LongestGTimeDiv4WheelsOffDiv4TailNumDiv5AirportDiv5AirportIDDiv5AirportSeqIDDiv5WheelsOnDiv5TotalGTimeDiv5LongestGTimeDiv5WheelsOffDiv5TailNumcolumn109
2022111452022-01-14YX20452YXN119HQ487911066110660631066CMHColumbus, OHOH39Ohio4411278112780530852DCAWashington, DCVA51Virginia3812241221-300-11200-1259281249135241352135644001300-13590088956313232NANANANANANANA0NANANANANANANANANANANANANANANANANANANANANANANANANA
2022111562022-01-15YX20452YXN122HQ487911066110660631066CMHColumbus, OHOH39Ohio4411278112780530852DCAWashington, DCVA51Virginia3812241214-1000-11200-12591912331323513521328-2400-21300-13590088745013232NANANANANANANA0NANANANANANANANANANANANANANANANANANANANANANANANANA
2022111672022-01-16YX20452YXN412YX487911066110660631066CMHColumbus, OHOH39Ohio4411278112780530852DCAWashington, DCVA51Virginia3812241218-600-11200-125916123413271213521339-1300-11300-13590088815313232NANANANANANANA0NANANANANANANANANANANANANANANANANANANANANANANANANA
2022111712022-01-17YX20452YXN405YX487911066110660631066CMHColumbus, OHOH39Ohio4411278112780530852DCAWashington, DCVA51Virginia3812241217-700-11200-12593212491345161352140199001300-135900881045613232NANANANANANANA0NANANANANANANANANANANANANANANANANANANANANANANANANA
2022111822022-01-18YX20452YXN420YX487911066110660631066CMHColumbus, OHOH39Ohio4411278112780530852DCAWashington, DCVA51Virginia3812241218-600-11200-12591112291317613521323-2900-21300-13590088654813232NANANANANANANA0NANANANANANANANANANANANANANANANANANANANANANANANANA

Selecting Columns and Slicing Rows

For column selection, the syntax is much the same, but the syntax for row selection using position looks different.

Note that PRQL, SQL, and R are 1-based, while Python is 0-based.

from tab
select {Dest, Tail_Number}
take 13..16
DestTail_Number
DCAN132HQ
DCAN109HQ
DCAN421YX
DCAN137HQ

Filtering Rows

from tab
filter (IATA_CODE_Reporting_Airline | in ["AA", "DL"])
take 5
YearQuarterMonthDayofMonthDayOfWeekFlightDateReporting_AirlineDOT_ID_Reporting_AirlineIATA_CODE_Reporting_AirlineTail_NumberFlight_Number_Reporting_AirlineOriginAirportIDOriginAirportSeqIDOriginCityMarketIDOriginOriginCityNameOriginStateOriginStateFipsOriginStateNameOriginWacDestAirportIDDestAirportSeqIDDestCityMarketIDDestDestCityNameDestStateDestStateFipsDestStateNameDestWacCRSDepTimeDepTimeDepDelayDepDelayMinutesDepDel15DepartureDelayGroupsDepTimeBlkTaxiOutWheelsOffWheelsOnTaxiInCRSArrTimeArrTimeArrDelayArrDelayMinutesArrDel15ArrivalDelayGroupsArrTimeBlkCancelledCancellationCodeDivertedCRSElapsedTimeActualElapsedTimeAirTimeFlightsDistanceDistanceGroupCarrierDelayWeatherDelayNASDelaySecurityDelayLateAircraftDelayFirstDepTimeTotalAddGTimeLongestAddGTimeDivAirportLandingsDivReachedDestDivActualElapsedTimeDivArrDelayDivDistanceDiv1AirportDiv1AirportIDDiv1AirportSeqIDDiv1WheelsOnDiv1TotalGTimeDiv1LongestGTimeDiv1WheelsOffDiv1TailNumDiv2AirportDiv2AirportIDDiv2AirportSeqIDDiv2WheelsOnDiv2TotalGTimeDiv2LongestGTimeDiv2WheelsOffDiv2TailNumDiv3AirportDiv3AirportIDDiv3AirportSeqIDDiv3WheelsOnDiv3TotalGTimeDiv3LongestGTimeDiv3WheelsOffDiv3TailNumDiv4AirportDiv4AirportIDDiv4AirportSeqIDDiv4WheelsOnDiv4TotalGTimeDiv4LongestGTimeDiv4WheelsOffDiv4TailNumDiv5AirportDiv5AirportIDDiv5AirportSeqIDDiv5WheelsOnDiv5TotalGTimeDiv5LongestGTimeDiv5WheelsOffDiv5TailNumcolumn109
202211642022-01-06DL19790DLN315DN158111697116970632467FLLFort Lauderdale, FLFL12Florida3312953129530431703LGANew York, NYNY36New York221126NANANANA1100-1159NANA1419NANANANA1400-14591A0173NANA110765NANANANANANANA0NANANANANANANANANANANANANANANANANANANANANANANANANA
202211642022-01-06DL19790DLN545US158210397103970730397ATLAtlanta, GAGA13Georgia3411697116970632467FLLFort Lauderdale, FLFL12Florida3316311627-400-11600-16591516421815518211820-100-11800-1859001101139315813NANANANANANANA0NANANANANANANANANANANANANANANANANANANANANANANANANA
202211642022-01-06DL19790DLN545US158211697116970632467FLLFort Lauderdale, FLFL12Florida3310397103970730397ATLAtlanta, GAGA13Georgia3419311929-200-11900-195913194221051021272115-1200-12100-2159001161068315813NANANANANANANA0NANANANANANANANANANANANANANANANANANANANANANANANANA
202211642022-01-06DL19790DLN345NB158311697116970632467FLLFort Lauderdale, FLFL12Florida3314492144920234492RDURaleigh/Durham, NCNC37North Carolina3610241019-500-11000-10591710361209312271212-1500-11200-1259001231139316803NANANANANANANA0NANANANANANANANANANANANANANANANANANANANANANANANANA
202211642022-01-06DL19790DLN978AT158410397103970730397ATLAtlanta, GAGA13Georgia3412448124480732448JANJackson/Vicksburg, MSMS28Mississippi5311171113-400-11100-11591411271127411421131-1100-11100-11590085786013412NANANANANANANA0NANANANANANANANANANANANANANANANANANANANANANANANANA

Assigning

let f = from [
{a = 1, b = 10},
{a = 2, b = 20},
{a = 3, b = 30},
{a = 4, b = 40},
{a = 5, b = 50},
]

from f
select {
a, # Note: can't use `derive` here https://github.com/PRQL/prql/issues/3130
b = case [
a <= 3 => b // 10,
true => b
]
}
ab
11
22
33
440
550