Skip to main content

Selecting and Filtering

note

This page is based on the chapter “Indexing (Or Lack Thereof)” of the Modern Polars book.

Preparing Data

important

Since the PRQL and SQL results shown on this page are after being converted to R DataFrame via knitr, they have been converted from DuckDB types to R types. So NULL in DuckDB is shown as NA.

Download

Download the data to be analysis (zipped CSV file) and write the data to a Parquet file.

This document uses R here, but we can do it in another language, or, manually download and unzip and create the Parquet file (with DuckDB CLI).

Code
R
# Create "data" directory, download the zip file into the directory, and create a Parquet file.
data_dir <- "data"
dest <- file.path(data_dir, "flights.csv.zip")
csv_name <- "On_Time_Reporting_Carrier_On_Time_Performance_(1987_present)_2022_1.csv"
csv_path <- file.path(data_dir, csv_name)
parquet_path <- file.path(data_dir, "flights.parquet")

if (!fs::file_exists(parquet_path)) {
if (!fs::file_exists(dest)) {
fs::dir_create(data_dir)
curl::curl_download(
"https://transtats.bts.gov/PREZIP/On_Time_Reporting_Carrier_On_Time_Performance_1987_present_2022_1.zip",
dest,
handle = curl::new_handle(ssl_verifypeer = FALSE)
)
}

unzip(dest, csv_name, exdir = data_dir)
duckdb:::sql(glue::glue("COPY (FROM read_csv_auto('{csv_path}')) TO '{parquet_path}' (FORMAT PARQUET)"))
}

Load the Data

After the Parquet file is ready, load it into DuckDB (in-memory) database table, R DataFrame, and Python polars.LazyFrame.

SQL
CREATE TABLE tab AS SELECT * FROM 'data/flights.parquet'
SQL
FROM tab LIMIT 5
Year Quarter Month DayofMonth DayOfWeek FlightDate Reporting_Airline DOT_ID_Reporting_Airline IATA_CODE_Reporting_Airline Tail_Number Flight_Number_Reporting_Airline OriginAirportID OriginAirportSeqID OriginCityMarketID Origin OriginCityName OriginState OriginStateFips OriginStateName OriginWac DestAirportID DestAirportSeqID DestCityMarketID Dest DestCityName DestState DestStateFips DestStateName DestWac CRSDepTime DepTime DepDelay DepDelayMinutes DepDel15 DepartureDelayGroups DepTimeBlk TaxiOut WheelsOff WheelsOn TaxiIn CRSArrTime ArrTime ArrDelay ArrDelayMinutes ArrDel15 ArrivalDelayGroups ArrTimeBlk Cancelled CancellationCode Diverted CRSElapsedTime ActualElapsedTime AirTime Flights Distance DistanceGroup CarrierDelay WeatherDelay NASDelay SecurityDelay LateAircraftDelay FirstDepTime TotalAddGTime LongestAddGTime DivAirportLandings DivReachedDest DivActualElapsedTime DivArrDelay DivDistance Div1Airport Div1AirportID Div1AirportSeqID Div1WheelsOn Div1TotalGTime Div1LongestGTime Div1WheelsOff Div1TailNum Div2Airport Div2AirportID Div2AirportSeqID Div2WheelsOn Div2TotalGTime Div2LongestGTime Div2WheelsOff Div2TailNum Div3Airport Div3AirportID Div3AirportSeqID Div3WheelsOn Div3TotalGTime Div3LongestGTime Div3WheelsOff Div3TailNum Div4Airport Div4AirportID Div4AirportSeqID Div4WheelsOn Div4TotalGTime Div4LongestGTime Div4WheelsOff Div4TailNum Div5Airport Div5AirportID Div5AirportSeqID Div5WheelsOn Div5TotalGTime Div5LongestGTime Div5WheelsOff Div5TailNum column109
2022 1 1 14 5 2022-01-14 YX 20452 YX N119HQ 4879 11066 1106606 31066 CMH Columbus, OH OH 39 Ohio 44 11278 1127805 30852 DCA Washington, DC VA 51 Virginia 38 1224 1221 -3 0 0 -1 1200-1259 28 1249 1352 4 1352 1356 4 4 0 0 1300-1359 0 0 88 95 63 1 323 2 NA NA NA NA NA NA NA 0 NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA
2022 1 1 15 6 2022-01-15 YX 20452 YX N122HQ 4879 11066 1106606 31066 CMH Columbus, OH OH 39 Ohio 44 11278 1127805 30852 DCA Washington, DC VA 51 Virginia 38 1224 1214 -10 0 0 -1 1200-1259 19 1233 1323 5 1352 1328 -24 0 0 -2 1300-1359 0 0 88 74 50 1 323 2 NA NA NA NA NA NA NA 0 NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA
2022 1 1 16 7 2022-01-16 YX 20452 YX N412YX 4879 11066 1106606 31066 CMH Columbus, OH OH 39 Ohio 44 11278 1127805 30852 DCA Washington, DC VA 51 Virginia 38 1224 1218 -6 0 0 -1 1200-1259 16 1234 1327 12 1352 1339 -13 0 0 -1 1300-1359 0 0 88 81 53 1 323 2 NA NA NA NA NA NA NA 0 NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA
2022 1 1 17 1 2022-01-17 YX 20452 YX N405YX 4879 11066 1106606 31066 CMH Columbus, OH OH 39 Ohio 44 11278 1127805 30852 DCA Washington, DC VA 51 Virginia 38 1224 1217 -7 0 0 -1 1200-1259 32 1249 1345 16 1352 1401 9 9 0 0 1300-1359 0 0 88 104 56 1 323 2 NA NA NA NA NA NA NA 0 NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA
2022 1 1 18 2 2022-01-18 YX 20452 YX N420YX 4879 11066 1106606 31066 CMH Columbus, OH OH 39 Ohio 44 11278 1127805 30852 DCA Washington, DC VA 51 Virginia 38 1224 1218 -6 0 0 -1 1200-1259 11 1229 1317 6 1352 1323 -29 0 0 -2 1300-1359 0 0 88 65 48 1 323 2 NA NA NA NA NA NA NA 0 NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA

Read the Data with PRQL

Since PRQL cannot be used to create tables, data must be read from pre-created tables in a DB.

But when PRQL is used on DuckDB, it is possible to treat Parquet and other files as tables, so that PRQL queries can be executed without creating tables.

PRQL
from `data/flights.parquet`
take 5
Year Quarter Month DayofMonth DayOfWeek FlightDate Reporting_Airline DOT_ID_Reporting_Airline IATA_CODE_Reporting_Airline Tail_Number Flight_Number_Reporting_Airline OriginAirportID OriginAirportSeqID OriginCityMarketID Origin OriginCityName OriginState OriginStateFips OriginStateName OriginWac DestAirportID DestAirportSeqID DestCityMarketID Dest DestCityName DestState DestStateFips DestStateName DestWac CRSDepTime DepTime DepDelay DepDelayMinutes DepDel15 DepartureDelayGroups DepTimeBlk TaxiOut WheelsOff WheelsOn TaxiIn CRSArrTime ArrTime ArrDelay ArrDelayMinutes ArrDel15 ArrivalDelayGroups ArrTimeBlk Cancelled CancellationCode Diverted CRSElapsedTime ActualElapsedTime AirTime Flights Distance DistanceGroup CarrierDelay WeatherDelay NASDelay SecurityDelay LateAircraftDelay FirstDepTime TotalAddGTime LongestAddGTime DivAirportLandings DivReachedDest DivActualElapsedTime DivArrDelay DivDistance Div1Airport Div1AirportID Div1AirportSeqID Div1WheelsOn Div1TotalGTime Div1LongestGTime Div1WheelsOff Div1TailNum Div2Airport Div2AirportID Div2AirportSeqID Div2WheelsOn Div2TotalGTime Div2LongestGTime Div2WheelsOff Div2TailNum Div3Airport Div3AirportID Div3AirportSeqID Div3WheelsOn Div3TotalGTime Div3LongestGTime Div3WheelsOff Div3TailNum Div4Airport Div4AirportID Div4AirportSeqID Div4WheelsOn Div4TotalGTime Div4LongestGTime Div4WheelsOff Div4TailNum Div5Airport Div5AirportID Div5AirportSeqID Div5WheelsOn Div5TotalGTime Div5LongestGTime Div5WheelsOff Div5TailNum column109
2022 1 1 14 5 2022-01-14 YX 20452 YX N119HQ 4879 11066 1106606 31066 CMH Columbus, OH OH 39 Ohio 44 11278 1127805 30852 DCA Washington, DC VA 51 Virginia 38 1224 1221 -3 0 0 -1 1200-1259 28 1249 1352 4 1352 1356 4 4 0 0 1300-1359 0 0 88 95 63 1 323 2 NA NA NA NA NA NA NA 0 NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA
2022 1 1 15 6 2022-01-15 YX 20452 YX N122HQ 4879 11066 1106606 31066 CMH Columbus, OH OH 39 Ohio 44 11278 1127805 30852 DCA Washington, DC VA 51 Virginia 38 1224 1214 -10 0 0 -1 1200-1259 19 1233 1323 5 1352 1328 -24 0 0 -2 1300-1359 0 0 88 74 50 1 323 2 NA NA NA NA NA NA NA 0 NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA
2022 1 1 16 7 2022-01-16 YX 20452 YX N412YX 4879 11066 1106606 31066 CMH Columbus, OH OH 39 Ohio 44 11278 1127805 30852 DCA Washington, DC VA 51 Virginia 38 1224 1218 -6 0 0 -1 1200-1259 16 1234 1327 12 1352 1339 -13 0 0 -1 1300-1359 0 0 88 81 53 1 323 2 NA NA NA NA NA NA NA 0 NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA
2022 1 1 17 1 2022-01-17 YX 20452 YX N405YX 4879 11066 1106606 31066 CMH Columbus, OH OH 39 Ohio 44 11278 1127805 30852 DCA Washington, DC VA 51 Virginia 38 1224 1217 -7 0 0 -1 1200-1259 32 1249 1345 16 1352 1401 9 9 0 0 1300-1359 0 0 88 104 56 1 323 2 NA NA NA NA NA NA NA 0 NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA
2022 1 1 18 2 2022-01-18 YX 20452 YX N420YX 4879 11066 1106606 31066 CMH Columbus, OH OH 39 Ohio 44 11278 1127805 30852 DCA Washington, DC VA 51 Virginia 38 1224 1218 -6 0 0 -1 1200-1259 11 1229 1317 6 1352 1323 -29 0 0 -2 1300-1359 0 0 88 65 48 1 323 2 NA NA NA NA NA NA NA 0 NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA

Selecting Columns and Slicing Rows

For column selection, the syntax is much the same, but the syntax for row selection using position looks different.

Note that PRQL, SQL, and R are 1-based, while Python is 0-based.

PRQL
from tab
select {Dest, Tail_Number}
take 13..16
Dest Tail_Number
DCA N132HQ
DCA N109HQ
DCA N421YX
DCA N137HQ

Filtering Rows

PRQL
from tab
filter (IATA_CODE_Reporting_Airline | in ["AA", "DL"])
take 5
Year Quarter Month DayofMonth DayOfWeek FlightDate Reporting_Airline DOT_ID_Reporting_Airline IATA_CODE_Reporting_Airline Tail_Number Flight_Number_Reporting_Airline OriginAirportID OriginAirportSeqID OriginCityMarketID Origin OriginCityName OriginState OriginStateFips OriginStateName OriginWac DestAirportID DestAirportSeqID DestCityMarketID Dest DestCityName DestState DestStateFips DestStateName DestWac CRSDepTime DepTime DepDelay DepDelayMinutes DepDel15 DepartureDelayGroups DepTimeBlk TaxiOut WheelsOff WheelsOn TaxiIn CRSArrTime ArrTime ArrDelay ArrDelayMinutes ArrDel15 ArrivalDelayGroups ArrTimeBlk Cancelled CancellationCode Diverted CRSElapsedTime ActualElapsedTime AirTime Flights Distance DistanceGroup CarrierDelay WeatherDelay NASDelay SecurityDelay LateAircraftDelay FirstDepTime TotalAddGTime LongestAddGTime DivAirportLandings DivReachedDest DivActualElapsedTime DivArrDelay DivDistance Div1Airport Div1AirportID Div1AirportSeqID Div1WheelsOn Div1TotalGTime Div1LongestGTime Div1WheelsOff Div1TailNum Div2Airport Div2AirportID Div2AirportSeqID Div2WheelsOn Div2TotalGTime Div2LongestGTime Div2WheelsOff Div2TailNum Div3Airport Div3AirportID Div3AirportSeqID Div3WheelsOn Div3TotalGTime Div3LongestGTime Div3WheelsOff Div3TailNum Div4Airport Div4AirportID Div4AirportSeqID Div4WheelsOn Div4TotalGTime Div4LongestGTime Div4WheelsOff Div4TailNum Div5Airport Div5AirportID Div5AirportSeqID Div5WheelsOn Div5TotalGTime Div5LongestGTime Div5WheelsOff Div5TailNum column109
2022 1 1 6 4 2022-01-06 DL 19790 DL N315DN 1581 11697 1169706 32467 FLL Fort Lauderdale, FL FL 12 Florida 33 12953 1295304 31703 LGA New York, NY NY 36 New York 22 1126 NA NA NA NA 1100-1159 NA NA 1419 NA NA NA NA 1400-1459 1 A 0 173 NA NA 1 1076 5 NA NA NA NA NA NA NA 0 NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA
2022 1 1 6 4 2022-01-06 DL 19790 DL N545US 1582 10397 1039707 30397 ATL Atlanta, GA GA 13 Georgia 34 11697 1169706 32467 FLL Fort Lauderdale, FL FL 12 Florida 33 1631 1627 -4 0 0 -1 1600-1659 15 1642 1815 5 1821 1820 -1 0 0 -1 1800-1859 0 0 110 113 93 1 581 3 NA NA NA NA NA NA NA 0 NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA
2022 1 1 6 4 2022-01-06 DL 19790 DL N545US 1582 11697 1169706 32467 FLL Fort Lauderdale, FL FL 12 Florida 33 10397 1039707 30397 ATL Atlanta, GA GA 13 Georgia 34 1931 1929 -2 0 0 -1 1900-1959 13 1942 2105 10 2127 2115 -12 0 0 -1 2100-2159 0 0 116 106 83 1 581 3 NA NA NA NA NA NA NA 0 NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA
2022 1 1 6 4 2022-01-06 DL 19790 DL N345NB 1583 11697 1169706 32467 FLL Fort Lauderdale, FL FL 12 Florida 33 14492 1449202 34492 RDU Raleigh/Durham, NC NC 37 North Carolina 36 1024 1019 -5 0 0 -1 1000-1059 17 1036 1209 3 1227 1212 -15 0 0 -1 1200-1259 0 0 123 113 93 1 680 3 NA NA NA NA NA NA NA 0 NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA
2022 1 1 6 4 2022-01-06 DL 19790 DL N978AT 1584 10397 1039707 30397 ATL Atlanta, GA GA 13 Georgia 34 12448 1244807 32448 JAN Jackson/Vicksburg, MS MS 28 Mississippi 53 1117 1113 -4 0 0 -1 1100-1159 14 1127 1127 4 1142 1131 -11 0 0 -1 1100-1159 0 0 85 78 60 1 341 2 NA NA NA NA NA NA NA 0 NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA

Assigning

PRQL
let f = from [
{a = 1, b = 10},
{a = 2, b = 20},
{a = 3, b = 30},
{a = 4, b = 40},
{a = 5, b = 50},
]

from f
select {
a, # Note: can't use `derive` here https://github.com/PRQL/prql/issues/3130
b = case [
a <= 3 => b // 10,
true => b
]
}
a b
1 1
2 2
3 3
4 40
5 50