Reshaping and Tidy Data
This page is based on the chapter “Reshaping and Tidy Data” of the Modern Polars book.
Read the Data
Since the PRQL and SQL results shown on this page are after being
converted to R DataFrame via knitr, they have been converted from DuckDB
types to R types. So NULL
in DuckDB is shown as NA
.
Download
Download the data to be analysis (tables on the website) and write the data to CSV files.
This document uses R to download the data from the source here, but we can also download and use the CSV files included in the kevinheavey/modern-polars GitHub repository.
- R
- Shell
Code
nba_dir <- file.path("data", "nba")
months <- c(
"october",
"november",
"december",
"january",
"february",
"march",
"april",
"may",
"june"
)
column_names <- c(
date = "date",
away_team = "visitor_neutral",
away_points = "pts",
home_team = "home_neutral",
home_points = "pts_2"
)
.write_data <- function(month) {
base_url <- "http://www.basketball-reference.com/leagues/NBA_2016_games-{month}.html"
glue::glue(base_url, month = month) |>
rvest::read_html() |>
rvest::html_table() |>
(\(x) x[[1]])() |> # TODO: Rewrite after R 4.3
janitor::clean_names() |>
dplyr::select(all_of(column_names)) |>
dplyr::filter(date != "Playoffs") |>
readr::write_csv(file.path(nba_dir, glue::glue("{month}.csv")))
}
if (!fs::dir_exists(nba_dir)) {
fs::dir_create(nba_dir)
months |>
purrr::walk(.write_data)
}
This is a sample command to download the CSV files from the
kevinheavey/modern-polars
GitHub repository.
mkdir data/nba
curl -sL https://github.com/kevinheavey/modern-polars/raw/87539190dde3e99d5e4c4f9957c78932a33075a0/data/nba/october.csv -o data/nba/october.csv
curl -sL https://github.com/kevinheavey/modern-polars/raw/87539190dde3e99d5e4c4f9957c78932a33075a0/data/nba/november.csv -o data/nba/november.csv
curl -sL https://github.com/kevinheavey/modern-polars/raw/87539190dde3e99d5e4c4f9957c78932a33075a0/data/nba/december.csv -o data/nba/december.csv
curl -sL https://github.com/kevinheavey/modern-polars/raw/87539190dde3e99d5e4c4f9957c78932a33075a0/data/nba/january.csv -o data/nba/january.csv
curl -sL https://github.com/kevinheavey/modern-polars/raw/87539190dde3e99d5e4c4f9957c78932a33075a0/data/nba/february.csv -o data/nba/february.csv
curl -sL https://github.com/kevinheavey/modern-polars/raw/87539190dde3e99d5e4c4f9957c78932a33075a0/data/nba/march.csv -o data/nba/march.csv
curl -sL https://github.com/kevinheavey/modern-polars/raw/87539190dde3e99d5e4c4f9957c78932a33075a0/data/nba/april.csv -o data/nba/april.csv
curl -sL https://github.com/kevinheavey/modern-polars/raw/87539190dde3e99d5e4c4f9957c78932a33075a0/data/nba/may.csv -o data/nba/may.csv
curl -sL https://github.com/kevinheavey/modern-polars/raw/87539190dde3e99d5e4c4f9957c78932a33075a0/data/nba/june.csv -o data/nba/june.csv
Load the Data
After the CSV files are ready, load these into DuckDB (in-memory) database table, R DataFrame, and Python polars.LazyFrame.
- DuckDB
- R DataFrame
- Python polars.LazyFrame
CREATE TABLE tab AS FROM read_csv_auto('data/nba/*.csv')
FROM tab
LIMIT 5
date | away_team | away_points | home_team | home_points |
---|---|---|---|---|
Fri, Apr 1, 2016 | Philadelphia 76ers | 91 | Charlotte Hornets | 100 |
Fri, Apr 1, 2016 | Dallas Mavericks | 98 | Detroit Pistons | 89 |
Fri, Apr 1, 2016 | Brooklyn Nets | 91 | New York Knicks | 105 |
Fri, Apr 1, 2016 | Cleveland Cavaliers | 110 | Atlanta Hawks | 108 |
Fri, Apr 1, 2016 | Toronto Raptors | 99 | Memphis Grizzlies | 95 |
library(dplyr, warn.conflicts = FALSE)
df <- readr::read_csv(
fs::dir_ls("data/nba", glob = "*.csv")
)
df |> head(5)
date | away_team | away_points | home_team | home_points |
---|---|---|---|---|
Fri, Apr 1, 2016 | Philadelphia 76ers | 91 | Charlotte Hornets | 100 |
Fri, Apr 1, 2016 | Dallas Mavericks | 98 | Detroit Pistons | 89 |
Fri, Apr 1, 2016 | Brooklyn Nets | 91 | New York Knicks | 105 |
Fri, Apr 1, 2016 | Cleveland Cavaliers | 110 | Atlanta Hawks | 108 |
Fri, Apr 1, 2016 | Toronto Raptors | 99 | Memphis Grizzlies | 95 |
import polars as pl
lf = pl.scan_csv("data/nba/*.csv")
lf.head(5).collect()
date | away_team | away_points | home_team | home_points |
---|---|---|---|---|
str | str | i64 | str | i64 |
"Fri, Apr 1, 2016" | "Philadelphia 76ers" | 91 | "Charlotte Hornets" | 100 |
"Fri, Apr 1, 2016" | "Dallas Mavericks" | 98 | "Detroit Pistons" | 89 |
"Fri, Apr 1, 2016" | "Brooklyn Nets" | 91 | "New York Knicks" | 105 |
"Fri, Apr 1, 2016" | "Cleveland Cavaliers" | 110 | "Atlanta Hawks" | 108 |
"Fri, Apr 1, 2016" | "Toronto Raptors" | 99 | "Memphis Grizzlies" | 95 |
Cleaning
Convert the date
column to date type and delete rows containing
missing values (null
).
PRQL does not have a “remove rows with missing values in any column” syntax (PRQL/prql#2386), but DuckDB SQL does (>= 0.8, duckdb/duckdb#6621), so it can be used.
- PRQL DuckDB
- SQL DuckDB
- dplyr R
- Python Polars
let games = (
from tab
filter s"COLUMNS(*) IS NOT NULL"
derive date_new = (s"strptime(date, '%a, %b %d, %Y')" | as date)
select !{this.date} # `this` points to refer to current relation
sort date_new
derive game_id = (row_number this)
)
from games
take 5
away_team | away_points | home_team | home_points | date_new | game_id |
---|---|---|---|---|---|
Cleveland Cavaliers | 95 | Chicago Bulls | 97 | 2015-10-27 | 1 |
Detroit Pistons | 106 | Atlanta Hawks | 94 | 2015-10-27 | 2 |
New Orleans Pelicans | 95 | Golden State Warriors | 111 | 2015-10-27 | 3 |
Washington Wizards | 88 | Orlando Magic | 87 | 2015-10-28 | 4 |
Philadelphia 76ers | 95 | Boston Celtics | 112 | 2015-10-28 | 5 |
CREATE TABLE games AS (
WITH _tab1 AS (
SELECT
* REPLACE (strptime(date, '%a, %b %d, %Y')::date AS date)
FROM tab
WHERE COLUMNS(*) IS NOT NULL
)
SELECT
row_number() OVER(ORDER BY date) AS game_id,
*
FROM _tab1
ORDER BY date
)
FROM games
LIMIT 5
game_id | date | away_team | away_points | home_team | home_points |
---|---|---|---|---|---|
1 | 2015-10-27 | Cleveland Cavaliers | 95 | Chicago Bulls | 97 |
2 | 2015-10-27 | Detroit Pistons | 106 | Atlanta Hawks | 94 |
3 | 2015-10-27 | New Orleans Pelicans | 95 | Golden State Warriors | 111 |
4 | 2015-10-28 | Washington Wizards | 88 | Orlando Magic | 87 |
5 | 2015-10-28 | Philadelphia 76ers | 95 | Boston Celtics | 112 |
games <- df |>
filter(!if_any(everything(), is.na)) |> # Also can use `tidyr::drop_na`
mutate(
date = lubridate::parse_date_time(date, "%a, %b %d, %Y") |>
lubridate::as_date()
) |>
arrange(date) |>
mutate(game_id = row_number(), .before = 1)
games |>
head(5)
game_id | date | away_team | away_points | home_team | home_points |
---|---|---|---|---|---|
1 | 2015-10-27 | Cleveland Cavaliers | 95 | Chicago Bulls | 97 |
2 | 2015-10-27 | Detroit Pistons | 106 | Atlanta Hawks | 94 |
3 | 2015-10-27 | New Orleans Pelicans | 95 | Golden State Warriors | 111 |
4 | 2015-10-28 | Washington Wizards | 88 | Orlando Magic | 87 |
5 | 2015-10-28 | Philadelphia 76ers | 95 | Boston Celtics | 112 |
games = (
lf.filter(~pl.any_horizontal(pl.all().is_null())) # Also can use `polars.LazyFrame.drop_nulls`
.with_columns(
pl.col("date").str.strptime(pl.Date, "%a, %b %d, %Y"),
)
.sort("date")
.with_row_index("game_id")
.collect()
)
games.head(5)
game_id | date | away_team | away_points | home_team | home_points |
---|---|---|---|---|---|
u32 | date | str | i64 | str | i64 |
0 | 2015-10-27 | "Cleveland Cavaliers" | 95 | "Chicago Bulls" | 97 |
1 | 2015-10-27 | "Detroit Pistons" | 106 | "Atlanta Hawks" | 94 |
2 | 2015-10-27 | "New Orleans Pelicans" | 95 | "Golden State Warriors" | 111 |
3 | 2015-10-28 | "Washington Wizards" | 88 | "Orlando Magic" | 87 |
4 | 2015-10-28 | "Philadelphia 76ers" | 95 | "Boston Celtics" | 112 |
Looking at the result tables, we notice that the PRQL result is
different from the other results; A column named date
in other results
is named date_new
in PRQL. This is because another name is needed to
avoid the behavior that using the column name date
here would add a
new column called date:1
instead of updating the original date
column.
In DuckDB SQL, we can use Replace Clause to update the original column with the same column name.
The SQL generated by the PRQL compiler looks like this:
WITH games AS (
SELECT
* EXCLUDE (date),
CAST(strptime(date, '%a, %b %d, %Y') AS date) AS date_new,
ROW_NUMBER() OVER (
ORDER BY
CAST(strptime(date, '%a, %b %d, %Y') AS date)
) AS game_id
FROM
tab
WHERE
COLUMNS(*) IS NOT NULL
)
SELECT
*
FROM
games
ORDER BY
date_new
LIMIT
5
-- Generated by PRQL compiler version:0.13.0 target:sql.duckdb (https://prql-lang.org)
Tidy Data
- PRQL does not yet support PIVOT and UNPIVOT. (PRQL/prql#644)
- DuckDB SQL supports PIVOT and UNPIVOT >= 0.8. (duckdb/duckdb#6387)
Unpivot
Transforms the data from wide format to long format. This transformation is called by names such as unpivot, pivot longer, and melt.
- PRQL DuckDB
- SQL DuckDB
- dplyr R
- Python Polars
games
in this query is defiened in
Section 2 with SQL, not
with PRQL.
from s"""
SELECT *
FROM (
PIVOT_LONGER games
ON away_team, home_team
INTO
NAME variable
VALUE team
)
"""
group {team} (
sort this.date
derive rest = this.date - (this.date | lag 1) - 1
)
select !{away_points, home_points}
filter rest != null
sort game_id
take 5
game_id | date | variable | team | rest |
---|---|---|---|---|
6 | 2015-10-28 | away_team | Chicago Bulls | 0 |
7 | 2015-10-28 | home_team | Detroit Pistons | 0 |
12 | 2015-10-28 | away_team | Cleveland Cavaliers | 0 |
15 | 2015-10-28 | away_team | New Orleans Pelicans | 0 |
18 | 2015-10-29 | home_team | Indiana Pacers | 0 |
CREATE TABLE tidy AS (
WITH _tab1 AS (
PIVOT_LONGER games
ON away_team, home_team
INTO
NAME variable
VALUE team
),
_tab2 AS (
SELECT
COLUMNS(x -> NOT suffix(x, '_points'))
FROM _tab1
),
_tab3 AS (
SELECT
*,
date - lag(date) OVER (PARTITION BY team ORDER BY date) -1 AS rest
FROM _tab2
)
SELECT *
FROM _tab3
WHERE rest IS NOT NULL
ORDER BY game_id
)
FROM tidy
LIMIT 5
game_id | date | variable | team | rest |
---|---|---|---|---|
6 | 2015-10-28 | away_team | Chicago Bulls | 0 |
7 | 2015-10-28 | home_team | Detroit Pistons | 0 |
12 | 2015-10-28 | away_team | Cleveland Cavaliers | 0 |
15 | 2015-10-28 | away_team | New Orleans Pelicans | 0 |
18 | 2015-10-29 | home_team | Indiana Pacers | 0 |
tidy <- games |>
tidyr::pivot_longer(
cols = c(away_team, home_team),
names_to = "variable",
values_to = "team"
) |>
select(!ends_with("_points")) |>
arrange(game_id) |>
mutate(
rest = date - lag(date) - 1,
.by = team
) |>
filter(!is.na(rest))
tidy |>
head(5)
game_id | date | variable | team | rest |
---|---|---|---|---|
6 | 2015-10-28 | away_team | Chicago Bulls | 0 days |
7 | 2015-10-28 | home_team | Detroit Pistons | 0 days |
12 | 2015-10-28 | away_team | Cleveland Cavaliers | 0 days |
15 | 2015-10-28 | away_team | New Orleans Pelicans | 0 days |
18 | 2015-10-29 | away_team | Memphis Grizzlies | 0 days |
tidy = (
games.unpivot(
index=["game_id", "date"],
on=["away_team", "home_team"],
value_name="team",
)
.sort("game_id")
.with_columns(
rest=(pl.col("date").diff().over("team").dt.total_days() - 1).cast(pl.Int8)
)
.drop_nulls("rest")
)
tidy.head(5)
game_id | date | variable | team | rest |
---|---|---|---|---|
u32 | date | str | str | i8 |
5 | 2015-10-28 | "away_team" | "Chicago Bulls" | 0 |
6 | 2015-10-28 | "home_team" | "Detroit Pistons" | 0 |
11 | 2015-10-28 | "away_team" | "Cleveland Cavaliers" | 0 |
14 | 2015-10-28 | "away_team" | "New Orleans Pelicans" | 0 |
17 | 2015-10-29 | "away_team" | "Memphis Grizzlies" | 0 |
PRQL, SQL and dplyr remove unnecessary columns after UNPIVOT (columns that were automatically removed in the original Polars and Pandas example).
Pivot
Transforms the data from long format to wide format. This transformation is called by names such as pivot, pivot wider.
- PRQL DuckDB
- SQL DuckDB
- dplyr R
- Python Polars
from s"""
SELECT *
FROM (
PIVOT_WIDER tidy ON variable USING FIRST(rest) GROUP BY (game_id, date)
)
"""
derive {
away_rest = away_team,
home_rest = home_team
}
select !{
away_team,
home_team
}
join side:left games (==game_id && ==date)
derive {
home_win = games.home_points > games.away_points,
rest_spread = home_rest - away_rest
}
sort games.game_id
take 5
game_id | date | away_rest | home_rest | game_id | date | away_team | away_points | home_team | home_points | home_win | rest_spread |
---|---|---|---|---|---|---|---|---|---|---|---|
6 | 2015-10-28 | 0 | NA | 6 | 2015-10-28 | Chicago Bulls | 115 | Brooklyn Nets | 100 | FALSE | NA |
7 | 2015-10-28 | NA | 0 | 7 | 2015-10-28 | Utah Jazz | 87 | Detroit Pistons | 92 | TRUE | NA |
12 | 2015-10-28 | 0 | NA | 12 | 2015-10-28 | Cleveland Cavaliers | 106 | Memphis Grizzlies | 76 | FALSE | NA |
15 | 2015-10-28 | 0 | NA | 15 | 2015-10-28 | New Orleans Pelicans | 94 | Portland Trail Blazers | 112 | TRUE | NA |
18 | 2015-10-29 | 0 | 0 | 18 | 2015-10-29 | Memphis Grizzlies | 112 | Indiana Pacers | 103 | FALSE | 0 |
CREATE TABLE by_game AS (
WITH _tab1 AS (
PIVOT_WIDER tidy ON variable USING FIRST(rest) GROUP BY (game_id, date)
)
SELECT
* EXCLUDE(away_team, home_team),
away_team AS away_rest,
home_team AS home_rest
FROM _tab1
)
CREATE TABLE joined AS (
SELECT
*,
home_points > away_points AS home_win,
home_rest - away_rest AS rest_spread
FROM by_game
LEFT JOIN games USING (game_id, date)
ORDER BY game_id
)
FROM joined
LIMIT 5
game_id | date | away_rest | home_rest | away_team | away_points | home_team | home_points | home_win | rest_spread |
---|---|---|---|---|---|---|---|---|---|
6 | 2015-10-28 | 0 | NA | Chicago Bulls | 115 | Brooklyn Nets | 100 | FALSE | NA |
7 | 2015-10-28 | NA | 0 | Utah Jazz | 87 | Detroit Pistons | 92 | TRUE | NA |
12 | 2015-10-28 | 0 | NA | Cleveland Cavaliers | 106 | Memphis Grizzlies | 76 | FALSE | NA |
15 | 2015-10-28 | 0 | NA | New Orleans Pelicans | 94 | Portland Trail Blazers | 112 | TRUE | NA |
18 | 2015-10-29 | 0 | 0 | Memphis Grizzlies | 112 | Indiana Pacers | 103 | FALSE | 0 |
by_game <- tidy |>
tidyr::pivot_wider(
id_cols = c("game_id", "date"),
values_from = "rest",
names_from = "variable"
) |>
rename(
away_rest = away_team,
home_rest = home_team
)
joined <- by_game |>
left_join(games, by = c("game_id", "date")) |>
mutate(
home_win = home_points > away_points,
rest_spread = home_rest - away_rest
)
joined |>
head(5)
game_id | date | away_rest | home_rest | away_team | away_points | home_team | home_points | home_win | rest_spread |
---|---|---|---|---|---|---|---|---|---|
6 | 2015-10-28 | 0 days | NA days | Chicago Bulls | 115 | Brooklyn Nets | 100 | FALSE | NA days |
7 | 2015-10-28 | NA days | 0 days | Utah Jazz | 87 | Detroit Pistons | 92 | TRUE | NA days |
12 | 2015-10-28 | 0 days | NA days | Cleveland Cavaliers | 106 | Memphis Grizzlies | 76 | FALSE | NA days |
15 | 2015-10-28 | 0 days | NA days | New Orleans Pelicans | 94 | Portland Trail Blazers | 112 | TRUE | NA days |
18 | 2015-10-29 | 0 days | 0 days | Memphis Grizzlies | 112 | Indiana Pacers | 103 | FALSE | 0 days |
by_game = tidy.pivot(
values="rest", index=["game_id", "date"], on="variable"
).rename({"away_team": "away_rest", "home_team": "home_rest"})
joined = by_game.join(games, on=["game_id", "date"]).with_columns(
home_win=pl.col("home_points") > pl.col("away_points"),
rest_spread=pl.col("home_rest") - pl.col("away_rest"),
)
joined.head(5)
game_id | date | away_rest | home_rest | away_team | away_points | home_team | home_points | home_win | rest_spread |
---|---|---|---|---|---|---|---|---|---|
u32 | date | i8 | i8 | str | i64 | str | i64 | bool | i8 |
5 | 2015-10-28 | 0 | null | "Chicago Bulls" | 115 | "Brooklyn Nets" | 100 | false | null |
6 | 2015-10-28 | null | 0 | "Utah Jazz" | 87 | "Detroit Pistons" | 92 | true | null |
11 | 2015-10-28 | 0 | null | "Cleveland Cavaliers" | 106 | "Memphis Grizzlies" | 76 | false | null |
14 | 2015-10-28 | 0 | null | "New Orleans Pelicans" | 94 | "Portland Trail Blazers" | 112 | true | null |
17 | 2015-10-29 | 0 | 0 | "Memphis Grizzlies" | 112 | "Indiana Pacers" | 103 | false | 0 |
There are more columns in the PRQL result than in the other results.
Because the output SQL is not using USING
for joins
(PRQL/prql#1335).
The SQL generated by the PRQL compiler looks like this:
WITH table_0 AS (
SELECT
*
FROM
(
PIVOT_WIDER tidy ON variable USING FIRST(rest)
GROUP BY
(game_id, date)
)
),
table_1 AS (
SELECT
* EXCLUDE (away_team, home_team),
away_team AS away_rest,
home_team AS home_rest
FROM
table_0
)
SELECT
table_1.*,
games.*,
games.home_points > games.away_points AS home_win,
table_1.home_rest - table_1.away_rest AS rest_spread
FROM
table_1
LEFT JOIN games ON table_1.game_id = games.game_id
AND table_1.date = games.date
ORDER BY
games.game_id
LIMIT
5
-- Generated by PRQL compiler version:0.13.0 target:sql.duckdb (https://prql-lang.org)