Reshaping and Tidy Data

note

This page is based on the chapter “Reshaping and Tidy Data” of the Modern Polars book.

Read the Data

important

Since the PRQL and SQL results shown on this page are after being converted to R DataFrame via knitr, they have been converted from DuckDB types to R types. So NULL in DuckDB is shown as NA.

Download

Download the data to be analysis (tables on the website) and write the data to CSV files.

This document uses R to download the data from the source here, but we can also download and use the CSV files included in the kevinheavey/modern-polars GitHub repository.

R
Shell

Code

R
nba_dir <- file.path("data", "nba")

months <- c(
  "october",
  "november",
  "december",
  "january",
  "february",
  "march",
  "april",
  "may",
  "june"
)

column_names <- c(
  date = "date",
  away_team = "visitor_neutral",
  away_points = "pts",
  home_team = "home_neutral",
  home_points = "pts_2"
)

.write_data <- function(month) {
  base_url <- "http://www.basketball-reference.com/leagues/NBA_2016_games-{month}.html"

  glue::glue(base_url, month = month) |>
    rvest::read_html() |>
    rvest::html_table() |>
    (\(x) x[[1]])() |> # TODO: Rewrite after R 4.3
    janitor::clean_names() |>
    dplyr::select(all_of(column_names)) |>
    dplyr::filter(date != "Playoffs") |>
    readr::write_csv(file.path(nba_dir, glue::glue("{month}.csv")))
}

if (!fs::dir_exists(nba_dir)) {
  fs::dir_create(nba_dir)
  months |>
    purrr::walk(.write_data)
}

This is a sample command to download the CSV files from the kevinheavey/modern-polars GitHub repository.

Terminal
mkdir data/nba
curl -sL https://github.com/kevinheavey/modern-polars/raw/87539190dde3e99d5e4c4f9957c78932a33075a0/data/nba/october.csv -o data/nba/october.csv
curl -sL https://github.com/kevinheavey/modern-polars/raw/87539190dde3e99d5e4c4f9957c78932a33075a0/data/nba/november.csv -o data/nba/november.csv
curl -sL https://github.com/kevinheavey/modern-polars/raw/87539190dde3e99d5e4c4f9957c78932a33075a0/data/nba/december.csv -o data/nba/december.csv
curl -sL https://github.com/kevinheavey/modern-polars/raw/87539190dde3e99d5e4c4f9957c78932a33075a0/data/nba/january.csv -o data/nba/january.csv
curl -sL https://github.com/kevinheavey/modern-polars/raw/87539190dde3e99d5e4c4f9957c78932a33075a0/data/nba/february.csv -o data/nba/february.csv
curl -sL https://github.com/kevinheavey/modern-polars/raw/87539190dde3e99d5e4c4f9957c78932a33075a0/data/nba/march.csv -o data/nba/march.csv
curl -sL https://github.com/kevinheavey/modern-polars/raw/87539190dde3e99d5e4c4f9957c78932a33075a0/data/nba/april.csv -o data/nba/april.csv
curl -sL https://github.com/kevinheavey/modern-polars/raw/87539190dde3e99d5e4c4f9957c78932a33075a0/data/nba/may.csv -o data/nba/may.csv
curl -sL https://github.com/kevinheavey/modern-polars/raw/87539190dde3e99d5e4c4f9957c78932a33075a0/data/nba/june.csv -o data/nba/june.csv

Load the Data

After the CSV files are ready, load these into DuckDB (in-memory) database table, R DataFrame, and Python polars.LazyFrame.

DuckDB
R DataFrame
Python polars.LazyFrame

SQL
CREATE TABLE tab AS FROM read_csv_auto('data/nba/*.csv')

SQL
FROM tab
LIMIT 5

date	away_team	away_points	home_team	home_points
Fri, Apr 1, 2016	Philadelphia 76ers	91	Charlotte Hornets	100
Fri, Apr 1, 2016	Dallas Mavericks	98	Detroit Pistons	89
Fri, Apr 1, 2016	Brooklyn Nets	91	New York Knicks	105
Fri, Apr 1, 2016	Cleveland Cavaliers	110	Atlanta Hawks	108
Fri, Apr 1, 2016	Toronto Raptors	99	Memphis Grizzlies	95

R
library(dplyr, warn.conflicts = FALSE)

df <- readr::read_csv(
  fs::dir_ls("data/nba", glob = "*.csv")
)

R
df |> head(5)

date	away_team	away_points	home_team	home_points
Fri, Apr 1, 2016	Philadelphia 76ers	91	Charlotte Hornets	100
Fri, Apr 1, 2016	Dallas Mavericks	98	Detroit Pistons	89
Fri, Apr 1, 2016	Brooklyn Nets	91	New York Knicks	105
Fri, Apr 1, 2016	Cleveland Cavaliers	110	Atlanta Hawks	108
Fri, Apr 1, 2016	Toronto Raptors	99	Memphis Grizzlies	95

Python
import polars as pl

lf = pl.scan_csv("data/nba/*.csv")

Python
lf.head(5).collect()

shape: (5, 5)

date	away_team	away_points	home_team	home_points
str	str	i64	str	i64
"Fri, Apr 1, 2016"	"Philadelphia 76ers"	91	"Charlotte Hornets"	100
"Fri, Apr 1, 2016"	"Dallas Mavericks"	98	"Detroit Pistons"	89
"Fri, Apr 1, 2016"	"Brooklyn Nets"	91	"New York Knicks"	105
"Fri, Apr 1, 2016"	"Cleveland Cavaliers"	110	"Atlanta Hawks"	108
"Fri, Apr 1, 2016"	"Toronto Raptors"	99	"Memphis Grizzlies"	95

Cleaning

Convert the date column to date type and delete rows containing missing values (null).

PRQL does not have a “remove rows with missing values in any column” syntax (PRQL/prql#2386), but DuckDB SQL does (>= 0.8, duckdb/duckdb#6621), so it can be used.

PRQL DuckDB
SQL DuckDB
dplyr R
Python Polars

PRQL
let games = (
  from tab
  filter s"COLUMNS(*) IS NOT NULL"
  derive date_new = (s"strptime(date, '%a, %b %d, %Y')" | as date)
  select !{this.date} # `this` points to refer to current relation
  sort date_new
  derive game_id = (row_number this)
)

from games
take 5

away_team	away_points	home_team	home_points	date_new	game_id
Cleveland Cavaliers	95	Chicago Bulls	97	2015-10-27	1
Detroit Pistons	106	Atlanta Hawks	94	2015-10-27	2
New Orleans Pelicans	95	Golden State Warriors	111	2015-10-27	3
Washington Wizards	88	Orlando Magic	87	2015-10-28	4
Philadelphia 76ers	95	Boston Celtics	112	2015-10-28	5

SQL
CREATE TABLE games AS (
  WITH _tab1 AS (
      SELECT
        * REPLACE (strptime(date, '%a, %b %d, %Y')::date AS date)
      FROM tab
      WHERE COLUMNS(*) IS NOT NULL
    )

  SELECT
    row_number() OVER(ORDER BY date) AS game_id,
    *
  FROM _tab1
  ORDER BY date
)

SQL
FROM games
LIMIT 5

game_id	date	away_team	away_points	home_team	home_points
1	2015-10-27	Cleveland Cavaliers	95	Chicago Bulls	97
2	2015-10-27	Detroit Pistons	106	Atlanta Hawks	94
3	2015-10-27	New Orleans Pelicans	95	Golden State Warriors	111
4	2015-10-28	Washington Wizards	88	Orlando Magic	87
5	2015-10-28	Philadelphia 76ers	95	Boston Celtics	112

R
games <- df |>
  filter(!if_any(everything(), is.na)) |> # Also can use `tidyr::drop_na`
  mutate(
    date = lubridate::parse_date_time(date, "%a, %b %d, %Y") |>
      lubridate::as_date()
  ) |>
  arrange(date) |>
  mutate(game_id = row_number(), .before = 1)

R
games |>
  head(5)

game_id	date	away_team	away_points	home_team	home_points
1	2015-10-27	Cleveland Cavaliers	95	Chicago Bulls	97
2	2015-10-27	Detroit Pistons	106	Atlanta Hawks	94
3	2015-10-27	New Orleans Pelicans	95	Golden State Warriors	111
4	2015-10-28	Washington Wizards	88	Orlando Magic	87
5	2015-10-28	Philadelphia 76ers	95	Boston Celtics	112

Python
games = (
    lf.filter(~pl.any_horizontal(pl.all().is_null())) # Also can use `polars.LazyFrame.drop_nulls`
    .with_columns(
        pl.col("date").str.strptime(pl.Date, "%a, %b %d, %Y"),
    )
    .sort("date")
    .with_row_index("game_id")
    .collect()
)

Python
games.head(5)

shape: (5, 6)

game_id	date	away_team	away_points	home_team	home_points
u32	date	str	i64	str	i64
0	2015-10-27	"Cleveland Cavaliers"	95	"Chicago Bulls"	97
1	2015-10-27	"Detroit Pistons"	106	"Atlanta Hawks"	94
2	2015-10-27	"New Orleans Pelicans"	95	"Golden State Warriors"	111
3	2015-10-28	"Washington Wizards"	88	"Orlando Magic"	87
4	2015-10-28	"Philadelphia 76ers"	95	"Boston Celtics"	112

Looking at the result tables, we notice that the PRQL result is different from the other results; A column named date in other results is named date_new in PRQL. This is because another name is needed to avoid the behavior that using the column name date here would add a new column called date:1 instead of updating the original date column.

In DuckDB SQL, we can use Replace Clause to update the original column with the same column name.

The SQL generated by the PRQL compiler looks like this:

SQL
WITH games AS (
  SELECT
    * EXCLUDE (date),
    CAST(strptime(date, '%a, %b %d, %Y') AS date) AS date_new,
    ROW_NUMBER() OVER (
      ORDER BY
        CAST(strptime(date, '%a, %b %d, %Y') AS date)
    ) AS game_id
  FROM
    tab
  WHERE
    COLUMNS(*) IS NOT NULL
)
SELECT
  *
FROM
  games
ORDER BY
  date_new
LIMIT
  5

-- Generated by PRQL compiler version:0.13.0 target:sql.duckdb (https://prql-lang.org)

Tidy Data

important

PRQL does not yet support PIVOT and UNPIVOT. (PRQL/prql#644)
DuckDB SQL supports PIVOT and UNPIVOT >= 0.8. (duckdb/duckdb#6387)

Unpivot

Transforms the data from wide format to long format. This transformation is called by names such as unpivot, pivot longer, and melt.

PRQL DuckDB
SQL DuckDB
dplyr R
Python Polars

important

games in this query is defiened in Section 2 with SQL, not with PRQL.

PRQL
from s"""
SELECT *
FROM (
  PIVOT_LONGER games
  ON away_team, home_team
  INTO
    NAME variable
    VALUE team
)
"""
group {team} (
  sort this.date
  derive rest = this.date - (this.date | lag 1) - 1
)
select !{away_points, home_points}
filter rest != null
sort game_id
take 5

game_id	date	variable	team
6	2015-10-28	away_team	Chicago Bulls
7	2015-10-28	home_team	Detroit Pistons
12	2015-10-28	away_team	Cleveland Cavaliers
15	2015-10-28	away_team	New Orleans Pelicans
18	2015-10-29	home_team	Indiana Pacers

SQL
CREATE TABLE tidy AS (
  WITH _tab1 AS (
    PIVOT_LONGER games
    ON away_team, home_team
    INTO
      NAME variable
      VALUE team
  ),

  _tab2 AS (
    SELECT
      COLUMNS(x -> NOT suffix(x, '_points'))
    FROM _tab1
  ),

  _tab3 AS (
    SELECT
      *,
      date - lag(date) OVER (PARTITION BY team ORDER BY date) -1 AS rest
    FROM _tab2
  )

  SELECT *
  FROM _tab3
  WHERE rest IS NOT NULL
  ORDER BY game_id
)

SQL
FROM tidy
LIMIT 5

game_id	date	variable	team
6	2015-10-28	away_team	Chicago Bulls
7	2015-10-28	home_team	Detroit Pistons
12	2015-10-28	away_team	Cleveland Cavaliers
15	2015-10-28	away_team	New Orleans Pelicans
18	2015-10-29	home_team	Indiana Pacers

R
tidy <- games |>
  tidyr::pivot_longer(
    cols = c(away_team, home_team),
    names_to = "variable",
    values_to = "team"
  ) |>
  select(!ends_with("_points")) |>
  arrange(game_id) |>
  mutate(
    rest = date - lag(date) - 1,
    .by = team
  ) |>
  filter(!is.na(rest))

R
tidy |>
  head(5)

game_id	date	variable	team
6	2015-10-28	away_team	Chicago Bulls
7	2015-10-28	home_team	Detroit Pistons
12	2015-10-28	away_team	Cleveland Cavaliers
15	2015-10-28	away_team	New Orleans Pelicans
18	2015-10-29	away_team	Memphis Grizzlies

Python
tidy = (
    games.unpivot(
        index=["game_id", "date"],
        on=["away_team", "home_team"],
        value_name="team",
    )
    .sort("game_id")
    .with_columns(
      rest=(pl.col("date").diff().over("team").dt.total_days() - 1).cast(pl.Int8)
    )
    .drop_nulls("rest")
)

Python
tidy.head(5)

shape: (5, 5)

game_id	date	variable	team	rest
u32	date	str	str	i8
5	2015-10-28	"away_team"	"Chicago Bulls"	0
6	2015-10-28	"home_team"	"Detroit Pistons"	0
11	2015-10-28	"away_team"	"Cleveland Cavaliers"	0
14	2015-10-28	"away_team"	"New Orleans Pelicans"	0
17	2015-10-29	"away_team"	"Memphis Grizzlies"	0

PRQL, SQL and dplyr remove unnecessary columns after UNPIVOT (columns that were automatically removed in the original Polars and Pandas example).

Pivot

Transforms the data from long format to wide format. This transformation is called by names such as pivot, pivot wider.

PRQL DuckDB
SQL DuckDB
dplyr R
Python Polars

important

tidy in this query is defiened in Section 3 with SQL, and games is defiened in Section 2 with SQL.

PRQL
from s"""
SELECT *
FROM (
  PIVOT_WIDER tidy ON variable USING FIRST(rest) GROUP BY (game_id, date)
)
"""
derive {
  away_rest = away_team,
  home_rest = home_team
}
select !{
  away_team,
  home_team
}
join side:left games (==game_id && ==date)
derive {
  home_win = games.home_points > games.away_points,
  rest_spread = home_rest - away_rest
}
sort games.game_id
take 5

game_id	date	away_rest	home_rest	game_id	date	away_team	away_points	home_team	home_points	home_win	rest_spread
6	2015-10-28	0	NA	6	2015-10-28	Chicago Bulls	115	Brooklyn Nets	100	FALSE	NA
7	2015-10-28	NA	0	7	2015-10-28	Utah Jazz	87	Detroit Pistons	92	TRUE	NA
12	2015-10-28	0	NA	12	2015-10-28	Cleveland Cavaliers	106	Memphis Grizzlies	76	FALSE	NA
15	2015-10-28	0	NA	15	2015-10-28	New Orleans Pelicans	94	Portland Trail Blazers	112	TRUE	NA
18	2015-10-29	0	0	18	2015-10-29	Memphis Grizzlies	112	Indiana Pacers	103	FALSE	0

SQL
CREATE TABLE by_game AS (
  WITH _tab1 AS (
    PIVOT_WIDER tidy ON variable USING FIRST(rest) GROUP BY (game_id, date)
  )

  SELECT
    * EXCLUDE(away_team, home_team),
    away_team AS away_rest,
    home_team AS home_rest
  FROM _tab1
)

SQL
CREATE TABLE joined AS (
  SELECT
    *,
    home_points > away_points AS home_win,
    home_rest - away_rest AS rest_spread
  FROM by_game
  LEFT JOIN games USING (game_id, date)
  ORDER BY game_id
)

SQL
FROM joined
LIMIT 5

game_id	date	away_rest	home_rest	away_team	away_points	home_team	home_points	home_win	rest_spread
6	2015-10-28	0	NA	Chicago Bulls	115	Brooklyn Nets	100	FALSE	NA
7	2015-10-28	NA	0	Utah Jazz	87	Detroit Pistons	92	TRUE	NA
12	2015-10-28	0	NA	Cleveland Cavaliers	106	Memphis Grizzlies	76	FALSE	NA
15	2015-10-28	0	NA	New Orleans Pelicans	94	Portland Trail Blazers	112	TRUE	NA
18	2015-10-29	0	0	Memphis Grizzlies	112	Indiana Pacers	103	FALSE	0

R
by_game <- tidy |>
  tidyr::pivot_wider(
    id_cols = c("game_id", "date"),
    values_from = "rest",
    names_from = "variable"
  ) |>
  rename(
    away_rest = away_team,
    home_rest = home_team
  )

joined <- by_game |>
  left_join(games, by = c("game_id", "date")) |>
  mutate(
    home_win = home_points > away_points,
    rest_spread = home_rest - away_rest
  )

R
joined |>
  head(5)

game_id	date	away_rest	home_rest	away_team	away_points	home_team	home_points	home_win	rest_spread
6	2015-10-28	0 days	NA days	Chicago Bulls	115	Brooklyn Nets	100	FALSE	NA days
7	2015-10-28	NA days	0 days	Utah Jazz	87	Detroit Pistons	92	TRUE	NA days
12	2015-10-28	0 days	NA days	Cleveland Cavaliers	106	Memphis Grizzlies	76	FALSE	NA days
15	2015-10-28	0 days	NA days	New Orleans Pelicans	94	Portland Trail Blazers	112	TRUE	NA days
18	2015-10-29	0 days	0 days	Memphis Grizzlies	112	Indiana Pacers	103	FALSE	0 days

Python
by_game = tidy.pivot(
    values="rest", index=["game_id", "date"], on="variable"
).rename({"away_team": "away_rest", "home_team": "home_rest"})

joined = by_game.join(games, on=["game_id", "date"]).with_columns(
    home_win=pl.col("home_points") > pl.col("away_points"),
    rest_spread=pl.col("home_rest") - pl.col("away_rest"),
)

Python
joined.head(5)

shape: (5, 10)

game_id	date	away_rest	home_rest	away_team	away_points	home_team	home_points	home_win	rest_spread
u32	date	i8	i8	str	i64	str	i64	bool	i8
5	2015-10-28	0	null	"Chicago Bulls"	115	"Brooklyn Nets"	100	false	null
6	2015-10-28	null	0	"Utah Jazz"	87	"Detroit Pistons"	92	true	null
11	2015-10-28	0	null	"Cleveland Cavaliers"	106	"Memphis Grizzlies"	76	false	null
14	2015-10-28	0	null	"New Orleans Pelicans"	94	"Portland Trail Blazers"	112	true	null
17	2015-10-29	0	0	"Memphis Grizzlies"	112	"Indiana Pacers"	103	false	0

There are more columns in the PRQL result than in the other results. Because the output SQL is not using USING for joins (PRQL/prql#1335).

The SQL generated by the PRQL compiler looks like this:

SQL
WITH table_0 AS (
  SELECT
    *
  FROM
    (
      PIVOT_WIDER tidy ON variable USING FIRST(rest)
      GROUP BY
        (game_id, date)
    )
),
table_1 AS (
  SELECT
    * EXCLUDE (away_team, home_team),
    away_team AS away_rest,
    home_team AS home_rest
  FROM
    table_0
)
SELECT
  table_1.*,
  games.*,
  games.home_points > games.away_points AS home_win,
  table_1.home_rest - table_1.away_rest AS rest_spread
FROM
  table_1
  LEFT JOIN games ON table_1.game_id = games.game_id
  AND table_1.date = games.date
ORDER BY
  games.game_id
LIMIT
  5

-- Generated by PRQL compiler version:0.13.0 target:sql.duckdb (https://prql-lang.org)

Read the Data​

Download​

Load the Data​

Cleaning​

Tidy Data​

Unpivot​

Pivot​

Read the Data

Download

Load the Data

Cleaning

Tidy Data

Unpivot

Pivot