Skip to main content

Reshaping and Tidy Data

note

This page is based on the chapter “Reshaping and Tidy Data” of the Modern Polars book.

Read the Data

important

Since the PRQL and SQL results shown on this page are after being converted to R DataFrame via knitr, they have been converted from DuckDB types to R types. So NULL in DuckDB is shown as NA.

Download

Download the data to be analysis (tables on the website) and write the data to CSV files.

This document uses R to download the data from the source here, but we can also download and use the CSV files included in the kevinheavey/modern-polars GitHub repository.

Code
nba_dir <- file.path("data", "nba")

months <- c(
"october",
"november",
"december",
"january",
"february",
"march",
"april",
"may",
"june"
)

column_names <- c(
date = "date",
away_team = "visitor_neutral",
away_points = "pts",
home_team = "home_neutral",
home_points = "pts_2"
)

.write_data <- function(month) {
base_url <- "http://www.basketball-reference.com/leagues/NBA_2016_games-{month}.html"

glue::glue(base_url, month = month) |>
rvest::read_html() |>
rvest::html_table() |>
(\(x) x[[1]])() |> # TODO: Rewrite after R 4.3
janitor::clean_names() |>
dplyr::select(all_of(column_names)) |>
dplyr::filter(date != "Playoffs") |>
readr::write_csv(file.path(nba_dir, glue::glue("{month}.csv")))
}

if (!fs::dir_exists(nba_dir)) {
fs::dir_create(nba_dir)
months |>
purrr::walk(.write_data)
}

Load the Data

After the CSV files are ready, load these into DuckDB (in-memory) database table, R DataFrame, and Python polars.LazyFrame.

CREATE TABLE tab AS FROM read_csv_auto('data/nba/*.csv')
FROM tab
LIMIT 5
dateaway_teamaway_pointshome_teamhome_points
Fri, Jan 1, 2016Orlando Magic91Washington Wizards103
Fri, Jan 1, 2016Dallas Mavericks82Miami Heat106
Fri, Jan 1, 2016Charlotte Hornets94Toronto Raptors104
Fri, Jan 1, 2016New York Knicks81Chicago Bulls108
Fri, Jan 1, 2016Philadelphia 76ers84Los Angeles Lakers93

Cleaning

Convert the date column to date type and delete rows containing missing values (null).

PRQL does not have a “remove rows with missing values in any column” syntax (PRQL/prql#2386), but DuckDB SQL does (>= 0.8, duckdb/duckdb#6621), so it can be used.

let games = (
from tab
filter s"COLUMNS(*) IS NOT NULL"
derive date_new = (s"strptime(date, '%a, %b %d, %Y')" | as date)
select !{this.date} # `this` points to refer to current relation
sort date_new
derive game_id = (row_number this)
)

from games
take 5
away_teamaway_pointshome_teamhome_pointsdate_newgame_id
Cleveland Cavaliers95Chicago Bulls972015-10-271
Detroit Pistons106Atlanta Hawks942015-10-272
New Orleans Pelicans95Golden State Warriors1112015-10-273
Washington Wizards88Orlando Magic872015-10-284
Philadelphia 76ers95Boston Celtics1122015-10-285

Looking at the result tables, we notice that the PRQL result is different from the other results; A column named date in other results is named date_new in PRQL. This is because another name is needed to avoid the behavior that using the column name date here would add a new column called date:1 instead of updating the original date column.

In DuckDB SQL, we can use Replace Clause to update the original column with the same column name.

The SQL generated by the PRQL compiler looks like this:

SQL
WITH games AS (
SELECT
* EXCLUDE (date),
CAST(strptime(date, '%a, %b %d, %Y') AS date) AS date_new,
ROW_NUMBER() OVER (
ORDER BY
CAST(strptime(date, '%a, %b %d, %Y') AS date)
) AS game_id
FROM
tab
WHERE
COLUMNS(*) IS NOT NULL
)
SELECT
*
FROM
games
ORDER BY
date_new
LIMIT
5

-- Generated by PRQL compiler version:0.11.1 target:sql.duckdb (https://prql-lang.org)

Tidy Data

important

Unpivot

Transforms the data from wide format to long format. This transformation is called by names such as unpivot, pivot longer, and melt.

important

games in this query is defiened in Section 2 with SQL, not with PRQL.

from s"""
SELECT *
FROM (
PIVOT_LONGER games
ON away_team, home_team
INTO
NAME variable
VALUE team
)
"""
group {team} (
sort this.date
derive rest = this.date - (this.date | lag 1) - 1
)
select !{away_points, home_points}
filter rest != null
sort game_id
take 5
game_iddatevariableteamrest
62015-10-28away_teamChicago Bulls0
72015-10-28home_teamDetroit Pistons0
122015-10-28away_teamCleveland Cavaliers0
152015-10-28away_teamNew Orleans Pelicans0
182015-10-29away_teamMemphis Grizzlies0

PRQL, SQL and dplyr remove unnecessary columns after UNPIVOT (columns that were automatically removed in the original Polars and Pandas example).

Pivot

Transforms the data from long format to wide format. This transformation is called by names such as pivot, pivot wider.

important

tidy in this query is defiened in Section 3 with SQL, and games is defiened in Section 2 with SQL.

from s"""
SELECT *
FROM (
PIVOT_WIDER tidy ON variable USING FIRST(rest) GROUP BY (game_id, date)
)
"""
derive {
away_rest = away_team,
home_rest = home_team
}
select !{
away_team,
home_team
}
join side:left games (==game_id && ==date)
derive {
home_win = games.home_points > games.away_points,
rest_spread = home_rest - away_rest
}
sort games.game_id
take 5
game_iddateaway_resthome_restgame_iddateaway_teamaway_pointshome_teamhome_pointshome_winrest_spread
62015-10-280NA62015-10-28Chicago Bulls115Brooklyn Nets100FALSENA
72015-10-28NA072015-10-28Utah Jazz87Detroit Pistons92TRUENA
122015-10-280NA122015-10-28Cleveland Cavaliers106Memphis Grizzlies76FALSENA
152015-10-280NA152015-10-28New Orleans Pelicans94Portland Trail Blazers112TRUENA
182015-10-2900182015-10-29Memphis Grizzlies112Indiana Pacers103FALSE0

There are more columns in the PRQL result than in the other results. Because the output SQL is not using USING for joins (PRQL/prql#1335).

The SQL generated by the PRQL compiler looks like this:

SQL
WITH table_0 AS (
SELECT
*
FROM
(
PIVOT_WIDER tidy ON variable USING FIRST(rest)
GROUP BY
(game_id, date)
)
),
table_1 AS (
SELECT
* EXCLUDE (away_team, home_team),
away_team AS away_rest,
home_team AS home_rest
FROM
table_0
)
SELECT
table_1.*,
games.*,
games.home_points > games.away_points AS home_win,
table_1.home_rest - table_1.away_rest AS rest_spread
FROM
table_1
LEFT JOIN games ON table_1.game_id = games.game_id
AND table_1.date = games.date
ORDER BY
games.game_id
LIMIT
5

-- Generated by PRQL compiler version:0.11.1 target:sql.duckdb (https://prql-lang.org)