Skip to main content

Reshaping and Tidy Data

note

This page is based on the chapter “Reshaping and Tidy Data” of the Modern Polars book.

Read the Data

important

Since the PRQL and SQL results shown on this page are after being converted to R DataFrame via knitr, they have been converted from DuckDB types to R types. So NULL in DuckDB is shown as NA.

Download

Download the data to be analysis (tables on the website) and write the data to CSV files.

This document uses R to download the data from the source here, but we can also download and use the CSV files included in the kevinheavey/modern-polars GitHub repository.

Code
R
nba_dir <- file.path("data", "nba")

months <- c(
"october",
"november",
"december",
"january",
"february",
"march",
"april",
"may",
"june"
)

column_names <- c(
date = "date",
away_team = "visitor_neutral",
away_points = "pts",
home_team = "home_neutral",
home_points = "pts_2"
)

.write_data <- function(month) {
base_url <- "http://www.basketball-reference.com/leagues/NBA_2016_games-{month}.html"

glue::glue(base_url, month = month) |>
rvest::read_html() |>
rvest::html_table() |>
(\(x) x[[1]])() |> # TODO: Rewrite after R 4.3
janitor::clean_names() |>
dplyr::select(all_of(column_names)) |>
dplyr::filter(date != "Playoffs") |>
readr::write_csv(file.path(nba_dir, glue::glue("{month}.csv")))
}

if (!fs::dir_exists(nba_dir)) {
fs::dir_create(nba_dir)
months |>
purrr::walk(.write_data)
}

Load the Data

After the CSV files are ready, load these into DuckDB (in-memory) database table, R DataFrame, and Python polars.LazyFrame.

SQL
CREATE TABLE tab AS FROM read_csv_auto('data/nba/*.csv')
SQL
FROM tab
LIMIT 5
date away_team away_points home_team home_points
Fri, Apr 1, 2016 Philadelphia 76ers 91 Charlotte Hornets 100
Fri, Apr 1, 2016 Dallas Mavericks 98 Detroit Pistons 89
Fri, Apr 1, 2016 Brooklyn Nets 91 New York Knicks 105
Fri, Apr 1, 2016 Cleveland Cavaliers 110 Atlanta Hawks 108
Fri, Apr 1, 2016 Toronto Raptors 99 Memphis Grizzlies 95

Cleaning

Convert the date column to date type and delete rows containing missing values (null).

PRQL does not have a “remove rows with missing values in any column” syntax (PRQL/prql#2386), but DuckDB SQL does (>= 0.8, duckdb/duckdb#6621), so it can be used.

PRQL
let games = (
from tab
filter s"COLUMNS(*) IS NOT NULL"
derive date_new = (s"strptime(date, '%a, %b %d, %Y')" | as date)
select !{this.date} # `this` points to refer to current relation
sort date_new
derive game_id = (row_number this)
)

from games
take 5
away_team away_points home_team home_points date_new game_id
Cleveland Cavaliers 95 Chicago Bulls 97 2015-10-27 1
Detroit Pistons 106 Atlanta Hawks 94 2015-10-27 2
New Orleans Pelicans 95 Golden State Warriors 111 2015-10-27 3
Washington Wizards 88 Orlando Magic 87 2015-10-28 4
Philadelphia 76ers 95 Boston Celtics 112 2015-10-28 5

Looking at the result tables, we notice that the PRQL result is different from the other results; A column named date in other results is named date_new in PRQL. This is because another name is needed to avoid the behavior that using the column name date here would add a new column called date:1 instead of updating the original date column.

In DuckDB SQL, we can use Replace Clause to update the original column with the same column name.

The SQL generated by the PRQL compiler looks like this:

SQL
WITH games AS (
SELECT
* EXCLUDE (date),
CAST(strptime(date, '%a, %b %d, %Y') AS date) AS date_new,
ROW_NUMBER() OVER (
ORDER BY
CAST(strptime(date, '%a, %b %d, %Y') AS date)
) AS game_id
FROM
tab
WHERE
COLUMNS(*) IS NOT NULL
)
SELECT
*
FROM
games
ORDER BY
date_new
LIMIT
5

-- Generated by PRQL compiler version:0.13.0 target:sql.duckdb (https://prql-lang.org)

Tidy Data

important

Unpivot

Transforms the data from wide format to long format. This transformation is called by names such as unpivot, pivot longer, and melt.

important

games in this query is defiened in Section 2 with SQL, not with PRQL.

PRQL
from s"""
SELECT *
FROM (
PIVOT_LONGER games
ON away_team, home_team
INTO
NAME variable
VALUE team
)
"""
group {team} (
sort this.date
derive rest = this.date - (this.date | lag 1) - 1
)
select !{away_points, home_points}
filter rest != null
sort game_id
take 5
game_id date variable team rest
6 2015-10-28 away_team Chicago Bulls 0
7 2015-10-28 home_team Detroit Pistons 0
12 2015-10-28 away_team Cleveland Cavaliers 0
15 2015-10-28 away_team New Orleans Pelicans 0
18 2015-10-29 home_team Indiana Pacers 0

PRQL, SQL and dplyr remove unnecessary columns after UNPIVOT (columns that were automatically removed in the original Polars and Pandas example).

Pivot

Transforms the data from long format to wide format. This transformation is called by names such as pivot, pivot wider.

important

tidy in this query is defiened in Section 3 with SQL, and games is defiened in Section 2 with SQL.

PRQL
from s"""
SELECT *
FROM (
PIVOT_WIDER tidy ON variable USING FIRST(rest) GROUP BY (game_id, date)
)
"""
derive {
away_rest = away_team,
home_rest = home_team
}
select !{
away_team,
home_team
}
join side:left games (==game_id && ==date)
derive {
home_win = games.home_points > games.away_points,
rest_spread = home_rest - away_rest
}
sort games.game_id
take 5
game_id date away_rest home_rest game_id date away_team away_points home_team home_points home_win rest_spread
6 2015-10-28 0 NA 6 2015-10-28 Chicago Bulls 115 Brooklyn Nets 100 FALSE NA
7 2015-10-28 NA 0 7 2015-10-28 Utah Jazz 87 Detroit Pistons 92 TRUE NA
12 2015-10-28 0 NA 12 2015-10-28 Cleveland Cavaliers 106 Memphis Grizzlies 76 FALSE NA
15 2015-10-28 0 NA 15 2015-10-28 New Orleans Pelicans 94 Portland Trail Blazers 112 TRUE NA
18 2015-10-29 0 0 18 2015-10-29 Memphis Grizzlies 112 Indiana Pacers 103 FALSE 0

There are more columns in the PRQL result than in the other results. Because the output SQL is not using USING for joins (PRQL/prql#1335).

The SQL generated by the PRQL compiler looks like this:

SQL
WITH table_0 AS (
SELECT
*
FROM
(
PIVOT_WIDER tidy ON variable USING FIRST(rest)
GROUP BY
(game_id, date)
)
),
table_1 AS (
SELECT
* EXCLUDE (away_team, home_team),
away_team AS away_rest,
home_team AS home_rest
FROM
table_0
)
SELECT
table_1.*,
games.*,
games.home_points > games.away_points AS home_win,
table_1.home_rest - table_1.away_rest AS rest_spread
FROM
table_1
LEFT JOIN games ON table_1.game_id = games.game_id
AND table_1.date = games.date
ORDER BY
games.game_id
LIMIT
5

-- Generated by PRQL compiler version:0.13.0 target:sql.duckdb (https://prql-lang.org)