Skip to main content

Timeseries

note

This page is based on the chapter “Timeseries” of the Modern Polars book.

Preparing Data

important

Since the PRQL and SQL results shown on this page are after being converted to R DataFrame via knitr, they have been converted from DuckDB types to R types. So NULL in DuckDB is shown as NA.

Download

Download the data from Binance REST API and write it to a Parquet file.

This document uses R to download the data from the source here, but we can also download and use the Parquet file included in the kevinheavey/modern-polars GitHub repository.

Code
R
data_path <- "data/ohlcv.parquet"

if (!fs::file_exists(data_path)) {
fs::dir_create(fs::path_dir(data_path))

.epoch_ms <- function(dt) {
dt |>
lubridate::as_datetime() |>
(\(x) (as.integer(x) * 1000))()
}

.start <- lubridate::make_datetime(2021, 1, 1) |> .epoch_ms()
.end <- lubridate::make_datetime(2022, 1, 1) |> .epoch_ms()

.url <- glue::glue(
"https://api.binance.com/api/v3/klines?symbol=BTCUSDT&",
"interval=1d&startTime={.start}&endTime={.end}"
)

.res <- jsonlite::read_json(.url)

time_col <- "time"
ohlcv_cols <- c(
"open",
"high",
"low",
"close",
"volume"
)
cols_to_use <- c(time_col, ohlcv_cols)
cols <- c(cols_to_use, glue::glue("ignore_{i}", i = 1:6))

df <- .res |>
tibble::enframe(name = NULL) |>
tidyr::unnest_wider(value, names_sep = "_") |>
rlang::set_names({{ cols }}) |>
dplyr::mutate(
dplyr::across({{ time_col }}, \(x) lubridate::as_datetime(x / 1000) |> lubridate::as_date()),
dplyr::across({{ ohlcv_cols }}, as.numeric),
.keep = "none"
)

# Unlike the Python client, the duckdb R client does not (yet) have automatic DataFrame registration.
# (duckdb/duckdb#6771)
con_tmp <- DBI::dbConnect(duckdb::duckdb(), ":memory:")
duckdb::duckdb_register(con_tmp, "df", df)
duckdb:::sql(glue::glue("COPY df TO '{data_path}' (FORMAT PARQUET)"), con_tmp)
DBI::dbDisconnect(con_tmp)
}

Load the Data

After the Parquet file is ready, load that into DuckDB (in-memory) database table, R DataFrame, and Python polars.LazyFrame.

SQL
CREATE TABLE tab AS FROM 'data/ohlcv.parquet'
SQL
FROM tab
LIMIT 5
time open high low close volume
2021-01-01 28923.63 29600.00 28624.57 29331.69 54182.93
2021-01-02 29331.70 33300.00 28946.53 32178.33 129993.87
2021-01-03 32176.45 34778.11 31962.99 33000.05 120957.57
2021-01-04 33000.05 33600.00 28130.00 31988.71 140899.89
2021-01-05 31989.75 34360.00 29900.00 33949.53 116050.00

Filtering

PRQL
from tab
filter s"date_part(['year', 'month'], time) = {{year: 2021, month: 2}}"
take 5
time open high low close volume
2021-02-01 33092.97 34717.27 32296.16 33526.37 82718.28
2021-02-02 33517.09 35984.33 33418.00 35466.24 78056.66
2021-02-03 35472.71 37662.63 35362.38 37618.87 80784.33
2021-02-04 37620.26 38708.27 36161.95 36936.66 92080.74
2021-02-05 36936.65 38310.12 36570.00 38290.24 66681.33

Downsampling

It is important to note carefully how units such as 5 days or 1 week actually work. In other words, where to start counting 5 days or 1 week could be completely different in each system.

Here, we should note that time_bucket in DuckDB, lubridate::floor_date in R, and group_by_dynamic in Polars have completely different initial starting points by default.

  • The DuckDB function time_bucket’s origin defaults to 2000-01-03 00:00:00+00 for days or weeks interval.1

  • In the R lubridate::floor_date function, timestamp is floored using the number of days elapsed since the beginning of every month when specifying "5 days" to the unit argument.

    R
    lubridate::as_date(c("2023-01-31", "2023-02-01")) |>
    lubridate::floor_date("5 days")
    [1] "2023-01-31" "2023-02-01"

    And when "1 week" to the unit argument, it is floored to the nearest week, Sunday through Saturday.

    R
    lubridate::as_date(c("2023-01-31", "2023-02-01")) |>
    lubridate::floor_date("1 week")
    [1] "2023-01-29" "2023-01-29"

    To start from an arbitrary origin, all breaks must be specified as a vector in the unit argument.2

    R
    lubridate::as_date(c("2023-01-31", "2023-02-01")) |>
    lubridate::floor_date(lubridate::make_date(2023, 1, 31))
    [1] "2023-01-31" "2023-01-31"
  • group_by_dynamic of Polars, the offset parameter to specify the origin point, is negative every by default.3

PRQL
from tab
derive {
time_new = s"""
time_bucket(INTERVAL '5 days', time, (FROM tab SELECT min(time)))
"""
}
group {time_new} (
aggregate {
open = average open,
high = average high,
low = average low,
close = average close,
volume = average volume
}
)
sort time_new
take 5
time_new open high low close volume
2021-01-01 31084.32 33127.62 29512.82 32089.66 112416.85
2021-01-06 38165.31 40396.84 35983.82 39004.54 118750.08
2021-01-11 36825.23 38518.10 33288.05 36542.76 146166.70
2021-01-16 36216.36 37307.53 34650.37 35962.92 81236.80
2021-01-21 32721.53 34165.71 30624.23 32077.48 97809.66
PRQL
from tab
derive {
time_new = s"""
time_bucket(INTERVAL '7 days', time, (FROM tab SELECT min(time)))
"""
}
group {time_new} (
aggregate {
open_mean = average open,
high_mean = average high,
low_mean = average low,
close_mean = average close,
volume_mean = average volume,
open_sum = sum open,
high_sum = sum high,
low_sum = sum low,
close_sum = sum close,
volume_sum = sum volume
}
)
sort time_new
take 5
time_new open_mean high_mean low_mean close_mean volume_mean open_sum high_sum low_sum close_sum volume_sum
2021-01-01 32305.78 34706.05 31021.73 33807.14 117435.59 226140.5 242942.3 217152.1 236650.0 822049.1
2021-01-08 37869.80 39646.11 34623.33 37827.52 135188.30 265088.6 277522.7 242363.3 264792.6 946318.1
2021-01-15 36527.89 37412.20 33961.55 35343.85 94212.72 255695.2 261885.4 237730.9 247406.9 659489.0
2021-01-22 31888.55 33498.81 30424.48 32248.01 89649.94 223219.8 234491.7 212971.4 225736.0 627549.6
2021-01-29 34511.48 36411.44 33450.08 35022.31 102728.42 241580.4 254880.1 234150.6 245156.2 719099.0

Upsampling

The way to use a function like generate_series to generate sequential values and then join them is general-purpose.

In R, we can also use dedicated functions like timetk::pad_by_time.

important

This example does not work with prql-compiler 0.11.1. (PRQL/prql#3129)

PRQL
let _tab1 = s"""
SELECT
generate_series(
(SELECT min(time)),
(SELECT max(time)),
INTERVAL '6 hours'
).unnest() AS time
FROM tab
"""

from _tab1
join side:left tab (==time)
sort tab.time
select !{tab.time}
take 5

Window Functions

It is necessary to be careful how the Window function calculates if the width of the window is less than the specified value.

Moving Average, Cumulative Avarage

PRQL has a dedicated way of applying the window to the entire table. For the others, use a individual function for each column.

In R, base R have some window functions like cumsum, but none like cumulative avarage. dplyr complements this with several functions, including cummean.

Polars does not yet have a dedicated function to compute cumulative averages, so we must use cumulative sums to compute them.

note

The original Modern Pandas post and the Modern Polars book have a exponentially weighted (EW) calculation example in addition. But DuckDB does not have a dedicated function to do this, so it is omitted here.

PRQL
from tab
sort this.time
window rolling:28 (
derive {`28D MA` = average close}
)
window rows:..0 (
derive {`Expanding Average` = average close}
)
select {
this.time,
Raw = close,
`28D MA`,
`Expanding Average`
}
take 26..30
time Raw 28D MA Expanding Average
2021-01-26 32467.77 35032.87 35032.87
2021-01-27 30366.15 34860.03 34860.03
2021-01-28 33364.86 34806.63 34806.63
2021-01-29 34252.20 34982.36 34787.51
2021-01-30 34262.88 35056.81 34770.02

Here, DuckDB also calculates avarage for cases where the window width is less than 28 for the 28D MA column, whereas R slider::slide_vec(.complete = TRUE) and Polars rolling_mean make them missing values. If we are using DuckDB and need to make replacement for NULL, we need to add further processing.

Plotting the results of dplyr shows the following.

Code
R
library(ggplot2)

roll_and_expand |>
tidyr::pivot_longer(cols = !time) |>
ggplot(aes(time, value, colour = name)) +
geom_line() +
theme_linedraw() +
labs(y = "Close ($)") +
scale_x_date(
date_breaks = "month",
labels = scales::label_date_short()
)

Combining Rolling Aggregations

PRQL
from tab
sort this.time
window rows:-15..14 (
select {
this.time,
mean = average close,
std = stddev close
}
)
take 13..17
time mean std
2021-01-13 34860.03 3092.516
2021-01-14 34806.63 3047.833
2021-01-15 34787.51 2994.683
2021-01-16 34770.02 2944.156
2021-01-17 34895.40 2780.095

As in Section 5.1, here too the DuckDB results differ from the others.

Plotting the results of dplyr shows the following.

Code
R
library(ggplot2)

mean_std |>
ggplot(aes(time)) +
geom_ribbon(
aes(ymin = mean - std, ymax = mean + std),
alpha = 0.3, fill = "blue"
) +
geom_line(aes(y = mean), color = "blue") +
theme_linedraw() +
labs(y = "Close ($)") +
scale_x_date(
date_breaks = "month",
labels = scales::label_date_short()
)

Timezones

important

In DuckDB, the icu DuckDB extension is needed for time zones support. If the DuckDB client that we are using does not contain the extension, we need to install and load it.

SQL
INSTALL 'icu'
SQL
LOAD 'icu'
PRQL
let timezone = tz col -> s"timezone({tz}, {col})"

from tab
derive {
time_new = (this.time | timezone "UTC" | timezone "US/Eastern")
}
select !{this.time}
take 5
open high low close volume time_new
28923.63 29600.00 28624.57 29331.69 54182.93 2020-12-31 19:00:00
29331.70 33300.00 28946.53 32178.33 129993.87 2021-01-01 19:00:00
32176.45 34778.11 31962.99 33000.05 120957.57 2021-01-02 19:00:00
33000.05 33600.00 28130.00 31988.71 140899.89 2021-01-03 19:00:00
31989.75 34360.00 29900.00 33949.53 116050.00 2021-01-04 19:00:00

Note that each system may keep time zone information in a different way. Here, the time column (and the time_new column) in DuckDB results are the TIMESTAMP type, has no time zone information.

Footnotes

  1. https://duckdb.org/docs/sql/functions/timestamptz#icu-timestamp-with-time-zone-operators

  2. https://github.com/tidyverse/lubridate/issues/932

  3. https://pola-rs.github.io/polars/py-polars/html/reference/dataframe/api/polars.DataFrame.group_by_dynamic.html