Skip to main content

Timeseries

note

This page is based on the chapter “Timeseries” of the Modern Polars book.

Preparing Data

important

Since the PRQL and SQL results shown on this page are after being converted to R DataFrame via knitr, they have been converted from DuckDB types to R types. So NULL in DuckDB is shown as NA.

Download

Download the data from Binance REST API and write it to a Parquet file.

This document uses R to download the data from the source here, but we can also download and use the Parquet file included in the kevinheavey/modern-polars GitHub repository.

Code
data_path <- "data/ohlcv.parquet"

if (!fs::file_exists(data_path)) {
fs::dir_create(fs::path_dir(data_path))

.epoch_ms <- function(dt) {
dt |>
lubridate::as_datetime() |>
(\(x) (as.integer(x) * 1000))()
}

.start <- lubridate::make_datetime(2021, 1, 1) |> .epoch_ms()
.end <- lubridate::make_datetime(2022, 1, 1) |> .epoch_ms()

.url <- glue::glue(
"https://api.binance.com/api/v3/klines?symbol=BTCUSDT&",
"interval=1d&startTime={.start}&endTime={.end}"
)

.res <- jsonlite::read_json(.url)

time_col <- "time"
ohlcv_cols <- c(
"open",
"high",
"low",
"close",
"volume"
)
cols_to_use <- c(time_col, ohlcv_cols)
cols <- c(cols_to_use, glue::glue("ignore_{i}", i = 1:6))

df <- .res |>
tibble::enframe(name = NULL) |>
tidyr::unnest_wider(value, names_sep = "_") |>
rlang::set_names({{ cols }}) |>
dplyr::mutate(
dplyr::across({{ time_col }}, \(x) lubridate::as_datetime(x / 1000) |> lubridate::as_date()),
dplyr::across({{ ohlcv_cols }}, as.numeric),
.keep = "none"
)

# Unlike the Python client, the duckdb R client does not (yet) have automatic DataFrame registration.
# (duckdb/duckdb#6771)
con_tmp <- DBI::dbConnect(duckdb::duckdb(), ":memory:")
duckdb::duckdb_register(con_tmp, "df", df)
duckdb:::sql(glue::glue("COPY df TO '{data_path}' (FORMAT PARQUET)"), con_tmp)
DBI::dbDisconnect(con_tmp)
}

Load the Data

After the Parquet file is ready, load that into DuckDB (in-memory) database table, R DataFrame, and Python polars.LazyFrame.

CREATE TABLE tab AS FROM 'data/ohlcv.parquet'
FROM tab
LIMIT 5
timeopenhighlowclosevolume
2021-01-0128923.6329600.0028624.5729331.6954182.93
2021-01-0229331.7033300.0028946.5332178.33129993.87
2021-01-0332176.4534778.1131962.9933000.05120957.57
2021-01-0433000.0533600.0028130.0031988.71140899.89
2021-01-0531989.7534360.0029900.0033949.53116050.00

Filtering

from tab
filter s"date_part(['year', 'month'], time) = {{year: 2021, month: 2}}"
take 5
timeopenhighlowclosevolume
2021-02-0133092.9734717.2732296.1633526.3782718.28
2021-02-0233517.0935984.3333418.0035466.2478056.66
2021-02-0335472.7137662.6335362.3837618.8780784.33
2021-02-0437620.2638708.2736161.9536936.6692080.74
2021-02-0536936.6538310.1236570.0038290.2466681.33

Downsampling

It is important to note carefully how units such as 5 days or 1 week actually work. In other words, where to start counting 5 days or 1 week could be completely different in each system.

Here, we should note that time_bucket in DuckDB, lubridate::floor_date in R, and groupby_dynamic in Polars have completely different initial starting points by default.

  • The DuckDB function time_bucket’s origin defaults to 2000-01-03 00:00:00+00 for days or weeks interval.1

  • In the R lubridate::floor_date function, timestamp is floored using the number of days elapsed since the beginning of every month when specifying "5 days" to the unit argument.

    lubridate::as_date(c("2023-01-31", "2023-02-01")) |>
    lubridate::floor_date("5 days")
    [1] "2023-01-31" "2023-02-01"

    And when "1 week" to the unit argument, it is floored to the nearest week, Sunday through Saturday.

    lubridate::as_date(c("2023-01-31", "2023-02-01")) |>
    lubridate::floor_date("1 week")
    [1] "2023-01-29" "2023-01-29"

    To start from an arbitrary origin, all breaks must be specified as a vector in the unit argument.2

    lubridate::as_date(c("2023-01-31", "2023-02-01")) |>
    lubridate::floor_date(lubridate::make_date(2023, 1, 31))
    [1] "2023-01-31" "2023-01-31"
  • groupby_dynamic of Polars, the offset parameter to specify the origin point, is negative every by default.3

from tab
derive {
time_new = s"""
time_bucket(INTERVAL '5 days', time, (FROM tab SELECT min(time)))
"""
}
group {time_new} (
aggregate {
open = average open,
high = average high,
low = average low,
close = average close,
volume = average volume
}
)
sort time_new
take 5
time_newopenhighlowclosevolume
2021-01-0131084.3233127.6229512.8232089.66112416.85
2021-01-0638165.3140396.8435983.8239004.54118750.08
2021-01-1136825.2338518.1033288.0536542.76146166.70
2021-01-1636216.3637307.5334650.3735962.9281236.80
2021-01-2132721.5334165.7130624.2332077.4897809.66
from tab
derive {
time_new = s"""
time_bucket(INTERVAL '7 days', time, (FROM tab SELECT min(time)))
"""
}
group {time_new} (
aggregate {
open_mean = average open,
high_mean = average high,
low_mean = average low,
close_mean = average close,
volume_mean = average volume,
open_sum = sum open,
high_sum = sum high,
low_sum = sum low,
close_sum = sum close,
volume_sum = sum volume
}
)
sort time_new
take 5
time_newopen_meanhigh_meanlow_meanclose_meanvolume_meanopen_sumhigh_sumlow_sumclose_sumvolume_sum
2021-01-0132305.7834706.0531021.7333807.14117435.59226140.5242942.3217152.1236650.0822049.1
2021-01-0837869.8039646.1134623.3337827.52135188.30265088.6277522.7242363.3264792.6946318.1
2021-01-1536527.8937412.2033961.5535343.8594212.72255695.2261885.4237730.9247406.9659489.0
2021-01-2231888.5533498.8130424.4832248.0189649.94223219.8234491.7212971.4225736.0627549.6
2021-01-2934511.4836411.4433450.0835022.31102728.42241580.4254880.1234150.6245156.2719099.0

Upsampling

The way to use a function like generate_series to generate sequential values and then join them is general-purpose.

In R, we can also use dedicated functions like timetk::pad_by_time.

important

This example does not work with prql-compiler 0.11.1. (PRQL/prql#3129)

let _tab1 = s"""
SELECT
generate_series(
(SELECT min(time)),
(SELECT max(time)),
INTERVAL '6 hours'
).unnest() AS time
FROM tab
"""

from _tab1
join side:left tab (==time)
sort tab.time
select !{tab.time}
take 5

Window Functions

It is necessary to be careful how the Window function calculates if the width of the window is less than the specified value.

Moving Average, Cumulative Avarage

PRQL has a dedicated way of applying the window to the entire table. For the others, use a individual function for each column.

In R, base R have some window functions like cumsum, but none like cumulative avarage. dplyr complements this with several functions, including cummean.

Polars does not yet have a dedicated function to compute cumulative averages, so we must use cumulative sums to compute them.

note

The original Modern Pandas post and the Modern Polars book have a exponentially weighted (EW) calculation example in addition. But DuckDB does not have a dedicated function to do this, so it is omitted here.

from tab
sort this.time
window rolling:28 (
derive {`28D MA` = average close}
)
window rows:..0 (
derive {`Expanding Average` = average close}
)
select {
this.time,
Raw = close,
`28D MA`,
`Expanding Average`
}
take 26..30
timeRaw28D MAExpanding Average
2021-01-2632467.7735032.8735032.87
2021-01-2730366.1534860.0334860.03
2021-01-2833364.8634806.6334806.63
2021-01-2934252.2034982.3634787.51
2021-01-3034262.8835056.8134770.02

Here, DuckDB also calculates avarage for cases where the window width is less than 28 for the 28D MA column, whereas R slider::slide_vec(.complete = TRUE) and Polars rolling_mean make them missing values. If we are using DuckDB and need to make replacement for NULL, we need to add further processing.

Plotting the results of dplyr shows the following.

Code
library(ggplot2)

roll_and_expand |>
tidyr::pivot_longer(cols = !time) |>
ggplot(aes(time, value, colour = name)) +
geom_line() +
theme_linedraw() +
labs(y = "Close ($)") +
scale_x_date(
date_breaks = "month",
labels = scales::label_date_short()
)

Combining Rolling Aggregations

from tab
sort this.time
window rows:-15..14 (
select {
this.time,
mean = average close,
std = stddev close
}
)
take 13..17
timemeanstd
2021-01-1334860.033092.516
2021-01-1434806.633047.833
2021-01-1534787.512994.683
2021-01-1634770.022944.156
2021-01-1734895.402780.095

As in Section 5.1, here too the DuckDB results differ from the others.

Plotting the results of dplyr shows the following.

Code
library(ggplot2)

mean_std |>
ggplot(aes(time)) +
geom_ribbon(
aes(ymin = mean - std, ymax = mean + std),
alpha = 0.3, fill = "blue"
) +
geom_line(aes(y = mean), color = "blue") +
theme_linedraw() +
labs(y = "Close ($)") +
scale_x_date(
date_breaks = "month",
labels = scales::label_date_short()
)

Timezones

important

In DuckDB, the icu DuckDB extension is needed for time zones support. If the DuckDB client that we are using does not contain the extension, we need to install and load it.

INSTALL 'icu'
LOAD 'icu'
let timezone = tz col -> s"timezone({tz}, {col})"

from tab
derive {
time_new = (this.time | timezone "UTC" | timezone "US/Eastern")
}
select !{this.time}
take 5
openhighlowclosevolumetime_new
28923.6329600.0028624.5729331.6954182.932020-12-31 19:00:00
29331.7033300.0028946.5332178.33129993.872021-01-01 19:00:00
32176.4534778.1131962.9933000.05120957.572021-01-02 19:00:00
33000.0533600.0028130.0031988.71140899.892021-01-03 19:00:00
31989.7534360.0029900.0033949.53116050.002021-01-04 19:00:00

Note that each system may keep time zone information in a different way. Here, the time column (and the time_new column) in DuckDB results are the TIMESTAMP type, has no time zone information.

Footnotes

  1. https://duckdb.org/docs/sql/functions/timestamptz#icu-timestamp-with-time-zone-operators

  2. https://github.com/tidyverse/lubridate/issues/932

  3. https://pola-rs.github.io/polars/py-polars/html/reference/dataframe/api/polars.DataFrame.groupby_dynamic.html