Perform joins on nearest keys
Description
This is similar to a left-join except that we match on nearest key
rather than equal keys. Both frames must be sorted by the
asof_join
key.
Usage
<LazyFrame>$join_asof(
other,
...,
left_on = NULL,
right_on = NULL,
on = NULL,
by_left = NULL,
by_right = NULL,
by = NULL,
strategy = c("backward", "forward", "nearest"),
suffix = "_right",
tolerance = NULL,
allow_parallel = TRUE,
force_parallel = FALSE,
coalesce = TRUE,
allow_exact_matches = TRUE,
check_sortedness = TRUE
)
Arguments
other
|
LazyFrame to join with. |
…
|
These dots are for future extensions and must be empty. |
left_on , right_on
|
Same as on but only for the left or the right DataFrame.
They must have the same length.
|
on
|
Either a vector of column names or a list of expressions and/or strings.
Use left_on and right_on if the column names
to match on are different between the two LazyFrames.
|
by_left , by_right
|
Same as by but only for the left or the right table. They
must have the same length.
|
by
|
Join on these columns before performing asof join. Either a vector of
column names or a list of expressions and/or strings. Use
left_by and right_by if the column names to
match on are different between the two tables.
|
strategy
|
Strategy for where to find match:
|
suffix
|
Suffix to add to duplicated column names. |
tolerance
|
Numeric tolerance. By setting this the join will only be done if the near keys are within this distance. If an asof join is done on columns of dtype "Date", "Datetime", "Duration" or "Time", use the Polars duration string language (see details). |
allow_parallel
|
Allow the physical plan to optionally evaluate the computation of both LazyFrames up to the join in parallel. |
force_parallel
|
Force the physical plan to evaluate the computation of both LazyFrames up to the join in parallel. |
coalesce
|
Coalescing behavior (merging of on / left_on /
right_on columns):
|
allow_exact_matches
|
Whether exact matches are valid join predicates. If TRUE
(default), allow matching with the same on value (i.e.
less-than-or-equal-to / greater-than-or-equal-to). Otherwise, don’t
match the same on value (i.e., strictly less-than / strictly
greater-than).
|
check_sortedness
|
Check the sortedness of the asof keys. If the keys are not sorted,
polars will error, or raise a warning if the by argument is
provided. This might become a hard error in the future.
|
Value
A polars LazyFrame
Polars duration string language
Polars duration string language is a simple representation of durations. It is used in many Polars functions that accept durations.
It has the following format:
- 1ns (1 nanosecond)
- 1us (1 microsecond)
- 1ms (1 millisecond)
- 1s (1 second)
- 1m (1 minute)
- 1h (1 hour)
- 1d (1 calendar day)
- 1w (1 calendar week)
- 1mo (1 calendar month)
- 1q (1 calendar quarter)
- 1y (1 calendar year)
Or combine them: “3d12h4m25s”
# 3 days, 12 hours, 4
minutes, and 25 seconds
By "calendar day", we mean the corresponding time on the next day (which may not be 24 hours, due to daylight savings). Similarly for "calendar week", "calendar month", "calendar quarter", and "calendar year".
Examples
library("polars")
gdp <- pl$LazyFrame(
date = as.Date(c("2016-1-1", "2017-5-1", "2018-1-1", "2019-1-1", "2020-1-1")),
gdp = c(4164, 4411, 4566, 4696, 4827)
)
pop <- pl$LazyFrame(
date = as.Date(c("2016-3-1", "2018-8-1", "2019-1-1")),
population = c(82.19, 82.66, 83.12)
)
# optional make sure tables are already sorted with "on" join-key
gdp <- gdp$sort("date")
pop <- pop$sort("date")
# Note how the dates don’t quite match. If we join them using join_asof and
# strategy = 'backward', then each date from population which doesn’t have
# an exact match is matched with the closest earlier date from gdp:
pop$join_asof(gdp, on = "date", strategy = "backward")$collect()
#> shape: (3, 3)
#> ┌────────────┬────────────┬────────┐
#> │ date ┆ population ┆ gdp │
#> │ --- ┆ --- ┆ --- │
#> │ date ┆ f64 ┆ f64 │
#> ╞════════════╪════════════╪════════╡
#> │ 2016-03-01 ┆ 82.19 ┆ 4164.0 │
#> │ 2018-08-01 ┆ 82.66 ┆ 4566.0 │
#> │ 2019-01-01 ┆ 83.12 ┆ 4696.0 │
#> └────────────┴────────────┴────────┘
# Note how:
# - date 2016-03-01 from population is matched with 2016-01-01 from gdp;
# - date 2018-08-01 from population is matched with 2018-01-01 from gdp.
# You can verify this by passing coalesce = FALSE:
pop$join_asof(
gdp,
on = "date", strategy = "backward", coalesce = FALSE
)$collect()
#> shape: (3, 4)
#> ┌────────────┬────────────┬────────────┬────────┐
#> │ date ┆ population ┆ date_right ┆ gdp │
#> │ --- ┆ --- ┆ --- ┆ --- │
#> │ date ┆ f64 ┆ date ┆ f64 │
#> ╞════════════╪════════════╪════════════╪════════╡
#> │ 2016-03-01 ┆ 82.19 ┆ 2016-01-01 ┆ 4164.0 │
#> │ 2018-08-01 ┆ 82.66 ┆ 2018-01-01 ┆ 4566.0 │
#> │ 2019-01-01 ┆ 83.12 ┆ 2019-01-01 ┆ 4696.0 │
#> └────────────┴────────────┴────────────┴────────┘
# If we instead use strategy = 'forward', then each date from population
# which doesn’t have an exact match is matched with the closest later date
# from gdp:
pop$join_asof(gdp, on = "date", strategy = "forward")$collect()
#> shape: (3, 3)
#> ┌────────────┬────────────┬────────┐
#> │ date ┆ population ┆ gdp │
#> │ --- ┆ --- ┆ --- │
#> │ date ┆ f64 ┆ f64 │
#> ╞════════════╪════════════╪════════╡
#> │ 2016-03-01 ┆ 82.19 ┆ 4411.0 │
#> │ 2018-08-01 ┆ 82.66 ┆ 4696.0 │
#> │ 2019-01-01 ┆ 83.12 ┆ 4696.0 │
#> └────────────┴────────────┴────────┘
# Note how:
# - date 2016-03-01 from population is matched with 2017-01-01 from gdp;
# - date 2018-08-01 from population is matched with 2019-01-01 from gdp.
# Finally, strategy = 'nearest' gives us a mix of the two results above, as
# each date from population which doesn’t have an exact match is matched
# with the closest date from gdp, regardless of whether it’s earlier or
# later:
pop$join_asof(gdp, on = "date", strategy = "nearest")$collect()
#> shape: (3, 3)
#> ┌────────────┬────────────┬────────┐
#> │ date ┆ population ┆ gdp │
#> │ --- ┆ --- ┆ --- │
#> │ date ┆ f64 ┆ f64 │
#> ╞════════════╪════════════╪════════╡
#> │ 2016-03-01 ┆ 82.19 ┆ 4164.0 │
#> │ 2018-08-01 ┆ 82.66 ┆ 4696.0 │
#> │ 2019-01-01 ┆ 83.12 ┆ 4696.0 │
#> └────────────┴────────────┴────────┘
# Note how:
# - date 2016-03-01 from population is matched with 2016-01-01 from gdp;
# - date 2018-08-01 from population is matched with 2019-01-01 from gdp.
# The `by` argument allows joining on another column first, before the asof
# join. In this example we join by country first, then asof join by date, as
# above.
gdp2 <- pl$LazyFrame(
country = rep(c("Germany", "Netherlands"), each = 5),
date = rep(
as.Date(c("2016-1-1", "2017-1-1", "2018-1-1", "2019-1-1", "2020-1-1")),
2
),
gdp = c(4164, 4411, 4566, 4696, 4827, 784, 833, 914, 910, 909)
)$sort("country", "date")
gdp2$collect()
#> shape: (10, 3)
#> ┌─────────────┬────────────┬────────┐
#> │ country ┆ date ┆ gdp │
#> │ --- ┆ --- ┆ --- │
#> │ str ┆ date ┆ f64 │
#> ╞═════════════╪════════════╪════════╡
#> │ Germany ┆ 2016-01-01 ┆ 4164.0 │
#> │ Germany ┆ 2017-01-01 ┆ 4411.0 │
#> │ Germany ┆ 2018-01-01 ┆ 4566.0 │
#> │ Germany ┆ 2019-01-01 ┆ 4696.0 │
#> │ Germany ┆ 2020-01-01 ┆ 4827.0 │
#> │ Netherlands ┆ 2016-01-01 ┆ 784.0 │
#> │ Netherlands ┆ 2017-01-01 ┆ 833.0 │
#> │ Netherlands ┆ 2018-01-01 ┆ 914.0 │
#> │ Netherlands ┆ 2019-01-01 ┆ 910.0 │
#> │ Netherlands ┆ 2020-01-01 ┆ 909.0 │
#> └─────────────┴────────────┴────────┘
pop2 <- pl$LazyFrame(
country = rep(c("Germany", "Netherlands"), each = 3),
date = rep(as.Date(c("2016-3-1", "2018-8-1", "2019-1-1")), 2),
population = c(82.19, 82.66, 83.12, 17.11, 17.32, 17.40)
)$sort("country", "date")
pop2$collect()
#> shape: (6, 3)
#> ┌─────────────┬────────────┬────────────┐
#> │ country ┆ date ┆ population │
#> │ --- ┆ --- ┆ --- │
#> │ str ┆ date ┆ f64 │
#> ╞═════════════╪════════════╪════════════╡
#> │ Germany ┆ 2016-03-01 ┆ 82.19 │
#> │ Germany ┆ 2018-08-01 ┆ 82.66 │
#> │ Germany ┆ 2019-01-01 ┆ 83.12 │
#> │ Netherlands ┆ 2016-03-01 ┆ 17.11 │
#> │ Netherlands ┆ 2018-08-01 ┆ 17.32 │
#> │ Netherlands ┆ 2019-01-01 ┆ 17.4 │
#> └─────────────┴────────────┴────────────┘
#> shape: (6, 4)
#> ┌─────────────┬────────────┬────────────┬────────┐
#> │ country ┆ date ┆ population ┆ gdp │
#> │ --- ┆ --- ┆ --- ┆ --- │
#> │ str ┆ date ┆ f64 ┆ f64 │
#> ╞═════════════╪════════════╪════════════╪════════╡
#> │ Germany ┆ 2016-03-01 ┆ 82.19 ┆ 4164.0 │
#> │ Germany ┆ 2018-08-01 ┆ 82.66 ┆ 4696.0 │
#> │ Germany ┆ 2019-01-01 ┆ 83.12 ┆ 4696.0 │
#> │ Netherlands ┆ 2016-03-01 ┆ 17.11 ┆ 784.0 │
#> │ Netherlands ┆ 2018-08-01 ┆ 17.32 ┆ 910.0 │
#> │ Netherlands ┆ 2019-01-01 ┆ 17.4 ┆ 910.0 │
#> └─────────────┴────────────┴────────────┴────────┘