Perform joins on nearest keys

Description

This is similar to a left-join except that we match on nearest key rather than equal keys. Both frames must be sorted by the asof_join key.

Usage

<LazyFrame>$join_asof(
  other,
  ...,
  left_on = NULL,
  right_on = NULL,
  on = NULL,
  by_left = NULL,
  by_right = NULL,
  by = NULL,
  strategy = c("backward", "forward", "nearest"),
  suffix = "_right",
  tolerance = NULL,
  allow_parallel = TRUE,
  force_parallel = FALSE,
  coalesce = TRUE,
  allow_exact_matches = TRUE,
  check_sortedness = TRUE
)

Arguments

other LazyFrame to join with.

… These dots are for future extensions and must be empty.

left_on, right_on Same as on but only for the left or the right DataFrame. They must have the same length.

on Either a vector of column names or a list of expressions and/or strings. Use left_on and right_on if the column names to match on are different between the two LazyFrames.

by_left, by_right Same as by but only for the left or the right table. They must have the same length.

by Join on these columns before performing asof join. Either a vector of column names or a list of expressions and/or strings. Use left_by and right_by if the column names to match on are different between the two tables.

strategy

Strategy for where to find match:

“backward” (default): search for the last row in the right table whose on key is less than or equal to the left key.
“forward”: search for the first row in the right table whose on key is greater than or equal to the left key.
“nearest”: search for the last row in the right table whose value is nearest to the left key. String keys are not currently supported for a nearest search.

suffix Suffix to add to duplicated column names.

tolerance Numeric tolerance. By setting this the join will only be done if the near keys are within this distance. If an asof join is done on columns of dtype "Date", "Datetime", "Duration" or "Time", use the Polars duration string language (see details).

allow_parallel Allow the physical plan to optionally evaluate the computation of both LazyFrames up to the join in parallel.

force_parallel Force the physical plan to evaluate the computation of both LazyFrames up to the join in parallel.

coalesce

Coalescing behavior (merging of on / left_on / right_on columns):

TRUE: Always coalesce join columns;
FALSE: Never coalesce join columns. Note that joining on any other expressions than col will turn off coalescing.

allow_exact_matches Whether exact matches are valid join predicates. If TRUE (default), allow matching with the same on value (i.e. less-than-or-equal-to / greater-than-or-equal-to). Otherwise, don’t match the same on value (i.e., strictly less-than / strictly greater-than).

check_sortedness Check the sortedness of the asof keys. If the keys are not sorted, polars will error, or raise a warning if the by argument is provided. This might become a hard error in the future.

Value

A polars LazyFrame

Polars duration string language

Polars duration string language is a simple representation of durations. It is used in many Polars functions that accept durations.

It has the following format:

1ns (1 nanosecond)
1us (1 microsecond)
1ms (1 millisecond)
1s (1 second)
1m (1 minute)
1h (1 hour)
1d (1 calendar day)
1w (1 calendar week)
1mo (1 calendar month)
1q (1 calendar quarter)
1y (1 calendar year)

Or combine them: “3d12h4m25s” # 3 days, 12 hours, 4 minutes, and 25 seconds

By "calendar day", we mean the corresponding time on the next day (which may not be 24 hours, due to daylight savings). Similarly for "calendar week", "calendar month", "calendar quarter", and "calendar year".

Examples

library("polars")

gdp <- pl$LazyFrame(
  date = as.Date(c("2016-1-1", "2017-5-1", "2018-1-1", "2019-1-1", "2020-1-1")),
  gdp = c(4164, 4411, 4566, 4696, 4827)
)

pop <- pl$LazyFrame(
  date = as.Date(c("2016-3-1", "2018-8-1", "2019-1-1")),
  population = c(82.19, 82.66, 83.12)
)

# optional make sure tables are already sorted with "on" join-key
gdp <- gdp$sort("date")
pop <- pop$sort("date")


# Note how the dates don’t quite match. If we join them using join_asof and
# strategy = 'backward', then each date from population which doesn’t have
# an exact match is matched with the closest earlier date from gdp:
pop$join_asof(gdp, on = "date", strategy = "backward")$collect()

#> shape: (3, 3)
#> ┌────────────┬────────────┬────────┐
#> │ date       ┆ population ┆ gdp    │
#> │ ---        ┆ ---        ┆ ---    │
#> │ date       ┆ f64        ┆ f64    │
#> ╞════════════╪════════════╪════════╡
#> │ 2016-03-01 ┆ 82.19      ┆ 4164.0 │
#> │ 2018-08-01 ┆ 82.66      ┆ 4566.0 │
#> │ 2019-01-01 ┆ 83.12      ┆ 4696.0 │
#> └────────────┴────────────┴────────┘

# Note how:
# - date 2016-03-01 from population is matched with 2016-01-01 from gdp;
# - date 2018-08-01 from population is matched with 2018-01-01 from gdp.
# You can verify this by passing coalesce = FALSE:
pop$join_asof(
  gdp,
  on = "date", strategy = "backward", coalesce = FALSE
)$collect()

#> shape: (3, 4)
#> ┌────────────┬────────────┬────────────┬────────┐
#> │ date       ┆ population ┆ date_right ┆ gdp    │
#> │ ---        ┆ ---        ┆ ---        ┆ ---    │
#> │ date       ┆ f64        ┆ date       ┆ f64    │
#> ╞════════════╪════════════╪════════════╪════════╡
#> │ 2016-03-01 ┆ 82.19      ┆ 2016-01-01 ┆ 4164.0 │
#> │ 2018-08-01 ┆ 82.66      ┆ 2018-01-01 ┆ 4566.0 │
#> │ 2019-01-01 ┆ 83.12      ┆ 2019-01-01 ┆ 4696.0 │
#> └────────────┴────────────┴────────────┴────────┘

# If we instead use strategy = 'forward', then each date from population
# which doesn’t have an exact match is matched with the closest later date
# from gdp:
pop$join_asof(gdp, on = "date", strategy = "forward")$collect()

#> shape: (3, 3)
#> ┌────────────┬────────────┬────────┐
#> │ date       ┆ population ┆ gdp    │
#> │ ---        ┆ ---        ┆ ---    │
#> │ date       ┆ f64        ┆ f64    │
#> ╞════════════╪════════════╪════════╡
#> │ 2016-03-01 ┆ 82.19      ┆ 4411.0 │
#> │ 2018-08-01 ┆ 82.66      ┆ 4696.0 │
#> │ 2019-01-01 ┆ 83.12      ┆ 4696.0 │
#> └────────────┴────────────┴────────┘

# Note how:
# - date 2016-03-01 from population is matched with 2017-01-01 from gdp;
# - date 2018-08-01 from population is matched with 2019-01-01 from gdp.

# Finally, strategy = 'nearest' gives us a mix of the two results above, as
# each date from population which doesn’t have an exact match is matched
# with the closest date from gdp, regardless of whether it’s earlier or
# later:
pop$join_asof(gdp, on = "date", strategy = "nearest")$collect()

#> shape: (3, 3)
#> ┌────────────┬────────────┬────────┐
#> │ date       ┆ population ┆ gdp    │
#> │ ---        ┆ ---        ┆ ---    │
#> │ date       ┆ f64        ┆ f64    │
#> ╞════════════╪════════════╪════════╡
#> │ 2016-03-01 ┆ 82.19      ┆ 4164.0 │
#> │ 2018-08-01 ┆ 82.66      ┆ 4696.0 │
#> │ 2019-01-01 ┆ 83.12      ┆ 4696.0 │
#> └────────────┴────────────┴────────┘

# Note how:
# - date 2016-03-01 from population is matched with 2016-01-01 from gdp;
# - date 2018-08-01 from population is matched with 2019-01-01 from gdp.

# The `by` argument allows joining on another column first, before the asof
# join. In this example we join by country first, then asof join by date, as
# above.
gdp2 <- pl$LazyFrame(
  country = rep(c("Germany", "Netherlands"), each = 5),
  date = rep(
    as.Date(c("2016-1-1", "2017-1-1", "2018-1-1", "2019-1-1", "2020-1-1")),
    2
  ),
  gdp = c(4164, 4411, 4566, 4696, 4827, 784, 833, 914, 910, 909)
)$sort("country", "date")
gdp2$collect()

#> shape: (10, 3)
#> ┌─────────────┬────────────┬────────┐
#> │ country     ┆ date       ┆ gdp    │
#> │ ---         ┆ ---        ┆ ---    │
#> │ str         ┆ date       ┆ f64    │
#> ╞═════════════╪════════════╪════════╡
#> │ Germany     ┆ 2016-01-01 ┆ 4164.0 │
#> │ Germany     ┆ 2017-01-01 ┆ 4411.0 │
#> │ Germany     ┆ 2018-01-01 ┆ 4566.0 │
#> │ Germany     ┆ 2019-01-01 ┆ 4696.0 │
#> │ Germany     ┆ 2020-01-01 ┆ 4827.0 │
#> │ Netherlands ┆ 2016-01-01 ┆ 784.0  │
#> │ Netherlands ┆ 2017-01-01 ┆ 833.0  │
#> │ Netherlands ┆ 2018-01-01 ┆ 914.0  │
#> │ Netherlands ┆ 2019-01-01 ┆ 910.0  │
#> │ Netherlands ┆ 2020-01-01 ┆ 909.0  │
#> └─────────────┴────────────┴────────┘

pop2 <- pl$LazyFrame(
  country = rep(c("Germany", "Netherlands"), each = 3),
  date = rep(as.Date(c("2016-3-1", "2018-8-1", "2019-1-1")), 2),
  population = c(82.19, 82.66, 83.12, 17.11, 17.32, 17.40)
)$sort("country", "date")
pop2$collect()

#> shape: (6, 3)
#> ┌─────────────┬────────────┬────────────┐
#> │ country     ┆ date       ┆ population │
#> │ ---         ┆ ---        ┆ ---        │
#> │ str         ┆ date       ┆ f64        │
#> ╞═════════════╪════════════╪════════════╡
#> │ Germany     ┆ 2016-03-01 ┆ 82.19      │
#> │ Germany     ┆ 2018-08-01 ┆ 82.66      │
#> │ Germany     ┆ 2019-01-01 ┆ 83.12      │
#> │ Netherlands ┆ 2016-03-01 ┆ 17.11      │
#> │ Netherlands ┆ 2018-08-01 ┆ 17.32      │
#> │ Netherlands ┆ 2019-01-01 ┆ 17.4       │
#> └─────────────┴────────────┴────────────┘

pop2$join_asof(
  gdp2,
  by = "country", on = "date", strategy = "nearest"
)$collect()

#> shape: (6, 4)
#> ┌─────────────┬────────────┬────────────┬────────┐
#> │ country     ┆ date       ┆ population ┆ gdp    │
#> │ ---         ┆ ---        ┆ ---        ┆ ---    │
#> │ str         ┆ date       ┆ f64        ┆ f64    │
#> ╞═════════════╪════════════╪════════════╪════════╡
#> │ Germany     ┆ 2016-03-01 ┆ 82.19      ┆ 4164.0 │
#> │ Germany     ┆ 2018-08-01 ┆ 82.66      ┆ 4696.0 │
#> │ Germany     ┆ 2019-01-01 ┆ 83.12      ┆ 4696.0 │
#> │ Netherlands ┆ 2016-03-01 ┆ 17.11      ┆ 784.0  │
#> │ Netherlands ┆ 2018-08-01 ┆ 17.32      ┆ 910.0  │
#> │ Netherlands ┆ 2019-01-01 ┆ 17.4       ┆ 910.0  │
#> └─────────────┴────────────┴────────────┴────────┘