Apply a custom R function to a whole Series or sequence of Series.
Description
The output of this custom function is presumed to be either a Series, or
a scalar that will be converted into a Series. If the result is a scalar
and you want it to stay as a scalar, pass in returns_scalar =
TRUE
.
If you want to apply a custom function elementwise over single values, see map_elements(). A reasonable use case for map functions is transforming the values represented by an expression using a third-party package.
Usage
<Expr>$map_batches(lambda, return_dtype = NULL, ..., agg_list = FALSE)
Arguments
lambda
|
Function to apply. |
return_dtype
|
Dtype of the output Series. If NULL (default), the dtype
will be inferred based on the first non-null value that is returned by
the function. This can lead to unexpected results, so it is recommended
to provide the return dtype.
|
…
|
These dots are for future extensions and must be empty. |
agg_list
|
Aggregate the values of the expression into a list before applying the function. This parameter only works in a group-by context. The function will be invoked only once on a list of groups, rather than once per group. |
Value
A polars expression
Examples
library("polars")
df <- pl$DataFrame(
sine = c(0.0, 1.0, 0.0, -1.0),
cosine = c(1.0, 0.0, -1.0, 0.0)
)
df$select(pl$all()$map_batches(\(x) {
elems <- as.vector(x)
which.max(elems)
}))
#> shape: (1, 2)
#> ┌──────┬────────┐
#> │ sine ┆ cosine │
#> │ --- ┆ --- │
#> │ i32 ┆ i32 │
#> ╞══════╪════════╡
#> │ 2 ┆ 1 │
#> └──────┴────────┘
# In a group-by context, the `agg_list` parameter can improve performance if
# used correctly. The following example has `agg_list = FALSE`, which causes
# the function to be applied once per group. The input of the function is a
# Series of type Int64. This is less efficient.
df <- pl$DataFrame(
a = c(0, 1, 0, 1),
b = c(1, 2, 3, 4)
)
system.time({
print(
df$group_by("a")$agg(
pl$col("b")$map_batches(\(x) x + 2, agg_list = FALSE)
)
)
})
#> shape: (2, 2)
#> ┌─────┬────────────┐
#> │ a ┆ b │
#> │ --- ┆ --- │
#> │ f64 ┆ list[f64] │
#> ╞═════╪════════════╡
#> │ 0.0 ┆ [3.0, 5.0] │
#> │ 1.0 ┆ [4.0, 6.0] │
#> └─────┴────────────┘
#> user system elapsed
#> 0.012 0.001 0.013
# Using `agg_list = TRUE` would be more efficient. In this example, the input
# of the function is a Series of type List(Int64).
system.time({
print(
df$group_by("a")$agg(
pl$col("b")$map_batches(
\(x) x$list$eval(pl$element() + 2),
agg_list = TRUE
)
)
)
})
#> shape: (2, 2)
#> ┌─────┬────────────┐
#> │ a ┆ b │
#> │ --- ┆ --- │
#> │ f64 ┆ list[f64] │
#> ╞═════╪════════════╡
#> │ 0.0 ┆ [3.0, 5.0] │
#> │ 1.0 ┆ [4.0, 6.0] │
#> └─────┴────────────┘
#> user system elapsed
#> 0.024 0.000 0.024
# Call a function that takes multiple arguments by creating a struct and
# referencing its fields inside the function call.
df <- pl$DataFrame(
a = c(5, 1, 0, 3),
b = c(4, 2, 3, 4),
)
df$with_columns(
a_times_b = pl$struct("a", "b")$map_batches(
\(x) x$struct$field("a") * x$struct$field("b")
)
)
#> shape: (4, 3)
#> ┌─────┬─────┬───────────┐
#> │ a ┆ b ┆ a_times_b │
#> │ --- ┆ --- ┆ --- │
#> │ f64 ┆ f64 ┆ f64 │
#> ╞═════╪═════╪═══════════╡
#> │ 5.0 ┆ 4.0 ┆ 20.0 │
#> │ 1.0 ┆ 2.0 ┆ 2.0 │
#> │ 0.0 ┆ 3.0 ┆ 0.0 │
#> │ 3.0 ┆ 4.0 ┆ 12.0 │
#> └─────┴─────┴───────────┘