Skip to content

Apply a custom R function to a whole Series or sequence of Series.

Description

[Experimental]

The output of this custom function is presumed to be either a Series, or a scalar that will be converted into a Series. If the result is a scalar and you want it to stay as a scalar, pass in returns_scalar = TRUE.

If you want to apply a custom function elementwise over single values, see map_elements(). A reasonable use case for map functions is transforming the values represented by an expression using a third-party package.

Usage

<Expr>$map_batches(lambda, return_dtype = NULL, ..., agg_list = FALSE)

Arguments

lambda Function to apply.
return_dtype Dtype of the output Series. If NULL (default), the dtype will be inferred based on the first non-null value that is returned by the function. This can lead to unexpected results, so it is recommended to provide the return dtype.
These dots are for future extensions and must be empty.
agg_list Aggregate the values of the expression into a list before applying the function. This parameter only works in a group-by context. The function will be invoked only once on a list of groups, rather than once per group.

Value

A polars expression

Examples

library("polars")

df <- pl$DataFrame(
  sine = c(0.0, 1.0, 0.0, -1.0),
  cosine = c(1.0, 0.0, -1.0, 0.0)
)
df$select(pl$all()$map_batches(\(x) {
  elems <- as.vector(x)
  which.max(elems)
}))
#> shape: (1, 2)
#> ┌──────┬────────┐
#> │ sine ┆ cosine │
#> │ ---  ┆ ---    │
#> │ i32  ┆ i32    │
#> ╞══════╪════════╡
#> │ 2    ┆ 1      │
#> └──────┴────────┘
# In a group-by context, the `agg_list` parameter can improve performance if
# used correctly. The following example has `agg_list = FALSE`, which causes
# the function to be applied once per group. The input of the function is a
# Series of type Int64. This is less efficient.
df <- pl$DataFrame(
  a = c(0, 1, 0, 1),
  b = c(1, 2, 3, 4)
)
system.time({
  print(
    df$group_by("a")$agg(
      pl$col("b")$map_batches(\(x) x + 2, agg_list = FALSE)
    )
  )
})
#> shape: (2, 2)
#> ┌─────┬────────────┐
#> │ a   ┆ b          │
#> │ --- ┆ ---        │
#> │ f64 ┆ list[f64]  │
#> ╞═════╪════════════╡
#> │ 0.0 ┆ [3.0, 5.0] │
#> │ 1.0 ┆ [4.0, 6.0] │
#> └─────┴────────────┘

#>    user  system elapsed 
#>   0.012   0.001   0.013
# Using `agg_list = TRUE` would be more efficient. In this example, the input
# of the function is a Series of type List(Int64).
system.time({
  print(
    df$group_by("a")$agg(
      pl$col("b")$map_batches(
        \(x) x$list$eval(pl$element() + 2),
        agg_list = TRUE
      )
    )
  )
})
#> shape: (2, 2)
#> ┌─────┬────────────┐
#> │ a   ┆ b          │
#> │ --- ┆ ---        │
#> │ f64 ┆ list[f64]  │
#> ╞═════╪════════════╡
#> │ 0.0 ┆ [3.0, 5.0] │
#> │ 1.0 ┆ [4.0, 6.0] │
#> └─────┴────────────┘

#>    user  system elapsed 
#>   0.024   0.000   0.024
# Call a function that takes multiple arguments by creating a struct and
# referencing its fields inside the function call.
df <- pl$DataFrame(
  a = c(5, 1, 0, 3),
  b = c(4, 2, 3, 4),
)
df$with_columns(
  a_times_b = pl$struct("a", "b")$map_batches(
    \(x) x$struct$field("a") * x$struct$field("b")
  )
)
#> shape: (4, 3)
#> ┌─────┬─────┬───────────┐
#> │ a   ┆ b   ┆ a_times_b │
#> │ --- ┆ --- ┆ ---       │
#> │ f64 ┆ f64 ┆ f64       │
#> ╞═════╪═════╪═══════════╡
#> │ 5.0 ┆ 4.0 ┆ 20.0      │
#> │ 1.0 ┆ 2.0 ┆ 2.0       │
#> │ 0.0 ┆ 3.0 ┆ 0.0       │
#> │ 3.0 ┆ 4.0 ┆ 12.0      │
#> └─────┴─────┴───────────┘