Using custom functions or other R packages

library(polars)

polars contains a large set of functions to manipulate variables, whether they are numerical, strings, dates, or other types. Still, it is sometimes necessary to apply custom functions either written by you or available in other R packages. This vignette details several ways to do that in the most computationally efficient way.

Writing functions using polars expressions

One of polars’ biggest strength is the composability of expressions. Chaining multiple functions to create custom functions is quite simple. For example, polars doesn’t provide a built-in function to standardize a variable, but we could create one using expressions:

pl_standardize <- function(x) {
  (x - x$mean()) / x$std()
}

And we can then run it in $with_columns() for instance:

dat <- as_polars_df(mtcars[, c("carb", "mpg", "drat")])
dat$with_columns(
  carb_stand = pl_standardize(pl$col("carb"))
)
#> shape: (32, 4)
#> ┌──────┬──────┬──────┬────────────┐
#> │ carb ┆ mpg  ┆ drat ┆ carb_stand │
#> │ ---  ┆ ---  ┆ ---  ┆ ---        │
#> │ f64  ┆ f64  ┆ f64  ┆ f64        │
#> ╞══════╪══════╪══════╪════════════╡
#> │ 4.0  ┆ 21.0 ┆ 3.9  ┆ 0.735203   │
#> │ 4.0  ┆ 21.0 ┆ 3.9  ┆ 0.735203   │
#> │ 1.0  ┆ 22.8 ┆ 3.85 ┆ -1.122152  │
#> │ 1.0  ┆ 21.4 ┆ 3.08 ┆ -1.122152  │
#> │ 2.0  ┆ 18.7 ┆ 3.15 ┆ -0.503034  │
#> │ …    ┆ …    ┆ …    ┆ …          │
#> │ 2.0  ┆ 30.4 ┆ 3.77 ┆ -0.503034  │
#> │ 4.0  ┆ 15.8 ┆ 4.22 ┆ 0.735203   │
#> │ 6.0  ┆ 19.7 ┆ 3.62 ┆ 1.97344    │
#> │ 8.0  ┆ 15.0 ┆ 3.54 ┆ 3.211677   │
#> │ 2.0  ┆ 21.4 ┆ 4.11 ┆ -0.503034  │
#> └──────┴──────┴──────┴────────────┘

Using purrr

polars expressions allow one to write their own functions, but it is sometimes necessary to use functions that already exist in other R packages. When this is the case, one should keep in mind that applying functions that are external to polars will limit polars’ strengths. In particular, polars is fast in part because it is able to detect the various transformations applied to the data and run them in parallel or cache them so that only the strict minimum of computations is made. Using polars expressions will also allow the code to be run in streaming mode, allowing for larger-than-memory data. When one uses functions that polars doesn’t natively know, those benefits are lost. This is the reason why it is better to use polars expressions in your own functions if you can.

To apply an external function to a polars object, we must convert this object to R, apply the function, and then convert it back to polars. If you come from Python Polars, this is a big difference. In Python Polars, several functions (for instance map_batches() and map_elements()) allow one to pass custom functions. There is <expr>$map_batches() in this package, but it is not recommended to use it because it is unstable and may not work as expected in the streaming mode.

Instead, it is recommended to convert the polars object to an R DataFrame or similar, and then use common higher-order functions like lapply() or purrr::map().

Note that when you convert your polars object to R, you may run out of memory if your object is a LazyFrame that is too big to fit in memory once collected. Another crucial point is that we want the R function to run as fast as possible on the data. One way to do that is to take advantage of {purrr} to run the function in parallel.

library(purrr)

If we do not want to run the function in parallel, something like this will suffice:

dat |>
  as.data.frame() |>
  map(\(col) scale(col)[, 1]) |>
  as_polars_df()
#> shape: (32, 3)
#> ┌───────────┬───────────┬───────────┐
#> │ carb      ┆ mpg       ┆ drat      │
#> │ ---       ┆ ---       ┆ ---       │
#> │ f64       ┆ f64       ┆ f64       │
#> ╞═══════════╪═══════════╪═══════════╡
#> │ 0.735203  ┆ 0.150885  ┆ 0.567514  │
#> │ 0.735203  ┆ 0.150885  ┆ 0.567514  │
#> │ -1.122152 ┆ 0.449543  ┆ 0.474     │
#> │ -1.122152 ┆ 0.217253  ┆ -0.966118 │
#> │ -0.503034 ┆ -0.230735 ┆ -0.835198 │
#> │ …         ┆ …         ┆ …         │
#> │ -0.503034 ┆ 1.710547  ┆ 0.324377  │
#> │ 0.735203  ┆ -0.711907 ┆ 1.166004  │
#> │ 1.97344   ┆ -0.064813 ┆ 0.043835  │
#> │ 3.211677  ┆ -0.844644 ┆ -0.105788 │
#> │ -0.503034 ┆ 0.217253  ┆ 0.960273  │
#> └───────────┴───────────┴───────────┘

When running in parallel, we need to create background processes using mirai::daemons() (see the documentation of {purrr} for details).

# Set the number of background processes to use
mirai::daemons(3)
#> [1] 3

After starting the daemons, we can wrap the function passed to purrr::map() with purrr::in_parallel() to run it in parallel in the background processes.

dat |>
  as.data.frame() |>
  map(in_parallel(\(col) scale(col)[, 1])) |>
  as_polars_df()
#> shape: (32, 3)
#> ┌───────────┬───────────┬───────────┐
#> │ carb      ┆ mpg       ┆ drat      │
#> │ ---       ┆ ---       ┆ ---       │
#> │ f64       ┆ f64       ┆ f64       │
#> ╞═══════════╪═══════════╪═══════════╡
#> │ 0.735203  ┆ 0.150885  ┆ 0.567514  │
#> │ 0.735203  ┆ 0.150885  ┆ 0.567514  │
#> │ -1.122152 ┆ 0.449543  ┆ 0.474     │
#> │ -1.122152 ┆ 0.217253  ┆ -0.966118 │
#> │ -0.503034 ┆ -0.230735 ┆ -0.835198 │
#> │ …         ┆ …         ┆ …         │
#> │ -0.503034 ┆ 1.710547  ┆ 0.324377  │
#> │ 0.735203  ┆ -0.711907 ┆ 1.166004  │
#> │ 1.97344   ┆ -0.064813 ┆ 0.043835  │
#> │ 3.211677  ┆ -0.844644 ┆ -0.105788 │
#> │ -0.503034 ┆ 0.217253  ┆ 0.960273  │
#> └───────────┴───────────┴───────────┘

Note that this applies the function on the entire data on memory and all data is collected in R. To apply this to a subset of columns only, we can use purrr::map_at().

In this case, instead of using as.data.frame() to convert all data to R, we can use as.list() to convert the polars DataFrame to a list of Series and only convert some Series to R, thus avoiding unnecessary conversions between Polars and R[1].

dat |>
  as.list() |> # Convert to a list of Series, not converting each column to R vector yet.
  map_at(
    c("carb", "mpg"),
    # Since the Series are sent to each daemon,
    # we need to convert them to R vectors first by `$to_r_vector()`.
    # Finally, we convert them back to Series by `as_polars_series()` (in this case, optional).
    in_parallel(\(s) scale(s$to_r_vector())[, 1] |> polars::as_polars_series())
  ) |>
  as_polars_df()
#> shape: (32, 3)
#> ┌───────────┬───────────┬──────┐
#> │ carb      ┆ mpg       ┆ drat │
#> │ ---       ┆ ---       ┆ ---  │
#> │ f64       ┆ f64       ┆ f64  │
#> ╞═══════════╪═══════════╪══════╡
#> │ 0.735203  ┆ 0.150885  ┆ 3.9  │
#> │ 0.735203  ┆ 0.150885  ┆ 3.9  │
#> │ -1.122152 ┆ 0.449543  ┆ 3.85 │
#> │ -1.122152 ┆ 0.217253  ┆ 3.08 │
#> │ -0.503034 ┆ -0.230735 ┆ 3.15 │
#> │ …         ┆ …         ┆ …    │
#> │ -0.503034 ┆ 1.710547  ┆ 3.77 │
#> │ 0.735203  ┆ -0.711907 ┆ 4.22 │
#> │ 1.97344   ┆ -0.064813 ┆ 3.62 │
#> │ 3.211677  ┆ -0.844644 ┆ 3.54 │
#> │ -0.503034 ┆ 0.217253  ┆ 4.11 │
#> └───────────┴───────────┴──────┘

After running the parallel computations, it is a good practice to stop the background processes to free up resources.

mirai::daemons(0)
#> [1] 0

Conclusion

Using custom functions is sometimes necessary when processing data. polars allows one to chain many expressions, thus making it possible to create custom functions that benefit from all its optimizations, such as parallelism and streaming mode.

While using polars expressions is the recommended way to write custom functions, one can also apply functions by converting the DataFrame or LazyFrame to R first, and then use purrr to run functions in parallel.

[1] In technical terms, polars Series and other objects are wrappers around external pointers, which usually cannot be sent to other processes. However, this package has built-in integration with {mirai}, which registers the serialization and deserialization functions for Series and other objects when the package is loaded. This allows users to treat Series and other polars objects like regular R objects.