Optimize polars performance

As highlighted by the DuckDB benchmarks, polars is very efficient to deal with large datasets. Still, one can make polars even faster by following some good practices.

Lazy vs eager execution

Order of operations

In the “Get Started” vignette, we mostly used eager execution. This is the classic type of execution in R: write some code, link functions with pipes, run the code, and it is executed line by line.

When dealing with datasets of a reasonable size (let’s say up to a few hundreds of thousands of observations and a few dozens columns), this kind of execution is perfectly fine. We don’t really have to worry about whether our code is optimized to save memory and time.

However, when we start working with much larger datasets, the order in which functions are applied is extremely important. Indeed, some functions are much more memory intensive than others.

For instance, let’s say we have a dataset with 1M observations containing country names and a few numeric columns. We would like to keep just a few of these countries and sort them alphabetically. Here, we have two operations: filtering and sorting. Sorting is much harder to do than filtering. To filter data, we simply check whether each row fills some conditions, but to sort data we have to compare rows between them and rearrange them as we go. If we don’t take this into account, sorting data before filtering it can slow down our pipeline significantly.

countries <- c(
  "France",
  "Germany",
  "United Kingdom",
  "Japan",
  "Columbia",
  "South Korea",
  "Vietnam",
  "South Africa",
  "Senegal",
  "Iran"
)

set.seed(123)
test <- data.frame(
  country = sample(countries, 1e6, TRUE),
  x = sample(1:100, 1e6, TRUE),
  y = sample(1:1000, 1e6, TRUE)
)

bench::mark(
  sort_filter = {
    tmp = test[order(test$country), ]
    subset(tmp, country %in% c("United Kingdom", "Japan", "Vietnam"))
  },
  filter_sort = {
    tmp = subset(test, country %in% c("United Kingdom", "Japan", "Vietnam"))
    tmp[order(tmp$country), ]
  }
)

#> Warning: Some expressions had a GC in every iteration; so filtering is
#> disabled.
#> # A tibble: 2 × 6
#>   expression       min   median `itr/sec` mem_alloc `gc/sec`
#>   <bch:expr>  <bch:tm> <bch:tm>     <dbl> <bch:byt>    <dbl>
#> 1 sort_filter    10.4s    10.4s    0.0964    87.2MB    0.193
#> 2 filter_sort     2.4s     2.4s    0.417     67.1MB    0.835

How does polars help?

We have seen that the order in which functions are applied matters a lot. But it already takes a long time to deal with large data, and we don’t want to spend even more time trying to optimize our pipeline.

This is where lazy execution comes into play. The idea is that we write our code as usual, but this time, we won’t apply it directly on a dataset but on a lazy dataset, i.e a dataset that is not loaded in memory yet (in polars terms, these are DataFrames and LazyFrames). Once our code is ready, we call collect() at the end of the pipeline. Before executing our code, polars will internally check whether it can be optimized, for example by reordering some operations.

Let’s re-use the example above but this time with polars syntax and 10M observations. For the purpose of this vignette, we can create a LazyFrame directly in our session, but if the data was stored in a CSV file for instance, we would have to scan it first with pl$scan_csv():

library(polars)

set.seed(123)
df_test <- pl$DataFrame(
  country = sample(countries, 1e7, TRUE),
  x = sample(1:100, 1e7, TRUE),
  y = sample(1:1000, 1e7, TRUE)
)

lf_test <- df_test$lazy()

Now, we can convert the base R code above to a polars query:

df_test$sort(pl$col("country"))$filter(
  pl$col("country")$is_in(list(c("United Kingdom", "Japan", "Vietnam")))
)
#> shape: (3_000_835, 3)
#> ┌─────────┬─────┬─────┐
#> │ country ┆ x   ┆ y   │
#> │ ---     ┆ --- ┆ --- │
#> │ str     ┆ i32 ┆ i32 │
#> ╞═════════╪═════╪═════╡
#> │ Japan   ┆ 97  ┆ 7   │
#> │ Japan   ┆ 96  ┆ 672 │
#> │ Japan   ┆ 17  ┆ 710 │
#> │ Japan   ┆ 68  ┆ 41  │
#> │ Japan   ┆ 100 ┆ 109 │
#> │ …       ┆ …   ┆ …   │
#> │ Vietnam ┆ 89  ┆ 352 │
#> │ Vietnam ┆ 62  ┆ 8   │
#> │ Vietnam ┆ 52  ┆ 988 │
#> │ Vietnam ┆ 85  ┆ 982 │
#> │ Vietnam ┆ 74  ┆ 692 │
#> └─────────┴─────┴─────┘

This works for the DataFrame, that uses eager execution. For the LazyFrame, we can write the same query:

lazy_query <- lf_test$sort(pl$col("country"))$filter(
  pl$col("country")$is_in(list(c("United Kingdom", "Japan", "Vietnam")))
)

lazy_query
#> <polars_lazy_frame>

However, this doesn’t do anything to the data until we call collect() at the end. We can now compare the two approaches:

bench::mark(
  eager = df_test$
    sort(pl$col("country"))$
    filter(
      pl$col("country")$is_in(list(c("United Kingdom", "Japan", "Vietnam")))
    ),
  lazy = lazy_query$collect(),
  iterations = 10
)

#> # A tibble: 2 × 6
#>   expression      min   median `itr/sec` mem_alloc `gc/sec`
#>   <bch:expr> <bch:tm> <bch:tm>     <dbl> <bch:byt>    <dbl>
#> 1 eager         1.18s    1.28s     0.754    9.95KB        0
#> 2 lazy       526.98ms 577.35ms     1.72       864B        0

The timings themselves are not very important. What is interesting here is to see that that on a very simple query, using lazy execution instead of eager execution lead to a 1.7-2.2x decrease in time.

This is because under the hood, polars reorganized the query so that it filters rows as early as possible, and then sorts the remaining data. This can be seen by comparing the original query with the optimized one:

lazy_query$explain(optimized = FALSE) |>
  cat()
#> FILTER col("country").is_in([["United Kingdom", "Japan", "Vietnam"]])
#> FROM
#>   SORT BY [col("country")]
#>     DF ["country", "x", "y"]; PROJECT */3 COLUMNS

lazy_query$explain() |>
  cat()
#> SORT BY [col("country")]
#>   FILTER col("country").is_in([["United Kingdom", "Japan", "Vietnam"]])
#>   FROM
#>     DF ["country", "x", "y"]; PROJECT */3 COLUMNS

Query plans are read from bottom to top. Therefore, the non-optimized query is “select the data, sort it by country, and then keep only the three countries we want”. Polars optimizes this query to apply the filter as early as possible, therefore leading to the query “select the dataset, keep only the three countries we want, and then sort the data by country”.

Polars runs many optimizations in the background, the entire list can be found here.

Use the streaming engine

Quoting Polars User Guide:

One additional benefit of the lazy API is that it allows queries to be executed in a streaming manner. Instead of processing the data all-at-once Polars can execute the query in batches allowing you to process datasets that are larger-than-memory.

Even in cases where the data would fit in memory, the streaming engine is often more efficient than the in-memory engine.

To use the streaming engine, we can add engine = "streaming" in collect():

bench::mark(
  lazy = lazy_query$collect(),
  lazy_streaming = lazy_query$collect(engine = "streaming"),
  iterations = 20
)

#> # A tibble: 2 × 6
#>   expression          min   median `itr/sec` mem_alloc `gc/sec`
#>   <bch:expr>     <bch:tm> <bch:tm>     <dbl> <bch:byt>    <dbl>
#> 1 lazy              482ms    546ms      1.85      864B        0
#> 2 lazy_streaming    292ms    346ms      2.90      864B        0

Use polars functions

polars comes with a large number of built-in, optimized, basic functions that should cover most aspects of data wrangling. These functions are designed to be very memory efficient. Therefore, using R functions or converting data back and forth between polars and R is discouraged as it can lead to a large decrease in efficiency.

Let’s use the test data from the previous section and let’s say that we only want to check whether each country contains “na”. This can be done in (at least) two ways: with the built-in function contains() and with the base R function grepl(). However, using the built-in function is much faster:

bench::mark(
  contains = df_test$with_columns(
    pl$col("country")$str$contains("na")
  ),
  grepl = df_test$with_columns(
    pl$col("country")$map_batches(\(s) { # map with a R function
      grepl("na", s)
    })
  ),
  iterations = 10
)
#> Warning: Some expressions had a GC in every iteration; so filtering is
#> disabled.
#> # A tibble: 3 × 6
#>   expression      min   median `itr/sec` mem_alloc `gc/sec`
#>   <bch:expr> <bch:tm> <bch:tm>     <dbl> <bch:byt>    <dbl>
#> 1 contains   387.02ms 432.12ms     2.27   401.86KB    0
#> 2 grepl         2.06s    2.11s     0.466  114.79MB    0.512

If you have to use external functions (for instance provided by another package), you can use the package purrr for that. See the vignette “Using custom functions”.