Skip to content

Collect and profile a lazy query

Description

This will run the query and return a list containing the materialized DataFrame and a DataFrame that contains profiling information of each node that is executed.

Usage

<LazyFrame>$profile(
  ...,
  type_coercion = TRUE,
  `_type_check` = TRUE,
  predicate_pushdown = TRUE,
  projection_pushdown = TRUE,
  simplify_expression = TRUE,
  slice_pushdown = TRUE,
  comm_subplan_elim = TRUE,
  comm_subexpr_elim = TRUE,
  cluster_with_columns = TRUE,
  collapse_joins = TRUE,
  streaming = FALSE,
  no_optimization = FALSE,
  `_check_order` = TRUE,
  show_plot = FALSE,
  truncate_nodes = 0
)

Arguments

These dots are for future extensions and must be empty.
type_coercion A logical, indicats type coercion optimization.
predicate_pushdown A logical, indicats predicate pushdown optimization.
projection_pushdown A logical, indicats projection pushdown optimization.
simplify_expression A logical, indicats simplify expression optimization.
slice_pushdown A logical, indicats slice pushdown optimization.
comm_subplan_elim A logical, indicats tring to cache branching subplans that occur on self-joins or unions.
comm_subexpr_elim A logical, indicats tring to cache common subexpressions.
cluster_with_columns A logical, indicats to combine sequential independent calls to with_columns.
collapse_joins Collapse a join and filters into a faster join.
streaming [Deprecated] A logical. If TRUE, process the query in batches to handle larger-than-memory data. If FALSE (default), the entire query is processed in a single batch. Note that streaming mode is considered unstable. It may be changed at any point without it being considered a breaking change.
no_optimization A logical. If TRUE, turn off (certain) optimizations.
\_check_order, \_type_check For internal use only.
show_plot Show a Gantt chart of the profiling result
truncate_nodes Truncate the label lengths in the Gantt chart to this number of characters. If 0 (default), do not truncate.

Details

The units of the timings are microseconds.

Value

List of two DataFrames: one with the collected result, the other with the timings of each step. If show_graph = TRUE, then the plot is also stored in the list.

See Also

  • $collect() - regular collect.
  • $sink_parquet() streams query to a parquet file.
  • $sink_ipc() streams query to a arrow file.

Examples

library("polars")

# Simplest use case
pl$LazyFrame()$select(pl$lit(2) + 2)$profile()
#> [[1]]
#> shape: (1, 1)
#> ┌─────────┐
#> │ literal │
#> │ ---     │
#> │ f64     │
#> ╞═════════╡
#> │ 4.0     │
#> └─────────┘
#> 
#> [[2]]
#> shape: (2, 3)
#> ┌─────────────────┬───────┬──────┐
#> │ node            ┆ start ┆ end  │
#> │ ---             ┆ ---   ┆ ---  │
#> │ str             ┆ u64   ┆ u64  │
#> ╞═════════════════╪═══════╪══════╡
#> │ optimization    ┆ 0     ┆ 2688 │
#> │ select(literal) ┆ 2688  ┆ 2797 │
#> └─────────────────┴───────┴──────┘
# Use $profile() to compare two queries

# -1-  map each Species-group with native polars
as_polars_lf(iris)$
  sort("Sepal.Length")$
  group_by("Species", maintain_order = TRUE)$
  agg(pl$col(pl$Float64)$first() + 5)$
  profile()
#> [[1]]
#> shape: (3, 6)
#> ┌────────────┬────────────────┬──────────────┬─────────────┬──────────────┬─────────────┐
#> │ Species    ┆ maintain_order ┆ Sepal.Length ┆ Sepal.Width ┆ Petal.Length ┆ Petal.Width │
#> │ ---        ┆ ---            ┆ ---          ┆ ---         ┆ ---          ┆ ---         │
#> │ cat        ┆ bool           ┆ f64          ┆ f64         ┆ f64          ┆ f64         │
#> ╞════════════╪════════════════╪══════════════╪═════════════╪══════════════╪═════════════╡
#> │ setosa     ┆ true           ┆ 9.3          ┆ 8.0         ┆ 6.1          ┆ 5.1         │
#> │ virginica  ┆ true           ┆ 9.9          ┆ 7.5         ┆ 9.5          ┆ 6.7         │
#> │ versicolor ┆ true           ┆ 9.9          ┆ 7.4         ┆ 8.3          ┆ 6.0         │
#> └────────────┴────────────────┴──────────────┴─────────────┴──────────────┴─────────────┘
#> 
#> [[2]]
#> shape: (3, 3)
#> ┌─────────────────────────────────┬───────┬──────┐
#> │ node                            ┆ start ┆ end  │
#> │ ---                             ┆ ---   ┆ ---  │
#> │ str                             ┆ u64   ┆ u64  │
#> ╞═════════════════════════════════╪═══════╪══════╡
#> │ optimization                    ┆ 0     ┆ 1762 │
#> │ sort(Sepal.Length)              ┆ 1762  ┆ 2816 │
#> │ group_by(Species, maintain_ord… ┆ 2847  ┆ 6453 │
#> └─────────────────────────────────┴───────┴──────┘
# some R function, prints `.` for each time called by polars
#  cat(".")
#  s$to_r()[1] + 5
#  sort("Sepal.Length")$
#  group_by("Species", maintain_order = TRUE)$
#  agg(pl$col(pl$Float64)$map_elements(r_func))$
#  profile()