Skip to content

Evaluate the query in streaming mode and write to a Parquet file

Description

[Experimental]

This allows streaming results that are larger than RAM to be written to disk.

Usage

parquet_statistics(
  ...,
  min = TRUE,
  max = TRUE,
  distinct_count = TRUE,
  null_count = TRUE
)

lazyframe__sink_parquet(
  path,
  ...,
  compression = c("lz4", "uncompressed", "snappy", "gzip", "lzo", "brotli", "zstd"),
  compression_level = NULL,
  statistics = TRUE,
  row_group_size = NULL,
  data_page_size = NULL,
  maintain_order = TRUE,
  type_coercion = TRUE,
  `_type_check` = TRUE,
  predicate_pushdown = TRUE,
  projection_pushdown = TRUE,
  simplify_expression = TRUE,
  slice_pushdown = TRUE,
  collapse_joins = TRUE,
  no_optimization = FALSE,
  storage_options = NULL,
  retries = 2,
  sync_on_close = c("none", "data", "all"),
  mkdir = FALSE
)

Arguments

These dots are for future extensions and must be empty.
min Include stats on the minimum values in the column.
max Include stats on the maximum values in the column.
distinct_count Include stats on the number of distinct values in the column.
null_count Include stats on the number of null values in the column.
path A character. File path to which the file should be written.
compression The compression method. Must be one of:
  • “lz4”: fast compression/decompression.
  • “uncompressed”
  • “snappy”: this guarantees that the parquet file will be compatible with older parquet readers.
  • “gzip”
  • “lzo”
  • “brotli”
  • “zstd”: good compression performance.
compression_level NULL or integer. The level of compression to use. Only used if method is one of “gzip”, “brotli”, or “zstd”. Higher compression means smaller files on disk:
  • “gzip”: min-level: 0, max-level: 10.
  • “brotli”: min-level: 0, max-level: 11.
  • “zstd”: min-level: 1, max-level: 22.
statistics Whether statistics should be written to the Parquet headers. Possible values:
  • TRUE: enable default set of statistics (default). Some statistics may be disabled.
  • FALSE: disable all statistics
  • “full”: calculate and write all available statistics
  • A list created via parquet_statistics() to specify which statistics to include.
row_group_size Size of the row groups in number of rows. If NULL (default), the chunks of the DataFrame are used. Writing in smaller chunks may reduce memory pressure and improve writing speeds.
data_page_size Size of the data page in bytes. If NULL (default), it is set to 1024^2 bytes.
maintain_order Maintain the order in which data is processed. Setting this to FALSE will be slightly faster.
type_coercion A logical, indicats type coercion optimization.
\_type_check For internal use only.
predicate_pushdown A logical, indicats predicate pushdown optimization.
projection_pushdown A logical, indicats projection pushdown optimization.
simplify_expression A logical, indicats simplify expression optimization.
slice_pushdown A logical, indicats slice pushdown optimization.
collapse_joins Collapse a join and filters into a faster join.
no_optimization A logical. If TRUE, turn off (certain) optimizations.
storage_options Named vector containing options that indicate how to connect to a cloud provider. The cloud providers currently supported are AWS, GCP, and Azure. See supported keys here:
  • aws
  • gcp
  • azure
  • Hugging Face (hf://): Accepts an API key under the token parameter c(token = YOUR_TOKEN) or by setting the HF_TOKEN environment variable.
If storage_options is not provided, Polars will try to infer the information from environment variables.
retries Number of retries if accessing a cloud instance fails.
sync_on_close Sync to disk when before closing a file. Must be one of:
  • “none”: does not sync;
  • “data”: syncs the file contents;
  • “all”: syncs the file contents and metadata.
mkdir Recursively create all the directories in the path.

Value

Invisibly returns the input LazyFrame

Examples

library("polars")

# sink table 'mtcars' from mem to parquet
tmpf <- tempfile()
as_polars_lf(mtcars)$sink_parquet(tmpf)

# stream a query end-to-end
tmpf2 <- tempfile()
pl$scan_parquet(tmpf)$select(pl$col("cyl") * 2)$sink_parquet(tmpf2)

# load parquet directly into a DataFrame / memory
pl$scan_parquet(tmpf2)$collect()
#> shape: (32, 1)
#> ┌──────┐
#> │ cyl  │
#> │ ---  │
#> │ f64  │
#> ╞══════╡
#> │ 12.0 │
#> │ 12.0 │
#> │ 8.0  │
#> │ 12.0 │
#> │ 16.0 │
#> │ …    │
#> │ 8.0  │
#> │ 16.0 │
#> │ 12.0 │
#> │ 16.0 │
#> │ 8.0  │
#> └──────┘