Skip to content

Evaluate the query in streaming mode and write to a CSV file

Description

[Experimental]

This allows streaming results that are larger than RAM to be written to disk.

Usage

<LazyFrame>$sink_csv(
  path,
  ...,
  include_bom = FALSE,
  include_header = TRUE,
  separator = ",",
  line_terminator = "\n",
  quote_char = "\"",
  batch_size = 1024,
  datetime_format = NULL,
  date_format = NULL,
  time_format = NULL,
  float_scientific = NULL,
  float_precision = NULL,
  null_value = "",
  quote_style = c("necessary", "always", "never", "non_numeric"),
  maintain_order = TRUE,
  type_coercion = TRUE,
  `_type_check` = TRUE,
  predicate_pushdown = TRUE,
  projection_pushdown = TRUE,
  simplify_expression = TRUE,
  slice_pushdown = TRUE,
  collapse_joins = TRUE,
  no_optimization = FALSE,
  storage_options = NULL,
  retries = 2,
  sync_on_close = c("none", "data", "all"),
  mkdir = FALSE
)

Arguments

path A character. File path to which the file should be written.
These dots are for future extensions and must be empty.
include_bom Logical, whether to include UTF-8 BOM in the CSV output.
include_header Logical, whether to include header in the CSV output.
separator Separate CSV fields with this symbol.
line_terminator String used to end each row.
quote_char Byte to use as quoting character.
batch_size Number of rows that will be processed per thread.
datetime_format A format string, with the specifiers defined by the chrono Rust crate. If no format specified, the default fractional-second precision is inferred from the maximum timeunit found in the frame’s Datetime cols (if any).
date_format A format string, with the specifiers defined by the chrono Rust crate.
time_format A format string, with the specifiers defined by the chrono Rust crate.
float_scientific Whether to use scientific form always (TRUE), never (FALSE), or automatically (NULL) for Float32 and Float64 datatypes.
float_precision Number of decimal places to write, applied to both Float32 and Float64 datatypes.
null_value A string representing null values (defaulting to the empty string).
quote_style Determines the quoting strategy used. Must be one of:
  • “necessary” (default): This puts quotes around fields only when necessary. They are necessary when fields contain a quote, delimiter or record terminator. Quotes are also necessary when writing an empty record (which is indistinguishable from a record with one empty field). This is the default.
  • “always”: This puts quotes around every field. Always.
  • “never”: This never puts quotes around fields, even if that results in invalid CSV data (e.g.: by not quoting strings containing the separator).
  • “non_numeric”: This puts quotes around all fields that are non-numeric. Namely, when writing a field that does not parse as a valid float or integer, then quotes will be used even if they aren’t strictly necessary.
maintain_order Maintain the order in which data is processed. Setting this to FALSE will be slightly faster.
type_coercion A logical, indicats type coercion optimization.
\_type_check For internal use only.
predicate_pushdown A logical, indicats predicate pushdown optimization.
projection_pushdown A logical, indicats projection pushdown optimization.
simplify_expression A logical, indicats simplify expression optimization.
slice_pushdown A logical, indicats slice pushdown optimization.
collapse_joins Collapse a join and filters into a faster join.
no_optimization A logical. If TRUE, turn off (certain) optimizations.
storage_options Named vector containing options that indicate how to connect to a cloud provider. The cloud providers currently supported are AWS, GCP, and Azure. See supported keys here:
  • aws
  • gcp
  • azure
  • Hugging Face (hf://): Accepts an API key under the token parameter c(token = YOUR_TOKEN) or by setting the HF_TOKEN environment variable.
If storage_options is not provided, Polars will try to infer the information from environment variables.
retries Number of retries if accessing a cloud instance fails.
sync_on_close Sync to disk when before closing a file. Must be one of:
  • “none”: does not sync;
  • “data”: syncs the file contents;
  • “all”: syncs the file contents and metadata.
mkdir Recursively create all the directories in the path.

Value

Invisibly returns the input LazyFrame

Examples

library("polars")

# sink table 'mtcars' from mem to CSV
tmpf <- tempfile(fileext = ".csv")
as_polars_lf(mtcars)$sink_csv(tmpf)

# stream a query end-to-end
tmpf2 <- tempfile(fileext = ".csv")
pl$scan_csv(tmpf)$select(pl$col("cyl") * 2)$sink_csv(tmpf2)

# load parquet directly into a DataFrame / memory
pl$scan_csv(tmpf2)$collect()
#> shape: (32, 1)
#> ┌──────┐
#> │ cyl  │
#> │ ---  │
#> │ f64  │
#> ╞══════╡
#> │ 12.0 │
#> │ 12.0 │
#> │ 8.0  │
#> │ 12.0 │
#> │ 16.0 │
#> │ …    │
#> │ 8.0  │
#> │ 16.0 │
#> │ 12.0 │
#> │ 16.0 │
#> │ 8.0  │
#> └──────┘