Evaluate the query in streaming mode and write to a CSV file

Description

This allows streaming results that are larger than RAM to be written to disk.

Usage

<LazyFrame>$sink_csv(
  path,
  ...,
  include_bom = FALSE,
  include_header = TRUE,
  separator = ",",
  line_terminator = "\n",
  quote_char = "\"",
  batch_size = 1024,
  datetime_format = NULL,
  date_format = NULL,
  time_format = NULL,
  float_scientific = NULL,
  float_precision = NULL,
  null_value = "",
  quote_style = c("necessary", "always", "never", "non_numeric"),
  maintain_order = TRUE,
  type_coercion = TRUE,
  `_type_check` = TRUE,
  predicate_pushdown = TRUE,
  projection_pushdown = TRUE,
  simplify_expression = TRUE,
  slice_pushdown = TRUE,
  collapse_joins = TRUE,
  no_optimization = FALSE,
  storage_options = NULL,
  retries = 2,
  sync_on_close = c("none", "data", "all"),
  mkdir = FALSE
)

Arguments

path A character. File path to which the file should be written.

… These dots are for future extensions and must be empty.

include_bom Logical, whether to include UTF-8 BOM in the CSV output.

include_header Logical, whether to include header in the CSV output.

separator Separate CSV fields with this symbol.

line_terminator String used to end each row.

quote_char Byte to use as quoting character.

batch_size Number of rows that will be processed per thread.

datetime_format A format string, with the specifiers defined by the chrono Rust crate. If no format specified, the default fractional-second precision is inferred from the maximum timeunit found in the frame’s Datetime cols (if any).

date_format A format string, with the specifiers defined by the chrono Rust crate.

time_format A format string, with the specifiers defined by the chrono Rust crate.

float_scientific Whether to use scientific form always (TRUE), never (FALSE), or automatically (NULL) for Float32 and Float64 datatypes.

float_precision Number of decimal places to write, applied to both Float32 and Float64 datatypes.

null_value A string representing null values (defaulting to the empty string).

quote_style

Determines the quoting strategy used. Must be one of:

“necessary” (default): This puts quotes around fields only when necessary. They are necessary when fields contain a quote, delimiter or record terminator. Quotes are also necessary when writing an empty record (which is indistinguishable from a record with one empty field). This is the default.
“always”: This puts quotes around every field. Always.
“never”: This never puts quotes around fields, even if that results in invalid CSV data (e.g.: by not quoting strings containing the separator).
“non_numeric”: This puts quotes around all fields that are non-numeric. Namely, when writing a field that does not parse as a valid float or integer, then quotes will be used even if they aren’t strictly necessary.

maintain_order Maintain the order in which data is processed. Setting this to FALSE will be slightly faster.

type_coercion A logical, indicates type coercion optimization.

\_type_check For internal use only.

predicate_pushdown A logical, indicates predicate pushdown optimization.

projection_pushdown A logical, indicates projection pushdown optimization.

simplify_expression A logical, indicates simplify expression optimization.

slice_pushdown A logical, indicates slice pushdown optimization.

collapse_joins Collapse a join and filters into a faster join.

no_optimization A logical. If TRUE, turn off (certain) optimizations.

storage_options

Named vector containing options that indicate how to connect to a cloud provider. The cloud providers currently supported are AWS, GCP, and Azure. See supported keys here:

aws
gcp
azure
Hugging Face (hf://): Accepts an API key under the token parameter c(token = YOUR_TOKEN) or by setting the HF_TOKEN environment variable.

If storage_options is not provided, Polars will try to infer the information from environment variables.

retries Number of retries if accessing a cloud instance fails.

sync_on_close

Sync to disk when before closing a file. Must be one of:

“none”: does not sync;
“data”: syncs the file contents;
“all”: syncs the file contents and metadata.

mkdir Recursively create all the directories in the path.

Value

NULL invisibly.

Examples

library("polars")

# sink table 'mtcars' from mem to CSV
tmpf <- tempfile(fileext = ".csv")
as_polars_lf(mtcars)$sink_csv(tmpf)

# stream a query end-to-end
tmpf2 <- tempfile(fileext = ".csv")
pl$scan_csv(tmpf)$select(pl$col("cyl") * 2)$sink_csv(tmpf2)

# load parquet directly into a DataFrame / memory
pl$scan_csv(tmpf2)$collect()

#> shape: (32, 1)
#> ┌──────┐
#> │ cyl  │
#> │ ---  │
#> │ f64  │
#> ╞══════╡
#> │ 12.0 │
#> │ 12.0 │
#> │ 8.0  │
#> │ 12.0 │
#> │ 16.0 │
#> │ …    │
#> │ 8.0  │
#> │ 16.0 │
#> │ 12.0 │
#> │ 16.0 │
#> │ 8.0  │
#> └──────┘