Lazily read from a CSV file or multiple files via glob patterns

Description

This allows the query optimizer to push down predicates and projections to the scan level, thereby potentially reducing memory overhead.

Usage

pl$scan_csv(
  source,
  ...,
  has_header = TRUE,
  separator = ",",
  comment_prefix = NULL,
  quote_char = "\"",
  skip_rows = 0,
  schema = NULL,
  schema_overrides = NULL,
  null_values = NULL,
  missing_utf8_is_empty_string = FALSE,
  ignore_errors = FALSE,
  cache = FALSE,
  infer_schema = TRUE,
  infer_schema_length = 100,
  n_rows = NULL,
  encoding = c("utf8", "utf8-lossy"),
  low_memory = FALSE,
  rechunk = FALSE,
  skip_rows_after_header = 0,
  row_index_name = NULL,
  row_index_offset = 0,
  try_parse_dates = FALSE,
  eol_char = "\n",
  raise_if_empty = TRUE,
  truncate_ragged_lines = FALSE,
  decimal_comma = FALSE,
  glob = TRUE,
  storage_options = NULL,
  retries = 2,
  file_cache_ttl = NULL,
  include_file_paths = NULL
)

Arguments

source Path to a file or URL. It is possible to provide multiple paths provided that all CSV files have the same schema. It is not possible to provide several URLs.

… These dots are for future extensions and must be empty.

has_header Indicate if the first row of dataset is a header or not.If FALSE, column names will be autogenerated in the following format: “column_x” with x being an enumeration over every column in the dataset starting at 1.

separator Single byte character to use as separator in the file.

comment_prefix A string, which can be up to 5 symbols in length, used to indicate the start of a comment line. For instance, it can be set to \# or //.

quote_char Single byte character used for quoting. Set to NULL to turn off special handling and escaping of quotes.

skip_rows Start reading after a particular number of rows. The header will be parsed at this offset.

schema Provide the schema. This means that polars doesn’t do schema inference. This argument expects the complete schema, whereas schema_overrides can be used to partially overwrite a schema. This must be a list. Names of list elements are used to match to inferred columns.

schema_overrides Overwrite dtypes during inference. This must be a list. Names of list elements are used to match to inferred columns.

null_values Character vector specifying the values to interpret as NA values. It can be named, in which case names specify the columns in which this replacement must be made (e.g. c(col1 = “a”)).

missing_utf8_is_empty_string By default, a missing value is considered to be NA. Setting this parameter to TRUE will consider missing UTF8 values as an empty character.

ignore_errors Keep reading the file even if some lines yield errors. You can also use infer_schema = FALSE to read all columns as UTF8 to check which values might cause an issue.

cache Cache the result after reading.

infer_schema If TRUE (default), the schema is inferred from the data using the first infer_schema_length rows. When FALSE, the schema is not inferred and will be pl$String if not specified in schema or schema_overrides.

infer_schema_length The maximum number of rows to scan for schema inference. If NULL, the full data may be scanned (this is slow). Set infer_schema = FALSE to read all columns as pl$String.

n_rows Stop reading from CSV file after reading n_rows.

encoding Either “utf8” or “utf8-lossy”. Lossy means that invalid UTF8 values are replaced with "?" characters.

low_memory Reduce memory pressure at the expense of performance.

rechunk Reallocate to contiguous memory when all chunks / files are parsed.

skip_rows_after_header Skip this number of rows when the header is parsed.

row_index_name If not NULL, this will insert a row index column with the given name into the DataFrame.

row_index_offset Offset to start the row index column (only used if the name is set).

try_parse_dates Try to automatically parse dates. Most ISO8601-like formats can be inferred, as well as a handful of others. If this does not succeed, the column remains of data type pl$String.

eol_char Single byte end of line character (default: ). When encountering a file with Windows line endings (), one can go with the default . The extra /code\> will be removed when processed.

raise_if_empty If FALSE, parsing an empty file returns an empty DataFrame or LazyFrame.

truncate_ragged_lines Truncate lines that are longer than the schema.

decimal_comma Parse floats using a comma as the decimal separator instead of a period.

glob Expand path given via globbing rules.

storage_options

Named vector containing options that indicate how to connect to a cloud provider. The cloud providers currently supported are AWS, GCP, and Azure. See supported keys here:

aws
gcp
azure
Hugging Face (hf://): Accepts an API key under the token parameter c(token = YOUR_TOKEN) or by setting the HF_TOKEN environment variable.

If storage_options is not provided, Polars will try to infer the information from environment variables.

retries Number of retries if accessing a cloud instance fails.

file_cache_ttl Amount of time to keep downloaded cloud files since their last access time, in seconds. Uses the POLARS_FILE_CACHE_TTL environment variable (which defaults to 1 hour) if not given.

include_file_paths Include the path of the source file(s) as a column with this name.

Value

A polars LazyFrame

Examples

library("polars")

my_file <- tempfile()
write.csv(iris, my_file)
lazy_frame <- pl$scan_csv(my_file)
lazy_frame$collect()

#> shape: (150, 6)
#> ┌─────┬──────────────┬─────────────┬──────────────┬─────────────┬───────────┐
#> │     ┆ Sepal.Length ┆ Sepal.Width ┆ Petal.Length ┆ Petal.Width ┆ Species   │
#> │ --- ┆ ---          ┆ ---         ┆ ---          ┆ ---         ┆ ---       │
#> │ i64 ┆ f64          ┆ f64         ┆ f64          ┆ f64         ┆ str       │
#> ╞═════╪══════════════╪═════════════╪══════════════╪═════════════╪═══════════╡
#> │ 1   ┆ 5.1          ┆ 3.5         ┆ 1.4          ┆ 0.2         ┆ setosa    │
#> │ 2   ┆ 4.9          ┆ 3.0         ┆ 1.4          ┆ 0.2         ┆ setosa    │
#> │ 3   ┆ 4.7          ┆ 3.2         ┆ 1.3          ┆ 0.2         ┆ setosa    │
#> │ 4   ┆ 4.6          ┆ 3.1         ┆ 1.5          ┆ 0.2         ┆ setosa    │
#> │ 5   ┆ 5.0          ┆ 3.6         ┆ 1.4          ┆ 0.2         ┆ setosa    │
#> │ …   ┆ …            ┆ …           ┆ …            ┆ …           ┆ …         │
#> │ 146 ┆ 6.7          ┆ 3.0         ┆ 5.2          ┆ 2.3         ┆ virginica │
#> │ 147 ┆ 6.3          ┆ 2.5         ┆ 5.0          ┆ 1.9         ┆ virginica │
#> │ 148 ┆ 6.5          ┆ 3.0         ┆ 5.2          ┆ 2.0         ┆ virginica │
#> │ 149 ┆ 6.2          ┆ 3.4         ┆ 5.4          ┆ 2.3         ┆ virginica │
#> │ 150 ┆ 5.9          ┆ 3.0         ┆ 5.1          ┆ 1.8         ┆ virginica │
#> └─────┴──────────────┴─────────────┴──────────────┴─────────────┴───────────┘

unlink(my_file)