Lazily read from a CSV file or multiple files via glob patterns
Description
This allows the query optimizer to push down predicates and projections to the scan level, thereby potentially reducing memory overhead.
Usage
pl$scan_csv(
source,
...,
has_header = TRUE,
separator = ",",
comment_prefix = NULL,
quote_char = "\"",
skip_rows = 0,
schema = NULL,
schema_overrides = NULL,
null_values = NULL,
missing_utf8_is_empty_string = FALSE,
ignore_errors = FALSE,
cache = FALSE,
infer_schema = TRUE,
infer_schema_length = 100,
n_rows = NULL,
encoding = c("utf8", "utf8-lossy"),
low_memory = FALSE,
rechunk = FALSE,
skip_rows_after_header = 0,
row_index_name = NULL,
row_index_offset = 0,
try_parse_dates = FALSE,
eol_char = "\n",
raise_if_empty = TRUE,
truncate_ragged_lines = FALSE,
decimal_comma = FALSE,
glob = TRUE,
storage_options = NULL,
retries = 2,
file_cache_ttl = NULL,
include_file_paths = NULL
)
Arguments
source
|
Path to a file or URL. It is possible to provide multiple paths provided that all CSV files have the same schema. It is not possible to provide several URLs. |
…
|
These dots are for future extensions and must be empty. |
has_header
|
Indicate if the first row of dataset is a header or not.If
FALSE , column names will be autogenerated in the following
format: “column_x” with x being an enumeration
over every column in the dataset starting at 1.
|
separator
|
Single byte character to use as separator in the file. |
comment_prefix
|
A string, which can be up to 5 symbols in length, used to indicate the
start of a comment line. For instance, it can be set to
\# or
// .
|
quote_char
|
Single byte character used for quoting. Set to NULL to turn
off special handling and escaping of quotes.
|
skip_rows
|
Start reading after a particular number of rows. The header will be parsed at this offset. |
schema
|
Provide the schema. This means that polars doesn’t do schema inference.
This argument expects the complete schema, whereas
schema_overrides can be used to partially overwrite a
schema. This must be a list. Names of list elements are used to match to
inferred columns.
|
schema_overrides
|
Overwrite dtypes during inference. This must be a list. Names of list elements are used to match to inferred columns. |
null_values
|
Character vector specifying the values to interpret as NA
values. It can be named, in which case names specify the columns in
which this replacement must be made (e.g. c(col1 = “a”) ).
|
missing_utf8_is_empty_string
|
By default, a missing value is considered to be NA . Setting
this parameter to TRUE will consider missing UTF8 values as
an empty character.
|
ignore_errors
|
Keep reading the file even if some lines yield errors. You can also use
infer_schema = FALSE to read all columns as UTF8 to check
which values might cause an issue.
|
cache
|
Cache the result after reading. |
infer_schema
|
If TRUE (default), the schema is inferred from the data
using the first infer_schema_length rows. When
FALSE , the schema is not inferred and will be
pl$String if not specified in schema or
schema_overrides .
|
infer_schema_length
|
The maximum number of rows to scan for schema inference. If
NULL , the full data may be scanned (this is slow). Set
infer_schema = FALSE to read all columns as
pl$String .
|
n_rows
|
Stop reading from CSV file after reading n_rows .
|
encoding
|
Either “utf8” or “utf8-lossy” . Lossy means
that invalid UTF8 values are replaced with "?" characters.
|
low_memory
|
Reduce memory pressure at the expense of performance. |
rechunk
|
Reallocate to contiguous memory when all chunks / files are parsed. |
skip_rows_after_header
|
Skip this number of rows when the header is parsed. |
row_index_name
|
If not NULL , this will insert a row index column with the
given name into the DataFrame.
|
row_index_offset
|
Offset to start the row index column (only used if the name is set). |
try_parse_dates
|
Try to automatically parse dates. Most ISO8601-like formats can be
inferred, as well as a handful of others. If this does not succeed, the
column remains of data type pl$String .
|
eol_char
|
Single byte end of line character (default:
). When encountering a file with
Windows line endings ( ), one can
go with the default . The extra
/code\> will be removed when processed.
|
raise_if_empty
|
If FALSE , parsing an empty file returns an empty DataFrame
or LazyFrame.
|
truncate_ragged_lines
|
Truncate lines that are longer than the schema. |
decimal_comma
|
Parse floats using a comma as the decimal separator instead of a period. |
glob
|
Expand path given via globbing rules. |
storage_options
|
Named vector containing options that indicate how to connect to a cloud
provider. The cloud providers currently supported are AWS, GCP, and
Azure. See supported keys here:
storage_options is not provided, Polars will try to
infer the information from environment variables.
|
retries
|
Number of retries if accessing a cloud instance fails. |
file_cache_ttl
|
Amount of time to keep downloaded cloud files since their last access
time, in seconds. Uses the POLARS_FILE_CACHE_TTL
environment variable (which defaults to 1 hour) if not given.
|
include_file_paths
|
Include the path of the source file(s) as a column with this name. |
Value
A polars LazyFrame
Examples
library("polars")
my_file <- tempfile()
write.csv(iris, my_file)
lazy_frame <- pl$scan_csv(my_file)
lazy_frame$collect()
#> shape: (150, 6)
#> ┌─────┬──────────────┬─────────────┬──────────────┬─────────────┬───────────┐
#> │ ┆ Sepal.Length ┆ Sepal.Width ┆ Petal.Length ┆ Petal.Width ┆ Species │
#> │ --- ┆ --- ┆ --- ┆ --- ┆ --- ┆ --- │
#> │ i64 ┆ f64 ┆ f64 ┆ f64 ┆ f64 ┆ str │
#> ╞═════╪══════════════╪═════════════╪══════════════╪═════════════╪═══════════╡
#> │ 1 ┆ 5.1 ┆ 3.5 ┆ 1.4 ┆ 0.2 ┆ setosa │
#> │ 2 ┆ 4.9 ┆ 3.0 ┆ 1.4 ┆ 0.2 ┆ setosa │
#> │ 3 ┆ 4.7 ┆ 3.2 ┆ 1.3 ┆ 0.2 ┆ setosa │
#> │ 4 ┆ 4.6 ┆ 3.1 ┆ 1.5 ┆ 0.2 ┆ setosa │
#> │ 5 ┆ 5.0 ┆ 3.6 ┆ 1.4 ┆ 0.2 ┆ setosa │
#> │ … ┆ … ┆ … ┆ … ┆ … ┆ … │
#> │ 146 ┆ 6.7 ┆ 3.0 ┆ 5.2 ┆ 2.3 ┆ virginica │
#> │ 147 ┆ 6.3 ┆ 2.5 ┆ 5.0 ┆ 1.9 ┆ virginica │
#> │ 148 ┆ 6.5 ┆ 3.0 ┆ 5.2 ┆ 2.0 ┆ virginica │
#> │ 149 ┆ 6.2 ┆ 3.4 ┆ 5.4 ┆ 2.3 ┆ virginica │
#> │ 150 ┆ 5.9 ┆ 3.0 ┆ 5.1 ┆ 1.8 ┆ virginica │
#> └─────┴──────────────┴─────────────┴──────────────┴─────────────┴───────────┘