New DataFrame from CSV
Description
New DataFrame from CSV
Usage
pl$read_csv(
source,
...,
has_header = TRUE,
separator = ",",
comment_prefix = NULL,
quote_char = "\"",
skip_rows = 0,
schema = NULL,
schema_overrides = NULL,
null_values = NULL,
missing_utf8_is_empty_string = FALSE,
ignore_errors = FALSE,
cache = FALSE,
infer_schema = TRUE,
infer_schema_length = 100,
n_rows = NULL,
encoding = c("utf8", "utf8-lossy"),
low_memory = FALSE,
rechunk = FALSE,
skip_rows_after_header = 0,
row_index_name = NULL,
row_index_offset = 0,
try_parse_dates = FALSE,
eol_char = "\n",
raise_if_empty = TRUE,
truncate_ragged_lines = FALSE,
decimal_comma = FALSE,
glob = TRUE,
storage_options = NULL,
retries = 2,
file_cache_ttl = NULL,
include_file_paths = NULL
)
Arguments
source
|
Path to a file or URL. It is possible to provide multiple paths provided that all CSV files have the same schema. It is not possible to provide several URLs. |
…
|
These dots are for future extensions and must be empty. |
has_header
|
Indicate if the first row of dataset is a header or not.If
FALSE , column names will be autogenerated in the following
format: “column_x” with x being an enumeration
over every column in the dataset starting at 1.
|
separator
|
Single byte character to use as separator in the file. |
comment_prefix
|
A string, which can be up to 5 symbols in length, used to indicate the
start of a comment line. For instance, it can be set to
\# or
// .
|
quote_char
|
Single byte character used for quoting. Set to NULL to turn
off special handling and escaping of quotes.
|
skip_rows
|
Start reading after a particular number of rows. The header will be parsed at this offset. |
schema
|
Provide the schema. This means that polars doesn’t do schema inference.
This argument expects the complete schema, whereas
schema_overrides can be used to partially overwrite a
schema. This must be a list. Names of list elements are used to match to
inferred columns.
|
schema_overrides
|
Overwrite dtypes during inference. This must be a list. Names of list elements are used to match to inferred columns. |
null_values
|
Character vector specifying the values to interpret as NA
values. It can be named, in which case names specify the columns in
which this replacement must be made (e.g. c(col1 = “a”) ).
|
missing_utf8_is_empty_string
|
By default, a missing value is considered to be NA . Setting
this parameter to TRUE will consider missing UTF8 values as
an empty character.
|
ignore_errors
|
Keep reading the file even if some lines yield errors. You can also use
infer_schema = FALSE to read all columns as UTF8 to check
which values might cause an issue.
|
cache
|
Cache the result after reading. |
infer_schema
|
If TRUE (default), the schema is inferred from the data
using the first infer_schema_length rows. When
FALSE , the schema is not inferred and will be
pl$String if not specified in schema or
schema_overrides .
|
infer_schema_length
|
The maximum number of rows to scan for schema inference. If
NULL , the full data may be scanned (this is slow). Set
infer_schema = FALSE to read all columns as
pl$String .
|
n_rows
|
Stop reading from CSV file after reading n_rows .
|
encoding
|
Either “utf8” or “utf8-lossy” . Lossy means
that invalid UTF8 values are replaced with "?" characters.
|
low_memory
|
Reduce memory pressure at the expense of performance. |
rechunk
|
Reallocate to contiguous memory when all chunks / files are parsed. |
skip_rows_after_header
|
Skip this number of rows when the header is parsed. |
row_index_name
|
If not NULL , this will insert a row index column with the
given name into the DataFrame.
|
row_index_offset
|
Offset to start the row index column (only used if the name is set). |
try_parse_dates
|
Try to automatically parse dates. Most ISO8601-like formats can be
inferred, as well as a handful of others. If this does not succeed, the
column remains of data type pl$String .
|
eol_char
|
Single byte end of line character (default:
). When encountering a file with
Windows line endings ( ), one can
go with the default . The extra
/code\> will be removed when processed.
|
raise_if_empty
|
If FALSE , parsing an empty file returns an empty DataFrame
or LazyFrame.
|
truncate_ragged_lines
|
Truncate lines that are longer than the schema. |
decimal_comma
|
Parse floats using a comma as the decimal separator instead of a period. |
glob
|
Expand path given via globbing rules. |
storage_options
|
Named vector containing options that indicate how to connect to a cloud
provider. The cloud providers currently supported are AWS, GCP, and
Azure. See supported keys here:
storage_options is not provided, Polars will try to
infer the information from environment variables.
|
retries
|
Number of retries if accessing a cloud instance fails. |
file_cache_ttl
|
Amount of time to keep downloaded cloud files since their last access
time, in seconds. Uses the POLARS_FILE_CACHE_TTL
environment variable (which defaults to 1 hour) if not given.
|
include_file_paths
|
Include the path of the source file(s) as a column with this name. |
Value
A polars DataFrame
Examples
#> shape: (150, 6)
#> ┌─────┬──────────────┬─────────────┬──────────────┬─────────────┬───────────┐
#> │ ┆ Sepal.Length ┆ Sepal.Width ┆ Petal.Length ┆ Petal.Width ┆ Species │
#> │ --- ┆ --- ┆ --- ┆ --- ┆ --- ┆ --- │
#> │ i64 ┆ f64 ┆ f64 ┆ f64 ┆ f64 ┆ str │
#> ╞═════╪══════════════╪═════════════╪══════════════╪═════════════╪═══════════╡
#> │ 1 ┆ 5.1 ┆ 3.5 ┆ 1.4 ┆ 0.2 ┆ setosa │
#> │ 2 ┆ 4.9 ┆ 3.0 ┆ 1.4 ┆ 0.2 ┆ setosa │
#> │ 3 ┆ 4.7 ┆ 3.2 ┆ 1.3 ┆ 0.2 ┆ setosa │
#> │ 4 ┆ 4.6 ┆ 3.1 ┆ 1.5 ┆ 0.2 ┆ setosa │
#> │ 5 ┆ 5.0 ┆ 3.6 ┆ 1.4 ┆ 0.2 ┆ setosa │
#> │ … ┆ … ┆ … ┆ … ┆ … ┆ … │
#> │ 146 ┆ 6.7 ┆ 3.0 ┆ 5.2 ┆ 2.3 ┆ virginica │
#> │ 147 ┆ 6.3 ┆ 2.5 ┆ 5.0 ┆ 1.9 ┆ virginica │
#> │ 148 ┆ 6.5 ┆ 3.0 ┆ 5.2 ┆ 2.0 ┆ virginica │
#> │ 149 ┆ 6.2 ┆ 3.4 ┆ 5.4 ┆ 2.3 ┆ virginica │
#> │ 150 ┆ 5.9 ┆ 3.0 ┆ 5.1 ┆ 1.8 ┆ virginica │
#> └─────┴──────────────┴─────────────┴──────────────┴─────────────┴───────────┘