Read into a DataFrame from NDJSON file
Description
Read into a DataFrame from NDJSON file
Usage
pl$read_ndjson(
source,
...,
schema = NULL,
schema_overrides = NULL,
infer_schema_length = 100,
batch_size = 1024,
n_rows = NULL,
low_memory = FALSE,
rechunk = FALSE,
row_index_name = NULL,
row_index_offset = 0L,
ignore_errors = FALSE,
storage_options = NULL,
retries = 2,
file_cache_ttl = NULL,
include_file_paths = NULL
)
Arguments
source
|
Path(s) to a file or directory. When needing to authenticate for
scanning cloud locations, see the storage_options
parameter.
|
…
|
These dots are for future extensions and must be empty. |
schema
|
Specify the datatypes of the columns. The datatypes must match the
datatypes in the file(s). If there are extra columns that are not in the
file(s), consider also enabling allow_missing_columns .
|
schema_overrides
|
Overwrite dtypes during inference. This must be a list. Names of list elements are used to match to inferred columns. |
infer_schema_length
|
The maximum number of rows to scan for schema inference. If
NULL , the full data may be scanned (this is slow). Set
infer_schema = FALSE to read all columns as
pl$String .
|
batch_size
|
Number of rows that will be processed per thread. |
n_rows
|
Stop reading from parquet file after reading n_rows .
|
low_memory
|
Reduce memory pressure at the expense of performance |
rechunk
|
In case of reading multiple files via a glob pattern rechunk the final DataFrame into contiguous memory chunks. |
row_index_name
|
If not NULL , this will insert a row index column with the
given name into the DataFrame.
|
row_index_offset
|
Offset to start the row index column (only used if the name is set). |
ignore_errors
|
Keep reading the file even if some lines yield errors. You can also use
infer_schema = FALSE to read all columns as UTF8 to check
which values might cause an issue.
|
storage_options
|
Named vector containing options that indicate how to connect to a cloud
provider. The cloud providers currently supported are AWS, GCP, and
Azure. See supported keys here:
storage_options is not provided, Polars will try to
infer the information from environment variables.
|
retries
|
Number of retries if accessing a cloud instance fails. |
file_cache_ttl
|
Amount of time to keep downloaded cloud files since their last access
time, in seconds. Uses the POLARS_FILE_CACHE_TTL
environment variable (which defaults to 1 hour) if not given.
|
include_file_paths
|
Character value indicating the column name that will include the path of the source file(s). |
Value
A polars DataFrame
Examples
library("polars")
ndjson_filename <- tempfile()
jsonlite::stream_out(iris, file(ndjson_filename), verbose = FALSE)
pl$read_ndjson(ndjson_filename)
#> shape: (150, 5)
#> ┌──────────────┬─────────────┬──────────────┬─────────────┬───────────┐
#> │ Sepal.Length ┆ Sepal.Width ┆ Petal.Length ┆ Petal.Width ┆ Species │
#> │ --- ┆ --- ┆ --- ┆ --- ┆ --- │
#> │ f64 ┆ f64 ┆ f64 ┆ f64 ┆ str │
#> ╞══════════════╪═════════════╪══════════════╪═════════════╪═══════════╡
#> │ 5.1 ┆ 3.5 ┆ 1.4 ┆ 0.2 ┆ setosa │
#> │ 4.9 ┆ 3.0 ┆ 1.4 ┆ 0.2 ┆ setosa │
#> │ 4.7 ┆ 3.2 ┆ 1.3 ┆ 0.2 ┆ setosa │
#> │ 4.6 ┆ 3.1 ┆ 1.5 ┆ 0.2 ┆ setosa │
#> │ 5.0 ┆ 3.6 ┆ 1.4 ┆ 0.2 ┆ setosa │
#> │ … ┆ … ┆ … ┆ … ┆ … │
#> │ 6.7 ┆ 3.0 ┆ 5.2 ┆ 2.3 ┆ virginica │
#> │ 6.3 ┆ 2.5 ┆ 5.0 ┆ 1.9 ┆ virginica │
#> │ 6.5 ┆ 3.0 ┆ 5.2 ┆ 2.0 ┆ virginica │
#> │ 6.2 ┆ 3.4 ┆ 5.4 ┆ 2.3 ┆ virginica │
#> │ 5.9 ┆ 3.0 ┆ 5.1 ┆ 1.8 ┆ virginica │
#> └──────────────┴─────────────┴──────────────┴─────────────┴───────────┘