scan_ipc
Lazily read from an Arrow IPC (Feather v2) file or multiple files via glob patterns
Description
This allows the query optimizer to push down predicates and projections to the scan level, thereby potentially reducing memory overhead.
Usage
pl$scan_ipc(
source,
...,
n_rows = NULL,
cache = TRUE,
rechunk = FALSE,
row_index_name = NULL,
row_index_offset = 0L,
storage_options = NULL,
retries = 2,
file_cache_ttl = NULL,
hive_partitioning = NULL,
hive_schema = NULL,
try_parse_hive_dates = TRUE,
include_file_paths = NULL
)
Arguments
source
|
Path(s) to a file or directory. When needing to authenticate for
scanning cloud locations, see the storage_options
parameter.
|
…
|
These dots are for future extensions and must be empty. |
n_rows
|
Stop reading from parquet file after reading n_rows .
|
cache
|
Cache the result after reading. |
rechunk
|
In case of reading multiple files via a glob pattern rechunk the final DataFrame into contiguous memory chunks. |
row_index_name
|
If not NULL , this will insert a row index column with the
given name into the DataFrame.
|
row_index_offset
|
Offset to start the row index column (only used if the name is set). |
storage_options
|
Named vector containing options that indicate how to connect to a cloud
provider. The cloud providers currently supported are AWS, GCP, and
Azure. See supported keys here:
storage_options is not provided, Polars will try to
infer the information from environment variables.
|
retries
|
Number of retries if accessing a cloud instance fails. |
file_cache_ttl
|
Amount of time to keep downloaded cloud files since their last access
time, in seconds. Uses the POLARS_FILE_CACHE_TTL
environment variable (which defaults to 1 hour) if not given.
|
hive_partitioning
|
Infer statistics and schema from Hive partitioned sources and use them
to prune reads. If NULL (default), it is automatically
enabled when a single directory is passed, and otherwise disabled.
|
hive_schema
|
A list containing the column names and data types of the columns by
which the data is partitioned, e.g. list(a = pl$String, b =
pl$Float32) . If NULL (default), the schema of the
Hive partitions is inferred.
|
try_parse_hive_dates
|
Whether to try parsing hive values as date / datetime types. |
include_file_paths
|
Character value indicating the column name that will include the path of the source file(s). |
Value
A polars LazyFrame
Examples
library("polars")
temp_dir <- tempfile()
# Write a hive-style partitioned arrow file dataset
arrow::write_dataset(
mtcars,
temp_dir,
partitioning = c("cyl", "gear"),
format = "arrow",
hive_style = TRUE
)
list.files(temp_dir, recursive = TRUE)
#> [1] "cyl=4/gear=3/part-0.arrow" "cyl=4/gear=4/part-0.arrow"
#> [3] "cyl=4/gear=5/part-0.arrow" "cyl=6/gear=3/part-0.arrow"
#> [5] "cyl=6/gear=4/part-0.arrow" "cyl=6/gear=5/part-0.arrow"
#> [7] "cyl=8/gear=3/part-0.arrow" "cyl=8/gear=5/part-0.arrow"
# If the path is a folder, Polars automatically tries to detect partitions
# and includes them in the output
pl$scan_ipc(temp_dir)$collect()
#> shape: (32, 11)
#> ┌──────┬───────┬───────┬──────┬───┬─────┬──────┬─────┬──────┐
#> │ mpg ┆ disp ┆ hp ┆ drat ┆ … ┆ am ┆ carb ┆ cyl ┆ gear │
#> │ --- ┆ --- ┆ --- ┆ --- ┆ ┆ --- ┆ --- ┆ --- ┆ --- │
#> │ f64 ┆ f64 ┆ f64 ┆ f64 ┆ ┆ f64 ┆ f64 ┆ i64 ┆ i64 │
#> ╞══════╪═══════╪═══════╪══════╪═══╪═════╪══════╪═════╪══════╡
#> │ 21.5 ┆ 120.1 ┆ 97.0 ┆ 3.7 ┆ … ┆ 0.0 ┆ 1.0 ┆ 4 ┆ 3 │
#> │ 22.8 ┆ 108.0 ┆ 93.0 ┆ 3.85 ┆ … ┆ 1.0 ┆ 1.0 ┆ 4 ┆ 4 │
#> │ 24.4 ┆ 146.7 ┆ 62.0 ┆ 3.69 ┆ … ┆ 0.0 ┆ 2.0 ┆ 4 ┆ 4 │
#> │ 22.8 ┆ 140.8 ┆ 95.0 ┆ 3.92 ┆ … ┆ 0.0 ┆ 2.0 ┆ 4 ┆ 4 │
#> │ 32.4 ┆ 78.7 ┆ 66.0 ┆ 4.08 ┆ … ┆ 1.0 ┆ 1.0 ┆ 4 ┆ 4 │
#> │ … ┆ … ┆ … ┆ … ┆ … ┆ … ┆ … ┆ … ┆ … │
#> │ 15.2 ┆ 304.0 ┆ 150.0 ┆ 3.15 ┆ … ┆ 0.0 ┆ 2.0 ┆ 8 ┆ 3 │
#> │ 13.3 ┆ 350.0 ┆ 245.0 ┆ 3.73 ┆ … ┆ 0.0 ┆ 4.0 ┆ 8 ┆ 3 │
#> │ 19.2 ┆ 400.0 ┆ 175.0 ┆ 3.08 ┆ … ┆ 0.0 ┆ 2.0 ┆ 8 ┆ 3 │
#> │ 15.8 ┆ 351.0 ┆ 264.0 ┆ 4.22 ┆ … ┆ 1.0 ┆ 4.0 ┆ 8 ┆ 5 │
#> │ 15.0 ┆ 301.0 ┆ 335.0 ┆ 3.54 ┆ … ┆ 1.0 ┆ 8.0 ┆ 8 ┆ 5 │
#> └──────┴───────┴───────┴──────┴───┴─────┴──────┴─────┴──────┘
# We can also impose a schema to the partition
pl$scan_ipc(temp_dir, hive_schema = list(cyl = pl$String, gear = pl$Int32))$collect()
#> shape: (32, 11)
#> ┌──────┬───────┬───────┬──────┬───┬─────┬──────┬─────┬──────┐
#> │ mpg ┆ disp ┆ hp ┆ drat ┆ … ┆ am ┆ carb ┆ cyl ┆ gear │
#> │ --- ┆ --- ┆ --- ┆ --- ┆ ┆ --- ┆ --- ┆ --- ┆ --- │
#> │ f64 ┆ f64 ┆ f64 ┆ f64 ┆ ┆ f64 ┆ f64 ┆ str ┆ i32 │
#> ╞══════╪═══════╪═══════╪══════╪═══╪═════╪══════╪═════╪══════╡
#> │ 21.5 ┆ 120.1 ┆ 97.0 ┆ 3.7 ┆ … ┆ 0.0 ┆ 1.0 ┆ 4 ┆ 3 │
#> │ 22.8 ┆ 108.0 ┆ 93.0 ┆ 3.85 ┆ … ┆ 1.0 ┆ 1.0 ┆ 4 ┆ 4 │
#> │ 24.4 ┆ 146.7 ┆ 62.0 ┆ 3.69 ┆ … ┆ 0.0 ┆ 2.0 ┆ 4 ┆ 4 │
#> │ 22.8 ┆ 140.8 ┆ 95.0 ┆ 3.92 ┆ … ┆ 0.0 ┆ 2.0 ┆ 4 ┆ 4 │
#> │ 32.4 ┆ 78.7 ┆ 66.0 ┆ 4.08 ┆ … ┆ 1.0 ┆ 1.0 ┆ 4 ┆ 4 │
#> │ … ┆ … ┆ … ┆ … ┆ … ┆ … ┆ … ┆ … ┆ … │
#> │ 15.2 ┆ 304.0 ┆ 150.0 ┆ 3.15 ┆ … ┆ 0.0 ┆ 2.0 ┆ 8 ┆ 3 │
#> │ 13.3 ┆ 350.0 ┆ 245.0 ┆ 3.73 ┆ … ┆ 0.0 ┆ 4.0 ┆ 8 ┆ 3 │
#> │ 19.2 ┆ 400.0 ┆ 175.0 ┆ 3.08 ┆ … ┆ 0.0 ┆ 2.0 ┆ 8 ┆ 3 │
#> │ 15.8 ┆ 351.0 ┆ 264.0 ┆ 4.22 ┆ … ┆ 1.0 ┆ 4.0 ┆ 8 ┆ 5 │
#> │ 15.0 ┆ 301.0 ┆ 335.0 ┆ 3.54 ┆ … ┆ 1.0 ┆ 8.0 ┆ 8 ┆ 5 │
#> └──────┴───────┴───────┴──────┴───┴─────┴──────┴─────┴──────┘