Read into a DataFrame from Arrow IPC (Feather v2) file

Description

Usage

pl$read_ipc(
  source,
  ...,
  n_rows = NULL,
  cache = TRUE,
  rechunk = FALSE,
  row_index_name = NULL,
  row_index_offset = 0L,
  storage_options = NULL,
  retries = 2,
  file_cache_ttl = NULL,
  hive_partitioning = NULL,
  hive_schema = NULL,
  try_parse_hive_dates = TRUE,
  include_file_paths = NULL
)

Arguments

source Path(s) to a file or directory. When needing to authenticate for scanning cloud locations, see the storage_options parameter.

… These dots are for future extensions and must be empty.

n_rows Stop reading from parquet file after reading n_rows.

cache Cache the result after reading.

rechunk In case of reading multiple files via a glob pattern rechunk the final DataFrame into contiguous memory chunks.

row_index_name If not NULL, this will insert a row index column with the given name into the DataFrame.

row_index_offset Offset to start the row index column (only used if the name is set).

storage_options

Named vector containing options that indicate how to connect to a cloud provider. The cloud providers currently supported are AWS, GCP, and Azure. See supported keys here:

aws
gcp
azure
Hugging Face (hf://): Accepts an API key under the token parameter c(token = YOUR_TOKEN) or by setting the HF_TOKEN environment variable.

If storage_options is not provided, Polars will try to infer the information from environment variables.

retries Number of retries if accessing a cloud instance fails.

file_cache_ttl Amount of time to keep downloaded cloud files since their last access time, in seconds. Uses the POLARS_FILE_CACHE_TTL environment variable (which defaults to 1 hour) if not given.

hive_partitioning Infer statistics and schema from Hive partitioned sources and use them to prune reads. If NULL (default), it is automatically enabled when a single directory is passed, and otherwise disabled.

hive_schema A list containing the column names and data types of the columns by which the data is partitioned, e.g. list(a = pl$String, b = pl$Float32). If NULL (default), the schema of the Hive partitions is inferred.

try_parse_hive_dates Whether to try parsing hive values as date / datetime types.

include_file_paths Character value indicating the column name that will include the path of the source file(s).

Value

A polars DataFrame

Examples

library("polars")


temp_dir <- tempfile()
# Write a hive-style partitioned arrow file dataset
arrow::write_dataset(
  mtcars,
  temp_dir,
  partitioning = c("cyl", "gear"),
  format = "arrow",
  hive_style = TRUE
)
list.files(temp_dir, recursive = TRUE)

#> [1] "cyl=4/gear=3/part-0.arrow" "cyl=4/gear=4/part-0.arrow"
#> [3] "cyl=4/gear=5/part-0.arrow" "cyl=6/gear=3/part-0.arrow"
#> [5] "cyl=6/gear=4/part-0.arrow" "cyl=6/gear=5/part-0.arrow"
#> [7] "cyl=8/gear=3/part-0.arrow" "cyl=8/gear=5/part-0.arrow"

# If the path is a folder, Polars automatically tries to detect partitions
# and includes them in the output
pl$read_ipc(temp_dir)

#> shape: (32, 11)
#> ┌──────┬───────┬───────┬──────┬───┬─────┬──────┬─────┬──────┐
#> │ mpg  ┆ disp  ┆ hp    ┆ drat ┆ … ┆ am  ┆ carb ┆ cyl ┆ gear │
#> │ ---  ┆ ---   ┆ ---   ┆ ---  ┆   ┆ --- ┆ ---  ┆ --- ┆ ---  │
#> │ f64  ┆ f64   ┆ f64   ┆ f64  ┆   ┆ f64 ┆ f64  ┆ i64 ┆ i64  │
#> ╞══════╪═══════╪═══════╪══════╪═══╪═════╪══════╪═════╪══════╡
#> │ 21.5 ┆ 120.1 ┆ 97.0  ┆ 3.7  ┆ … ┆ 0.0 ┆ 1.0  ┆ 4   ┆ 3    │
#> │ 22.8 ┆ 108.0 ┆ 93.0  ┆ 3.85 ┆ … ┆ 1.0 ┆ 1.0  ┆ 4   ┆ 4    │
#> │ 24.4 ┆ 146.7 ┆ 62.0  ┆ 3.69 ┆ … ┆ 0.0 ┆ 2.0  ┆ 4   ┆ 4    │
#> │ 22.8 ┆ 140.8 ┆ 95.0  ┆ 3.92 ┆ … ┆ 0.0 ┆ 2.0  ┆ 4   ┆ 4    │
#> │ 32.4 ┆ 78.7  ┆ 66.0  ┆ 4.08 ┆ … ┆ 1.0 ┆ 1.0  ┆ 4   ┆ 4    │
#> │ …    ┆ …     ┆ …     ┆ …    ┆ … ┆ …   ┆ …    ┆ …   ┆ …    │
#> │ 15.2 ┆ 304.0 ┆ 150.0 ┆ 3.15 ┆ … ┆ 0.0 ┆ 2.0  ┆ 8   ┆ 3    │
#> │ 13.3 ┆ 350.0 ┆ 245.0 ┆ 3.73 ┆ … ┆ 0.0 ┆ 4.0  ┆ 8   ┆ 3    │
#> │ 19.2 ┆ 400.0 ┆ 175.0 ┆ 3.08 ┆ … ┆ 0.0 ┆ 2.0  ┆ 8   ┆ 3    │
#> │ 15.8 ┆ 351.0 ┆ 264.0 ┆ 4.22 ┆ … ┆ 1.0 ┆ 4.0  ┆ 8   ┆ 5    │
#> │ 15.0 ┆ 301.0 ┆ 335.0 ┆ 3.54 ┆ … ┆ 1.0 ┆ 8.0  ┆ 8   ┆ 5    │
#> └──────┴───────┴───────┴──────┴───┴─────┴──────┴─────┴──────┘

# We can also impose a schema to the partition
pl$read_ipc(temp_dir, hive_schema = list(cyl = pl$String, gear = pl$Int32))

#> shape: (32, 11)
#> ┌──────┬───────┬───────┬──────┬───┬─────┬──────┬─────┬──────┐
#> │ mpg  ┆ disp  ┆ hp    ┆ drat ┆ … ┆ am  ┆ carb ┆ cyl ┆ gear │
#> │ ---  ┆ ---   ┆ ---   ┆ ---  ┆   ┆ --- ┆ ---  ┆ --- ┆ ---  │
#> │ f64  ┆ f64   ┆ f64   ┆ f64  ┆   ┆ f64 ┆ f64  ┆ str ┆ i32  │
#> ╞══════╪═══════╪═══════╪══════╪═══╪═════╪══════╪═════╪══════╡
#> │ 21.5 ┆ 120.1 ┆ 97.0  ┆ 3.7  ┆ … ┆ 0.0 ┆ 1.0  ┆ 4   ┆ 3    │
#> │ 22.8 ┆ 108.0 ┆ 93.0  ┆ 3.85 ┆ … ┆ 1.0 ┆ 1.0  ┆ 4   ┆ 4    │
#> │ 24.4 ┆ 146.7 ┆ 62.0  ┆ 3.69 ┆ … ┆ 0.0 ┆ 2.0  ┆ 4   ┆ 4    │
#> │ 22.8 ┆ 140.8 ┆ 95.0  ┆ 3.92 ┆ … ┆ 0.0 ┆ 2.0  ┆ 4   ┆ 4    │
#> │ 32.4 ┆ 78.7  ┆ 66.0  ┆ 4.08 ┆ … ┆ 1.0 ┆ 1.0  ┆ 4   ┆ 4    │
#> │ …    ┆ …     ┆ …     ┆ …    ┆ … ┆ …   ┆ …    ┆ …   ┆ …    │
#> │ 15.2 ┆ 304.0 ┆ 150.0 ┆ 3.15 ┆ … ┆ 0.0 ┆ 2.0  ┆ 8   ┆ 3    │
#> │ 13.3 ┆ 350.0 ┆ 245.0 ┆ 3.73 ┆ … ┆ 0.0 ┆ 4.0  ┆ 8   ┆ 3    │
#> │ 19.2 ┆ 400.0 ┆ 175.0 ┆ 3.08 ┆ … ┆ 0.0 ┆ 2.0  ┆ 8   ┆ 3    │
#> │ 15.8 ┆ 351.0 ┆ 264.0 ┆ 4.22 ┆ … ┆ 1.0 ┆ 4.0  ┆ 8   ┆ 5    │
#> │ 15.0 ┆ 301.0 ┆ 335.0 ┆ 3.54 ┆ … ┆ 1.0 ┆ 8.0  ┆ 8   ┆ 5    │
#> └──────┴───────┴───────┴──────┴───┴─────┴──────┴─────┴──────┘