`stac_geoparquet.arrow`¶

Arrow-based format conversions.

stac_geoparquet.arrow ¶

DEFAULT_JSON_CHUNK_SIZE `module-attribute` ¶

DEFAULT_JSON_CHUNK_SIZE = 65536

The default chunk size to use for reading JSON into memory.

DEFAULT_PARQUET_SCHEMA_VERSION `module-attribute` ¶

DEFAULT_PARQUET_SCHEMA_VERSION: SUPPORTED_PARQUET_SCHEMA_VERSIONS = '1.1.0'

The default GeoParquet schema version written to file.

SUPPORTED_PARQUET_SCHEMA_VERSIONS `module-attribute` ¶

SUPPORTED_PARQUET_SCHEMA_VERSIONS = Literal['1.0.0', '1.1.0']

A Literal type with the supported GeoParquet schema versions.

parse_stac_items_to_arrow ¶

parse_stac_items_to_arrow(
    items: Iterable[Item | dict[str, Any]],
    *,
    chunk_size: int = 8192,
    schema: Schema | InferredSchema | None = None
) -> RecordBatchReader

Parse a collection of STAC Items to an iterable of pyarrow.RecordBatch.

The objects under properties are moved up to the top-level of the Table, similar to geopandas.GeoDataFrame.from_features.

Parameters:

items (Iterable[Item | dict[str, Any]]) –

the STAC Items to convert
chunk_size (int, default: 8192 ) –

The chunk size to use for Arrow record batches. This only takes effect if schema is not None. When schema is None, the input will be parsed into a single contiguous record batch. Defaults to 8192.
schema (Schema | InferredSchema | None, default: None ) –

The schema of the input data. If provided, can improve memory use; otherwise all items need to be parsed into a single array for schema inference. Defaults to None.

Returns:

RecordBatchReader –

pyarrow RecordBatchReader with a stream of STAC Arrow RecordBatches.

parse_stac_ndjson_to_arrow ¶

parse_stac_ndjson_to_arrow(
    path: str | Path | Iterable[str | Path],
    *,
    chunk_size: int = DEFAULT_JSON_CHUNK_SIZE,
    schema: Schema | None = None,
    limit: int | None = None
) -> RecordBatchReader

Convert one or more newline-delimited JSON STAC files to a generator of Arrow RecordBatches.

Each RecordBatch in the returned iterator is guaranteed to have an identical schema, and can be used to write to one or more Parquet files.

Parameters:

path (str | Path | Iterable[str | Path]) –

One or more paths to files with STAC items.
chunk_size (int, default: DEFAULT_JSON_CHUNK_SIZE ) –

The chunk size. Defaults to 65536.
schema (Schema | None, default: None ) –

The schema to represent the input STAC data. Defaults to None, in which case the schema will first be inferred via a full pass over the input data. In this case, there will be two full passes over the input data: one to infer a common schema across all data and another to read the data.

Other Parameters:

limit (int | None) –

The maximum number of JSON Items to use for schema inference

Returns:

RecordBatchReader –

pyarrow RecordBatchReader with a stream of STAC Arrow RecordBatches.

parse_stac_ndjson_to_delta_lake ¶

parse_stac_ndjson_to_delta_lake(
    input_path: str | Path | Iterable[str | Path],
    table_or_uri: str | Path | DeltaTable,
    *,
    chunk_size: int = DEFAULT_JSON_CHUNK_SIZE,
    schema: Schema | None = None,
    limit: int | None = None,
    schema_version: SUPPORTED_PARQUET_SCHEMA_VERSIONS = DEFAULT_PARQUET_SCHEMA_VERSION,
    **kwargs: Any
) -> None

Convert one or more newline-delimited JSON STAC files to Delta Lake

Parameters:

input_path (str | Path | Iterable[str | Path]) –

One or more paths to files with STAC items.
table_or_uri (str | Path | DeltaTable) –

A path to the output Delta Lake table

Parameters:

chunk_size (int, default: DEFAULT_JSON_CHUNK_SIZE ) –

The chunk size to use for reading JSON into memory. Defaults to 65536.
schema (Schema | None, default: None ) –

The schema to represent the input STAC data. Defaults to None, in which case the schema will first be inferred via a full pass over the input data. In this case, there will be two full passes over the input data: one to infer a common schema across all data and another to read the data and iteratively convert to GeoParquet.
limit (int | None, default: None ) –

The maximum number of JSON records to convert.
schema_version (SUPPORTED_PARQUET_SCHEMA_VERSIONS, default: DEFAULT_PARQUET_SCHEMA_VERSION ) –

GeoParquet specification version; if not provided will default to latest supported version.

parse_stac_ndjson_to_parquet ¶

parse_stac_ndjson_to_parquet(
    input_path: str | Path | Iterable[str | Path],
    output_path: str | Path,
    *,
    chunk_size: int = DEFAULT_JSON_CHUNK_SIZE,
    schema: Schema | InferredSchema | None = None,
    limit: int | None = None,
    schema_version: SUPPORTED_PARQUET_SCHEMA_VERSIONS = DEFAULT_PARQUET_SCHEMA_VERSION,
    **kwargs: Any
) -> None

Convert one or more newline-delimited JSON STAC files to GeoParquet

Parameters:

input_path (str | Path | Iterable[str | Path]) –

One or more paths to files with STAC items.
output_path (str | Path) –

A path to the output Parquet file.

Other Parameters:

chunk_size (int) –

The chunk size. Defaults to 65536.
schema (Schema | InferredSchema | None) –

The schema to represent the input STAC data. Defaults to None, in which case the schema will first be inferred via a full pass over the input data. In this case, there will be two full passes over the input data: one to infer a common schema across all data and another to read the data and iteratively convert to GeoParquet.
limit (int | None) –

The maximum number of JSON records to convert.
schema_version (SUPPORTED_PARQUET_SCHEMA_VERSIONS) –

GeoParquet specification version; if not provided will default to latest supported version.

All other keyword args are passed on to pyarrow.parquet.ParquetWriter.

stac_table_to_items ¶

stac_table_to_items(
    table: Table | RecordBatchReader | ArrowStreamExportable,
) -> Iterable[dict]

Convert STAC Arrow to a generator of STAC Item dicts.

Parameters:

table (Table | RecordBatchReader | ArrowStreamExportable) –

STAC in Arrow form. This can be a pyarrow Table, a pyarrow RecordBatchReader, or any other Arrow stream object exposed through the Arrow PyCapsule Interface. A RecordBatchReader or stream object will not be materialized in memory.

Yields:

Iterable[dict] –

A STAC dict for each input row.

stac_table_to_ndjson ¶

stac_table_to_ndjson(
    table: Table | RecordBatchReader | ArrowStreamExportable,
    dest: str | Path | PathLike[bytes],
) -> None

Write STAC Arrow to a newline-delimited JSON file.

Note

This function appends to the JSON file at dest; it does not overwrite any existing data.

Parameters:

table (Table | RecordBatchReader | ArrowStreamExportable) –

STAC in Arrow form. This can be a pyarrow Table, a pyarrow RecordBatchReader, or any other Arrow stream object exposed through the Arrow PyCapsule Interface. A RecordBatchReader or stream object will not be materialized in memory.
dest (str | Path | PathLike[bytes]) –

The destination where newline-delimited JSON should be written.

to_parquet ¶

to_parquet(
    table: Table | RecordBatchReader | ArrowStreamExportable,
    output_path: str | Path,
    *,
    schema_version: SUPPORTED_PARQUET_SCHEMA_VERSIONS = DEFAULT_PARQUET_SCHEMA_VERSION,
    **kwargs: Any
) -> None

Write an Arrow table with STAC data to GeoParquet

This writes metadata compliant with either GeoParquet 1.0 or 1.1.

Parameters:

table (Table | RecordBatchReader | ArrowStreamExportable) –

STAC in Arrow form. This can be a pyarrow Table, a pyarrow RecordBatchReader, or any other Arrow stream object exposed through the Arrow PyCapsule Interface. A RecordBatchReader or stream object will not be materialized in memory.
output_path (str | Path) –

The destination for saving.

Other Parameters:

schema_version (SUPPORTED_PARQUET_SCHEMA_VERSIONS) –

GeoParquet specification version; if not provided will default to latest supported version.

All other keyword args are passed on to pyarrow.parquet.ParquetWriter.

stac_geoparquet.arrow¶