Skip to content

stac_geoparquet.arrow

Arrow-based format conversions.

stac_geoparquet.arrow

DEFAULT_JSON_CHUNK_SIZE module-attribute

DEFAULT_JSON_CHUNK_SIZE = 65536

The default chunk size to use for reading JSON into memory.

DEFAULT_PARQUET_SCHEMA_VERSION module-attribute

DEFAULT_PARQUET_SCHEMA_VERSION: SUPPORTED_PARQUET_SCHEMA_VERSIONS = '1.1.0'

The default GeoParquet schema version written to file.

SUPPORTED_PARQUET_SCHEMA_VERSIONS module-attribute

SUPPORTED_PARQUET_SCHEMA_VERSIONS = Literal['1.0.0', '1.1.0']

A Literal type with the supported GeoParquet schema versions.

parse_stac_items_to_arrow

parse_stac_items_to_arrow(
    items: Iterable[Item | dict[str, Any]],
    *,
    chunk_size: int = 8192,
    schema: Schema | InferredSchema | None = None
) -> RecordBatchReader

Parse a collection of STAC Items to an iterable of pyarrow.RecordBatch.

The objects under properties are moved up to the top-level of the Table, similar to geopandas.GeoDataFrame.from_features.

Parameters:

  • items (Iterable[Item | dict[str, Any]]) –

    the STAC Items to convert

  • chunk_size (int, default: 8192 ) –

    The chunk size to use for Arrow record batches. This only takes effect if schema is not None. When schema is None, the input will be parsed into a single contiguous record batch. Defaults to 8192.

  • schema (Schema | InferredSchema | None, default: None ) –

    The schema of the input data. If provided, can improve memory use; otherwise all items need to be parsed into a single array for schema inference. Defaults to None.

Returns:

  • RecordBatchReader

    pyarrow RecordBatchReader with a stream of STAC Arrow RecordBatches.

parse_stac_ndjson_to_arrow

parse_stac_ndjson_to_arrow(
    path: str | Path | Iterable[str | Path],
    *,
    chunk_size: int = DEFAULT_JSON_CHUNK_SIZE,
    schema: Schema | None = None,
    limit: int | None = None
) -> RecordBatchReader

Convert one or more newline-delimited JSON STAC files to a generator of Arrow RecordBatches.

Each RecordBatch in the returned iterator is guaranteed to have an identical schema, and can be used to write to one or more Parquet files.

Parameters:

  • path (str | Path | Iterable[str | Path]) –

    One or more paths to files with STAC items.

  • chunk_size (int, default: DEFAULT_JSON_CHUNK_SIZE ) –

    The chunk size. Defaults to 65536.

  • schema (Schema | None, default: None ) –

    The schema to represent the input STAC data. Defaults to None, in which case the schema will first be inferred via a full pass over the input data. In this case, there will be two full passes over the input data: one to infer a common schema across all data and another to read the data.

Other Parameters:

  • limit (int | None) –

    The maximum number of JSON Items to use for schema inference

Returns:

  • RecordBatchReader

    pyarrow RecordBatchReader with a stream of STAC Arrow RecordBatches.

parse_stac_ndjson_to_delta_lake

parse_stac_ndjson_to_delta_lake(
    input_path: str | Path | Iterable[str | Path],
    table_or_uri: str | Path | DeltaTable,
    *,
    chunk_size: int = DEFAULT_JSON_CHUNK_SIZE,
    schema: Schema | None = None,
    limit: int | None = None,
    schema_version: SUPPORTED_PARQUET_SCHEMA_VERSIONS = DEFAULT_PARQUET_SCHEMA_VERSION,
    **kwargs: Any
) -> None

Convert one or more newline-delimited JSON STAC files to Delta Lake

Parameters:

Parameters:

  • chunk_size (int, default: DEFAULT_JSON_CHUNK_SIZE ) –

    The chunk size to use for reading JSON into memory. Defaults to 65536.

  • schema (Schema | None, default: None ) –

    The schema to represent the input STAC data. Defaults to None, in which case the schema will first be inferred via a full pass over the input data. In this case, there will be two full passes over the input data: one to infer a common schema across all data and another to read the data and iteratively convert to GeoParquet.

  • limit (int | None, default: None ) –

    The maximum number of JSON records to convert.

  • schema_version (SUPPORTED_PARQUET_SCHEMA_VERSIONS, default: DEFAULT_PARQUET_SCHEMA_VERSION ) –

    GeoParquet specification version; if not provided will default to latest supported version.

parse_stac_ndjson_to_parquet

parse_stac_ndjson_to_parquet(
    input_path: str | Path | Iterable[str | Path],
    output_path: str | Path,
    *,
    chunk_size: int = DEFAULT_JSON_CHUNK_SIZE,
    schema: Schema | InferredSchema | None = None,
    limit: int | None = None,
    schema_version: SUPPORTED_PARQUET_SCHEMA_VERSIONS = DEFAULT_PARQUET_SCHEMA_VERSION,
    **kwargs: Any
) -> None

Convert one or more newline-delimited JSON STAC files to GeoParquet

Parameters:

  • input_path (str | Path | Iterable[str | Path]) –

    One or more paths to files with STAC items.

  • output_path (str | Path) –

    A path to the output Parquet file.

Other Parameters:

  • chunk_size (int) –

    The chunk size. Defaults to 65536.

  • schema (Schema | InferredSchema | None) –

    The schema to represent the input STAC data. Defaults to None, in which case the schema will first be inferred via a full pass over the input data. In this case, there will be two full passes over the input data: one to infer a common schema across all data and another to read the data and iteratively convert to GeoParquet.

  • limit (int | None) –

    The maximum number of JSON records to convert.

  • schema_version (SUPPORTED_PARQUET_SCHEMA_VERSIONS) –

    GeoParquet specification version; if not provided will default to latest supported version.

All other keyword args are passed on to pyarrow.parquet.ParquetWriter.

stac_table_to_items

stac_table_to_items(
    table: Table | RecordBatchReader | ArrowStreamExportable,
) -> Iterable[dict]

Convert STAC Arrow to a generator of STAC Item dicts.

Parameters:

  • table (Table | RecordBatchReader | ArrowStreamExportable) –

    STAC in Arrow form. This can be a pyarrow Table, a pyarrow RecordBatchReader, or any other Arrow stream object exposed through the Arrow PyCapsule Interface. A RecordBatchReader or stream object will not be materialized in memory.

Yields:

stac_table_to_ndjson

stac_table_to_ndjson(
    table: Table | RecordBatchReader | ArrowStreamExportable,
    dest: str | Path | PathLike[bytes],
) -> None

Write STAC Arrow to a newline-delimited JSON file.

Note

This function appends to the JSON file at dest; it does not overwrite any existing data.

Parameters:

  • table (Table | RecordBatchReader | ArrowStreamExportable) –

    STAC in Arrow form. This can be a pyarrow Table, a pyarrow RecordBatchReader, or any other Arrow stream object exposed through the Arrow PyCapsule Interface. A RecordBatchReader or stream object will not be materialized in memory.

  • dest (str | Path | PathLike[bytes]) –

    The destination where newline-delimited JSON should be written.

to_parquet

to_parquet(
    table: Table | RecordBatchReader | ArrowStreamExportable,
    output_path: str | Path,
    *,
    schema_version: SUPPORTED_PARQUET_SCHEMA_VERSIONS = DEFAULT_PARQUET_SCHEMA_VERSION,
    **kwargs: Any
) -> None

Write an Arrow table with STAC data to GeoParquet

This writes metadata compliant with either GeoParquet 1.0 or 1.1.

Parameters:

  • table (Table | RecordBatchReader | ArrowStreamExportable) –

    STAC in Arrow form. This can be a pyarrow Table, a pyarrow RecordBatchReader, or any other Arrow stream object exposed through the Arrow PyCapsule Interface. A RecordBatchReader or stream object will not be materialized in memory.

  • output_path (str | Path) –

    The destination for saving.

Other Parameters:

All other keyword args are passed on to pyarrow.parquet.ParquetWriter.