Skip to content

stac_geoparquet.arrow

Arrow-based format conversions.

stac_geoparquet.arrow

DEFAULT_JSON_CHUNK_SIZE module-attribute

DEFAULT_JSON_CHUNK_SIZE = 65536

The default chunk size to use for reading JSON into memory.

DEFAULT_PARQUET_SCHEMA_VERSION module-attribute

DEFAULT_PARQUET_SCHEMA_VERSION: SUPPORTED_PARQUET_SCHEMA_VERSIONS = '1.1.0'

The default GeoParquet schema version written to file.

SUPPORTED_PARQUET_SCHEMA_VERSIONS module-attribute

SUPPORTED_PARQUET_SCHEMA_VERSIONS = Literal['1.0.0', '1.1.0']

A Literal type with the supported GeoParquet schema versions.

parse_stac_items_to_arrow

parse_stac_items_to_arrow(
    items: Iterable[Item | dict[str, Any]],
    chunk_size: int = DEFAULT_JSON_CHUNK_SIZE,
    schema: ACCEPTED_SCHEMA_OPTIONS = "FullFile",
    tmpdir: str | Path | None = None,
) -> RecordBatchReader

Parse a collection of STAC Items to an iterable of pyarrow.RecordBatch.

The objects under properties are moved up to the top-level of the Table, similar to geopandas.GeoDataFrame.from_features.

Parameters:

  • items (Iterable[Item | dict[str, Any]]) –

    the STAC Items to convert

  • chunk_size (int, default: DEFAULT_JSON_CHUNK_SIZE ) –

    The chunk size to use for Arrow record batches. This only takes effect if schema is not None. When schema is None, the input will be parsed into a single contiguous record batch. Defaults to 8192.

  • schema (ACCEPTED_SCHEMA_OPTIONS, default: 'FullFile' ) –

    The schema of the input data. If provided, can improve memory use; otherwise all items need to be parsed into a single array for schema inference. This can also be set to a string value of "FullFile" which will scan the entire input in memory to get the schema, "FirstBatch" which will use the first batch of items to infer the schema, or "ChunksToDisk" which will write each chunk of items to disk as a Parquet file and then read them back in to unify the schema. Defaults to "FullFile".

Returns:

  • RecordBatchReader

    pyarrow RecordBatchReader with a stream of STAC Arrow RecordBatches.

parse_stac_items_to_parquet

parse_stac_items_to_parquet(
    items: Iterable[Item | dict[str, Any]],
    *,
    chunk_size: int = DEFAULT_JSON_CHUNK_SIZE,
    schema: ACCEPTED_SCHEMA_OPTIONS = "FirstBatch",
    output_path: str | Path,
    tmpdir: str | Path | None = None,
    schema_version: SUPPORTED_PARQUET_SCHEMA_VERSIONS = DEFAULT_PARQUET_SCHEMA_VERSION,
    filesystem: FileSystem | None = None,
    **kwargs: Any,
) -> str

Parse an iterable of Stac Items into Parquet.

parse_stac_ndjson_to_arrow

parse_stac_ndjson_to_arrow(
    path: str | Path | Iterable[str | Path],
    *,
    chunk_size: int = DEFAULT_JSON_CHUNK_SIZE,
    schema: Schema | None = None,
    limit: int | None = None,
) -> RecordBatchReader

Convert one or more newline-delimited JSON STAC files to a generator of Arrow RecordBatches.

Each RecordBatch in the returned iterator is guaranteed to have an identical schema, and can be used to write to one or more Parquet files.

Parameters:

  • path (str | Path | Iterable[str | Path]) –

    One or more paths to files with STAC items.

  • chunk_size (int, default: DEFAULT_JSON_CHUNK_SIZE ) –

    The chunk size. Defaults to 65536.

  • schema (Schema | None, default: None ) –

    The schema to represent the input STAC data. Defaults to None, in which case the schema will first be inferred via a full pass over the input data. In this case, there will be two full passes over the input data: one to infer a common schema across all data and another to read the data.

Other Parameters:

  • limit (int | None) –

    The maximum number of JSON Items to use for schema inference

Returns:

  • RecordBatchReader

    pyarrow RecordBatchReader with a stream of STAC Arrow RecordBatches.

parse_stac_ndjson_to_delta_lake

parse_stac_ndjson_to_delta_lake(
    input_path: str | Path | Iterable[str | Path],
    table_or_uri: str | Path | DeltaTable,
    *,
    chunk_size: int = DEFAULT_JSON_CHUNK_SIZE,
    schema: Schema | None = None,
    limit: int | None = None,
    schema_version: SUPPORTED_PARQUET_SCHEMA_VERSIONS = DEFAULT_PARQUET_SCHEMA_VERSION,
    **kwargs: Any,
) -> None

Convert one or more newline-delimited JSON STAC files to Delta Lake

Parameters:

Parameters:

  • chunk_size (int, default: DEFAULT_JSON_CHUNK_SIZE ) –

    The chunk size to use for reading JSON into memory. Defaults to 65536.

  • schema (Schema | None, default: None ) –

    The schema to represent the input STAC data. Defaults to None, in which case the schema will first be inferred via a full pass over the input data. In this case, there will be two full passes over the input data: one to infer a common schema across all data and another to read the data and iteratively convert to GeoParquet.

  • limit (int | None, default: None ) –

    The maximum number of JSON records to convert.

  • schema_version (SUPPORTED_PARQUET_SCHEMA_VERSIONS, default: DEFAULT_PARQUET_SCHEMA_VERSION ) –

    GeoParquet specification version; if not provided will default to latest supported version.

parse_stac_ndjson_to_parquet

parse_stac_ndjson_to_parquet(
    input_path: str | Path | Iterable[str | Path],
    output_path: str | Path,
    *,
    chunk_size: int = DEFAULT_JSON_CHUNK_SIZE,
    schema: Schema | InferredSchema | None = None,
    limit: int | None = None,
    schema_version: SUPPORTED_PARQUET_SCHEMA_VERSIONS = DEFAULT_PARQUET_SCHEMA_VERSION,
    collections: Mapping[str, Mapping[str, Any]] | None = None,
    collection_metadata: Mapping[str, Any] | None = None,
    filesystem: FileSystem | None = None,
    **kwargs: Any,
) -> None

Convert one or more newline-delimited JSON STAC files to GeoParquet

Parameters:

  • input_path (str | Path | Iterable[str | Path]) –

    One or more paths to files with STAC items.

  • output_path (str | Path) –

    A path to the output Parquet file.

Other Parameters:

  • chunk_size (int) –

    The chunk size. Defaults to 65536.

  • schema (Schema | InferredSchema | None) –

    The schema to represent the input STAC data. Defaults to None, in which case the schema will first be inferred via a full pass over the input data. In this case, there will be two full passes over the input data: one to infer a common schema across all data and another to read the data and iteratively convert to GeoParquet.

  • limit (int | None) –

    The maximum number of JSON records to convert.

  • schema_version (SUPPORTED_PARQUET_SCHEMA_VERSIONS) –

    GeoParquet specification version; if not provided will default to latest supported version.

  • collections (Mapping[str, Mapping[str, Any]] | None) –

    A dictionary mapping collection IDs to dictionaries representing a Collection in a SpatioTemporal Asset Catalog. This will be stored under the key stac-geoparquet in the parquet file metadata, under the key collections.

  • collection_metadata (Mapping[str, Any] | None) –

    A dictionary representing a Collection in a SpatioTemporal Asset Catalog. This will be stored under the key stac-geoparquet in the parquet file metadata, under the key collection.

    Deprecated in favor of collections.

  • filesystem (FileSystem | None) –

    PyArrow FileSystem to use for writing. If not provided, will be inferred from output_path for local files.

All other keyword args are passed on to pyarrow.parquet.ParquetWriter.

stac_table_to_items

stac_table_to_items(
    table: Table | RecordBatchReader | ArrowStreamExportable,
) -> Iterable[dict]

Convert STAC Arrow to a generator of STAC Item dicts.

Parameters:

  • table (Table | RecordBatchReader | ArrowStreamExportable) –

    STAC in Arrow form. This can be a pyarrow Table, a pyarrow RecordBatchReader, or any other Arrow stream object exposed through the Arrow PyCapsule Interface. A RecordBatchReader or stream object will not be materialized in memory.

Yields:

stac_table_to_ndjson

stac_table_to_ndjson(
    table: Table | RecordBatchReader | ArrowStreamExportable,
    dest: str | Path | PathLike[bytes],
) -> None

Write STAC Arrow to a newline-delimited JSON file.

Note

This function appends to the JSON file at dest; it does not overwrite any existing data.

Parameters:

  • table (Table | RecordBatchReader | ArrowStreamExportable) –

    STAC in Arrow form. This can be a pyarrow Table, a pyarrow RecordBatchReader, or any other Arrow stream object exposed through the Arrow PyCapsule Interface. A RecordBatchReader or stream object will not be materialized in memory.

    The 'type' field is not required in stac-geoparquet. If not present, a 'type' field will be added where each record is 'Feature'.

  • dest (str | Path | PathLike[bytes]) –

    The destination where newline-delimited JSON should be written.

to_parquet

to_parquet(
    table: Table | RecordBatchReader | ArrowStreamExportable,
    output_path: str | Path,
    *,
    schema_version: SUPPORTED_PARQUET_SCHEMA_VERSIONS = DEFAULT_PARQUET_SCHEMA_VERSION,
    collections: Mapping[str, Mapping[str, Any]] | None = None,
    collection_metadata: Mapping[str, Any] | None = None,
    filesystem: FileSystem | None = None,
    **kwargs: Any,
) -> None

Write an Arrow table with STAC data to GeoParquet

This writes metadata compliant with either GeoParquet 1.0 or 1.1.

Parameters:

  • table (Table | RecordBatchReader | ArrowStreamExportable) –

    STAC in Arrow form. This can be a pyarrow Table, a pyarrow RecordBatchReader, or any other Arrow stream object exposed through the Arrow PyCapsule Interface. A RecordBatchReader or stream object will not be materialized in memory.

  • output_path (str | Path) –

    The destination for saving.

Other Parameters:

  • schema_version (SUPPORTED_PARQUET_SCHEMA_VERSIONS) –

    GeoParquet specification version; if not provided will default to latest supported version.

  • collections (Mapping[str, Mapping[str, Any]] | None) –

    A dictionary mapping collection IDs to dictionaries representing a Collection in a SpatioTemporal Asset Catalog. This will be stored under the key stac-geoparquet in the parquet file metadata, under the key collections.

  • collection_metadata (Mapping[str, Any] | None) –

    A dictionary representing a Collection in a SpatioTemporal Asset Catalog. This will be stored under the key stac-geoparquet in the parquet file metadata, under the key collection.

    Deprecated in favor of collections.

  • filesystem (FileSystem | None) –

    PyArrow FileSystem to use for writing. If not provided, will be inferred from output_path for local files.

All other keyword args are passed on to pyarrow.parquet.ParquetWriter.