stac_geoparquet.arrow
¶
Arrow-based format conversions.
stac_geoparquet.arrow ¶
DEFAULT_JSON_CHUNK_SIZE
module-attribute
¶
DEFAULT_JSON_CHUNK_SIZE = 65536
The default chunk size to use for reading JSON into memory.
DEFAULT_PARQUET_SCHEMA_VERSION
module-attribute
¶
DEFAULT_PARQUET_SCHEMA_VERSION: SUPPORTED_PARQUET_SCHEMA_VERSIONS = '1.1.0'
The default GeoParquet schema version written to file.
SUPPORTED_PARQUET_SCHEMA_VERSIONS
module-attribute
¶
SUPPORTED_PARQUET_SCHEMA_VERSIONS = Literal['1.0.0', '1.1.0']
A Literal type with the supported GeoParquet schema versions.
parse_stac_items_to_arrow ¶
parse_stac_items_to_arrow(
items: Iterable[Item | dict[str, Any]],
*,
chunk_size: int = 8192,
schema: Schema | InferredSchema | None = None
) -> RecordBatchReader
Parse a collection of STAC Items to an iterable of
pyarrow.RecordBatch
.
The objects under properties
are moved up to the top-level of the
Table, similar to
geopandas.GeoDataFrame.from_features
.
Parameters:
-
items
(Iterable[Item | dict[str, Any]]
) –the STAC Items to convert
-
chunk_size
(int
, default:8192
) –The chunk size to use for Arrow record batches. This only takes effect if
schema
is not None. Whenschema
is None, the input will be parsed into a single contiguous record batch. Defaults to 8192. -
schema
(Schema | InferredSchema | None
, default:None
) –The schema of the input data. If provided, can improve memory use; otherwise all items need to be parsed into a single array for schema inference. Defaults to None.
Returns:
-
RecordBatchReader
–pyarrow RecordBatchReader with a stream of STAC Arrow RecordBatches.
parse_stac_ndjson_to_arrow ¶
parse_stac_ndjson_to_arrow(
path: str | Path | Iterable[str | Path],
*,
chunk_size: int = DEFAULT_JSON_CHUNK_SIZE,
schema: Schema | None = None,
limit: int | None = None
) -> RecordBatchReader
Convert one or more newline-delimited JSON STAC files to a generator of Arrow RecordBatches.
Each RecordBatch in the returned iterator is guaranteed to have an identical schema, and can be used to write to one or more Parquet files.
Parameters:
-
path
(str | Path | Iterable[str | Path]
) –One or more paths to files with STAC items.
-
chunk_size
(int
, default:DEFAULT_JSON_CHUNK_SIZE
) –The chunk size. Defaults to 65536.
-
schema
(Schema | None
, default:None
) –The schema to represent the input STAC data. Defaults to None, in which case the schema will first be inferred via a full pass over the input data. In this case, there will be two full passes over the input data: one to infer a common schema across all data and another to read the data.
Other Parameters:
-
limit
(int | None
) –The maximum number of JSON Items to use for schema inference
Returns:
-
RecordBatchReader
–pyarrow RecordBatchReader with a stream of STAC Arrow RecordBatches.
parse_stac_ndjson_to_delta_lake ¶
parse_stac_ndjson_to_delta_lake(
input_path: str | Path | Iterable[str | Path],
table_or_uri: str | Path | DeltaTable,
*,
chunk_size: int = DEFAULT_JSON_CHUNK_SIZE,
schema: Schema | None = None,
limit: int | None = None,
schema_version: SUPPORTED_PARQUET_SCHEMA_VERSIONS = DEFAULT_PARQUET_SCHEMA_VERSION,
**kwargs: Any
) -> None
Convert one or more newline-delimited JSON STAC files to Delta Lake
Parameters:
-
input_path
(str | Path | Iterable[str | Path]
) –One or more paths to files with STAC items.
-
table_or_uri
(str | Path | DeltaTable
) –A path to the output Delta Lake table
Parameters:
-
chunk_size
(int
, default:DEFAULT_JSON_CHUNK_SIZE
) –The chunk size to use for reading JSON into memory. Defaults to 65536.
-
schema
(Schema | None
, default:None
) –The schema to represent the input STAC data. Defaults to None, in which case the schema will first be inferred via a full pass over the input data. In this case, there will be two full passes over the input data: one to infer a common schema across all data and another to read the data and iteratively convert to GeoParquet.
-
limit
(int | None
, default:None
) –The maximum number of JSON records to convert.
-
schema_version
(SUPPORTED_PARQUET_SCHEMA_VERSIONS
, default:DEFAULT_PARQUET_SCHEMA_VERSION
) –GeoParquet specification version; if not provided will default to latest supported version.
parse_stac_ndjson_to_parquet ¶
parse_stac_ndjson_to_parquet(
input_path: str | Path | Iterable[str | Path],
output_path: str | Path,
*,
chunk_size: int = DEFAULT_JSON_CHUNK_SIZE,
schema: Schema | InferredSchema | None = None,
limit: int | None = None,
schema_version: SUPPORTED_PARQUET_SCHEMA_VERSIONS = DEFAULT_PARQUET_SCHEMA_VERSION,
**kwargs: Any
) -> None
Convert one or more newline-delimited JSON STAC files to GeoParquet
Parameters:
-
input_path
(str | Path | Iterable[str | Path]
) –One or more paths to files with STAC items.
-
output_path
(str | Path
) –A path to the output Parquet file.
Other Parameters:
-
chunk_size
(int
) –The chunk size. Defaults to 65536.
-
schema
(Schema | InferredSchema | None
) –The schema to represent the input STAC data. Defaults to None, in which case the schema will first be inferred via a full pass over the input data. In this case, there will be two full passes over the input data: one to infer a common schema across all data and another to read the data and iteratively convert to GeoParquet.
-
limit
(int | None
) –The maximum number of JSON records to convert.
-
schema_version
(SUPPORTED_PARQUET_SCHEMA_VERSIONS
) –GeoParquet specification version; if not provided will default to latest supported version.
All other keyword args are passed on to
pyarrow.parquet.ParquetWriter
.
stac_table_to_items ¶
stac_table_to_items(
table: Table | RecordBatchReader | ArrowStreamExportable,
) -> Iterable[dict]
Convert STAC Arrow to a generator of STAC Item dict
s.
Parameters:
-
table
(Table | RecordBatchReader | ArrowStreamExportable
) –STAC in Arrow form. This can be a pyarrow Table, a pyarrow RecordBatchReader, or any other Arrow stream object exposed through the Arrow PyCapsule Interface. A RecordBatchReader or stream object will not be materialized in memory.
Yields:
stac_table_to_ndjson ¶
stac_table_to_ndjson(
table: Table | RecordBatchReader | ArrowStreamExportable,
dest: str | Path | PathLike[bytes],
) -> None
Write STAC Arrow to a newline-delimited JSON file.
Note
This function appends to the JSON file at dest
; it does not overwrite any
existing data.
Parameters:
-
table
(Table | RecordBatchReader | ArrowStreamExportable
) –STAC in Arrow form. This can be a pyarrow Table, a pyarrow RecordBatchReader, or any other Arrow stream object exposed through the Arrow PyCapsule Interface. A RecordBatchReader or stream object will not be materialized in memory.
-
dest
(str | Path | PathLike[bytes]
) –The destination where newline-delimited JSON should be written.
to_parquet ¶
to_parquet(
table: Table | RecordBatchReader | ArrowStreamExportable,
output_path: str | Path,
*,
schema_version: SUPPORTED_PARQUET_SCHEMA_VERSIONS = DEFAULT_PARQUET_SCHEMA_VERSION,
**kwargs: Any
) -> None
Write an Arrow table with STAC data to GeoParquet
This writes metadata compliant with either GeoParquet 1.0 or 1.1.
Parameters:
-
table
(Table | RecordBatchReader | ArrowStreamExportable
) –STAC in Arrow form. This can be a pyarrow Table, a pyarrow RecordBatchReader, or any other Arrow stream object exposed through the Arrow PyCapsule Interface. A RecordBatchReader or stream object will not be materialized in memory.
-
output_path
(str | Path
) –The destination for saving.
Other Parameters:
-
schema_version
(SUPPORTED_PARQUET_SCHEMA_VERSIONS
) –GeoParquet specification version; if not provided will default to latest supported version.
All other keyword args are passed on to
pyarrow.parquet.ParquetWriter
.