stac_geoparquet.arrow¶
Arrow-based format conversions.
stac_geoparquet.arrow ¶
DEFAULT_JSON_CHUNK_SIZE
module-attribute
¶
DEFAULT_JSON_CHUNK_SIZE = 65536
The default chunk size to use for reading JSON into memory.
DEFAULT_PARQUET_SCHEMA_VERSION
module-attribute
¶
DEFAULT_PARQUET_SCHEMA_VERSION: SUPPORTED_PARQUET_SCHEMA_VERSIONS = '1.1.0'
The default GeoParquet schema version written to file.
SUPPORTED_PARQUET_SCHEMA_VERSIONS
module-attribute
¶
SUPPORTED_PARQUET_SCHEMA_VERSIONS = Literal['1.0.0', '1.1.0']
A Literal type with the supported GeoParquet schema versions.
parse_stac_items_to_arrow ¶
parse_stac_items_to_arrow(
items: Iterable[Item | dict[str, Any]],
chunk_size: int = DEFAULT_JSON_CHUNK_SIZE,
schema: ACCEPTED_SCHEMA_OPTIONS = "FullFile",
tmpdir: str | Path | None = None,
) -> RecordBatchReader
Parse a collection of STAC Items to an iterable of
pyarrow.RecordBatch.
The objects under properties are moved up to the top-level of the
Table, similar to
geopandas.GeoDataFrame.from_features.
Parameters:
-
items(Iterable[Item | dict[str, Any]]) –the STAC Items to convert
-
chunk_size(int, default:DEFAULT_JSON_CHUNK_SIZE) –The chunk size to use for Arrow record batches. This only takes effect if
schemais not None. Whenschemais None, the input will be parsed into a single contiguous record batch. Defaults to 8192. -
schema(ACCEPTED_SCHEMA_OPTIONS, default:'FullFile') –The schema of the input data. If provided, can improve memory use; otherwise all items need to be parsed into a single array for schema inference. This can also be set to a string value of "FullFile" which will scan the entire input in memory to get the schema, "FirstBatch" which will use the first batch of items to infer the schema, or "ChunksToDisk" which will write each chunk of items to disk as a Parquet file and then read them back in to unify the schema. Defaults to "FullFile".
Returns:
-
RecordBatchReader–pyarrow RecordBatchReader with a stream of STAC Arrow RecordBatches.
parse_stac_items_to_parquet ¶
parse_stac_items_to_parquet(
items: Iterable[Item | dict[str, Any]],
*,
chunk_size: int = DEFAULT_JSON_CHUNK_SIZE,
schema: ACCEPTED_SCHEMA_OPTIONS = "FirstBatch",
output_path: str | Path,
tmpdir: str | Path | None = None,
schema_version: SUPPORTED_PARQUET_SCHEMA_VERSIONS = DEFAULT_PARQUET_SCHEMA_VERSION,
filesystem: FileSystem | None = None,
**kwargs: Any,
) -> str
Parse an iterable of Stac Items into Parquet.
parse_stac_ndjson_to_arrow ¶
parse_stac_ndjson_to_arrow(
path: str | Path | Iterable[str | Path],
*,
chunk_size: int = DEFAULT_JSON_CHUNK_SIZE,
schema: Schema | None = None,
limit: int | None = None,
) -> RecordBatchReader
Convert one or more newline-delimited JSON STAC files to a generator of Arrow RecordBatches.
Each RecordBatch in the returned iterator is guaranteed to have an identical schema, and can be used to write to one or more Parquet files.
Parameters:
-
path(str | Path | Iterable[str | Path]) –One or more paths to files with STAC items.
-
chunk_size(int, default:DEFAULT_JSON_CHUNK_SIZE) –The chunk size. Defaults to 65536.
-
schema(Schema | None, default:None) –The schema to represent the input STAC data. Defaults to None, in which case the schema will first be inferred via a full pass over the input data. In this case, there will be two full passes over the input data: one to infer a common schema across all data and another to read the data.
Other Parameters:
-
limit(int | None) –The maximum number of JSON Items to use for schema inference
Returns:
-
RecordBatchReader–pyarrow RecordBatchReader with a stream of STAC Arrow RecordBatches.
parse_stac_ndjson_to_delta_lake ¶
parse_stac_ndjson_to_delta_lake(
input_path: str | Path | Iterable[str | Path],
table_or_uri: str | Path | DeltaTable,
*,
chunk_size: int = DEFAULT_JSON_CHUNK_SIZE,
schema: Schema | None = None,
limit: int | None = None,
schema_version: SUPPORTED_PARQUET_SCHEMA_VERSIONS = DEFAULT_PARQUET_SCHEMA_VERSION,
**kwargs: Any,
) -> None
Convert one or more newline-delimited JSON STAC files to Delta Lake
Parameters:
-
input_path(str | Path | Iterable[str | Path]) –One or more paths to files with STAC items.
-
table_or_uri(str | Path | DeltaTable) –A path to the output Delta Lake table
Parameters:
-
chunk_size(int, default:DEFAULT_JSON_CHUNK_SIZE) –The chunk size to use for reading JSON into memory. Defaults to 65536.
-
schema(Schema | None, default:None) –The schema to represent the input STAC data. Defaults to None, in which case the schema will first be inferred via a full pass over the input data. In this case, there will be two full passes over the input data: one to infer a common schema across all data and another to read the data and iteratively convert to GeoParquet.
-
limit(int | None, default:None) –The maximum number of JSON records to convert.
-
schema_version(SUPPORTED_PARQUET_SCHEMA_VERSIONS, default:DEFAULT_PARQUET_SCHEMA_VERSION) –GeoParquet specification version; if not provided will default to latest supported version.
parse_stac_ndjson_to_parquet ¶
parse_stac_ndjson_to_parquet(
input_path: str | Path | Iterable[str | Path],
output_path: str | Path,
*,
chunk_size: int = DEFAULT_JSON_CHUNK_SIZE,
schema: Schema | InferredSchema | None = None,
limit: int | None = None,
schema_version: SUPPORTED_PARQUET_SCHEMA_VERSIONS = DEFAULT_PARQUET_SCHEMA_VERSION,
collections: Mapping[str, Mapping[str, Any]] | None = None,
collection_metadata: Mapping[str, Any] | None = None,
filesystem: FileSystem | None = None,
**kwargs: Any,
) -> None
Convert one or more newline-delimited JSON STAC files to GeoParquet
Parameters:
-
input_path(str | Path | Iterable[str | Path]) –One or more paths to files with STAC items.
-
output_path(str | Path) –A path to the output Parquet file.
Other Parameters:
-
chunk_size(int) –The chunk size. Defaults to 65536.
-
schema(Schema | InferredSchema | None) –The schema to represent the input STAC data. Defaults to None, in which case the schema will first be inferred via a full pass over the input data. In this case, there will be two full passes over the input data: one to infer a common schema across all data and another to read the data and iteratively convert to GeoParquet.
-
limit(int | None) –The maximum number of JSON records to convert.
-
schema_version(SUPPORTED_PARQUET_SCHEMA_VERSIONS) –GeoParquet specification version; if not provided will default to latest supported version.
-
collections(Mapping[str, Mapping[str, Any]] | None) –A dictionary mapping collection IDs to dictionaries representing a Collection in a SpatioTemporal Asset Catalog. This will be stored under the key
stac-geoparquetin the parquet file metadata, under the keycollections. -
collection_metadata(Mapping[str, Any] | None) –A dictionary representing a Collection in a SpatioTemporal Asset Catalog. This will be stored under the key
stac-geoparquetin the parquet file metadata, under the keycollection.Deprecated in favor of
collections. -
filesystem(FileSystem | None) –PyArrow FileSystem to use for writing. If not provided, will be inferred from output_path for local files.
All other keyword args are passed on to
pyarrow.parquet.ParquetWriter.
stac_table_to_items ¶
stac_table_to_items(
table: Table | RecordBatchReader | ArrowStreamExportable,
) -> Iterable[dict]
Convert STAC Arrow to a generator of STAC Item dicts.
Parameters:
-
table(Table | RecordBatchReader | ArrowStreamExportable) –STAC in Arrow form. This can be a pyarrow Table, a pyarrow RecordBatchReader, or any other Arrow stream object exposed through the Arrow PyCapsule Interface. A RecordBatchReader or stream object will not be materialized in memory.
Yields:
stac_table_to_ndjson ¶
stac_table_to_ndjson(
table: Table | RecordBatchReader | ArrowStreamExportable,
dest: str | Path | PathLike[bytes],
) -> None
Write STAC Arrow to a newline-delimited JSON file.
Note
This function appends to the JSON file at dest; it does not overwrite any
existing data.
Parameters:
-
table(Table | RecordBatchReader | ArrowStreamExportable) –STAC in Arrow form. This can be a pyarrow Table, a pyarrow RecordBatchReader, or any other Arrow stream object exposed through the Arrow PyCapsule Interface. A RecordBatchReader or stream object will not be materialized in memory.
The 'type' field is not required in stac-geoparquet. If not present, a 'type' field will be added where each record is 'Feature'.
-
dest(str | Path | PathLike[bytes]) –The destination where newline-delimited JSON should be written.
to_parquet ¶
to_parquet(
table: Table | RecordBatchReader | ArrowStreamExportable,
output_path: str | Path,
*,
schema_version: SUPPORTED_PARQUET_SCHEMA_VERSIONS = DEFAULT_PARQUET_SCHEMA_VERSION,
collections: Mapping[str, Mapping[str, Any]] | None = None,
collection_metadata: Mapping[str, Any] | None = None,
filesystem: FileSystem | None = None,
**kwargs: Any,
) -> None
Write an Arrow table with STAC data to GeoParquet
This writes metadata compliant with either GeoParquet 1.0 or 1.1.
Parameters:
-
table(Table | RecordBatchReader | ArrowStreamExportable) –STAC in Arrow form. This can be a pyarrow Table, a pyarrow RecordBatchReader, or any other Arrow stream object exposed through the Arrow PyCapsule Interface. A RecordBatchReader or stream object will not be materialized in memory.
-
output_path(str | Path) –The destination for saving.
Other Parameters:
-
schema_version(SUPPORTED_PARQUET_SCHEMA_VERSIONS) –GeoParquet specification version; if not provided will default to latest supported version.
-
collections(Mapping[str, Mapping[str, Any]] | None) –A dictionary mapping collection IDs to dictionaries representing a Collection in a SpatioTemporal Asset Catalog. This will be stored under the key
stac-geoparquetin the parquet file metadata, under the keycollections. -
collection_metadata(Mapping[str, Any] | None) –A dictionary representing a Collection in a SpatioTemporal Asset Catalog. This will be stored under the key
stac-geoparquetin the parquet file metadata, under the keycollection.Deprecated in favor of
collections. -
filesystem(FileSystem | None) –PyArrow FileSystem to use for writing. If not provided, will be inferred from output_path for local files.
All other keyword args are passed on to
pyarrow.parquet.ParquetWriter.