Usage¶
Apache Arrow is used as the in-memory interchange format between all formats. While some end-to-end helper functions are provided, the user can go through Arrow objects for maximal flexibility in the conversion process.
All functionality that goes through Arrow is currently exported via the stac_geoparquet.arrow namespace.
dict/JSON - Arrow conversion¶
Convert dicts to Arrow¶
Use parse_stac_items_to_arrow to convert STAC items either in memory or on disk to a stream of Arrow record batches. This accepts either an iterable of Python dicts or an iterable of pystac.Item objects.
For example:
import pyarrow as pa
import pystac
import stac_geoparquet
item = pystac.read_file(
    "https://planetarycomputer.microsoft.com/api/stac/v1/collections/sentinel-2-l2a/items/S2A_MSIL2A_20230112T104411_R008_T29NPE_20230113T053333"
)
assert isinstance(item, pystac.Item)
record_batch_reader = stac_geoparquet.arrow.parse_stac_items_to_arrow([item])
table = record_batch_reader.read_all()
Convert JSON to Arrow¶
parse_stac_ndjson_to_arrow is a helper function to take one or more JSON or newline-delimited JSON files on disk, infer the schema from all of them, and convert the data to a stream of Arrow record batches.
Convert Arrow to dicts¶
Use stac_table_to_items to convert a table or stream of Arrow record batches of STAC data to a generator of Python dicts. This accepts either a pyarrow.Table or a pyarrow.RecordBatchReader, which allows conversions of larger-than-memory files in a streaming manner.
Convert Arrow to JSON¶
Use stac_table_to_ndjson to convert a table or stream of Arrow record batches of STAC data to a newline-delimited JSON file. This accepts either a pyarrow.Table or a pyarrow.RecordBatchReader, which allows conversions of larger-than-memory files in a streaming manner.
Parquet¶
Use to_parquet to write STAC Arrow data from memory to a path or file-like object. This is a special function to ensure that GeoParquet 1.0 or 1.1 metadata is written to the Parquet file.
parse_stac_ndjson_to_parquet is a helper that connects reading (newline-delimited) JSON on disk to writing out to a Parquet file.
No special API is required for reading a STAC GeoParquet file back into Arrow. You can use pyarrow.parquet.read_table or pyarrow.parquet.ParquetFile directly to read the STAC GeoParquet data back into Arrow.
Delta Lake¶
Use parse_stac_ndjson_to_delta_lake to read (newline-delimited) JSON on disk and write out to a Delta Lake table.
No special API is required for reading a STAC Delta Lake table back into Arrow. You can use the DeltaTable class directly to read the data back into Arrow.
Important
Arrow has a null data type, where every value in the column is always null, but Delta Lake does not. This means that for any column inferred to have a null data type, writing to Delta Lake will error with
_internal.SchemaMismatchError: Invalid data type for Delta Lake: Null
This is a problem because if all items in a STAC Collection have a null JSON key, it gets inferred as an Arrow null type. For example, in the 3dep-lidar-copc collection in the tests, it has start_datetime and end_datetime fields, and so according to the spec, datetime is always null. This column would need to be casted to a timestamp type before being written to Delta Lake.
This means we cannot write this collection to Delta Lake solely with automatic schema inference.
In such cases, users may need to manually update the inferred schema to cast any null type to another Delta Lake-compatible type.