NAIP example¶
We'll use STAC Items from the NAIP STAC Collection on Microsoft's Planetary Computer to illustrate how to use the stac-geoparquet
library.
There are a few libraries we need to install to run this notebook:
pip install pystac-client stac-geoparquet pyarrow deltalake lonboard
import json
from pathlib import Path
import deltalake
import lonboard
import pyarrow as pa
import pyarrow.parquet as pq
import pystac_client
import stac_geoparquet
We can open the Planetary Computer STAC Collection with pystac_client.Client.open
, ensuring we also sign the returned URLs in each STAC Item.
catalog = pystac_client.Client.open(
"https://planetarycomputer.microsoft.com/api/stac/v1"
)
Now we'll access the NAIP collection from the Planetary Computer catalog and download 1000 items from this collection, writing them to a newline-delimited JSON file in the current directory.
items_iter = catalog.get_collection("naip").get_items()
max_items = 1000
naip_json_path = Path("naip.jsonl")
if not naip_json_path.exists():
with open(naip_json_path, "w") as f:
count = 0
for item in items_iter:
json.dump(item.to_dict(), f, separators=(",", ":"))
f.write("\n")
count += 1
if count >= max_items:
break
We can now use stac-geoparquet
APIs on this data.
Loading to Arrow¶
We can load to an Arrow RecordBatchReader
by using the parse_stac_ndjson_to_arrow
function.
record_batch_reader = stac_geoparquet.arrow.parse_stac_ndjson_to_arrow(naip_json_path)
The Arrow RecordBatchReader
represents a stream of Arrow batches, which can be useful when converting a very large STAC collection, which you don't want to materialize in memory at once.
We can convert this to an Arrow table with read_all
.
table = record_batch_reader.read_all()
table.schema
assets: struct<image: struct<eo:bands: list<item: struct<common_name: string, description: string, name: string>>, href: string, roles: list<item: string>, title: string, type: string>, rendered_preview: struct<href: string, rel: string, roles: list<item: string>, title: string, type: string>, thumbnail: struct<href: string, roles: list<item: string>, title: string, type: string>, tilejson: struct<href: string, roles: list<item: string>, title: string, type: string>> child 0, image: struct<eo:bands: list<item: struct<common_name: string, description: string, name: string>>, href: string, roles: list<item: string>, title: string, type: string> child 0, eo:bands: list<item: struct<common_name: string, description: string, name: string>> child 0, item: struct<common_name: string, description: string, name: string> child 0, common_name: string child 1, description: string child 2, name: string child 1, href: string child 2, roles: list<item: string> child 0, item: string child 3, title: string child 4, type: string child 1, rendered_preview: struct<href: string, rel: string, roles: list<item: string>, title: string, type: string> child 0, href: string child 1, rel: string child 2, roles: list<item: string> child 0, item: string child 3, title: string child 4, type: string child 2, thumbnail: struct<href: string, roles: list<item: string>, title: string, type: string> child 0, href: string child 1, roles: list<item: string> child 0, item: string child 2, title: string child 3, type: string child 3, tilejson: struct<href: string, roles: list<item: string>, title: string, type: string> child 0, href: string child 1, roles: list<item: string> child 0, item: string child 2, title: string child 3, type: string bbox: struct<xmin: double, ymin: double, xmax: double, ymax: double> child 0, xmin: double child 1, ymin: double child 2, xmax: double child 3, ymax: double collection: string geometry: binary -- field metadata -- ARROW:extension:name: 'geoarrow.wkb' ARROW:extension:metadata: '{"crs":{"$schema":"https://proj.org/schemas/' + 1296 id: string links: list<item: struct<href: string, rel: string, title: string, type: string>> child 0, item: struct<href: string, rel: string, title: string, type: string> child 0, href: string child 1, rel: string child 2, title: string child 3, type: string stac_extensions: list<item: string> child 0, item: string stac_version: string type: string datetime: timestamp[us, tz=UTC] gsd: double naip:state: string naip:year: string proj:bbox: list<item: double> child 0, item: double proj:centroid: struct<lat: double, lon: double> child 0, lat: double child 1, lon: double proj:epsg: int64 proj:shape: list<item: int64> child 0, item: int64 proj:transform: list<item: double> child 0, item: double providers: list<item: struct<name: string, roles: list<item: string>, url: string>> child 0, item: struct<name: string, roles: list<item: string>, url: string> child 0, name: string child 1, roles: list<item: string> child 0, item: string child 2, url: string
We can also pass a small chunk size into parse_stac_ndjson_to_arrow
to show how the streaming works.
record_batch_reader = stac_geoparquet.arrow.parse_stac_ndjson_to_arrow(
naip_json_path, chunk_size=100
)
record_batch_reader
is an iterator that yields Arrow RecordBatch
objects. If we load just the first one, we'll see that it contains 100 rows.
first_batch = next(record_batch_reader)
first_batch.num_rows
100
Materializing the rest of the batches from the iterator into a table gives us the other 900 rows in the dataset.
other_batches = record_batch_reader.read_all()
other_batches.num_rows
900
All batches from the RecordBatchReader have the same schema, so we can concatenate them back into a single table:
combined_table = pa.concat_tables([pa.Table.from_batches([first_batch]), other_batches])
Both the original table
object and this combined_table
object have the exact same data:
table == combined_table
True
Converting to Parquet¶
We can use the utility function parse_stac_ndjson_to_parquet
to convert the items directly to GeoParquet.
naip_parquet_path = "naip.parquet"
stac_geoparquet.arrow.parse_stac_ndjson_to_parquet(naip_json_path, naip_parquet_path)
Reading that Parquet data back into Arrow with pyarrow.parquet.read_table
gives us the exact same Arrow data as before.
pq.read_table(naip_parquet_path) == table
True
Converting to Delta Lake¶
We can use the utility function parse_stac_ndjson_to_delta_lake
to convert items directly to Delta Lake.
naip_delta_lake_path = "naip_table"
stac_geoparquet.arrow.parse_stac_ndjson_to_delta_lake(
naip_json_path, naip_delta_lake_path, mode="overwrite"
)
Reading the Delta Lake table back into Arrow with deltalake.DeltaTable
gives us the exact same Arrow data as before.
deltalake.DeltaTable(naip_delta_lake_path).to_pyarrow_table() == table
True
Visualizing with Lonboard¶
We can also connect this to Lonboard to visua
m = lonboard.viz(table)
m