NAIP example¶

We'll use STAC Items from the NAIP STAC Collection on Microsoft's Planetary Computer to illustrate how to use the stac-geoparquet library.

There are a few libraries we need to install to run this notebook:

pip install pystac-client stac-geoparquet pyarrow deltalake lonboard

In [1]:

Copied!





import json
from pathlib import Path

import deltalake
import lonboard
import pyarrow as pa
import pyarrow.parquet as pq
import pystac_client

import stac_geoparquet
import json
from pathlib import Path

import deltalake
import lonboard
import pyarrow as pa
import pyarrow.parquet as pq
import pystac_client

import stac_geoparquet

We can open the Planetary Computer STAC Collection with pystac_client.Client.open, ensuring we also sign the returned URLs in each STAC Item.

In [2]:

Copied!

catalog = pystac_client.Client.open(
    "https://planetarycomputer.microsoft.com/api/stac/v1"
)
catalog = pystac_client.Client.open(
    "https://planetarycomputer.microsoft.com/api/stac/v1"
)

Now we'll access the NAIP collection from the Planetary Computer catalog and download 1000 items from this collection, writing them to a newline-delimited JSON file in the current directory.

In [4]:

Copied!





items_iter = catalog.get_collection("naip").get_items()

max_items = 1000
naip_json_path = Path("naip.jsonl")
if not naip_json_path.exists():
    with open(naip_json_path, "w") as f:
        count = 0

        for item in items_iter:
            json.dump(item.to_dict(), f, separators=(",", ":"))
            f.write("\n")

            count += 1
            if count >= max_items:
                break
items_iter = catalog.get_collection("naip").get_items()

max_items = 1000
naip_json_path = Path("naip.jsonl")
if not naip_json_path.exists():
    with open(naip_json_path, "w") as f:
        count = 0

        for item in items_iter:
            json.dump(item.to_dict(), f, separators=(",", ":"))
            f.write("\n")

            count += 1
            if count >= max_items:
                break

We can now use stac-geoparquet APIs on this data.

Loading to Arrow¶

We can load to an Arrow RecordBatchReader by using the parse_stac_ndjson_to_arrow function.

In [5]:

Copied!

record_batch_reader = stac_geoparquet.arrow.parse_stac_ndjson_to_arrow(naip_json_path)
record_batch_reader = stac_geoparquet.arrow.parse_stac_ndjson_to_arrow(naip_json_path)

The Arrow RecordBatchReader represents a stream of Arrow batches, which can be useful when converting a very large STAC collection, which you don't want to materialize in memory at once.

We can convert this to an Arrow table with read_all.

In [6]:

Copied!

table = record_batch_reader.read_all()
table.schema
table = record_batch_reader.read_all()
table.schema

Out[6]:

assets: struct<image: struct<eo:bands: list<item: struct<common_name: string, description: string, name: string>>, href: string, roles: list<item: string>, title: string, type: string>, rendered_preview: struct<href: string, rel: string, roles: list<item: string>, title: string, type: string>, thumbnail: struct<href: string, roles: list<item: string>, title: string, type: string>, tilejson: struct<href: string, roles: list<item: string>, title: string, type: string>>
  child 0, image: struct<eo:bands: list<item: struct<common_name: string, description: string, name: string>>, href: string, roles: list<item: string>, title: string, type: string>
      child 0, eo:bands: list<item: struct<common_name: string, description: string, name: string>>
          child 0, item: struct<common_name: string, description: string, name: string>
              child 0, common_name: string
              child 1, description: string
              child 2, name: string
      child 1, href: string
      child 2, roles: list<item: string>
          child 0, item: string
      child 3, title: string
      child 4, type: string
  child 1, rendered_preview: struct<href: string, rel: string, roles: list<item: string>, title: string, type: string>
      child 0, href: string
      child 1, rel: string
      child 2, roles: list<item: string>
          child 0, item: string
      child 3, title: string
      child 4, type: string
  child 2, thumbnail: struct<href: string, roles: list<item: string>, title: string, type: string>
      child 0, href: string
      child 1, roles: list<item: string>
          child 0, item: string
      child 2, title: string
      child 3, type: string
  child 3, tilejson: struct<href: string, roles: list<item: string>, title: string, type: string>
      child 0, href: string
      child 1, roles: list<item: string>
          child 0, item: string
      child 2, title: string
      child 3, type: string
bbox: struct<xmin: double, ymin: double, xmax: double, ymax: double>
  child 0, xmin: double
  child 1, ymin: double
  child 2, xmax: double
  child 3, ymax: double
collection: string
geometry: binary
  -- field metadata --
  ARROW:extension:name: 'geoarrow.wkb'
  ARROW:extension:metadata: '{"crs":{"$schema":"https://proj.org/schemas/' + 1296
id: string
links: list<item: struct<href: string, rel: string, title: string, type: string>>
  child 0, item: struct<href: string, rel: string, title: string, type: string>
      child 0, href: string
      child 1, rel: string
      child 2, title: string
      child 3, type: string
stac_extensions: list<item: string>
  child 0, item: string
stac_version: string
type: string
datetime: timestamp[us, tz=UTC]
gsd: double
naip:state: string
naip:year: string
proj:bbox: list<item: double>
  child 0, item: double
proj:centroid: struct<lat: double, lon: double>
  child 0, lat: double
  child 1, lon: double
proj:epsg: int64
proj:shape: list<item: int64>
  child 0, item: int64
proj:transform: list<item: double>
  child 0, item: double
providers: list<item: struct<name: string, roles: list<item: string>, url: string>>
  child 0, item: struct<name: string, roles: list<item: string>, url: string>
      child 0, name: string
      child 1, roles: list<item: string>
          child 0, item: string
      child 2, url: string

We can also pass a small chunk size into parse_stac_ndjson_to_arrow to show how the streaming works.

In [7]:

Copied!

record_batch_reader = stac_geoparquet.arrow.parse_stac_ndjson_to_arrow(
    naip_json_path, chunk_size=100
)
record_batch_reader = stac_geoparquet.arrow.parse_stac_ndjson_to_arrow(
    naip_json_path, chunk_size=100
)

record_batch_reader is an iterator that yields Arrow RecordBatch objects. If we load just the first one, we'll see that it contains 100 rows.

In [8]:

Copied!

first_batch = next(record_batch_reader)
first_batch.num_rows
first_batch = next(record_batch_reader)
first_batch.num_rows

Out[8]:

Materializing the rest of the batches from the iterator into a table gives us the other 900 rows in the dataset.

In [9]:

Copied!

other_batches = record_batch_reader.read_all()
other_batches.num_rows
other_batches = record_batch_reader.read_all()
other_batches.num_rows

Out[9]:

All batches from the RecordBatchReader have the same schema, so we can concatenate them back into a single table:

In [10]:

Copied!

combined_table = pa.concat_tables([pa.Table.from_batches([first_batch]), other_batches])
combined_table = pa.concat_tables([pa.Table.from_batches([first_batch]), other_batches])

Both the original table object and this combined_table object have the exact same data:

In [11]:

Copied!

table == combined_table
table == combined_table

Out[11]:

True

Converting to Parquet¶

We can use the utility function parse_stac_ndjson_to_parquet to convert the items directly to GeoParquet.

In [12]:

Copied!

naip_parquet_path = "naip.parquet"
stac_geoparquet.arrow.parse_stac_ndjson_to_parquet(naip_json_path, naip_parquet_path)
naip_parquet_path = "naip.parquet"
stac_geoparquet.arrow.parse_stac_ndjson_to_parquet(naip_json_path, naip_parquet_path)

Reading that Parquet data back into Arrow with pyarrow.parquet.read_table gives us the exact same Arrow data as before.

In [13]:

Copied!

pq.read_table(naip_parquet_path) == table
pq.read_table(naip_parquet_path) == table

Out[13]:

True

Converting to Delta Lake¶

We can use the utility function parse_stac_ndjson_to_delta_lake to convert items directly to Delta Lake.

In [14]:

Copied!





naip_delta_lake_path = "naip_table"
stac_geoparquet.arrow.parse_stac_ndjson_to_delta_lake(
    naip_json_path, naip_delta_lake_path, mode="overwrite"
)
naip_delta_lake_path = "naip_table"
stac_geoparquet.arrow.parse_stac_ndjson_to_delta_lake(
    naip_json_path, naip_delta_lake_path, mode="overwrite"
)

Reading the Delta Lake table back into Arrow with deltalake.DeltaTable gives us the exact same Arrow data as before.

In [15]:

Copied!

deltalake.DeltaTable(naip_delta_lake_path).to_pyarrow_table() == table
deltalake.DeltaTable(naip_delta_lake_path).to_pyarrow_table() == table

Out[15]:

True

Visualizing with Lonboard¶

We can also connect this to Lonboard to visua

In [17]:

Copied!

m = lonboard.viz(table)
m
m = lonboard.viz(table)
m

Out[17]:

In [ ]: