# NAIP example

We'll use STAC Items from the [NAIP STAC Collection](https://planetarycomputer.microsoft.com/dataset/naip) on Microsoft's Planetary Computer to illustrate how to use the `stac-geoparquet` library.


There are a few libraries we need to install to run this notebook:

```
pip install pystac-client stac-geoparquet pyarrow deltalake lonboard
```


In [1]:
import json
from pathlib import Path

import deltalake
import lonboard
import pyarrow as pa
import pyarrow.parquet as pq
import pystac_client

import stac_geoparquet

We can open the Planetary Computer STAC Collection with `pystac_client.Client.open`, ensuring we also sign the returned URLs in each STAC Item.


In [2]:
catalog = pystac_client.Client.open(
    "https://planetarycomputer.microsoft.com/api/stac/v1"
)

Now we'll access the NAIP collection from the Planetary Computer catalog and download 1000 items from this collection, writing them to a newline-delimited JSON file in the current directory.


In [4]:
items_iter = catalog.get_collection("naip").get_items()

max_items = 1000
naip_json_path = Path("naip.jsonl")
if not naip_json_path.exists():
    with open(naip_json_path, "w") as f:
        count = 0

        for item in items_iter:
            json.dump(item.to_dict(), f, separators=(",", ":"))
            f.write("\n")

            count += 1
            if count >= max_items:
                break

We can now use `stac-geoparquet` APIs on this data.


### Loading to Arrow

We can load to an Arrow `RecordBatchReader` by using the `parse_stac_ndjson_to_arrow` function.


In [5]:
record_batch_reader = stac_geoparquet.arrow.parse_stac_ndjson_to_arrow(naip_json_path)

The Arrow `RecordBatchReader` represents a _stream_ of Arrow batches, which can be useful when converting a very large STAC collection, which you don't want to materialize in memory at once.

We can convert this to an Arrow table with `read_all`.


In [6]:
table = record_batch_reader.read_all()
table.schema

assets: struct<image: struct<eo:bands: list<item: struct<common_name: string, description: string, name: string>>, href: string, roles: list<item: string>, title: string, type: string>, rendered_preview: struct<href: string, rel: string, roles: list<item: string>, title: string, type: string>, thumbnail: struct<href: string, roles: list<item: string>, title: string, type: string>, tilejson: struct<href: string, roles: list<item: string>, title: string, type: string>>
  child 0, image: struct<eo:bands: list<item: struct<common_name: string, description: string, name: string>>, href: string, roles: list<item: string>, title: string, type: string>
      child 0, eo:bands: list<item: struct<common_name: string, description: string, name: string>>
          child 0, item: struct<common_name: string, description: string, name: string>
              child 0, common_name: string
              child 1, description: string
              child 2, name: string
      child 1, href: string
      chi

We can also pass a small chunk size into `parse_stac_ndjson_to_arrow` to show how the streaming works.


In [7]:
record_batch_reader = stac_geoparquet.arrow.parse_stac_ndjson_to_arrow(
    naip_json_path, chunk_size=100
)

`record_batch_reader` is an iterator that yields Arrow `RecordBatch` objects. If we load just the first one, we'll see that it contains 100 rows.


In [8]:
first_batch = next(record_batch_reader)
first_batch.num_rows

100

Materializing the rest of the batches from the iterator into a table gives us the other 900 rows in the dataset.


In [9]:
other_batches = record_batch_reader.read_all()
other_batches.num_rows

900

All batches from the RecordBatchReader have the same schema, so we can concatenate them back into a single table:


In [10]:
combined_table = pa.concat_tables([pa.Table.from_batches([first_batch]), other_batches])

Both the original `table` object and this `combined_table` object have the exact same data:


In [11]:
table == combined_table

True

### Converting to Parquet

We can use the utility function `parse_stac_ndjson_to_parquet` to convert the items directly to GeoParquet.


In [12]:
naip_parquet_path = "naip.parquet"
stac_geoparquet.arrow.parse_stac_ndjson_to_parquet(naip_json_path, naip_parquet_path)

Reading that Parquet data back into Arrow with `pyarrow.parquet.read_table` gives us the exact same Arrow data as before.


In [13]:
pq.read_table(naip_parquet_path) == table

True

### Converting to Delta Lake

We can use the utility function `parse_stac_ndjson_to_delta_lake` to convert items directly to Delta Lake.


In [14]:
naip_delta_lake_path = "naip_table"
stac_geoparquet.arrow.parse_stac_ndjson_to_delta_lake(
    naip_json_path, naip_delta_lake_path, mode="overwrite"
)

Reading the Delta Lake table back into Arrow with `deltalake.DeltaTable` gives us the exact same Arrow data as before.


In [15]:
deltalake.DeltaTable(naip_delta_lake_path).to_pyarrow_table() == table

True

### Visualizing with Lonboard


We can also connect this to Lonboard to visua


In [17]:
m = lonboard.viz(table)
m

Map(basemap_style=<CartoBasemap.DarkMatter: 'https://basemaps.cartocdn.com/gl/dark-matter-gl-style/style.json'â€¦