stac-geoparquet¶
stac-geoparquet is a data storage specification for STAC. There are (at least) two Python libraries for reading and writing stac-geoparquet:
- stac-geoparquet lives in the same repository as the specification
- Our rustac implementation does more of the hard work in Rust
For more on the difference between the two implementations, see our README.
Creating stac-geoparquet¶
Create stac-geoparquet from an iterable of items.
from typing import Any
import os
import datetime
import humanize
import rustac
def create_item(
id: str, dt: datetime.datetime, extra_properties: dict[str, Any] | None = None
) -> dict[str, Any]:
properties = {
"datetime": dt.isoformat(),
}
if extra_properties:
properties.update(extra_properties)
return {
"type": "Feature",
"stac_version": "1.1.0",
"id": id,
"geometry": {"type": "Point", "coordinates": [-105.1019, 40.1672]},
"bbox": [-105.1019, 40.1672, -105.1019, 40.1672],
"properties": properties,
# Assets can't be empty at the moment: https://github.com/stac-utils/rustac/issues/766
"assets": {
"data": {
"href": "https://storage.googleapis.com/open-cogs/stac-examples/20201211_223832_CS2.jpg"
}
},
"links": [],
}
items = [
create_item(
f"item-{i}",
datetime.datetime(2024, 1, 1, tzinfo=datetime.timezone.utc)
+ datetime.timedelta(hours=i),
)
for i in range(10000)
]
await rustac.write("items.parquet", items)
print(humanize.naturalsize(os.path.getsize("items.parquet")))
150.2 kB
Reading is just as simple.
import json
item_collection = await rustac.read("items.parquet")
print(json.dumps(item_collection["features"][0], indent=2))
{ "type": "Feature", "stac_version": "1.1.0", "id": "item-0", "geometry": { "type": "Point", "coordinates": [ -105.1019, 40.1672 ] }, "bbox": [ -105.1019, 40.1672, -105.1019, 40.1672 ], "properties": { "datetime": "2024-01-01T00:00:00Z" }, "links": [], "assets": { "data": { "href": "https://storage.googleapis.com/open-cogs/stac-examples/20201211_223832_CS2.jpg" } } }
Appending¶
One of STAC's key features is its flexibility. The core specification is minimal, so data producers are encouraged to use extensions and custom attributes to add expressiveness to their STAC items. This flexibility is an awkward fit with parquet (and arrow), which require fixed schemas. Many parquet implementations simply punt on appends (e.g.).
To add new data to an existing stac-geoparquet data store, you can:
- Read, update, and write
- Create a new file and search over both, e.g. with DuckDB
Let's take a look at both options.
Read, update, and write¶
If you can fit all of your items into memory, you can read all of your items in, add the new items, then write them back out. rustac will take care of updating the output schema to match the new items. It's not very elegant, but it works.
import time
new_items = [
create_item(
f"new-item-{i}",
datetime.datetime(1986, 6, 14, tzinfo=datetime.timezone.utc)
+ datetime.timedelta(hours=i),
{"foo": "bar"}, # add a new attribute that wasn't in the original schema
)
for i in range(9999)
]
start = time.time()
old_items = await rustac.read("items.parquet")
print(f"That took {time.time() - start:0.2f} seconds to read")
start = time.time()
await rustac.write("more-items.parquet", old_items["features"] + new_items)
print(f"That took {time.time() - start:0.2f} seconds to write")
all_the_items = await rustac.read("more-items.parquet")
print(
len(
list(item for item in all_the_items["features"] if "foo" in item["properties"])
),
"items have a 'foo' property",
)
That took 0.37 seconds to read That took 1.20 seconds to write 9999 items have a 'foo' property
Create a new file¶
Some tools, like DuckDB, can query across multiple parquet files. This lets you write your new items in a second file next to your old one, then query across both.
import duckdb
await rustac.write("new-items.parquet", new_items)
duckdb.sql(
"select id, datetime, geometry from read_parquet(['items.parquet', 'new-items.parquet'])"
)
┌───────────┬──────────────────────────┬───────────────────────────┐ │ id │ datetime │ geometry │ │ varchar │ timestamp with time zone │ geometry │ ├───────────┼──────────────────────────┼───────────────────────────┤ │ item-0 │ 2023-12-31 17:00:00-07 │ POINT (-105.1019 40.1672) │ │ item-1 │ 2023-12-31 18:00:00-07 │ POINT (-105.1019 40.1672) │ │ item-2 │ 2023-12-31 19:00:00-07 │ POINT (-105.1019 40.1672) │ │ item-3 │ 2023-12-31 20:00:00-07 │ POINT (-105.1019 40.1672) │ │ item-4 │ 2023-12-31 21:00:00-07 │ POINT (-105.1019 40.1672) │ │ item-5 │ 2023-12-31 22:00:00-07 │ POINT (-105.1019 40.1672) │ │ item-6 │ 2023-12-31 23:00:00-07 │ POINT (-105.1019 40.1672) │ │ item-7 │ 2024-01-01 00:00:00-07 │ POINT (-105.1019 40.1672) │ │ item-8 │ 2024-01-01 01:00:00-07 │ POINT (-105.1019 40.1672) │ │ item-9 │ 2024-01-01 02:00:00-07 │ POINT (-105.1019 40.1672) │ │ · │ · │ · │ │ · │ · │ · │ │ · │ · │ · │ │ item-9990 │ 2025-02-19 23:00:00-07 │ POINT (-105.1019 40.1672) │ │ item-9991 │ 2025-02-20 00:00:00-07 │ POINT (-105.1019 40.1672) │ │ item-9992 │ 2025-02-20 01:00:00-07 │ POINT (-105.1019 40.1672) │ │ item-9993 │ 2025-02-20 02:00:00-07 │ POINT (-105.1019 40.1672) │ │ item-9994 │ 2025-02-20 03:00:00-07 │ POINT (-105.1019 40.1672) │ │ item-9995 │ 2025-02-20 04:00:00-07 │ POINT (-105.1019 40.1672) │ │ item-9996 │ 2025-02-20 05:00:00-07 │ POINT (-105.1019 40.1672) │ │ item-9997 │ 2025-02-20 06:00:00-07 │ POINT (-105.1019 40.1672) │ │ item-9998 │ 2025-02-20 07:00:00-07 │ POINT (-105.1019 40.1672) │ │ item-9999 │ 2025-02-20 08:00:00-07 │ POINT (-105.1019 40.1672) │ ├───────────┴──────────────────────────┴───────────────────────────┤ │ ? rows (>9999 rows, 20 shown) 3 columns │ └──────────────────────────────────────────────────────────────────┘
Even though our old items don't have a foo
property, we can still query on it with DuckDB by setting union_by_name = true
.
duckdb.sql(
"select count(*) from read_parquet(['items.parquet', 'new-items.parquet'], union_by_name = true) where foo = 'bar'"
)
┌──────────────┐ │ count_star() │ │ int64 │ ├──────────────┤ │ 9999 │ └──────────────┘
If we don't set union_by_name = true
, we get an error because of the schema mismatch.
duckdb.sql("select id, foo from read_parquet(['items.parquet', 'new-items.parquet'])")
--------------------------------------------------------------------------- BinderException Traceback (most recent call last) Cell In[73], line 1 ----> 1 duckdb.sql("select id, foo from read_parquet(['items.parquet', 'new-items.parquet'])") BinderException: Binder Error: Referenced column "foo" not found in FROM clause! Candidate bindings: "bbox"