Geospatial Data at Scale (Part 1): Why File Format Matters

Lander Analytics Team
Mar 17
4 min read

The wrong format can 10× your storage costs before analysis even begins

By Jared Lander and Joe Marlo

This is the first post in a three-part series on handling large geospatial data. The throughline: geographic data keeps getting bigger, and each stage of the pipeline has breaking points and solutions.

Big is relative. What counts as big for a laptop is not big for a server. What was big ten years ago is routine today. But file format choice? That matters more than most people realize. The wrong format can 10x your storage costs and slow every downstream operation before you even start analyzing anything.

The File Format Problem

We loaded up a dataset with 325,000 (not large, but big enough by geospatial standards to make you think about your computing) points and saved it in a few common geospatial formats. The results were striking:

CSV: 35 MB
GeoJSON: 107 MB
KML: 144 MB
GeoParquet: 16 MB

CSV is the smallest of the common formats, but it is only really good for point data. You could jam a polygon in there, but we are not getting into that. GeoJSON is far more common in the geospatial world, but it is verbose. Every column name is repeated for every record. What Edward Tufte would call chart junk but it is data junk. KML is similar, just with XML syntax instead of JSON.

And shapefiles? They are not real files, it is like five files and you just read in one file but all the other four come with it, it is crazy. We are not going to talk about shapefiles.

GeoParquet Is the Answer

GeoParquet is parquet but for geo. If you do not know Parquet, it is a columnar data format that is compressed, fast and related to Arrow. Arrow is in memory, parquet is on disk. They are designed to work together.

At 16 MB, GeoParquet is half the size of CSV and a tenth the size of KML for the same data. It is also much faster to read and write. The tradeoff is that it is binary, not plain text, so you cannot open it in a text editor. For anything beyond quick inspection, that is a worthwhile trade.

Creating these files is straightforward. With the sf package you can use write_sf() for most formats. For GeoParquet, use the geoarrow package with write_parquet(). If you are stuck in the command line, ogr2ogr handles conversions between formats.

The GeoParquet specification hit maturity with version 1.0 in 2023 and now has support in over 20 tools and libraries. Major platforms like Snowflake, BigQuery, Databricks and LanceDB can all work with it natively. Overture Maps uses it as their core format.

When Files Are Not Enough

Files work until you need concurrent writes, real-time ingestion, or multiple users hitting the same data. At that point, you need a database.

Everything we do ends up in Postgres anyway. PostGIS adds geographic operations on top of it, and if your data has timestamps — and everything has timestamps — TimescaleDB adds hypertables that automatically partition by time. You create a table with a geometry column, add your spatial index, and let TimescaleDB handle the rest. Queries with a time filter scan only the relevant partitions instead of the full table.

Real-World Patterns

We have seen this play out with clients. A utility company was tracking power line inspections and outage reports, with field crews submitting GPS-tagged inspection records daily. They started with GeoJSON exports from their mobile app vendor, but file transfers were slow and storage costs ballooned. They moved to GeoParquet for archival (10x smaller files) and PostGIS with TimescaleDB for live queries against inspection history.

A property insurer maintaining a database of insured locations for catastrophe modeling had a different problem. They needed to quickly identify which policies fall within a hurricane's projected path. Spatial indexes and time-partitioned tables let them run exposure queries in seconds rather than minutes. When a storm is approaching, minutes matter.

The Takeaway

For most geospatial work: use GeoParquet. It is smaller, faster and well-supported. When you need to handle multiple concurrent users, move to PostGIS. Add TimescaleDB if your data has a time dimension, which it probably does.

Next up in Part 2: what happens when you actually need to compute on this data, and why DuckDB has become our go-to for spatial queries.

Jared P. Lander

Founder and Chief Data Scientist Lander Analytics

Joe Marlo

Director of Data Science

Lander Analytics

Subscribe to our Substack and below to our monthly emails for practical AI strategies for your organization: what to build, what to avoid, and how to make systems reliable in the real world.

Work with us: If you want help identifying the right first workflow, building a permissioned knowledge base, or training your team to ship responsibly, reach out at info@landeranalytics.com.

About the author: Jared P. Lander is Chief Data Scientist and founder of Lander Analytics, where he helps organizations build practical, measurable AI workflows grounded in strong data foundations.

About the author: Joe Marlo is Director of Data Science at Lander Analytics, where he designs agentic workflows, statistical models, and interactive frontends that put rigorous analysis into production.