Jarvus Innovations

If you’re building a data warehouse for transit analytics, you’ve probably encountered a fundamental challenge: how do you organize hundreds of data tables in a way that’s both technically sound and actually usable by analysts? After years of working on transit data projects we’ve found that dimensional modeling remains one of the most practical frameworks for solving this problem, and dbt is our preferred tool for organizing the transformation code. This post will walk through some of the key considerations for this approach.

Why Dimensional Modeling Still Matters

It’s true that modern cloud warehouses have largely eliminated the performance constraints that originally motivated dimensional modeling. But dimensional modeling isn’t primarily about performance any more—it’s about creating a shared language for your organization. A clear, accessible data model that accurately reflects business needs allows colleagues from different areas to have productive conversations about data instead of talking past one another or quibbling about why two numbers don’t agree.

Dimensional modeling hits that sweet spot by giving a framework to guide your modeling for consistency and coherency while giving enough flexibility to make it work for your organization.

Facts and Dimensions: The Building Blocks

At its core, dimensional modeling is based on creating two types of tables:

Facts are events – things that happened. A bus arrived at a stop. A passenger tapped their card. A trip was scheduled. Facts are verbs, and they’re what you count. How many boardings? How many trips? How much delay?

Dimensions are the nouns – the entities involved in those events. The vehicle, the route, the stop, the operator. Dimensions provide context for facts. You don’t usually count dimensions directly; you count facts by dimensions. How many boardings per route? How much delay by operator?

Practical Design Guidelines

Consider a common transit data scenario: recording when a bus arrives at a stop. In your source system, this might come from your AVL (Automatic Vehicle Location) system as a stream of GPS pings and geofence events. How do you model this dimensionally?

The core fact you’re targeting is straightforward: a vehicle arrived at a stop at a specific time. That’s your fct_stop_events table.

But what dimensions are involved? By default you might have:

The vehicle (dim_vehicles) with information like series, model, etc.
The stop (dim_stops) with information like latitude, longitude, facility attributes (e.g. bus shelter), etc.
The trip (dim_trips) with information like trip start and end times, trip headsign, etc.

But it’s not totally cut and dried. What about the route? Is that an attribute of the stop event, or do you look it up through the trip? Or what about the operator driving the vehicle—is that directly part of the event or looked up via a block assignment on the trip or vehicle?

There’s no perfect answer. The “right” design depends on your specific use cases and query patterns. That said, we have developed some rules of thumb.

For Dimensions

One canonical dimension per entity. Don’t create dim_buses and dim_articulated_buses. Make articulation a property of dim_buses. This might seem obvious, but you’d be surprised how often this rule gets broken when different business units have different definitions of the same entity.

Use Type 2 Slowly Changing Dimensions. Add valid_from and valid_to dates to track how dimensions change over time.

Keep foreign keys minimal. Dimensions should be relatively self-contained. If you find yourself with lots of foreign keys in a dimension table, you might not have a well-defined entity.

For Facts

Define your grain explicitly. What exactly is one row in your fact table? One passenger boarding? One stop arrival? One aggregated hour of ridership by route? Be precise and specific. If you can’t clearly explain what a row in the table represents, how will analysts know how to use the table?

Include relevant foreign keys. Facts can and should have multiple foreign keys to relevant dimensions. Your fct_passenger_boardings might reference vehicles, stops, routes, fare products, and time dimensions.

Denormalize thoughtfully. There is always a balancing act: if you denormalize too much, the table can become unwieldy and brittle since changes to just one attribute affect the whole table, but some denormalization helps performance and usability. For example, including route_name alongside route_id in your fact table can save countless repetitive joins in common queries.

The GTFS Special Case

If you work with operational data, there’s a good chance you’ll end up working with GTFS (General Transit Feed Specification) data. GTFS is a little unique from a dimensional modeling perspective because it is basically already a mini relational database.

We recommend raw GTFS tables as dimensions that inherit their versioning (valid from and valid to dates) from centralized feed-level versioning; instead of versioning individual routes or stops, you version entire feeds and propagate that versioning to all contained entities.

While this approach duplicates some data storage for entities that are stable over time, it significantly simplifies intra-GTFS joins by allowing you to join on a feed identifier instead of having to navigate different date ranges for each record in a feed. This is important because GTFS data requires a number of joins to prepare it for analytical use.

Bridging to the Medallion Architecture and dbt project

Once you’ve mapped your business concepts into a coherent dimensional model, you will need to organize the tables in your warehouse according to your medallion architecture (or similar framework) and organize the dbt project or other code that orchestrates the transformations.

One example approach would be:

The bronze medallion layer contains raw and partially-transformed data—the sources, staging models, and intermediate models of a dbt project. The data in this layer is not dimensionally modeled. This is where you land your raw GTFS feeds, AVL data streams, and fare system exports.
The silver medallion layer contains clean, dimensionally modeled data at the entity level. Your core dimension and fact tables live here in the mart layer of a dbt project. Analysts can work with this data directly.
The gold medallion layer contains aggregated metrics for business users. These might still be dimensionally modeled facts at a higher grain, like daily ridership by route rather than individual boardings. These would be the aggregations and metrics of a dbt project.

Only your silver and (potentially) gold layers need to be dimensionally modeled. Your bronze layer can be as messy as needed to get data into the warehouse efficiently.

From a dbt project perspective, it is fine if you have intermediate models in the bronze layer that depend on dimensionally-modeled mart models in your silver layer. The data dependencies for your transformations are a distinct consideration from the conceptual organization of your dimensional modeling.

A More Complete Example

We can tie this together with a simplified example. Say we want to track daily ridership. We might create the following models:

Bronze Layer:

stg_gtfs__trips, stg_gtfs__routes, stg_gtfs__stops, etc. (raw GTFS data)
stg_avl__stop_events (raw AVL events)
stg_apc__boardings (raw APC boarding counts)
int_gtfs__daily_service (intermediate transformation of GTFS calendar services)
int_match_avl_to_gtfs_stops (intermediate transformation joining APC data to GTFS stops — contains cleaning logic or fuzzy location matching)
int_match_apc_to_gtfs_stops (intermediate transformation joining APC to GTFS stops — may require mapping stop identifiers)

Silver Layer:

dim_routes, dim_stops, etc. (dimensions derived from GTFS)
fct_stop_events (fact derived from AVL events, joined with GTFS data and stop-level ridership from APC boardings)

Gold Layer:

fct_daily_route_ridership (aggregated fact from stop-level ridership)

Notice how the bronze layer includes both staging (stg_) and intermediate (int_) models that don’t follow dimensional patterns. That’s fine – they’re building blocks, not user-facing tables.

The silver layer has our core dimensional model: clear dimensions and facts at the most granular level. Analysts can join these tables to answer detailed questions about transit operations.

The gold layer provides pre-aggregated metrics that business users can drop directly into dashboards without writing complex queries.

Start Simple, Iterate Often

You don’t need to create a perfect, complete dimensional model on the first try. Start with the most obvious facts and dimensions. Get feedback from users. Refine based on actual query patterns. Tools like dbt enable you to iterate and validate quickly, and a dimensional modeling framework gives you the vocabulary and structure to have those conversations effectively.

Final Thoughts

After years of working with transit data, we’ve learned that the technical challenges are usually solvable. The hard part is getting everyone to agree on what the data means. Dimensional modeling isn’t perfect, but it provides a framework for having those conversations.

When you sit down with operations folks and ask “What stop identifiers are used in this system?” or “How do you define a delay?”, you’re not just building a data model—you’re building a shared understanding. That’s the real value of this approach.

So start with your facts and dimensions. Use dbt’s conventions as defaults unless you have good reasons to deviate. Test everything, especially those slowly changing dimension date ranges. And remember: rules of thumb cover most cases nicely, but sometimes you need to make judgment calls.

Your transit system is unique, and your data model should accommodate that uniqueness while following comprehensible patterns to create user-friendly data.