Using the Modern Data Stack to Improve Transit Outcomes in California
California Integrated Travel Project (Cal-ITP) is a statewide effort to make transit easier to use and more cost effective for riders in the state of California.
Cal-ITP needed a technology partner to help them collect data about California’s 200+ transit agencies, and transform that data into usable models to support a wide range of data users and analysis products. Jarvus was selected as that partner and determined that Cal-ITP offered an exciting opportunity to bring modern data stack principles to a public sector context.
Cal-ITP’s data infrastructure serves a diverse ecosystem of users and project goals, including both internal and external data stakeholders, all of whom have different needs and levels of self-service. For example, project leadership may ask for a quick number that they can share with an outside stakeholder while data analysts are working on long-term research projects.
Cal-ITP’s data sources are heterogeneous. The project in many ways is more similar to a research initiative than traditional business analytics because it involves scraping data from open sources with fewer guarantees than structured APIs would have in many business contexts. The project’s emphasis on assessing data quality requires that the ingest pipeline have extremely reliable uptime and capture rates. The nature of the GTFS data specification means that the ingested GTFS data is of variable completeness, uses a variety of components/features, and can take multiple approaches to represent the same concept. The project’s data sources also vary widely in size and update frequency, from large real-time data captured every 20 seconds to a relatively very small manually-maintained internal database with perhaps weekly data updates.
One primary goal of project’s data stakeholders is to create data products, which are built and maintained by the data users directly rather than each being maintained by the Jarvus data services team. The list of data products was not fully defined at project outset, but it was known to include at least:
- Agency-facing GTFS quality reporting at reports.calitp.org
- Agency-facing dashboards of payment transaction data
- Public-facing GTFS open data, hosted on the California Open Data Portal
Over time, the original products have matured and added new feature requirements, and new products have been developed, including a variety of published analyses.
The first step in developing a solution was to recognize that the data needs of the project would be better served by taking a platform approach than by tackling individual product, source, or user requirements one by one. The project needed an extensible data platform, rather than a set of bespoke pipelines tailored to each individual deliverable.
By assessing the requirements for such a platform, it became clear that Cal-ITP’s needs align in many ways with the principles of the modern data stack. One clear formulation of those principles comes from Atlan, who define the modern data platform as characterized by:
- “Self-service for diverse users”
- “Agile data management”
- “Flexible, fast, pay as you go”
These features match well with Cal-ITP’s needs to facilitate collaboration, enable self-service, scale flexibly, and be cost-effective.
With these requirements in mind, Jarvus developed the following data platform:
Tools were selected with a preference for open-source tooling, and Google Cloud had already been selected as the cloud provider. However, this architecture can also be formulated in a tool-agnostic way:
The key benefits here, of separating compute and storage; separating raw data from modeling; and using version control, are not tool-specific.
With Jarvus’ help, Cal-ITP currently ingests and analyzes over 1 million files per day, and makes that data immediately accessible to data engineers, analysts, and the agencies Cal-ITP supports.
Through the shared BigQuery warehouse, Cal-ITP data users have access to a shared source of truth. Analysts can do complex research tasks using flexible Jupyter notebooks in an entirely browser-based workflow, and they can publish those directly to an analysis site to share their insights. Customer success managers can self-serve dashboards in Metabase to analyze their customers’ data quality and identify where support is needed.
Peer developers can leverage the robust documentation and developer experience of dbt to contribute their own models to the warehouse. And all of those users are looking at the same underlying data, ensuring that Cal-ITP can speak with a unified voice when answering data questions.