Creating an Open Source Data Science Platform to Improve Transit Outcomes in California

Background

California Integrated Travel Project (Cal-ITP) is a statewide effort to make transit easier to use and more cost effective for riders in the state of California.

Cal-ITP’s goals include uniting the state’s disconnected transit systems by enabling contactless payment methods, expanding access for riders to real-time updates about every transit trip, ensuring transit is fully accessible, and easing the burden on lower-income residents by better integrating benefit and discount vouchers across the state.

Opportunity

Cal-ITP needed a technology partner to help them collect data about California’s 300+ transit agencies, and transform that data into centralized insights and assessments. Some of those data sources include:

  • GTFS (Allows riders to see routes and stops on apps like Google Maps)
  • GTFS-RT (Allows riders to see when a bus or train is coming in real-time)
  • Payments information (Collected from vendors on a per-trip basis)

Cal-ITP additionally needed help with strategy, research, design, and general support to scale their operations.

Approach

Cal-ITP brought in the Jarvus team early on in the project. With the help of Cal-ITP staff and other key partners like Compiler, Jarvus staff designed a scalable open source data warehouse, key processes for assessments and outcome, and feedback loops with the agencies that Cal-ITP serves.

Jarvus evaluated the data that agencies currently publish, conducted interviews with agencies and key stakeholders, as well as surveyed the existing tools and open source transit and data ecosystems.

Solution

While Jarvus has contributed an array of solutions including app development, product management, design, and other consultation; the foundation of the work focused on two key data sources: GTFS and open loop payment vendor data.

Open Source Data Warehouse and Data Science Platform

Jarvus has created an open source data warehouse and data science platform for Cal-ITP that ingests, transforms, warehouses, and publishes federated data from disparate sources. These rich data are representative of hundreds of agencies throughout the state, and include transit routes and schedules, real-time vehicle positions, and passenger transactions from onboard contactless payment devices.

This platform helps stakeholders make informed business and policy decisions that improve rider experiences. The technology for this stack includes:

Data Sources

  • Open loop payments vendors
  • GTFS static data
  • GTFS-RT (real-time) data

ELT Stack

Dashboard Solution

Data Science Platform

  • JupyterHub hosted within Google Kubernetes Engine

alt_text

This implementation uses Airflow to orchestrate data ingestion tasks using dbt as a data pipeline and transformation layer. The data itself is stored in BigQuery, with a focus both on flexibility of data views for analysts, efficiency of data ingestion, and transparency for data engineers and consumers.

The solution includes a custom data ingestion system built outside of Airflow, in order to support the high-performance requirements of the system.

The codebase that orchestrates the data warehouse and data science platform is open source and publicly documented.

Outcomes

Data Warehouse

With Jarvus’ help, Cal-ITP currently ingests and analyzes over 1 million files per day, and makes that data immediately accessible to data engineers, analysts, and the agencies Cal-ITP supports. This includes capturing and archiving real-time vehicle positions across the entire state every 20 seconds.

Data analysts can access data directly from the data warehouse in Jupyter notebooks that can be easily published and shared through a hosted JupyterHub implementation.

Data analysts, product managers, agencies themselves, and other stakeholders can also create and view dashboards from the warehouse data using Metabase.

Jarvus delivered an entirely self-hosted and open source infrastructure that connects to Cal-ITP’s Google Cloud Platform account, making it possible for them to focus on procuring commodity capacity in one place instead of licenses and hosting services across an array of vendors and pricing schemes.

The result is extremely cost effective for Cal-ITP as there are no vendor costs for the open source tooling and the data is stored efficiently in BigQuery.

Documentation and Data Discoverability

The data warehouse as well as details about the data pipeline are publicly documented as part of the process. This creates greater transparency and reduces friction as that data is utilized internally and throughout the network.

Data Science Analysis and Reports

Cal-ITP data scientists and engineers have begun to publish public reports and analysis about the transit network using the data from the warehouse.

Jarvus created a new toolkit to facilitate this, which allows analysts to automatically create publicly hosted reports built on top of parameterized Jupyter notebooks. Some examples include:

Transit Speed Maps

alt_text

These maps, tables, and charts provide an overview of typical weekday transit vehicle speeds to identifying slower parts of bus routes that would be candidates for projects to speed up buses

Parallel Corridors Analysis

This analysis helps agencies by identifying corridors in which they can improve service by increasing frequency on the routes most competitive with car travel.

High Quality Transit Areas

This assessment provides an examination of facilities in High Quality Transit Areas that are candidates for retrofits.

alt_text

Next Steps

Jarvus will continue to build on the capabilities of the existing data warehouse, and help expand the tools to support agencies. Some potential next steps include better validation tools for agencies as well as better reports and metrics around payment data.