Skip to content

Latest commit

 

History

History

README.md

DEPRECATED in favor of download_parse_and_validate_gtfs.py DAG task (no longer run with PodOperator)

Wrapper for the MobilityData GTFS Schedule validator

This image exists to faciliate a pod operator executing a Java JAR which validates GTFS Schedule zipfiles and outputs the resulting notices (i.e. violations). The JARs in this folder are sourced from https://github.com/MobilityData/gtfs-validator/, and our Python code automatically selects which version is relevant for an extract.

Note that unlike gtfs-rt-parser, this code does NOT include any parsing/unzipping of the underlying data. The RT code was written prior to substantial changes to calitp-py and therefore bundles together shared functionality that will eventually live in calitp-py.

See this PR for an example of how to integrate new versions of the underlying MD-stewarded validator into our validation process - note the addition of a new JAR file corresponding to the new validator version being referenced (taken from the Assets section at the bottom of the new validator version's release page), and a corresponding list of rules and their short descriptions adopted from the canonical validator rules list.

In order to respect past validation outcomes, we don't re-validate old data using the latest available version of the validator. Instead, we use extract dates to determine which version of the validator was correct to use at the time the data was created. That way, we don't "punish" older data for not conforming to expectations that changed in the time since data creation.

Upgrading the Schedule Validator Version tips

If you run into trouble when adding the new validator jar, it's because the default set for check-added-large-files in our pre-commit config which is a relatively low 500Kb. It's more meant as an alarm for local development than as an enforcement mechanism. You can make one commit that adds the jar and temporarily adds a higher file size threshold to the pre-commit config like this one and then a second commit that removes the threshold modification like this one. That'll get the file through.

Remember you need to rebuild and push the latest docker file to dhcr.io before changes will be reflected in airflow runs.

You will need to parse the rules.json from the mobility validator. Here is a code example for the upgrade to v5:

# https://github.com/MobilityData/gtfs-validator/releases/tag/v5.0.0
import json
import pandas as pd

# Replace with your JSON data 
with open('rules.json') as f:
    data = json.load(f)
result = []
for key in data.keys():
    # print(key)
    result.append({
        'code': data[key]['code'],
        'human_readable_description': data[key]['shortSummary'],
        'version': 'v5.0.0',
        'severity': data[key]['severityLevel']
    })
# Create CSV
df = pd.DataFrame(result)
df.to_csv('gtfs_schedule_validator_rule_details_v5_0_0.csv', index=False)

Here is a command to test once you have appropriate gtfs zip files in the test bucket:

docker compose run airflow tasks test unzip_and_validate_gtfs_schedule_hourly validate_gtfs_schedule YYYY-MM-DDTHH:MM:SS