Skip to content

Commit 33fc1bd

Browse files
authored
Merge pull request #1 from openstates/sqlmesh-audit-definition
Audit definitions
2 parents e351c03 + 07a262a commit 33fc1bd

20 files changed

+2146
-1
lines changed

.gitignore

Lines changed: 6 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,6 @@
1+
logs
2+
_data
3+
.cache
4+
merged_*
5+
db.db
6+
__pycache__

README.md

Lines changed: 78 additions & 1 deletion
Original file line numberDiff line numberDiff line change
@@ -1 +1,78 @@
1-
# scraper-audit
1+
# Scraper Audit
2+
3+
This is a utility for running audits on scraper output for legislative data entities (`bill` or `event`) using [SQLMesh](https://docs.sqlmesh.com/). The script processes JSON data into a DuckDB database and runs a SQLMesh plan, returning any audit errors.
4+
5+
## Features
6+
7+
- Merges entity-level JSON files into one dataset.
8+
- Initializes a DuckDB database with merged data.
9+
- Runs `sqlmesh plan` on the staged models (`staged.bill` or `staged.event`).
10+
- Extracts and prints any audit-related warnings or errors.
11+
12+
## Requirements
13+
14+
- Python 3.9+
15+
- [`poetry`](https://python-poetry.org/docs/#installation)
16+
17+
## Installation
18+
19+
Clone the repository and install dependencies using Poetry:
20+
21+
```bash
22+
git clone [email protected]:openstates/scraper-audit.git
23+
cd scraper-audit
24+
poetry install
25+
```
26+
27+
## Usage
28+
29+
Ensure that the main directory contains a data folder with the JSON output files to audit.
30+
These files are typically generated by the OpenStates Scraper and should follow the naming pattern:
31+
`*/*/<entity>*.json`
32+
For example:
33+
34+
```bash
35+
_data/or/bill_0a3faf9c-1969-11f0-aaa5-4ef1b5972379.json
36+
_data/or/bill_0a21cf0c-196b-11f0-aaa5-4ef1b5972379.json
37+
```
38+
39+
Run `poetry run python main.py --entity <entity name>` for example `poetry run python main.py --entity bill`. This should produce output similar to:
40+
41+
```bash
42+
INFO:openstates:Initializing data with arguments: entity=bill, jurisdiction=None
43+
INFO:openstates:Merging JSON files matching pattern: ./*/*/bill*.json
44+
INFO:openstates:Merged 1179 records into merged_entities.json
45+
INFO:openstates:Creating DuckDB schema and loading data...
46+
INFO:openstates:bill: initialized successfully
47+
INFO:openstates:Running SQLMesh plan via subprocess...
48+
INFO:openstates:SQLMesh plan output:
49+
50+
`prod` environment will be initialized
51+
52+
Models:
53+
└── Added:
54+
└── staged.bill
55+
Models needing backfill:
56+
└── staged.bill: [2025-04-24 - 2025-05-05]
57+
58+
staged.bill created
59+
Updating physical layer ━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━ 100.0% • 1/1 • 0:00:00
60+
61+
✔ Physical layer updated
62+
63+
64+
[WARNING] staged.bill: 'assert_bills_have_sponsor' audit error: 1179 rows failed. Learn more in logs: ~/scraper-audit/logs/sqlmesh_2025_05_06_20_58_30.log
65+
66+
[1/1] staged.bill [insert/update rows, audits ❌1] 0.03s
67+
Executing model batches ━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━ 100.0% • 1/1 • 0:00:00
68+
✔ Model batches executed
69+
70+
staged.bill created
71+
Updating virtual layer ━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━ 100.0% • 1/1 • 0:00:00
72+
73+
✔ Virtual layer updated
74+
75+
76+
Audit failed:
77+
[WARNING] staged.bill: 'assert_bills_have_sponsor' audit error: 1179 rows failed. Learn more in logs: ~/scraper-audit/logs/sqlmesh_2025_05_06_20_58_30.log
78+
```

__init__.py

Whitespace-only changes.

audits/.gitkeep

Whitespace-only changes.

audits/bill.sql

Lines changed: 7 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,7 @@
1+
-- Does bill have sponsors?
2+
AUDIT (
3+
name assert_bills_have_sponsor,
4+
blocking false
5+
);
6+
SELECT * from scraper.bill
7+
WHERE sponsorships IS NULL;

audits/event.sql

Lines changed: 6 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,6 @@
1+
AUDIT (
2+
name assert_events_are_classified,
3+
blocking false
4+
);
5+
SELECT * from scraper.event
6+
WHERE classification IS NULL;

config.yaml

Lines changed: 11 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,11 @@
1+
gateways:
2+
duckdb:
3+
connection:
4+
type: duckdb
5+
database: db.db
6+
7+
default_gateway: duckdb
8+
9+
model_defaults:
10+
dialect: duckdb
11+
start: 2025-04-28

macros/.gitkeep

Whitespace-only changes.

macros/__init__.py

Whitespace-only changes.

main.py

Lines changed: 35 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,35 @@
1+
import argparse
2+
3+
4+
from sqlmesh_tasks import sqlmesh_plan
5+
6+
if __name__ == "__main__":
7+
default_parser = argparse.ArgumentParser(add_help=False)
8+
9+
parser = argparse.ArgumentParser(
10+
parents=[default_parser],
11+
description="Run audits on Scraper output and returns report as string",
12+
)
13+
parser.add_argument(
14+
"--jurisdiction",
15+
"-j",
16+
type=str,
17+
help="Specific jurisdiction to query from",
18+
)
19+
parser.add_argument(
20+
"--entity",
21+
"-e",
22+
required=True,
23+
choices=["bill", "event"],
24+
type=str,
25+
help="Entity type: bill or event",
26+
)
27+
28+
args = parser.parse_args()
29+
entity = args.entity
30+
jurisdiction = args.jurisdiction
31+
report = sqlmesh_plan(entity, jurisdiction)
32+
if report:
33+
print("Audit failed:\n", report)
34+
else:
35+
print("Audit passed.")

models/.gitkeep

Whitespace-only changes.

models/bill.sql

Lines changed: 35 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,35 @@
1+
MODEL (
2+
name staged.bill,
3+
kind FULL,
4+
start '2024-04-24',
5+
cron '0 5 * * *',
6+
interval_unit 'day',
7+
grain (id),
8+
audits (
9+
assert_bills_have_sponsor,
10+
),
11+
);
12+
13+
SELECT
14+
legislative_session::TEXT AS legislative_session,
15+
identifier::TEXT AS identifier,
16+
title::TEXT AS title,
17+
from_organization::TEXT AS from_organization,
18+
classification::JSON AS classification,
19+
subject::JSON AS subject,
20+
abstracts::JSON AS abstracts,
21+
other_titles::JSON AS other_titles,
22+
other_identifiers::JSON AS other_identifiers,
23+
actions::JSON AS actions,
24+
sponsorships::JSON AS sponsorships,
25+
related_bills::JSON AS related_bills,
26+
versions::JSON AS versions,
27+
documents::JSON AS documents,
28+
citations::JSON AS citations,
29+
sources::JSON AS sources,
30+
extras::JSON AS extras,
31+
jurisdiction::JSON AS jurisdiction,
32+
scraped_at::TIMESTAMP AS scraped_at,
33+
_id::TEXT AS _id
34+
FROM
35+
scraper.bill;

models/event.sql

Lines changed: 32 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,32 @@
1+
MODEL (
2+
name staged.event,
3+
kind FULL,
4+
start '2024-04-24',
5+
cron '0 5 * * *',
6+
interval_unit 'day',
7+
grains (jurisdiction_id, start_date, 'name'),
8+
audits (assert_events_are_classified),
9+
);
10+
11+
SELECT
12+
name::TEXT AS name,
13+
all_day::BOOLEAN AS all_day,
14+
NULLIF(start_date, '')::TIMESTAMP AS start_date,
15+
NULLIF(end_date, '')::TIMESTAMP AS end_date,
16+
status::TEXT AS status,
17+
classification::TEXT AS classification,
18+
description::TEXT AS description,
19+
upstream_id::TEXT AS upstream_id,
20+
location::JSON AS location,
21+
media::JSON AS media,
22+
documents::JSON AS documents,
23+
links::JSON AS links,
24+
participants::JSON AS participants,
25+
agenda::JSON AS agenda,
26+
sources::JSON AS sources,
27+
extras::JSON AS extras,
28+
jurisdiction::JSON AS jurisdiction,
29+
scraped_at::TIMESTAMP AS scraped_at,
30+
_id::TEXT AS _id
31+
FROM
32+
scraper.event;

0 commit comments

Comments
 (0)