-
Notifications
You must be signed in to change notification settings - Fork 3
12: Improved Entity Matching #42
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Open
NewAgeAirbender
wants to merge
7
commits into
main
Choose a base branch
from
rj_additions
base: main
Could not load branches
Branch not found: {{ refName }}
Loading
Could not load tags
Nothing to show
Loading
Are you sure you want to change the base?
Some commits from the old base branch may be removed from the timeline,
and old review comments may become outdated.
Open
Changes from 3 commits
Commits
Show all changes
7 commits
Select commit
Hold shift + click to select a range
36c5c98
12: improved entity matching draft
NewAgeAirbender 0643c9a
12: update based on discussions
NewAgeAirbender 4eb6573
12: update script names
NewAgeAirbender e2448d8
12: EP categories
NewAgeAirbender 7da97c8
12: add solutions to specifications
NewAgeAirbender bd68f60
12: update committee matching options
NewAgeAirbender ebed913
12: add bill sponsorship scrape&import example
NewAgeAirbender File filter
Filter by extension
Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
There are no files selected for viewing
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Original file line number | Diff line number | Diff line change |
---|---|---|
@@ -0,0 +1,114 @@ | ||
# OSEP #12: Improved Entity Matching | ||
|
||
| | | | ||
|--------------------|----------------------------------------------------------------| | ||
| **Author(s)** | Rylie | | ||
| **Implementer(s)** | Rylie | | ||
| **Status** | Draft | | ||
| **Issue** | https://github.com/openstates/enhancement-proposals/issues/TBD | | ||
| **Draft PR(s)** | https://github.com/openstates/enhancement-proposals/pull/TBD | | ||
| **Approval PR(s)** | https://github.com/openstates/enhancement-proposals/pull/TBD | | ||
| **Created** | 2024-07-01 | | ||
| **Updated** | TODO | | ||
|
||
--- | ||
|
||
## Abstract | ||
|
||
With the 2024 New Session, we had far more eyes on Events & Votes as well as our usual Bill activity. Working through | ||
bug tickets, it became evident that there was only so much we could do for some scrapers but some missing data could be | ||
traced back to lack of proper matching. This EP is to start improving the matching by passing in data that would narrow | ||
the query results returned on import. | ||
|
||
|
||
## Specification | ||
|
||
To help resolve People mismatching, there is already an option to pass in an `org_classification` to the | ||
NewAgeAirbender marked this conversation as resolved.
Show resolved
Hide resolved
|
||
[resolve_person](https://github.com/openstates/openstates-core/blob/ac8e53aefe2a70d8ff360fc8b641bf77f28e2d7c/openstates/importers/base.py#L526) | ||
function on the `BaseImporter` that is used to query & match People to Bills, Events, & Votes. If the | ||
`org_classification` isn't set, it just defaults to a combination of `upper`, `lower`, & `legislature`. If we ensure | ||
NewAgeAirbender marked this conversation as resolved.
Show resolved
Hide resolved
|
||
that an `org_classification` can be passed in from where it's used in the Bill, Event, & Vote importers, we should be | ||
able to alleviate some of that mismatching. There may need to be some scraper updates to ensure that the classification | ||
is correct, like a Bill getting sponsors added from the opposite chamber than it was introduced in, but for Votes where | ||
the voting body is either a Chamber or a Committee, we can narrow down People by classification based off of that voting | ||
body with more accuracy. Because of this, we should start with adding the `org_classification` to Events & Votes before | ||
tackling Bills. When we get to Bills, `chamber` is already a passable value on [add_sponsorship](https://github.com/openstates/openstates-core/blob/ac8e53aefe2a70d8ff360fc8b641bf77f28e2d7c/openstates/scrape/bill.py#L105) | ||
NewAgeAirbender marked this conversation as resolved.
Show resolved
Hide resolved
|
||
so it'll be mostly scraper work to ensure that the correct chamber is being passed in per sponsorship. | ||
NewAgeAirbender marked this conversation as resolved.
Show resolved
Hide resolved
|
||
|
||
Similarly, in helping resolve Committees, we can improve the matching query by cleaning or splitting up the scraped name | ||
into it's different Committee elements such as Chamber & Type and then incorporating that into the `OrganizationImporter` | ||
NewAgeAirbender marked this conversation as resolved.
Show resolved
Hide resolved
|
||
[limit_spec](https://github.com/openstates/openstates-core/blob/ac8e53aefe2a70d8ff360fc8b641bf77f28e2d7c/openstates/importers/organizations.py#L11) | ||
logic. This will be a bit messier, so we could also add `other_names` to Committee files to more easily match up against | ||
what is commonly scraped like we did [for MN](https://github.com/openstates/people/pull/1442/files) when Events were | ||
"missing" because of name mismatching & update the `limit_spec` logic to check for more than the first `other_name` | ||
string. This is the preferred route since we can update the Committee script to include the other formats | ||
of the name without work from Engineering & Product to write to hundreds of files & we can incorporate multiple name | ||
formats easily to accommodate however the source may be posting the Committees (ex: 'Committee on Ending Homelessness' | ||
as a Bill Sponsor vs 'House Ending Homelessness' on Events, etc.) | ||
NewAgeAirbender marked this conversation as resolved.
Show resolved
Hide resolved
|
||
|
||
In resolving Committees as Bill Sponsors, there's logic that should be able to match in the `BillImporter`'s | ||
[prepare_for_db](https://github.com/openstates/openstates-core/blob/ac8e53aefe2a70d8ff360fc8b641bf77f28e2d7c/openstates/importers/bills.py#L147) | ||
function, so need to ensure that scrapers are checking if the Sponsor is a Person or Organization & make sure that is | ||
being correctly passed in as the `entity_type` in `add_sponsorship()`. | ||
NewAgeAirbender marked this conversation as resolved.
Show resolved
Hide resolved
|
||
|
||
When it comes to matching Bills to Agenda Items on Events, I'm a little more fuzzy. Right now we have a [resolve_bill](https://github.com/openstates/openstates-core/blob/ac8e53aefe2a70d8ff360fc8b641bf77f28e2d7c/openstates/importers/base.py#L164) | ||
function on the `BaseImporter` that attempts to match Bills via `bill_id`, `jurisdiction_id`, & `date` if it gets passed, | ||
which seems like it could be improved by incorporating some of the logic in `resolve_related_bills` that Jesse worked on | ||
this spring where the match query is also narrowed down by `session_id`. We can certainly pass in more data to try to | ||
identify the Bill match better, but could also incorporate a LLM so will be testing out different approaches. | ||
NewAgeAirbender marked this conversation as resolved.
Show resolved
Hide resolved
|
||
|
||
|
||
## Rationale | ||
|
||
We've known that matching Bills or Votes to Sponsors has been tricky for a while, hence OSEP #3 to help alleviate some | ||
of the issues with mismatching legislators. The People Matcher Tool can only get us so far, since we run into a blocker | ||
when there are legislators with the same last name in a jurisdiction or the sponsor is actually a committee, where | ||
adding an `other_name` to a person's yaml file isn't a possible fix. | ||
|
||
A similar issue has been happening with matching Events to their Participants (typically a Committee). The scraped name | ||
of a participant can vary from vague things such as "Rules" with no chamber, or more specific like "Assembly Privacy and | ||
Consumer Protection Committee" but name of the Committee doesn't have the chamber listed on the yaml file. Now that | ||
we've come to a standard expectation for the OS People repo that Committees will just be the name without chamber & | ||
committee type since those are able to be derived from data in the yaml file, this should make it easier to match with | ||
if we can narrow the match query based on those attributes. | ||
|
||
Another area where we're struggling to match entities is Events to the Bills listed in their Agenda Items. Sometimes | ||
it's clearly because the scraped bill id format is different from how the Bill gets saved, but sometimes it's less clear | ||
as to why some Bills get matched but others don't. Occasionally, there may be a Bill that doesn't exist in OS yet but | ||
is mentioned as an Event's Agenda Item, so it won't be attached to the Event until after a future scrape after the Bill | ||
is in the system. | ||
|
||
## Drawbacks | ||
|
||
Should absolutely add defaults if we're not certain what's going to be passed in on `core` updates. | ||
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. What does this mean? |
||
|
||
## Implementation Plan | ||
|
||
Setup: | ||
- Pull numbers for average percent matched per data type, also broken down per jurisdiction | ||
- Create harnesses to try & limit testing scope per data type. Can include bug tickets for specific jurisdictions | ||
- Create shared database for running tests on improvements | ||
- Insights team tests to see if we can use AI to help match more entities | ||
|
||
Core Improvements: | ||
NewAgeAirbender marked this conversation as resolved.
Show resolved
Hide resolved
|
||
- Adding `org_classification` to Events & Votes from where `resolve_person` is being used on Import, same with Bills | ||
but Bills may need to be after scraper improvements | ||
- Fix `limit_spec` on the `OrganizationImporter` so that more than just the first string in `other_names` is checked for | ||
Committees | ||
- Bill Identifier match improvements, passing in more data but also could incorporate AI assistance | ||
- Potentially cli command to try matching Events with Unmatched Bills in their agendas to Bills like we have with | ||
Resolving Bill Relationships | ||
|
||
Scraper Improvements: | ||
- Ensure correct `chamber` is passed in with `add_sponsorship` on Bill Scrapes | ||
- Ensure correct `entity_type` is passed in with `add_sponsorship` on Bill Scrapes (just need to check which states | ||
have unmatched People that are actually Committees) | ||
- Ensure `bill_identifier` matches the format of the expected Bill per jurisdiction | ||
|
||
Elsewhere: | ||
- Update Committee Script to include `other_names` for Committees that include Chamber, Type, & Both | ||
- Update People Script to include name values that may be overwritten as `other_name` options | ||
|
||
## Copyright | ||
|
||
This document has been placed in the public domain per the [Creative Commons CC0 1.0 Universal license.](https://creativecommons.org/publicdomain/zero/1.0/deed) |
Add this suggestion to a batch that can be applied as a single commit.
This suggestion is invalid because no changes were made to the code.
Suggestions cannot be applied while the pull request is closed.
Suggestions cannot be applied while viewing a subset of changes.
Only one suggestion per line can be applied in a batch.
Add this suggestion to a batch that can be applied as a single commit.
Applying suggestions on deleted lines is not supported.
You must change the existing code in this line in order to create a valid suggestion.
Outdated suggestions cannot be applied.
This suggestion has been applied or marked resolved.
Suggestions cannot be applied from pending reviews.
Suggestions cannot be applied on multi-line comments.
Suggestions cannot be applied while the pull request is queued to merge.
Suggestion cannot be applied right now. Please check back later.
Uh oh!
There was an error while loading. Please reload this page.