-
Notifications
You must be signed in to change notification settings - Fork 3
12: Improved Entity Matching #42
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
base: main
Are you sure you want to change the base?
Changes from all commits
36c5c98
0643c9a
4eb6573
e2448d8
7da97c8
bd68f60
ebed913
File filter
Filter by extension
Conversations
Jump to
Diff view
Diff view
There are no files selected for viewing
Original file line number | Diff line number | Diff line change |
---|---|---|
@@ -0,0 +1,166 @@ | ||
# OSEP #12: Improved Entity Matching | ||
|
||
| | | | ||
|--------------------|----------------------------------------------------------------| | ||
| **Author(s)** | @newageairbender | | ||
| **Implementer(s)** | @newageairbender, @jessemortenson, @alexobaseki | | ||
| **Status** | Draft | | ||
| **Issue** | https://github.com/openstates/enhancement-proposals/issues/TBD | | ||
| **Draft PR(s)** | https://github.com/openstates/enhancement-proposals/pull/TBD | | ||
| **Approval PR(s)** | https://github.com/openstates/enhancement-proposals/pull/TBD | | ||
| **Created** | 2024-07-01 | | ||
| **Updated** | 2024-07-31 | | ||
|
||
--- | ||
|
||
## Abstract | ||
|
||
With the 2024 New Session, we had far more eyes on Events & Votes as well as our usual Bill activity. Working through | ||
bug tickets, it became evident that there was only so much we could do for some scrapers but some missing data could be | ||
traced back to lack of proper matching. This EP is to start improving the matching by passing in data that would narrow | ||
the query results returned on import. | ||
|
||
|
||
## Specification | ||
|
||
### People Matching on Sponsorship, Votes, & Events | ||
To help resolve People mismatching, there is already an option to pass in an `org_classification` to the | ||
[resolve_person](https://github.com/openstates/openstates-core/blob/ac8e53aefe2a70d8ff360fc8b641bf77f28e2d7c/openstates/importers/base.py#L526) | ||
function on the `BaseImporter` that is used to query & match People to Bills, Events, & Votes. If the | ||
`org_classification` isn't set, it just defaults to any match of `upper`, `lower`, & `legislature`. If we ensure | ||
that an `org_classification` can be passed in from where it's used in the Bill, Event, & Vote importers, we should be | ||
able to alleviate some of that mismatching. There may need to be some scraper updates to ensure that the classification | ||
is correct, like a Bill getting sponsors added from the opposite chamber than it was introduced in, but for Votes where | ||
the voting body is either a Chamber or a Committee, we can narrow down People by classification based off of that voting | ||
body with more accuracy. Because of this, we should start with adding the `org_classification` to Events & Votes before | ||
tackling Bills. | ||
|
||
When we get to Bills, `chamber` is already a passable value on [add_sponsorship](https://github.com/openstates/openstates-core/blob/ac8e53aefe2a70d8ff360fc8b641bf77f28e2d7c/openstates/scrape/bill.py#L105), | ||
so it'll be mostly scraper work to ensure that the correct chamber is being passed in per sponsorship. For example, | ||
scrapers should be updated to include logic around if Representative or Senator is listed on the Sponsor's name to | ||
designate chamber or where House vs Senate have grouped names like in [IL](https://ilga.gov/legislation/BillStatus.asp?DocNum=4910&GAID=17&DocTypeID=HB&LegId=152782&SessionID=112&GA=103), | ||
we can be certain on chamber to pass in for`org_classification`, etc. | ||
|
||
We also should consider adding nicknames of People to `other_names` in the yaml files through the People script so we | ||
can catch matches when the name may not be exactly as scraped if the person goes by multiple first names or includes | ||
their middle name/initial in some places to differentiate from people with other names. | ||
|
||
#### Solutions: | ||
- Core: Adding `org_classification` to Events & Votes from where `resolve_person` is being used on Import based on data | ||
provided on the scrape | ||
- Core: Add `org_classification` to Bill Import for Sponsors, but may need to be after scraper improvements if | ||
jurisdictions have sponsors from both chamber per Bill | ||
- Scrapers: Ensure correct `chamber` is passed in with `add_sponsorship` on Bill Scrapes | ||
- People Script: Update People Script to include name values that may be overwritten as `other_name` options | ||
- People Repo: Add `other_name` values that match scraped name formats for sponsorship or votes | ||
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. How do we intend to do this? Maybe using the people matching tool? Explaining how we will arrive at this will be useful. |
||
|
||
### Committees as Bill Sponsors | ||
In resolving Committees as Bill Sponsors, there's logic that should be able to match in the `BillImporter`'s | ||
[prepare_for_db](https://github.com/openstates/openstates-core/blob/ac8e53aefe2a70d8ff360fc8b641bf77f28e2d7c/openstates/importers/bills.py#L147) | ||
function, so need to ensure that scrapers are checking if the Sponsor is a Person or Organization & make sure that is | ||
being correctly passed in as the `entity_type` in `add_sponsorship()`. The only fix needed is in the scrapers themselves. | ||
|
||
#### Solution: | ||
- Scrapers: Ensure correct `entity_type` is passed in with `add_sponsorship` on Bill Scrapes (just need to check which | ||
states have unmatched People that are actually Committees) | ||
|
||
### Committees on Events | ||
Similarly, in helping resolve Committees, we can improve the matching query by cleaning or splitting up the scraped name | ||
into it's different Committee elements such as Chamber & Type and then incorporating that into the `OrganizationImporter` | ||
NewAgeAirbender marked this conversation as resolved.
Show resolved
Hide resolved
|
||
[limit_spec](https://github.com/openstates/openstates-core/blob/ac8e53aefe2a70d8ff360fc8b641bf77f28e2d7c/openstates/importers/organizations.py#L11) | ||
logic. This will be a bit messier, so I nominate that we add `other_names` to Committee files to more easily match up | ||
against what is commonly scraped like we did [for MN](https://github.com/openstates/people/pull/1442/files) when Events | ||
were "missing" because of name mismatching & update the `limit_spec` logic to check for more than the first `other_name` | ||
string. This is the preferred route since we can update the Committee script to include the other formats | ||
of the name without work from Engineering & Product to write to hundreds of files & we can incorporate multiple name | ||
formats easily to accommodate however the source may be posting the Committees (ex: 'Committee on Ending Homelessness' | ||
as a Bill Sponsor vs 'House Ending Homelessness' on Events, etc.) | ||
NewAgeAirbender marked this conversation as resolved.
Show resolved
Hide resolved
|
||
|
||
Currently, the `limit_spec` function is used to overwrite the Django default to limit the query parameters. As of right | ||
now, the function: | ||
- If classification is NOT party, then add the jurisdiction_id to the query spec | ||
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. This steps is not terrible clear to me. Like what is "Django default", which classification is NOT |
||
- if name is set, match on (the rest of the spec) AND (first other_names value matches name) OR (name is exact match) | ||
- if name is NOT set, then just match on rest of spec | ||
|
||
IF we go the `other_name` route, the change we'd need to make is: | ||
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. Lets keep other names consistent across board. I see it is |
||
- If name is set, match on (the rest of the spec) AND (~~first~~ANY other_names value matches name) OR (name is exact match) | ||
|
||
IF we wanted to split up by chamber & type first in `core`, we'd have to add: | ||
- Update [add_participant](https://github.com/openstates/openstates-core/blob/7ac7b73bbb0956f7a539128f9186929509c19550/openstates/scrape/event.py#L140) | ||
and `add_committee` to accept a `chamber` value or `committee_type` of `committee` or `subcommittee` (if `subcommittee`, | ||
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. Is there a link to |
||
add `parent_committee_id`) | ||
- Add that `chamber` value to the `self.org_importer.resolve_json_id` calls in the `EventImporter` on lines [92](https://github.com/openstates/openstates-core/blob/ac8e53aefe2a70d8ff360fc8b641bf77f28e2d7c/openstates/importers/events.py#L92) | ||
and 101 | ||
- In `limit_scope` if classification is `committee`, then add the `chamber_id` to query spec | ||
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. Forgive me for my ignorance, is this |
||
- In `limit_scope` if classification is `committee`, then add the `committee_type` to query spec | ||
- In `limit_scope` if classification is `committee` AND `committee_type` = `subcommittee`, then add the | ||
`parent_committee_id` to query spec | ||
|
||
#### Solutions: | ||
- Core: Fix `limit_spec` on the `OrganizationImporter` so that more than just the first string in `other_names` is checked for | ||
Committees | ||
- People Script: Update Committee Script to include `other_names` for Committees that include Chamber, Type, & Both | ||
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. I am assuming chamber is like House, Senate, Joint. What is Type and Both? |
||
|
||
### Bill Matching to Event Agenda Items | ||
When it comes to matching Bills to Agenda Items on Events, I'm a little more fuzzy. Right now we have a [resolve_bill](https://github.com/openstates/openstates-core/blob/ac8e53aefe2a70d8ff360fc8b641bf77f28e2d7c/openstates/importers/base.py#L164) | ||
function on the `BaseImporter` that attempts to match Bills via `bill_id`, `jurisdiction_id`, & `date` if it gets passed, | ||
which seems like it could be improved by incorporating some of the logic in `resolve_related_bills` that Jesse worked on | ||
this spring where the match query is also narrowed down by `session_id`. We can certainly pass in more data to try to | ||
identify the Bill match better, but could also incorporate a LLM so will be testing out different approaches. | ||
NewAgeAirbender marked this conversation as resolved.
Show resolved
Hide resolved
|
||
|
||
#### Solutions: | ||
- Scrapers: Ensure `bill_identifier` matches the format of the expected Bill per jurisdiction | ||
- Core: Bill Identifier match improvements, passing in more data (at least `session`, maybe `chamber`) | ||
- Core: Add LLM to try better matching with above Core improvement | ||
- Core: Potentially cli command to try matching Events with Unmatched Bills in their agendas to Bills post-import | ||
|
||
## Rationale | ||
|
||
### Bills or Votes to People or Committees | ||
We've known that matching Bills or Votes to Sponsors has been tricky for a while, hence OSEP #3 to help alleviate some | ||
of the issues with mismatching legislators. The People Matcher Tool can only get us so far, since we run into a blocker | ||
when there are legislators with the same last name in a jurisdiction or the sponsor is actually a committee, where | ||
adding an `other_name` to a person's yaml file isn't a possible fix. | ||
|
||
Current example for matching a Person to a Bill Sponsor: | ||
- Bill scraper calls `add_sponsorship` passing in { "name": "JOHNSON", entity_type="person", "classification"="primary", | ||
"primary"=True } | ||
- `add_sponsorship` creates a `pseudo_person_id` that is JOHNSON | ||
- BillImport calls `resolve_person` passing in that `pseudo_person_id` with start/end date values from the Bill's `session` | ||
- [resolve_person](https://github.com/openstates/openstates-core/blob/7ac7b73bbb0956f7a539128f9186929509c19550/openstates/importers/base.py#L526) | ||
constructs a spec that is used to compose filters to query data from the Person model to find a match. Could pass in | ||
`org_classification` but currently don't to narrow down via chamber | ||
- If jurisdiction has more than one legislator with the last name "Johnson", Importer will give an error message that | ||
`multiple people returned for spec` but continue through Import task | ||
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. I am imagining that "multiple people returned for spec" will be limited if you have organization classification in the resolve person query. Thinking of an idea here:
|
||
|
||
### Events to Committees | ||
A similar issue has been happening with matching Events to their Participants (typically a Committee). The scraped name | ||
of a participant can vary from vague things such as "Rules" with no chamber, or more specific like "Assembly Privacy and | ||
Consumer Protection Committee" but name of the Committee doesn't have the chamber listed on the yaml file. Now that | ||
we've come to a standard expectation for the OS People repo that Committees will just be the name without chamber & | ||
committee type since those are able to be derived from data in the yaml file, this should make it easier to match with | ||
if we can narrow the match query based on those attributes. | ||
|
||
### Events to Bills | ||
Another area where we're struggling to match entities is Events to the Bills listed in their Agenda Items. Sometimes | ||
it's clearly because the scraped bill id format is different from how the Bill gets saved, but sometimes it's less clear | ||
as to why some Bills get matched but others don't. Occasionally, there may be a Bill that doesn't exist in OS yet but | ||
is mentioned as an Event's Agenda Item, so it won't be attached to the Event until after a future scrape after the Bill | ||
is in the system. | ||
|
||
## Drawbacks | ||
|
||
Should absolutely add defaults if we're not certain what's going to be passed in on `core` updates. | ||
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. What does this mean? |
||
|
||
## Implementation Plan | ||
Most are listed above with the entity types they fix, but other plans included below | ||
|
||
#### Setup | ||
- Pull numbers for average percent matched per data type, also broken down per jurisdiction | ||
- Create harnesses to try & limit testing scope per data type. Can include bug tickets for specific jurisdictions | ||
- Create shared database for running tests on improvements | ||
- Insights team tests to see if we can use AI to help match more entities | ||
|
||
## Copyright | ||
|
||
This document has been placed in the public domain per the [Creative Commons CC0 1.0 Universal license.](https://creativecommons.org/publicdomain/zero/1.0/deed) |
Uh oh!
There was an error while loading. Please reload this page.