Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Adopt external data parsing/normalisation lib #1368

Merged
merged 24 commits into from
Apr 2, 2024
Merged

Conversation

pudo
Copy link
Contributor

@pudo pudo commented Jan 14, 2024

We talked about this briefly during the Berlin meeting - the idea here it to make rigour into a more in-depth data normalisation library that can also hold a lot of the fundamental handling code for human names that's currently in nomenklatura.

Comment on lines 30 to 40
def clean_text(
self,
text: str,
fuzzy: bool = False,
format: Optional[str] = None,
proxy: Optional["EntityProxy"] = None,
) -> Optional[str]:
if format is not None and format in FORMATS:
type_ = FORMATS[format]
return type_.normalize(text)
return text
Copy link
Contributor

@tillprochaska tillprochaska Jan 15, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

  • Just to sanity check my understanding of FtM: Validating concrete identifier formats is a new feature, right? Actually, if I don’t miss anything, the only use for property type formats so far was the date type, right?

  • I vaguely remember that we talked about adding the identifier format to the schema, so e.g. Thing:wikidataId would be an identifier in the wikidata format (and when validating an entity, values of this property would need to be a valid Wikidata QID)? Is that still something you’d like to add? I think it’s a good idea, but in that case would like to check on our side if/where these stricter validations could cause problems (e.g. in scrapers).

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Nice catch! So this is new functionality, but for the moment it would only come into effect in a mapping (that's the only place the format is set from) which would specify format for an identifier field. So it's completely opt-in.

The other option - making properties themselves know format - makes a ton of sense semantically (they're literally called wikidataId, swiftBic, imoCode, innCode, etc.) but it would create a breaking change in the sense that old mappings might produce much less data because all of a sudden a whole ton of INNs are recognised as invalid. There's maybe an argument for that to be a good thing, but it'd definitely be a surprising thing.

Maybe there's even a way of half-doing it: some props, like wikidataId have strong expectations, and we could introduce it there. By doing that we could also make IBAN fields from type: iban into type: identifier, format: iban which makes a bit more sense.

WDYT?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

but it would create a breaking change in the sense that old mappings might produce much less data because all of a sudden a whole ton of INNs are recognised as invalid. There's maybe an argument for that to be a good thing, but it'd definitely be a surprising thing.

Personally, I’d say that’s a good thing as long as there’s maybe some logging in the mapping logic when values are removed because they are invalid, allowing the monitoring of e.g. Memorious scrapers for validation issues.

But I’ve also forwarded this thread to our data team because they’d be much more qualified to assess the potential impact on our scraper fleet.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The other use case where validation is relevant from Aleph’s perspective is OCR/pattern extraction: When extracting IBANs from scanned documents, there might be OCR issues that e.g. cause the extracted IBAN to not match validation. While such an invalid IBAN might not be as useful for cross referencing, there might still be value in storing and exposing the extracted IBAN to users.

This isn’t really relevant to this PR, as the PR doesn’t change how IBANs are validated (there’s just an additional layer of abstraction), but something to keep in mind for the future and possibly when adding other identifier formats. (For most of the other identifier formats implemented in rigour, it’s probably much more likely that they are extracted from company registries etc. rather than from a scanned document, but I might be incorrect here.)

setup.py Outdated
@@ -35,6 +35,7 @@
"types-PyYAML",
"sqlalchemy2-stubs",
"banal >= 1.0.6, < 1.1.0",
"rigour >= 0.3.0, < 1.0.0",
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Maybe that’s a dumb question, but what’s the advantage of extracting this functionality into a separate package compared to a separate module in followthemoney? As it contains validation/normalization logic for FtM data, I guess most projects that would make use of this (inlcuding Aleph, nomenklatura, …) would require followthemoney anyways?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

That's a fair point, it could live in followthemoney. This is more meant to address a social issue than a technical one: at present, I'm treating FtM releases as expensive because they involve "y'all" - so I try to only push out a release here if it is super relevant.

The problem I have is that this leads to a lot of "plaque" in our upstream libs - for example when it comes to extra name and identifier processing code like here: https://github.com/opensanctions/nomenklatura/blob/main/nomenklatura/util.py#L95-L206

The idea with rigour is to combine all the intel about the types and how to process them into one place, and to basically make a promise that the upstream compatibility with FtM will be preserved, while not putting a limitation on the amount of e.g. name processing stuff that we can add as needed.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks for clarifying, that makes sense!

@@ -1,9 +1,14 @@
import re
from typing import Optional, TYPE_CHECKING
from rigour.ids import FORMATS
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Maybe these could be exposed as a class attribute on the IdentifierType class? Then the available formats (and even their docstrings) could be included in the JSON dump of the model, making it possible to include them in the documentation and to pick them up in the TS lib (maybe not to do full validation on the client side, but at least to render some format hints or something like that.)

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I love that idea, importing the raw dict felt wrong. Will mint a new rigour release later :)

@pudo
Copy link
Contributor Author

pudo commented Jan 17, 2024

@tillprochaska I've done another pass at this, implementing the helper methods for fetching a format, and also adding a property.format thing. Strictly speaking, this makes types.iban un-used in this.

One thing I've skipped so far is the annotations for innCode and ogrnCode - those would have pretty drastic consequences with OCCRP data. So I'd test without for a while, then turn that one.

@tillprochaska
Copy link
Contributor

I've done another pass at this, implementing the helper methods for fetching a format, and also adding a property.format thing.

@pudo Nice! I’m off until Tuesday next week, but will take a closer look once I’m back.

Strictly speaking, this makes types.iban un-used in this.

One thing I didn’t really think about so far with regards to Aleph is that we also use the types to merge unique values for all properties of the same type (all names, all IBANs, …) in the backend (indexing entities and computing statistics) and in the frontend (when displaying lists of entities). We could probably work around that by using the new format attribute, haven’t checked out the code yet though.

But maybe IBANs are (ignoring the implementation details) a sufficiently independent concept compared to other ID formats that this warrants IBANs being a separate type? Similar to how IP addresses or phone numbers could probably also be considered identifiers that follow a specific format, but they are still separate types. Haven’t made up my mind about that, and probably I’m overthinking it.

@pudo
Copy link
Contributor Author

pudo commented Feb 20, 2024

@tillprochaska - This should be in a place right now where it's safe to merge; and rigour has seen a bit of action on our end to make sure it's not totally messed up. Would appreciate your feedback!

@tillprochaska
Copy link
Contributor

@pudo Hey, sorry for the late reply. I got confirmation from @brrttwrks that he’s fine with the stricter validation for bic, lei, isin, figi, qid 👍

The only other request from our data team was some logging when validation fails (and values are cleaned from entities) before we implement stricter validation. And possibly the option to disable validation/cleaning on a per-property level in mappings. I’ll try to open a PR that does that today or tomorrow.

Thanks also for restoring IBANs as a separate type. We rely on that in quite a few places in Aleph. I’ll also take another look at the code today.

Copy link
Contributor

@tillprochaska tillprochaska left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Okay, I finally found the time to take a closer look this morning. I found a few places that I think you’ve overlooked when restoring the iban type (or we misunderstood each other?). But apart from that I’m happy for this to be merged!

followthemoney/schema/Analyzable.yaml Outdated Show resolved Hide resolved
followthemoney/schema/BankAccount.yaml Outdated Show resolved Hide resolved
followthemoney/compare.py Outdated Show resolved Hide resolved
followthemoney/types/language.py Show resolved Hide resolved
followthemoney/types/mimetype.py Show resolved Hide resolved
followthemoney/types/url.py Show resolved Hide resolved
Copy link
Contributor

@tillprochaska tillprochaska left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks a lot for restoring the IBAN type in the schemata! This makes this a non-breaking update in Aleph. Sorry for the amount of back-and-forth from my side…

@pudo
Copy link
Contributor Author

pudo commented Apr 2, 2024

No this is definitely an open heart operation, thanks for taking the time to work through it!

@pudo pudo merged commit 6fb47dc into main Apr 2, 2024
9 checks passed
@pudo pudo deleted the pudo/rigour-parsers branch April 2, 2024 09:02
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

None yet

2 participants