Adopt external data parsing/normalisation lib #1368

pudo · 2024-01-14T12:45:29Z

We talked about this briefly during the Berlin meeting - the idea here it to make rigour into a more in-depth data normalisation library that can also hold a lot of the fundamental handling code for human names that's currently in nomenklatura.

tillprochaska · 2024-01-15T09:28:01Z

followthemoney/types/identifier.py

+    def clean_text(
+        self,
+        text: str,
+        fuzzy: bool = False,
+        format: Optional[str] = None,
+        proxy: Optional["EntityProxy"] = None,
+    ) -> Optional[str]:
+        if format is not None and format in FORMATS:
+            type_ = FORMATS[format]
+            return type_.normalize(text)
+        return text


Just to sanity check my understanding of FtM: Validating concrete identifier formats is a new feature, right? Actually, if I don’t miss anything, the only use for property type formats so far was the date type, right?

I vaguely remember that we talked about adding the identifier format to the schema, so e.g. Thing:wikidataId would be an identifier in the wikidata format (and when validating an entity, values of this property would need to be a valid Wikidata QID)? Is that still something you’d like to add? I think it’s a good idea, but in that case would like to check on our side if/where these stricter validations could cause problems (e.g. in scrapers).

Nice catch! So this is new functionality, but for the moment it would only come into effect in a mapping (that's the only place the format is set from) which would specify format for an identifier field. So it's completely opt-in.

The other option - making properties themselves know format - makes a ton of sense semantically (they're literally called wikidataId, swiftBic, imoCode, innCode, etc.) but it would create a breaking change in the sense that old mappings might produce much less data because all of a sudden a whole ton of INNs are recognised as invalid. There's maybe an argument for that to be a good thing, but it'd definitely be a surprising thing.

Maybe there's even a way of half-doing it: some props, like wikidataId have strong expectations, and we could introduce it there. By doing that we could also make IBAN fields from type: iban into type: identifier, format: iban which makes a bit more sense.

WDYT?

but it would create a breaking change in the sense that old mappings might produce much less data because all of a sudden a whole ton of INNs are recognised as invalid. There's maybe an argument for that to be a good thing, but it'd definitely be a surprising thing.

Personally, I’d say that’s a good thing as long as there’s maybe some logging in the mapping logic when values are removed because they are invalid, allowing the monitoring of e.g. Memorious scrapers for validation issues.

But I’ve also forwarded this thread to our data team because they’d be much more qualified to assess the potential impact on our scraper fleet.

The other use case where validation is relevant from Aleph’s perspective is OCR/pattern extraction: When extracting IBANs from scanned documents, there might be OCR issues that e.g. cause the extracted IBAN to not match validation. While such an invalid IBAN might not be as useful for cross referencing, there might still be value in storing and exposing the extracted IBAN to users.

This isn’t really relevant to this PR, as the PR doesn’t change how IBANs are validated (there’s just an additional layer of abstraction), but something to keep in mind for the future and possibly when adding other identifier formats. (For most of the other identifier formats implemented in rigour, it’s probably much more likely that they are extracted from company registries etc. rather than from a scanned document, but I might be incorrect here.)

tillprochaska · 2024-01-15T09:31:03Z

setup.py

@@ -35,6 +35,7 @@
        "types-PyYAML",
        "sqlalchemy2-stubs",
        "banal >= 1.0.6, < 1.1.0",
+        "rigour >= 0.3.0, < 1.0.0",


Maybe that’s a dumb question, but what’s the advantage of extracting this functionality into a separate package compared to a separate module in followthemoney? As it contains validation/normalization logic for FtM data, I guess most projects that would make use of this (inlcuding Aleph, nomenklatura, …) would require followthemoney anyways?

That's a fair point, it could live in followthemoney. This is more meant to address a social issue than a technical one: at present, I'm treating FtM releases as expensive because they involve "y'all" - so I try to only push out a release here if it is super relevant.

The problem I have is that this leads to a lot of "plaque" in our upstream libs - for example when it comes to extra name and identifier processing code like here: https://github.com/opensanctions/nomenklatura/blob/main/nomenklatura/util.py#L95-L206

The idea with rigour is to combine all the intel about the types and how to process them into one place, and to basically make a promise that the upstream compatibility with FtM will be preserved, while not putting a limitation on the amount of e.g. name processing stuff that we can add as needed.

Thanks for clarifying, that makes sense!

tillprochaska · 2024-01-15T12:49:50Z

followthemoney/types/identifier.py

@@ -1,9 +1,14 @@
 import re
+from typing import Optional, TYPE_CHECKING
+from rigour.ids import FORMATS


Maybe these could be exposed as a class attribute on the IdentifierType class? Then the available formats (and even their docstrings) could be included in the JSON dump of the model, making it possible to include them in the documentation and to pick them up in the TS lib (maybe not to do full validation on the client side, but at least to render some format hints or something like that.)

I love that idea, importing the raw dict felt wrong. Will mint a new rigour release later :)

pudo · 2024-01-17T16:51:32Z

@tillprochaska I've done another pass at this, implementing the helper methods for fetching a format, and also adding a property.format thing. Strictly speaking, this makes types.iban un-used in this.

One thing I've skipped so far is the annotations for innCode and ogrnCode - those would have pretty drastic consequences with OCCRP data. So I'd test without for a while, then turn that one.

tillprochaska · 2024-01-18T08:17:59Z

I've done another pass at this, implementing the helper methods for fetching a format, and also adding a property.format thing.

@pudo Nice! I’m off until Tuesday next week, but will take a closer look once I’m back.

Strictly speaking, this makes types.iban un-used in this.

One thing I didn’t really think about so far with regards to Aleph is that we also use the types to merge unique values for all properties of the same type (all names, all IBANs, …) in the backend (indexing entities and computing statistics) and in the frontend (when displaying lists of entities). We could probably work around that by using the new format attribute, haven’t checked out the code yet though.

But maybe IBANs are (ignoring the implementation details) a sufficiently independent concept compared to other ID formats that this warrants IBANs being a separate type? Similar to how IP addresses or phone numbers could probably also be considered identifiers that follow a specific format, but they are still separate types. Haven’t made up my mind about that, and probably I’m overthinking it.

pudo · 2024-02-20T16:23:25Z

@tillprochaska - This should be in a place right now where it's safe to merge; and rigour has seen a bit of action on our end to make sure it's not totally messed up. Would appreciate your feedback!

tillprochaska · 2024-02-28T10:26:31Z

@pudo Hey, sorry for the late reply. I got confirmation from @brrttwrks that he’s fine with the stricter validation for bic, lei, isin, figi, qid 👍

The only other request from our data team was some logging when validation fails (and values are cleaned from entities) before we implement stricter validation. And possibly the option to disable validation/cleaning on a per-property level in mappings. I’ll try to open a PR that does that today or tomorrow.

Thanks also for restoring IBANs as a separate type. We rely on that in quite a few places in Aleph. I’ll also take another look at the code today.

tillprochaska

Okay, I finally found the time to take a closer look this morning. I found a few places that I think you’ve overlooked when restoring the iban type (or we misunderstood each other?). But apart from that I’m happy for this to be merged!

followthemoney/schema/Analyzable.yaml

followthemoney/schema/BankAccount.yaml

followthemoney/compare.py

followthemoney/types/language.py

followthemoney/types/mimetype.py

followthemoney/types/url.py

tillprochaska

Thanks a lot for restoring the IBAN type in the schemata! This makes this a non-breaking update in Aleph. Sorry for the amount of back-and-forth from my side…

pudo · 2024-04-02T09:02:37Z

No this is definitely an open heart operation, thanks for taking the time to work through it!

pudo added 3 commits January 14, 2024 13:43

begin to rely on rigour for data validation

250e4c4

clean up type junk

d314b48

add format validation for types

fa290dd

tillprochaska reviewed Jan 15, 2024

View reviewed changes

pudo added 8 commits January 17, 2024 16:04

use helper functions to get formats

11aa6b6

introduce a .format field on property metadata

fed86ee

add format metadata to a few properties in the schema

27c666b

up the dependency

b658d77

modify range of supported python versions

45cc3be

knocked over this test, lovely

ae2ba31

ok py 3.12

91beb12

test format-based validation for identifiers, copy most iban tests over

5a85093

pudo added 2 commits January 17, 2024 17:54

actually remove registry.iban

84cc118

remove iban typed field

f230126

pudo added 9 commits January 27, 2024 20:58

adopt rigour levenshtein and language codes

b3bd355

adopt pick_name from rigour

560967b

replace pantomime

0381aec

restore IBAN type

9e8a037

up rigour to include MIME support

0f8baf0

some lint

4e4e206

fix name-picking tests

5d42552

adapt iban type signature

9e38cd8

use openpyxl type annotations

3d2c3be

tillprochaska reviewed Feb 29, 2024

View reviewed changes

pudo added 2 commits March 8, 2024 11:43

re-instate registry.iban in compare

f2f008c

restore iban types in schema

11a7cad

tillprochaska approved these changes Apr 2, 2024

View reviewed changes

pudo merged commit 6fb47dc into main Apr 2, 2024
9 checks passed

pudo deleted the pudo/rigour-parsers branch April 2, 2024 09:02

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Adopt external data parsing/normalisation lib #1368

Adopt external data parsing/normalisation lib #1368

pudo commented Jan 14, 2024

tillprochaska Jan 15, 2024 •

edited

Loading

pudo Jan 15, 2024

tillprochaska Jan 15, 2024

tillprochaska Jan 15, 2024

tillprochaska Jan 15, 2024

pudo Jan 15, 2024

tillprochaska Jan 15, 2024

tillprochaska Jan 15, 2024

pudo Jan 15, 2024

pudo commented Jan 17, 2024

tillprochaska commented Jan 18, 2024

pudo commented Feb 20, 2024

tillprochaska commented Feb 28, 2024

tillprochaska left a comment

tillprochaska left a comment

pudo commented Apr 2, 2024

Adopt external data parsing/normalisation lib #1368

Adopt external data parsing/normalisation lib #1368

Conversation

pudo commented Jan 14, 2024

tillprochaska Jan 15, 2024 • edited Loading

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

pudo commented Jan 17, 2024

tillprochaska commented Jan 18, 2024

pudo commented Feb 20, 2024

tillprochaska commented Feb 28, 2024

tillprochaska left a comment

Choose a reason for hiding this comment

tillprochaska left a comment

Choose a reason for hiding this comment

pudo commented Apr 2, 2024

tillprochaska Jan 15, 2024 •

edited

Loading