Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Maintain insert order of property values #1148

Merged
merged 5 commits into from
Jul 12, 2023
Merged

Conversation

tillprochaska
Copy link
Contributor

Fix #1139

@pudo
Copy link
Contributor

pudo commented Jun 28, 2023

Just so it's on record: this won't live through ftm-store properly, and I'm not sure that can be made to work.

@tillprochaska
Copy link
Contributor Author

tillprochaska commented Jun 28, 2023

Just so it's on record: this won't live through ftm-store properly, and I'm not sure that can be made to work.

@pudo Just to make sure I don’t misunderstand you: I understand that if property values are stored in separate fragments, this PR won’t solve that issue, because order isn’t guaranteed when selecting the individual fragments from the database. That means even after merging this PR, the order of extracted text in Pages entities won’t reflect the order in the source document. That’s ok for now and not my main concern.

I’m currently trying to solve a few issues related to previews of multipart emails in Aleph. The plaintext and HTML parts of the email are stored in a single fragment. For example, there would be a single ingest-origin fragment like this:

{
  "schema": "Email",
  "properties": {
    "bodyText": ["First plaintext part", "Second plaintext part"],
    "bodyHtml": ["First HTML part", "second HTML part"],
  }
}

As far as I can see, that means that the property values are retrieved in the original order during indexing, then indexed in the original order in ES. Once we retrieve the entity data from ES and instantiate an FtM proxy object in Aleph, the order gets mixed up. That’s what I’m trying to fix.

Does that make sense? (I might still overlook something here, but I think at least for this particular problem, ftm-store shouldn’t be the problem.)

values = set(value)
self._properties[key] = values
self._size += sum([len(v) for v in values])
self.add(key, values, quiet=True)
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This was previously setting _properties directly in case the data is already cleaned. For the sake of simplicity I’m calling the add here too. Not sure if this has any disadvantages? (I assume the previous implementation was to avoid unnecessary method calling/loop overhead?)

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yep this is all super perf optimised. I'd sort of consider keeping it unrolled, this gets called millions and millions of times e.g. when a mapping is generated and saving three lines in exchange for a few hours on the other end may be bad math.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes, makes sense, I’ll change that :)

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I pushed another commit that inlines this again and ran a very basic benchmark based on the contrib/perf_create_proxy.py script, creating 2,000,000 proxies with cleaned values. Surprisingly, the inlined version is ~1,4 times slower than the "unoptimized" version (which is only slightly slower than the current FtM release).

Might need to run a slightly more advanced benchmark though to exclude other factors as the reason for this.

f61f739

Took 10.915786981582642
Took 10.810703992843628
Took 10.882683038711548
Took 10.802056074142456
Took 10.878925085067749

00bb791

Took 15.575160026550293
Took 15.75855827331543
Took 15.749438047409058
Took 15.530592918395996
Took 15.721524953842163

Current followthemoney release

Took 10.602813005447388
Took 10.623345851898193
Took 10.726281881332397
Took 10.77332091331482
Took 10.679194211959839

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Wooo, weird. I guess we technically don't need to set but it feels a bit hygienic....

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Ah but thinking about it a bit more: you can't set there. It's called every time the thing is fished out of the database, which would make this whole PR not persistent....

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Hm, not sure I understand? The set is just used to check if the value already exists, but order is still preserved.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Okay, ran a slightly more sophisticated benchmark. The results suggest that inlining initialization of property values into the constructor indeed has a significant impact. It’s still slower than the currently released implementation, but it’s less worse (only 25% slowdown vs >100% slowdown).

@pudo The numbers below are for 10x creating 1,000,000 proxies. The slowdown sounds tolerable to me, what do you think?

Screen Shot 2023-07-11 at 12 50 57

I’m not an expert in benchmarking/profiling, so this might still be a little flawed. Ran this in Docker containers for ease of testing different Python versions, but can see similar results when running it on the host machine directly.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yeah all the non-add ones look totally fine :)

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@pudo Perfect, thanks!

And sorry for bothering you again, but would you mind clarifying what you meant here. I’m not 100% sure I understand what you mean. The set is only used to test whether a value already occurred before, but the values are added to the list of values in their original order.

Ah but thinking about it a bit more: you can't set there. It's called every time the thing is fished out of the database, which would make this whole PR not persistent....

@tillprochaska tillprochaska merged commit b4d51f4 into main Jul 12, 2023
8 checks passed
tillprochaska added a commit to alephdata/ingest-file that referenced this pull request Jul 12, 2023
We need this fix to preserve the original order of email parts: alephdata/followthemoney#1148
tillprochaska added a commit to alephdata/aleph that referenced this pull request Jul 12, 2023
We need this fix to preserve the original order of email parts: alephdata/followthemoney#1148
tillprochaska added a commit to alephdata/ingest-file that referenced this pull request Jul 13, 2023
We need this fix to preserve the original order of email parts: alephdata/followthemoney#1148
tillprochaska added a commit to alephdata/aleph that referenced this pull request Jul 13, 2023
* Derive "safeHtml" from all "bodyHtml" values

* Bump followthemoney to 3.4.4

We need this fix to preserve the original order of email parts: alephdata/followthemoney#1148
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

Maintain insert order for multi-valued properties
3 participants