Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

fix: Remove expensive logs table migration #11920

Merged
merged 1 commit into from
Dec 4, 2020

Conversation

etr2460
Copy link
Member

@etr2460 etr2460 commented Dec 4, 2020

SUMMARY

Addresses issues with #11714 that prevented the migration from being run on MySQL dbs with large logs tables.

Does the following:

  • Updates the original migration to remove the add column statements
  • Adds a new migration to remove the columns if they exist
  • Pushes the new columns into the json column in the logs table

Does not:

  • Address any naming concerns about the new columns

BEFORE/AFTER SCREENSHOTS OR ANIMATED GIF

Screen Shot 2020-12-03 at 4 45 49 PM

TEST PLAN

  • run the original migration, then upgrade past my new migration without errors
  • run the updated original migration, then upgrade past my new migration without errors
  • run the original migration, then downgrade without errors
  • run the update original migration, then downgrade without errors
  • run a query and see the new fields in the json column of the logs table

ADDITIONAL INFORMATION

  • Has associated issue:
  • Changes UI
  • Requires DB Migration.
  • Confirm DB Migration upgrade and downgrade tested.
  • Introduces new feature or API
  • Removes existing feature or API

to: @mistercrunch @john-bodley @dpgaspar @ktmud @graceguo-supercat

@etr2460 etr2460 added the risk:db-migration PRs that require a DB migration label Dec 4, 2020
@codecov-io
Copy link

codecov-io commented Dec 4, 2020

Codecov Report

Merging #11920 (14fd7fc) into master (e0288bf) will decrease coverage by 0.04%.
The diff coverage is 2.63%.

Impacted file tree graph

@@            Coverage Diff             @@
##           master   #11920      +/-   ##
==========================================
- Coverage   67.68%   67.64%   -0.05%     
==========================================
  Files         930      932       +2     
  Lines       45132    45161      +29     
  Branches     4331     4331              
==========================================
+ Hits        30549    30550       +1     
- Misses      14480    14508      +28     
  Partials      103      103              
Flag Coverage Δ
cypress 54.94% <ø> (+0.05%) ⬆️
javascript 63.15% <ø> (ø)
python 64.16% <2.63%> (-0.08%) ⬇️

Flags with carried forward coverage won't be shown. Click here to find out more.

Impacted Files Coverage Δ
superset/migrations/shared/utils.py 0.00% <0.00%> (ø)
...ons/versions/811494c0cc23_remove_path_from_logs.py 0.00% <0.00%> (ø)
...grations/versions/a8173232b786_add_path_to_logs.py 0.00% <0.00%> (ø)
superset/models/core.py 88.85% <ø> (-0.10%) ⬇️
superset/utils/log.py 93.20% <100.00%> (+0.06%) ⬆️
...ponents/AdhocFilterEditPopoverSimpleTabContent.jsx 81.81% <0.00%> (-1.82%) ⬇️
...perset-frontend/src/messageToasts/actions/index.ts 96.15% <0.00%> (+7.69%) ⬆️
...et-frontend/src/messageToasts/components/Toast.tsx 100.00% <0.00%> (+8.33%) ⬆️

Continue to review full report at Codecov.

Legend - Click here to learn more
Δ = absolute <relative> (impact), ø = not affected, ? = missing data
Powered by Codecov. Last update e0288bf...14fd7fc. Read the comment docs.

@etr2460 etr2460 force-pushed the erik-ritter--remove-logs-columns branch from a5a8eb2 to f89b9da Compare December 4, 2020 00:59
@@ -224,6 +224,9 @@ def log( # pylint: disable=too-many-arguments,too-many-locals
logs = list()
for record in records:
json_string: Optional[str]
record.update(
{"path": path, "path_no_int": path_no_int, "ref": ref,}
)
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This should probably happen in log_context so you don't have to pass these arguments around.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

done

@etr2460 etr2460 force-pushed the erik-ritter--remove-logs-columns branch 2 times, most recently from 1f221c4 to 6cf8428 Compare December 4, 2020 01:19
insp = reflection.Inspector.from_engine(engine)
has_column = False
try:
for col in insp.get_columns(table):
Copy link
Member

@john-bodley john-bodley Dec 4, 2020

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Maybe,

return any(col["name"] == column for col in insp.get_columns(table))

This would replace line 37 and lines 40 - 43.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

fyi: https://stackoverflow.com/a/52865284

but i'm realizing now, this is a pretty janky function. let me fix it

@etr2460 etr2460 force-pushed the erik-ritter--remove-logs-columns branch from 6cf8428 to 23cf625 Compare December 4, 2020 02:33
def upgrade():
with op.batch_alter_table("logs") as batch_op:
if utils.table_has_column("logs", "path"):
batch_op.drop_column("path")
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@etr2460 this isn't quite right. If the column exists one should really iterate over the non-NULL values and migrate those to the json column. That could be somewhat of an expensive operation but given the relevancy (or lack there of) and recency of the original migration it may be ok to accept the data loss.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

yeah, i made the executive decision to drop any extra metadata that came in in the last week from the previous change. I think it's reasonable since i doubt many folks are relying on them thus far

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think it's ok to drop the metadata

Copy link
Member

@mistercrunch mistercrunch left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM, thank you for picking up the pieces here.

For the record, listing out some reasons why we don't recommend using the DBEventLogger at scale:

  • analytics event logging happens in the main thread and adds a tax to every logged request (slows down the web server a bit)
  • puts significant extra load on the metadata database, that's already busy with the OLTP workload
  • logs quickly becomes the largest table in your database [by far!], making backup and restore operations much more expensive and time consuming than they should be

For the record, at Preset we use Segment and configure it to send our logs to BigQuery.

@etr2460 etr2460 force-pushed the erik-ritter--remove-logs-columns branch from 23cf625 to 89bd58e Compare December 4, 2020 04:48
@ktmud
Copy link
Member

ktmud commented Dec 4, 2020

I think it's useful to keep logs for things like object creation/deletion for auditing purposes. But for UX analytics and simple access logs, we should probably target moving it out of the metadata database?

Should we create two separate loggers for this?

Copy link
Member

@ktmud ktmud left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM with one naming suggestion.

payload.update(
{
"path": request.path,
"path_no_int": strip_int_from_path(request.path),
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

can we name this to path_no_param in case we want to update the logic in strip_int_from_path as discussed in #11714 ?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

as mentioned in the summary, this PR:

Does not:

  • Address any naming concerns about the new columns

I'd rather keep this PR just scoped to fixing the migration issue. We can fast follow with naming changes if we all agree they're preferred.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think path_no_param was agreed on. Just thought it's better to make that change sooner so we don't have any dirty data.

Copy link
Member

@ktmud ktmud Dec 4, 2020

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

But if you just merge this after CI is green, I can make another followup PR soon.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Feel free to completely drop path_no_int and the strip_int_from_path method

Copy link
Member

@mistercrunch mistercrunch Dec 4, 2020

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm also ok fine with ref -> object_ref too. I'm happy to do in a fast follow PR too

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Naming is hard. I'm a fan of measure twice cut once, as renaming things (or dealing with legacy names) in the future is always somewhat painful.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

ah, i see the path_no_param agreement from the previous pr. tbh, CI is sooo slow tonight i probably won't merge this until tomorrow. So i'll make the change now.

@etr2460 etr2460 force-pushed the erik-ritter--remove-logs-columns branch from 89bd58e to 14fd7fc Compare December 4, 2020 05:33
@etr2460
Copy link
Member Author

etr2460 commented Dec 4, 2020

Since CI is green, i'm going to merge this now. Feel free to follow up with any naming PRs today or next week

@etr2460 etr2460 merged commit 77d362d into master Dec 4, 2020
graceguo-supercat pushed a commit to airbnb/superset-fork that referenced this pull request Dec 4, 2020
@ktmud
Copy link
Member

ktmud commented Dec 4, 2020

Since CI is green, i'm going to merge this now. Feel free to follow up with any naming PRs today or next week

@etr2460 @mistercrunch #11927 is ready for review

@amitmiran137 amitmiran137 deleted the erik-ritter--remove-logs-columns branch March 29, 2021 18:11
@mistercrunch mistercrunch added 🏷️ bot A label used by `supersetbot` to keep track of which PR where auto-tagged with release labels 🚢 1.0.0 labels Mar 12, 2024
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
🏷️ bot A label used by `supersetbot` to keep track of which PR where auto-tagged with release labels risk:db-migration PRs that require a DB migration size/L 🚢 1.0.0
Projects
None yet
Development

Successfully merging this pull request may close these issues.

5 participants