Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

chore: improve analytics #11714

Merged
merged 9 commits into from
Nov 25, 2020
Merged

Conversation

mistercrunch
Copy link
Member

@mistercrunch mistercrunch commented Nov 16, 2020

SUMMARY

Adding a few fields to what's logged:

  • path: the path portion of the url ie /superset/explore_json/
  • path_no_int: the path portion of the url, replace int portions /api/v1/chart/<int>/
  • ref: a reference to the Python object ie: ChartRestApi.post

Also instrumenting more events in /superset/views/core.py, and the dashboard and chart APIs

Screen Shot 2020-11-16 at 8 53 25 AM

@codecov-io
Copy link

codecov-io commented Nov 16, 2020

Codecov Report

Merging #11714 (db3870d) into master (25345be) will decrease coverage by 2.11%.
The diff coverage is 90.00%.

Impacted file tree graph

@@            Coverage Diff             @@
##           master   #11714      +/-   ##
==========================================
- Coverage   65.67%   63.56%   -2.12%     
==========================================
  Files         905      449     -456     
  Lines       43862    27663   -16199     
  Branches     4201        0    -4201     
==========================================
- Hits        28806    17583   -11223     
+ Misses      14915    10080    -4835     
+ Partials      141        0     -141     
Flag Coverage Δ
cypress ?
javascript ?
python 63.56% <90.00%> (-0.09%) ⬇️

Flags with carried forward coverage won't be shown. Click here to find out more.

Impacted Files Coverage Δ
...grations/versions/a8173232b786_add_path_to_logs.py 0.00% <0.00%> (ø)
superset/utils/log.py 93.13% <94.73%> (-0.42%) ⬇️
superset/annotation_layers/api.py 83.05% <100.00%> (+0.74%) ⬆️
superset/cachekeys/api.py 98.07% <100.00%> (ø)
superset/charts/api.py 78.78% <100.00%> (+0.66%) ⬆️
superset/css_templates/api.py 95.65% <100.00%> (+0.19%) ⬆️
superset/dashboards/api.py 84.31% <100.00%> (+0.64%) ⬆️
superset/databases/api.py 87.70% <100.00%> (+0.30%) ⬆️
superset/datasets/api.py 91.94% <100.00%> (+0.27%) ⬆️
superset/models/core.py 88.67% <100.00%> (-0.18%) ⬇️
... and 465 more

Continue to review full report at Codecov.

Legend - Click here to learn more
Δ = absolute <relative> (impact), ø = not affected, ? = missing data
Powered by Codecov. Last update 25345be...db3870d. Read the comment docs.

@mistercrunch
Copy link
Member Author

@dpgaspar, looking for ways to cover things like get_list in base_api, but also want to prevent double-logging. Looking for input on this.

@mistercrunch
Copy link
Member Author

@dpgaspar this is ready for review!

Copy link
Member

@dpgaspar dpgaspar left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Looks good, left some comments and the duplicate needs to be removed

superset/utils/log.py Show resolved Hide resolved
superset/views/core.py Outdated Show resolved Hide resolved
superset/views/core.py Show resolved Hide resolved
superset/views/core.py Show resolved Hide resolved
Copy link
Member

@dpgaspar dpgaspar left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Looks good, left one last comment

superset/datasets/api.py Outdated Show resolved Hide resolved
@mistercrunch
Copy link
Member Author

@dpgaspar , I think I addressed all the comments

@mistercrunch mistercrunch merged commit 0504cf1 into apache:master Nov 25, 2020
@mistercrunch mistercrunch deleted the improve_analytics branch November 25, 2020 16:45
@ktmud
Copy link
Member

ktmud commented Nov 26, 2020

Hmm, sorry for being late to the party. But I think path_no_int should've been path_no_param because ideally you'd want to remove all parameters (e.g. slug) even if they are not int. Also, it would be more robust to get that path from URL pattern (either via the _url attribute added by @expose or manually passing it to the logger decorator) instead of manually cleaning up the string.

@john-bodley
Copy link
Member

john-bodley commented Dec 3, 2020

@mistercrunch, @dpgaspar, et al. would you be able to provide more as what the new fields added to the logs are (or will be) used for? The reason I ask is at Airbnb we use a MySQL database and the DDL operations locks the table which is problematic (especially for a large and frequently written to table like the logs table). Without planned downtime (possibly significant) the migration will not complete and the application is at risk of becoming inoperable (if/when a lock is obtained).

Also did you consider using the json column for storing the extra metadata? This could be useful especially if these fields are experimental and no DB migration would be required. Finally storing the path_no_int (either via a column or in a JSON encoded string) seems somewhat kludgy as this could be easily obtained from the path field using regular expressions etc. (which is available in most databases) and would reduce the payload size.

cc: @bkyryliuk (unsure if Dropbox uses MySQL).

@john-bodley
Copy link
Member

john-bodley commented Dec 3, 2020

@robdiciuccio or @willbarrett do either you (or anyone at Preset) know the motivations for these changes? Wrangling this migration has already cost Airbnb quite some time and our current work around (to prevent significant downtime) is a costly and timely operation.

Prior to rolling out this migration to production we just want to be sure that this change is really needed and permanent in nature, i.e., we would hate to have to perform a similar migration in the near future to rollback this logic.

@robdiciuccio
Copy link
Member

@john-bodley looking into it and will respond shortly

@mistercrunch
Copy link
Member Author

path seems very useful to log, and added more fields while at it, ref is also helpful for self-documenting the event and provide more context than the function name

Clearly this PR has 2 main parts:

  1. richer logging (more fields)
  2. logging more events

It sounds like db migration on 1. is the issue for you. How many events do you have on this table? Can we help with the migration?

@mistercrunch
Copy link
Member Author

@ktmud I agree about the path_no_param naming & approach comment, but I think identifying params may require a bit more instrumentation and work.

About reverting (if that's on the table), we kind of need to alter the upgrade to be upgrade script to be noop AND add a downgrade script that would remove the columns if they exist as people are already on this migration.

@mistercrunch mistercrunch added the risk:db-migration PRs that require a DB migration label Dec 3, 2020
@etr2460
Copy link
Member

etr2460 commented Dec 3, 2020

@mistercrunch we have ~190 million events in the table currently, and while we may have a solution to migrate it, I think we've historically tried to avoid major changes to the logs table since migrations can be quite costly. The last migration before yours that touched the table was in 2018, and it already includes the json column where we could easily add any new metadata to here.

If we do think the approach i mentioned here makes sense (putting your new fields in the json column). Then perhaps we can fix forward with a new migration, and include details on how to skip past the 2 migrations adding and removing the new columns in UPDATING.md

@john-bodley
Copy link
Member

john-bodley commented Dec 3, 2020

In general I sense we should not be storing columns (or fields in JSON objects, etc.) which can easily be derived from other columns, i.e., path_no_int from path. This adds unnecessary bloat especially when the use case is not well defined.

As I mentioned in this comment and as @etr2460 also stated using the json column seems like a better place to store experimental fields (or those subject to change) especially when they importance is secondary. For example the json column could include a request field with all the nested request properties (arguments, referrer, path, etc.). This provides a level of hierarchy/organization and flexibility.

@ktmud
Copy link
Member

ktmud commented Dec 3, 2020

@mistercrunch I think for things as heavy as that requiring a db migration, a little bit more instrumentation is warranted. I'd vote for revert before this goes out to more users.

My two cents on the value of the added columns:

  1. As John commented, path_no_int is not very useful because you can easily do the same RegExp matching and cleaning up in SQL (not the case for path_no_param that uses the actual route pattern, though).
  2. The other two columns are actually under the same category as they could both have a 1:1 mapping to the existing column action. Since it's rare to run analytic queries on the production database because most likely you would analyze the logs in some kind of data warehouse, a little bit extra ETL work is probably more favorable in the spirit of keeping the production database lean and performant.

I agree both path and ref (maybe call this object_ref to disambiguate with referrer?) are useful information. It's also valuable to record point-at-time information even if we could manually curate the mapping. So adding them to json sounds like a good idea.

@mistercrunch mistercrunch added 🏷️ bot A label used by `supersetbot` to keep track of which PR where auto-tagged with release labels 🚢 1.0.0 labels Mar 12, 2024
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
🏷️ bot A label used by `supersetbot` to keep track of which PR where auto-tagged with release labels risk:db-migration PRs that require a DB migration size/L 🚢 1.0.0
Projects
None yet
Development

Successfully merging this pull request may close these issues.

7 participants