Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[dataviz] feat: add graph Active contributors grouped by age #1991

Merged
merged 9 commits into from
Jul 22, 2024

Conversation

NatNgs
Copy link
Collaborator

@NatNgs NatNgs commented Jun 24, 2024

related issues #1726


Add user-growth plot in Tournesol's streamlit app

Added a new plot in Streamlit, displaying the number of contributors, new users and active users in one line plot.

"Active" users are defined as having done at least one comparison on the week or before, and having done at least one comparison on the week or after.

Preview

Checklist

  • I added the related issue(s) id in the related issues section (if any)
  • I described my changes and my decisions in the PR description
  • I read the development guidelines of the CONTRIBUTING.md
  • The tests pass and have been updated if relevant
  • The code quality check pass

❤️ Thank you for your contribution!

@NatNgs NatNgs added Data Visualization python Pull requests that update Python code labels Jun 24, 2024
Copy link
Collaborator

@GresilleSiffle GresilleSiffle left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

To define with the rest of the team:

What is an active contributor?

@NatNgs
Copy link
Collaborator Author

NatNgs commented Jul 4, 2024

Updated graph after comments on discord; updated graph now look like this:

image

  • Number of categories (=colors, see legend) can be modified easily as needed
  • Choice of colors (as the rainbow gradient here) to be confirmed
  • Also not sure about how I labelled it ("Age of community", "First comparison date = last comparison date")

@GresilleSiffle
Copy link
Collaborator

GresilleSiffle commented Jul 15, 2024

Hi @NatNgs

The new graph is really cool 👌

I ran black -l 99 on the file to make it more readable, and updated some comments and texts to make them more explicit:

  • the title is now "Active contributors grouped by age", to introduce the notion of active contributors;
  • there is an info box explaining what is an active contributor to avoid any confusion;
  • the label of the Y axis is now "Active contributors" to match the title.

The colours and the groups look good to me for now. Let's check with the rest of the team.

Also not sure about how I labelled it ("Age of community", "First comparison date = last comparison date")

I haven't updated "First comparison date = last comparison date" yet, but I'll think about it.

capture

import streamlit as st
from dateutil.relativedelta import relativedelta
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is not a direct dependency of the project, should we add it to the file requirements.txt ?

@GresilleSiffle GresilleSiffle changed the title [data-viz] Streamlit: Added users growth graph (#1726) [dataviz] feat: add graph Active contributors grouped by age Jul 15, 2024
drop=True
) # Keep only the required data, remove duplicates.

df.week_date = pd.to_datetime(df.week_date, infer_datetime_format=True, utc=True).astype(
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Using astype raises a warning in the console, and will raise an exception in the future.

Using .astype to convert from timezone-aware dtype to timezone-naive dtype is deprecated and will raise in a future version. Use obj.tz_localize(None) or obj.tz_convert('UTC').tz_localize(None) instead

I don't know the best practice here (I have nearly zero experience with pandas), but it looks like the astype is used to make methods such as df.week_date.min() and .max() work. It should be possible to keep the timezone aware datetime, right? Maybe by explicitly sorting the column week_date?

If there is no side effect we may want to explicitly convert the timezone aware datetime to naive dates.

In case it helps:

# works
df.week_date = pd.to_datetime(df.week_date, infer_datetime_format=True, utc=True)
df.week_date.cat.as_ordered().max()

# doesn't work
df.week_date = pd.to_datetime(df.week_date, infer_datetime_format=True, utc=True)
df.week_date.max()
*** TypeError: Categorical is not ordered for operation max
you can use .as_ordered() to change the Categorical to an ordered one

Copy link
Member

@lfaucon lfaucon left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks @NatNgs

@GresilleSiffle GresilleSiffle merged commit 9f72b4e into main Jul 22, 2024
6 checks passed
@GresilleSiffle GresilleSiffle deleted the 1726-streamlit-users-growth-plot branch July 22, 2024 09:27
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Data Visualization python Pull requests that update Python code
Projects
None yet
Development

Successfully merging this pull request may close these issues.

None yet

3 participants