Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Unit status not properly synced with long-running generators #1038

Open
PReithofer opened this issue Jul 31, 2023 · 1 comment
Open

Unit status not properly synced with long-running generators #1038

PReithofer opened this issue Jul 31, 2023 · 1 comment
Labels
bug Something isn't working

Comments

@PReithofer
Copy link
Contributor

PReithofer commented Jul 31, 2023

As I understand it, the only place where the unit status can transition to AssignmentState.COMPLETED is in
mephisto/data_model/unit.py in get_status(). This method needs to be called to keep the agent status and unit status in sync. However, when checking if a worker can work on another assignment, this is probably not called. Normally this is not an issue because the operator regularly (indirectly) calls this here:

async def _track_and_kill_runs(self):
...
if not tracked_run.force_shutdown:
    if tracked_run.task_launcher.finished_generators is False:
        # If the run can still generate assignments, it's
        # definitely not done
        continue
    task_run = tracked_run.task_run
    task_run.update_completion_progress(
        task_launcher=tracked_run.task_launcher
    )
    if not task_run.get_is_completed(): # <- here
        continue
...

However, when the using a long-running generator, it can happen that a unit finishes before the generator does.
If you want to try this, on this branch I replaced the static_task_data with the following generator:

def slow_generator():
    import time
    i=0
    while(True):
        i+=1
        time.sleep(3)
        yield {"text": f"This is assignment {i}"}

One quick fix would be to simply move the emptiness check of the generator below get_is_completed see #1039. But a cleaner solution would probably be to either

  • always sync the states when accessing the assignment status (i.e. also when checking if workers can take on more units) or
  • directly set this when the agent status changes.

EDIT: this only happens for screening units!

@JackUrb
Copy link
Contributor

JackUrb commented Aug 1, 2023

Thanks for all the details here @PReithofer - this is an interesting bug you've stumbled on, as the limitations for workers' active tasks requires information in the MephistoDB to be up-to-date, but indeed only a get_status() call actually pushes that status update.

Your quick fix in #1039 seems OK to me, though it definitely brings up questions on how we manage state syncing correctly. Indeed I've left this up to the _track_and_kill_runs thread, but it may be time to rethink, we haven't used the generator pattern extensively.

@meta-paul meta-paul added bug Something isn't working question Further information is requested and removed question Further information is requested labels Apr 11, 2024
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
bug Something isn't working
Projects
None yet
Development

No branches or pull requests

3 participants