Opened 5 months ago

Closed 2 months ago

#2784 closed defect (done)

Pausing of Hive tasks does not work

Reported by: jkarder Owned by: jkarder
Priority: medium Milestone: HeuristicLab 3.3.15
Component: Hive.General Version: 3.3.14
Keywords: Cc:

Description

Currently, when a Hive user pauses a task, a new state log entry is created. The slave ID for this entry is null. When the next heartbeat of the calculating slave is received, the service concludes that the assigned slave (none, according to the state log) does not match the slave which has sent the heartbeat and commands the slave to abort the task.

Change History (11)

comment:1 Changed 5 months ago by jkarder

  • Status changed from new to accepted

comment:2 Changed 5 months ago by jkarder

r14901: fixed pausing of hive tasks

comment:3 Changed 5 months ago by jkarder

Pausing now leads to a PauseTask message being sent in response to the slave's next heartbeat.

comment:4 Changed 5 months ago by jkarder

  • Owner changed from jkarder to pfleck
  • Status changed from accepted to reviewing

comment:5 Changed 5 months ago by jkarder

r14913: fixed hive engine

comment:6 Changed 3 months ago by pfleck

  • Owner changed from pfleck to jkarder
  • Status changed from reviewing to assigned

Reviwed r14901,r14913 and also tested the changes.

Tested the following:

  • Uploaded experiment
  • Paused experiment (tasks are downloaded automatically after the pause command was executed on the slaves)
  • Opened the optimizer and changed some parameters (e.g. changed selector)
  • Continued the the algorithm locally.
  • Continued the algorithm in hive.
  • Stopped task.
  • Restarted the task (on restarting, all previous results were lost).

I am not sure if restarting a stopped task should be possible. If yes, we should consider whether we want the previous run to appear in the runs collection of this algorithm, or the algorithm should be complete reset with the previous runs deleted (as it is done currently).

When trying to pause a waiting task, it is actually not paused. This is because the server sets the state to waiting again because it is assumed that the state was set to pause by a slave (due to checkpointing, for example). As discussed with jkarder, the slave itself should not change the state of a task in such a case, so the server does not need to manually reset it to waiting again (which prevents waiting tasks from being paused by the user).

comment:7 Changed 3 months ago by jkarder

  • Status changed from assigned to accepted

comment:8 Changed 3 months ago by jkarder

r15121: fixed pausing of waiting hive tasks

comment:9 Changed 3 months ago by jkarder

  • Owner changed from jkarder to pfleck
  • Status changed from accepted to reviewing

comment:10 Changed 2 months ago by pfleck

  • Owner changed from pfleck to jkarder
  • Status changed from reviewing to readytorelease

Tested the following:

  • Uploaded Job with a "global" experiment, containing batchruns, containing GAs.
  • Paused some tasks -> optimizer is downloaded.
  • Modified optimizer and resumed it -> resumed settings are used.
  • Stopped, prepared and resumed a paused optimizer -> run is created the newly prepared algorithm continues as expected.
  • Stopped and resumed a paused task -> run is created, however task fails since the optimizer was not prepared.
  • Stopped the job -> see issues below.

Following issues were encountered during some tests (some are probably caused since the "production"-slaves do not have the newest changes):

  • Pausing the whole experiment with some tasks already paused restarts the paused tasks.
  • Failed Tasks can not be resumed although they can be modified (e.g. to fix the reason why it was failing). (New ticket in the future)
  • When stopping job:
    • Tasks that were calculating are now finished (and produce a run).
    • Tasks that were waiting are now aborted (and don't produce a run).
    • To produce a run of a paused optimizer, we would need to deserialize the optimizer and call stop manually. However, I don't think the Hive-Server should do this. (We should create a separate ticket for this).
  • When stopping a job (and waiting for the GUI to automatically refresh/download the tasks) the runs of appear in their task. A batchrun also contains the runs of its optimizers; and an experiment containing an optimizer (e.g. a batchrun that was not distributed) also contains this run(s). However, the experiment containing a batch run (which contains runs) do not contain the runs of the batchrun. Only after refreshing the whole job, the experiment also contains the jobs of its (distributed)batchruns.

Nice to have features:

  • Pausing a task with child-tasks should pause all child tasks. Currently only the Stop-button is enabled.

The issues where discussed with jkarder and tested on a slave with the newest changes applied. There, most issues (paused tasks restarting, etc) where resolved already.

comment:11 Changed 2 months ago by jkarder

  • Resolution set to done
  • Status changed from readytorelease to closed

r15262: merged r14901 and r15121 into stable

Note: See TracTickets for help on using tickets.