Opened 4 years ago

Closed 3 years ago

#2153 closed defect (done)

Add a timeout for stopping a Hive task

Reported by: ascheibe Owned by: ascheibe
Priority: medium Milestone: HeuristicLab 3.3.10
Component: Hive.Slave Version: 3.3.9
Keywords: Cc:

Description (last modified by ascheibe)

Currently Stop() is called on the task. If it does not terminate properly, it waits indefinitely. There should be a timeout. See Executor.cs, line 136. There is already a configurable timeout that is used when starting tasks (ExecutorSemTimeouts) that can be reused.

Change History (11)

comment:1 Changed 3 years ago by ascheibe

  • Description modified (diff)

comment:2 Changed 3 years ago by ascheibe

r11082

  • added a new method HandleStartStopPauseError in Executor to handle error conditions in the same way
  • added timeouts for semaphores so that failed tasks or tasks with endless loops don't block the slave
  • removed ExceptionOccured events from Executor/SlaveTask/TaskManager and use TaskFailed instead
  • removed another ExcpetionOccured event in HeartbeatManager that was never used

comment:3 Changed 3 years ago by ascheibe

  • Status changed from new to accepted

I removed ExceptionOccured because it has more or less the same semantic as TaskFailed. The plan was that ExceptionOccured was raised when something went wrong while executing a task and TaskFailed when something with starting/stopping/pausing went wrong. Therefore the handling of both was similar and code was duplicated. The only difference was that ExceptionOccured rescheduled the task while TaskFailed prevented the task from getting run again. But as #2154 mentions that is not desired anyway.

comment:4 Changed 3 years ago by ascheibe

  • Owner changed from ascheibe to mkommend
  • Status changed from accepted to reviewing

comment:5 Changed 3 years ago by mkommend

  • Owner changed from mkommend to ascheibe
  • Status changed from reviewing to assigned

R11082 looks OK.

However, I tried to test the changes by uploading a job which immediately throws an exception and while the execution time is not increased anymore (0.1s) the job is still calculating (after ~ 10 minutes).

comment:6 Changed 3 years ago by mkommend

Btw, I have shared the hanging hive job with you.

comment:7 Changed 3 years ago by ascheibe

  • Owner changed from ascheibe to mkommend
  • Status changed from assigned to reviewing

r11113 fixed assembly file version lookup to also work in sandboxes. FileVersionInfo.GetVersionInfo(..) needs LinkDemand which we don't allow in a Hive sandbox and therefore throws an exceptions. This leads to tasks that get rescheduled or just stay paused on the slave and never get sent back to the server.

comment:8 Changed 3 years ago by ascheibe

I have installed the new version of the slave on blade01, you can use it for testing.

comment:9 Changed 3 years ago by ascheibe

r11117 changed exception type

comment:10 Changed 3 years ago by mkommend

  • Owner changed from mkommend to ascheibe
  • Status changed from reviewing to readytorelease

comment:11 Changed 3 years ago by ascheibe

  • Resolution set to done
  • Status changed from readytorelease to closed

r11121: merged r11082, r11113, r11117 into stable

Note: See TracTickets for help on using tickets.