Opened 5 years ago

Closed 4 years ago

#1950 closed enhancement (done)

Improve Hive UI stability

Reported by: abeham Owned by: abeham
Priority: medium Milestone: HeuristicLab 3.3.8
Component: Hive.Client Version: 3.3.8
Keywords: Cc:

Description (last modified by abeham)

The job manager in hive should be made more stable.

Change History (35)

comment:1 Changed 5 years ago by abeham

I profiled the application over night and tried in similar situations that I experienced previous errors. I could not reproduce the crash, but the profiling revealed that the extreme amount of CPU usage is the culprit of UpdateStateLogList() in RefreshableHiveJobView.Content_StateLogListChanged(object, EventArgs). Profiling says it spends a lot of time in the setter of the Content property. Because EventHandlers are attached and removed all the time the setter is called it says that System.ComponentModel.EventHandlerList.RemoveHandler(object, Delegate) is the function doing most individual work. Could it be that there's a handler not deregistered so that this list grows bigger and bigger?

When closing the JobManager view the cpu usage dropped to normal levels.

comment:2 Changed 5 years ago by ascheibe

  • Status changed from new to accepted

comment:3 Changed 5 years ago by ascheibe

r8656 removed redundant event handler additions and removals

Last edited 5 years ago by ascheibe (previous) (diff)

comment:4 Changed 5 years ago by ascheibe

  • Owner changed from ascheibe to abeham
  • Status changed from accepted to reviewing

Thanks a lot for profiling this! I found that the event handlers for the charts in StateLogGanttChartListView were set every time the content was changed. I think this could be the cause of the high CPU usage, could you maybe again profile the Job Manager and tell me if this was the problem?

comment:5 follow-up: Changed 5 years ago by abeham

  • Owner changed from abeham to ascheibe
  • Status changed from reviewing to assigned

It seems the performance problem is gone, but when I upload a job and leave it (refresh automatically is on). I get multiple exceptions like these over time:

InvalidOperationException: Collection was modified; enumeration operation may not execute.
   at System.Collections.Generic.List`1.Enumerator.MoveNextRare()
   at HeuristicLab.Optimization.RunCollection.OnItemsAdded(IEnumerable`1 items) in c:\HL3\trunk\sources\HeuristicLab.Optimization\3.3\RunCollection.cs:line 167
   at HeuristicLab.Clients.Hive.JobManager.Views.RefreshableHiveJobView.GetAllRunsFromJob(RefreshableJob job) in c:\HL3\trunk\sources\HeuristicLab.Clients.Hive.JobManager\3.3\Views\RefreshableHiveJobView.cs:line 536
   at HeuristicLab.Clients.Hive.JobManager.Views.RefreshableHiveJobView.Content_TaskReceived(Object sender, EventArgs e) in c:\HL3\trunk\sources\HeuristicLab.Clients.Hive.JobManager\3.3\Views\RefreshableHiveJobView.cs:line 187
   at HeuristicLab.Clients.Hive.RefreshableJob.jobResultPoller_JobResultReceived(Object sender, EventArgs`1 e) in c:\HL3\trunk\sources\HeuristicLab.Clients.Hive\3.3\RefreshableJob.cs:line 312
   at HeuristicLab.Clients.Hive.JobResultPoller.<FetchJobResults>b__0(IHiveService service) in c:\HL3\trunk\sources\HeuristicLab.Clients.Hive\3.3\JobResultPoller.cs:line 111
   at HeuristicLab.Clients.Hive.HiveServiceLocator.CallHiveService[T](Func`2 call) in c:\HL3\trunk\sources\HeuristicLab.Clients.Hive\3.3\HiveServiceLocator.cs:line 64
   at HeuristicLab.Clients.Hive.JobResultPoller.RunPolling() in c:\HL3\trunk\sources\HeuristicLab.Clients.Hive\3.3\JobResultPoller.cs:line 91

comment:6 in reply to: ↑ 5 Changed 5 years ago by abeham

Replying to abeham:

It seems the performance problem is gone, but when I upload a job and leave it (refresh automatically is on). I get multiple exceptions like these over time:

InvalidOperationException: Collection was modified; enumeration operation may not execute.
   at System.Collections.Generic.List`1.Enumerator.MoveNextRare()
   at HeuristicLab.Optimization.RunCollection.OnItemsAdded(IEnumerable`1 items) in c:\HL3\trunk\sources\HeuristicLab.Optimization\3.3\RunCollection.cs:line 167
   at HeuristicLab.Clients.Hive.JobManager.Views.RefreshableHiveJobView.GetAllRunsFromJob(RefreshableJob job) in c:\HL3\trunk\sources\HeuristicLab.Clients.Hive.JobManager\3.3\Views\RefreshableHiveJobView.cs:line 536
   at HeuristicLab.Clients.Hive.JobManager.Views.RefreshableHiveJobView.Content_TaskReceived(Object sender, EventArgs e) in c:\HL3\trunk\sources\HeuristicLab.Clients.Hive.JobManager\3.3\Views\RefreshableHiveJobView.cs:line 187
   at HeuristicLab.Clients.Hive.RefreshableJob.jobResultPoller_JobResultReceived(Object sender, EventArgs`1 e) in c:\HL3\trunk\sources\HeuristicLab.Clients.Hive\3.3\RefreshableJob.cs:line 312
   at HeuristicLab.Clients.Hive.JobResultPoller.<FetchJobResults>b__0(IHiveService service) in c:\HL3\trunk\sources\HeuristicLab.Clients.Hive\3.3\JobResultPoller.cs:line 111
   at HeuristicLab.Clients.Hive.HiveServiceLocator.CallHiveService[T](Func`2 call) in c:\HL3\trunk\sources\HeuristicLab.Clients.Hive\3.3\HiveServiceLocator.cs:line 64
   at HeuristicLab.Clients.Hive.JobResultPoller.RunPolling() in c:\HL3\trunk\sources\HeuristicLab.Clients.Hive\3.3\JobResultPoller.cs:line 91

I noticed that this resulted in several tasks that didn't have an execution history and which had their state set to "Offline", but which have calculated all runs.

comment:7 Changed 5 years ago by ascheibe

r8848 fixed an InvalidOperationException in the Hive Job Manager

comment:8 Changed 5 years ago by ascheibe

  • Owner changed from ascheibe to abeham
  • Status changed from assigned to reviewing

comment:9 Changed 5 years ago by abeham

  • Description modified (diff)
  • Summary changed from HeuristicLab crashes with stack overflow exception to Improve Hive UI stability

I changed the title and description of the ticket since I couldn't reproduce the original StackOverflowException.

But I noticed that I received multiple EndpointNotFoundExceptions when I had the JobManager open, put the computer to sleep and woke it up, but with the network unplugged. A job was shown and to be refreshed automatically. Admittedly, this is expected and a rare case. The problem was that this produces new exception dialogs every second. The exception should either be swallowed and no action performed if not connected, or a state should be remembered which indicates that the computer is not connected and that the exception was already shown.

EndpointNotFoundException: There was no endpoint listening at http://services.heuristiclab.com/Hive-3.3/HiveService.svc that could accept the message. This is often caused by an incorrect address or SOAP action. See InnerException, if present, for more details.

Server stack trace: 
   at System.ServiceModel.Security.IssuanceTokenProviderBase`1.DoNegotiation(TimeSpan timeout)
   at System.ServiceModel.Security.SspiNegotiationTokenProvider.OnOpen(TimeSpan timeout)
   at System.ServiceModel.Security.TlsnegoTokenProvider.OnOpen(TimeSpan timeout)
   at System.ServiceModel.Channels.CommunicationObject.Open(TimeSpan timeout)
   at System.ServiceModel.Security.SymmetricSecurityProtocol.OnOpen(TimeSpan timeout)
   at System.ServiceModel.Channels.CommunicationObject.Open(TimeSpan timeout)
   at System.ServiceModel.Channels.SecurityChannelFactory`1.ClientSecurityChannel`1.OnOpen(TimeSpan timeout)
   at System.ServiceModel.Channels.CommunicationObject.Open(TimeSpan timeout)
   at System.ServiceModel.Security.SecuritySessionSecurityTokenProvider.DoOperation(SecuritySessionOperation operation, EndpointAddress target, Uri via, SecurityToken currentToken, TimeSpan timeout)
   at System.ServiceModel.Security.SecuritySessionSecurityTokenProvider.GetTokenCore(TimeSpan timeout)
   at System.IdentityModel.Selectors.SecurityTokenProvider.GetToken(TimeSpan timeout)
   at System.ServiceModel.Security.SecuritySessionClientSettings`1.ClientSecuritySessionChannel.OnOpen(TimeSpan timeout)
   at System.ServiceModel.Channels.CommunicationObject.Open(TimeSpan timeout)
   at System.ServiceModel.Channels.ServiceChannel.OnOpen(TimeSpan timeout)
   at System.ServiceModel.Channels.CommunicationObject.Open(TimeSpan timeout)
   at System.ServiceModel.Channels.ServiceChannel.CallOnceManager.CallOnce(TimeSpan timeout, CallOnceManager cascade)
   at System.ServiceModel.Channels.ServiceChannel.EnsureOpened(TimeSpan timeout)
   at System.ServiceModel.Channels.ServiceChannel.Call(String action, Boolean oneway, ProxyOperationRuntime operation, Object[] ins, Object[] outs, TimeSpan timeout)
   at System.ServiceModel.Channels.ServiceChannelProxy.InvokeService(IMethodCallMessage methodCall, ProxyOperationRuntime operation)
   at System.ServiceModel.Channels.ServiceChannelProxy.Invoke(IMessage message)

Exception rethrown at [0]: 
   at System.Runtime.Remoting.Proxies.RealProxy.HandleReturnMessage(IMessage reqMsg, IMessage retMsg)
   at System.Runtime.Remoting.Proxies.RealProxy.PrivateInvoke(MessageData& msgData, Int32 type)
   at HeuristicLab.Clients.Hive.IHiveService.GetLightweightJobTasks(Guid jobId)
   at HeuristicLab.Clients.Hive.HiveServiceClient.GetLightweightJobTasks(Guid jobId) in c:\HL3\trunk\sources\HeuristicLab.Clients.Hive\3.3\ServiceClients\HiveServiceClient.cs:line 1972
   at HeuristicLab.Clients.Hive.JobResultPoller.<FetchJobResults>b__0(IHiveService service) in c:\HL3\trunk\sources\HeuristicLab.Clients.Hive\3.3\JobResultPoller.cs:line 109
   at HeuristicLab.Clients.Hive.HiveServiceLocator.CallHiveService[T](Func`2 call) in c:\HL3\trunk\sources\HeuristicLab.Clients.Hive\3.3\HiveServiceLocator.cs:line 64
   at HeuristicLab.Clients.Hive.JobResultPoller.RunPolling() in c:\HL3\trunk\sources\HeuristicLab.Clients.Hive\3.3\JobResultPoller.cs:line 91
-----
WebException: The remote name could not be resolved: 'services.heuristiclab.com'
   at System.Net.HttpWebRequest.GetRequestStream(TransportContext& context)
   at System.Net.HttpWebRequest.GetRequestStream()
   at System.ServiceModel.Channels.HttpOutput.WebRequestHttpOutput.GetOutputStream()

comment:10 Changed 4 years ago by ascheibe

  • Owner changed from abeham to ascheibe
  • Status changed from reviewing to assigned

comment:11 Changed 4 years ago by ascheibe

r8869 fixed multiple EndpointNotFoundExceptions in the HiveJobManager

comment:12 Changed 4 years ago by ascheibe

  • Owner changed from ascheibe to abeham
  • Status changed from assigned to reviewing

If a job is refreshed automatically and the connection to the server is lost, the Hive Job Manager now displays the error message only once and then stops polling. The "Refresh automatically" checkbox also gets unchecked, so the user gets feedback. If the user has again internet, the checkbox can be checked and the polling and task downloading is continued.

comment:13 Changed 4 years ago by ascheibe

  • Owner changed from abeham to ascheibe
  • Status changed from reviewing to assigned

comment:14 Changed 4 years ago by ascheibe

  • Owner changed from ascheibe to abeham
  • Status changed from assigned to reviewing

r8871

  • fixed a bug where runs were downloaded multiple times
  • fixed a bug where "Refresh automatically" wasn't disabled when the job was finished
  • added a locker around the code that integrates downloaded optimizers as in rare cases collections were modified by multiple threads which lead to an exception

comment:15 Changed 4 years ago by abeham

  • Owner changed from abeham to ascheibe
  • Status changed from reviewing to assigned

I reviewed the changes, thx for improving the stability, much appreciated.

Nevertheless, as discussed change r8848 could probably be reverted. I think the exception has to occur within the view and thus the lock isn't effective.

comment:16 Changed 4 years ago by ascheibe

r8884 reverted changes of r8848 as this doesn't make any sense

comment:17 Changed 4 years ago by ascheibe

r8914

  • fixed nasty permission exception: refresh automatically can only be activated after the job is actually uploaded to the server
  • fixed setting refresh automatically correctly when switching between jobs

comment:18 Changed 4 years ago by ascheibe

r8939

  • added more aggressive locking so that the views don't read run collections that get modified in the meantime
  • start downloading of tasks after the job has been uploaded completely
  • fixed exceptions that got thrown when waiting for the threads that upload the tasks

comment:19 Changed 4 years ago by ascheibe

r8993 fixed downloading of the wrong number of runs

comment:20 Changed 4 years ago by ascheibe

r8994 added locking for assigning runs to the run collection view

comment:21 Changed 4 years ago by ascheibe

  • Owner changed from ascheibe to abeham
  • Status changed from assigned to reviewing

comment:22 Changed 4 years ago by abeham

  • Owner changed from abeham to ascheibe
  • Status changed from reviewing to assigned

I found another bug while using it. Maybe you can have a look at it. I did not have a user set in my HL instance (-> anonymous). I opened the administrator and got notified that I have no user set. I then set username and password and hit refresh in the administrator view. I got an exception. I opened the job manager, but all my jobs were shown. I then opened user information from the Access menu, hit refresh and HL crashed. I hope you can reproduce this.

Starting the optimizer with a valid username doesn't show any problems. Something is probably forgotten to update when changing the username. Maybe it's cached in the view, I don't know.

comment:23 Changed 4 years ago by ascheibe

  • Owner changed from ascheibe to abeham
  • Status changed from assigned to reviewing

r9063 fixed some errors that occurred when the user name was not set correctly

comment:24 follow-up: Changed 4 years ago by abeham

  • Owner changed from abeham to ascheibe
  • Status changed from reviewing to assigned
  • Type changed from defect to enhancement

Reviewed r8656, r8848, r8869, and r8871 once more.

I have some questions regarding r8869. There you have the following code:

while (!stopRequested) { .. }
if (stopRequested) { .. }

Wouldn't the if condition always evaluate to true? If there's a possibility that it could go false (stopRequested is not a local variable after all) then waitHandle wouldn't be closed. Also, the waitHandle wouldn't close if AutoResumeOnException was set to true and an exception was fired.

In r8914:

  • SuppressEvents would also need to be set in e.g. Content_RefreshAutomaticallyChanged
    • SuppressEvents should also be used in other control event handlers that update the content. Otherwise e.g. nameTextBox.Text = Content.Name triggers nameTextBox_Validating which in turn calls Content.Name = nameTextBox.Text
  • There doesn't seem to be an event that handles changes to IsPriviledged

In r8939

  • OptimizerHiveTask:
    • I do not fully understand UpdateChildHiveTasks, UpdateOptimizerInBatchRun, and UpdateOptimizerInExperiment
      • What is the purpose of the code where the batchrun's repetitions and the child tasks are being compared?
      • In the comment "only set the first optimizer as Optimizer..." what is "the first optimizer"?

r8993:8994: ok

comment:25 Changed 4 years ago by ascheibe

r9097

  • added missing SuppressEvents conditions
  • removed unnecessary check for stopRequested

comment:26 in reply to: ↑ 24 Changed 4 years ago by ascheibe

  • Owner changed from ascheibe to abeham
  • Status changed from assigned to reviewing

Replying to abeham:

Reviewed r8656, r8848, r8869, and r8871 once more.

I have some questions regarding r8869. There you have the following code:

while (!stopRequested) { .. }
if (stopRequested) { .. }

Wouldn't the if condition always evaluate to true? If there's a possibility that it could go false (stopRequested is not a local variable after all) then waitHandle wouldn't be closed. Also, the waitHandle wouldn't close if AutoResumeOnException was set to true and an exception was fired.

Thanks, you are right, i have fixed that as well as the missing SuppressEvents conditions.

In r8914:

  • SuppressEvents would also need to be set in e.g. Content_RefreshAutomaticallyChanged
    • SuppressEvents should also be used in other control event handlers that update the content. Otherwise e.g. nameTextBox.Text = Content.Name triggers nameTextBox_Validating which in turn calls Content.Name = nameTextBox.Text
  • There doesn't seem to be an event that handles changes to IsPriviledged

That's true because it can only be changed by the user and therefore a content event is not necessary.

In r8939

  • OptimizerHiveTask:
    • I do not fully understand UpdateChildHiveTasks, UpdateOptimizerInBatchRun, and UpdateOptimizerInExperiment
      • What is the purpose of the code where the batchrun's repetitions and the child tasks are being compared?
      • In the comment "only set the first optimizer as Optimizer..." what is "the first optimizer"?

Well, i'm also not quit sure but i will try to explain:

  • UpdateChildHiveTasks: In Hive we have an hierarchical structure of tasks. This is the structure that you can see in the Hive Job Manager. It also contains the actual optimizers (the HL algs, experiments and batchruns). It uses the experiments and batchruns to build this hierarchical structure of tasks and child tasks.
  • UpdateOptimizerInBatchRun: Because batchruns can be executed in parallel, the result of every batchrun is called Run 1 and therefore we change the names of the batchruns. Because Hive does not know anything about batchruns, we have to assemble it ourselves so that the user can view it again.
  • "only set the first optimizer as Optimizer..." means that we have to set the optimizer of the batchrun so that the user can use it in HL (e.g. run the batchrun again). As all optimizers of a batchrun are the same (they get cloned and wrapped in child tasks if parallelization of batchruns is switched on) we simply set the first we get/download.

r8993:8994: ok

comment:27 Changed 4 years ago by ascheibe

  • Owner changed from abeham to ascheibe
  • Status changed from reviewing to assigned

comment:28 Changed 4 years ago by ascheibe

r9107 fixed handling of IsAllowedPrivileged in the Hive Job Manager

comment:29 Changed 4 years ago by ascheibe

  • Owner changed from ascheibe to abeham
  • Status changed from assigned to reviewing

comment:30 follow-up: Changed 4 years ago by abeham

Let me summarize how I understood the change:

The privilege checkbox enabled state is now determined by the HiveClient.IsAllowedPrivileged property in SetEnabledStateOfControls. Before, it was checked only on job creation if a user has privileged rights and it was not updated later on, e.g. when a privileged right was granted or revoked. Now at least when a refresh is taking place the privileged rights are updated. What I don't understand is the change to the IsAllowedPrivileged property in RefreshableJob. Basically you synchronize IsAllowedPrivileged with Job.IsPrivileged (in one direction). However, is it really correct to set Job.IsPrivileged to true when IsAllowedPrivileged is set to true? I assume I could execute a job as not privileged even though I am allowed to do so. Also I don't quite understand why a RefreshableJob can be allowed privileged execution. The allowance, as far as I understood it, is user-specific, not job-specific. But I assume, that's because the hive slave isn't able to query the permissions of the job's owner and thus this permission is serialized into the job itself.

comment:31 Changed 4 years ago by abeham

  • Owner changed from abeham to ascheibe
  • Status changed from reviewing to assigned

Please have a look at the property IsAllowedPrivileged in RefreshableJob. Also the getter is strange. If IsPrivileged is set to false, then IsAllowedPrivileged would also return false, even though it might actually be allowed.

comment:32 Changed 4 years ago by ascheibe

r9436 removed IsAllowedPrivileged property from RefreshableJob as it's not needed anymore

comment:33 in reply to: ↑ 30 Changed 4 years ago by ascheibe

  • Owner changed from ascheibe to abeham
  • Status changed from assigned to reviewing

Replying to abeham:

Let me summarize how I understood the change:

The privilege checkbox enabled state is now determined by the HiveClient.IsAllowedPrivileged property in SetEnabledStateOfControls. Before, it was checked only on job creation if a user has privileged rights and it was not updated later on, e.g. when a privileged right was granted or revoked. Now at least when a refresh is taking place the privileged rights are updated.

Yes, so if a user has the privileged role the checkbox should be enabled when creating jobs.

What I don't understand is the change to the IsAllowedPrivileged property in RefreshableJob. Basically you synchronize IsAllowedPrivileged with Job.IsPrivileged (in one direction). However, is it really correct to set Job.IsPrivileged to true when IsAllowedPrivileged is set to true? I assume I could execute a job as not privileged even though I am allowed to do so. Also I don't quite understand why a RefreshableJob can be allowed privileged execution. The allowance, as far as I understood it, is user-specific, not job-specific. But I assume, that's because the hive slave isn't able to query the permissions of the job's owner and thus this permission is serialized into the job itself.

That change didn't make sense, thats true. I have removed the IsAllowedPriviledged property as it's actually not needed anymore because we have the Hive client which holds this information. And the information whether a job is now privileged or not is stored in the job. Also this information is sadly not stored on the server in the job table but for each task. Therefore we have to add this value for every task when uploading it and check when downloading if a task is privileged and update the job object. This should be cleaned up in a future version.

comment:34 Changed 4 years ago by abeham

  • Status changed from reviewing to readytorelease

You're right that this property is not needed anymore. Thanks for removing it.

comment:35 Changed 4 years ago by swagner

  • Resolution set to done
  • Status changed from readytorelease to closed
  • Version changed from 3.3.7 to 3.3.8
Note: See TracTickets for help on using tickets.