Opened 12 years ago
Closed 12 years ago
#1950 closed enhancement (done)
Improve Hive UI stability
Reported by: | abeham | Owned by: | abeham |
---|---|---|---|
Priority: | medium | Milestone: | HeuristicLab 3.3.8 |
Component: | Hive.Client | Version: | 3.3.8 |
Keywords: | Cc: |
Description (last modified by abeham)
The job manager in hive should be made more stable.
Change History (35)
comment:1 Changed 12 years ago by abeham
comment:2 Changed 12 years ago by ascheibe
- Status changed from new to accepted
comment:3 Changed 12 years ago by ascheibe
r8656 removed redundant event handler additions and removals
comment:4 Changed 12 years ago by ascheibe
- Owner changed from ascheibe to abeham
- Status changed from accepted to reviewing
Thanks a lot for profiling this! I found that the event handlers for the charts in StateLogGanttChartListView were set every time the content was changed. I think this could be the cause of the high CPU usage, could you maybe again profile the Job Manager and tell me if this was the problem?
comment:5 follow-up: ↓ 6 Changed 12 years ago by abeham
- Owner changed from abeham to ascheibe
- Status changed from reviewing to assigned
It seems the performance problem is gone, but when I upload a job and leave it (refresh automatically is on). I get multiple exceptions like these over time:
InvalidOperationException: Collection was modified; enumeration operation may not execute. at System.Collections.Generic.List`1.Enumerator.MoveNextRare() at HeuristicLab.Optimization.RunCollection.OnItemsAdded(IEnumerable`1 items) in c:\HL3\trunk\sources\HeuristicLab.Optimization\3.3\RunCollection.cs:line 167 at HeuristicLab.Clients.Hive.JobManager.Views.RefreshableHiveJobView.GetAllRunsFromJob(RefreshableJob job) in c:\HL3\trunk\sources\HeuristicLab.Clients.Hive.JobManager\3.3\Views\RefreshableHiveJobView.cs:line 536 at HeuristicLab.Clients.Hive.JobManager.Views.RefreshableHiveJobView.Content_TaskReceived(Object sender, EventArgs e) in c:\HL3\trunk\sources\HeuristicLab.Clients.Hive.JobManager\3.3\Views\RefreshableHiveJobView.cs:line 187 at HeuristicLab.Clients.Hive.RefreshableJob.jobResultPoller_JobResultReceived(Object sender, EventArgs`1 e) in c:\HL3\trunk\sources\HeuristicLab.Clients.Hive\3.3\RefreshableJob.cs:line 312 at HeuristicLab.Clients.Hive.JobResultPoller.<FetchJobResults>b__0(IHiveService service) in c:\HL3\trunk\sources\HeuristicLab.Clients.Hive\3.3\JobResultPoller.cs:line 111 at HeuristicLab.Clients.Hive.HiveServiceLocator.CallHiveService[T](Func`2 call) in c:\HL3\trunk\sources\HeuristicLab.Clients.Hive\3.3\HiveServiceLocator.cs:line 64 at HeuristicLab.Clients.Hive.JobResultPoller.RunPolling() in c:\HL3\trunk\sources\HeuristicLab.Clients.Hive\3.3\JobResultPoller.cs:line 91
comment:6 in reply to: ↑ 5 Changed 12 years ago by abeham
Replying to abeham:
It seems the performance problem is gone, but when I upload a job and leave it (refresh automatically is on). I get multiple exceptions like these over time:
InvalidOperationException: Collection was modified; enumeration operation may not execute. at System.Collections.Generic.List`1.Enumerator.MoveNextRare() at HeuristicLab.Optimization.RunCollection.OnItemsAdded(IEnumerable`1 items) in c:\HL3\trunk\sources\HeuristicLab.Optimization\3.3\RunCollection.cs:line 167 at HeuristicLab.Clients.Hive.JobManager.Views.RefreshableHiveJobView.GetAllRunsFromJob(RefreshableJob job) in c:\HL3\trunk\sources\HeuristicLab.Clients.Hive.JobManager\3.3\Views\RefreshableHiveJobView.cs:line 536 at HeuristicLab.Clients.Hive.JobManager.Views.RefreshableHiveJobView.Content_TaskReceived(Object sender, EventArgs e) in c:\HL3\trunk\sources\HeuristicLab.Clients.Hive.JobManager\3.3\Views\RefreshableHiveJobView.cs:line 187 at HeuristicLab.Clients.Hive.RefreshableJob.jobResultPoller_JobResultReceived(Object sender, EventArgs`1 e) in c:\HL3\trunk\sources\HeuristicLab.Clients.Hive\3.3\RefreshableJob.cs:line 312 at HeuristicLab.Clients.Hive.JobResultPoller.<FetchJobResults>b__0(IHiveService service) in c:\HL3\trunk\sources\HeuristicLab.Clients.Hive\3.3\JobResultPoller.cs:line 111 at HeuristicLab.Clients.Hive.HiveServiceLocator.CallHiveService[T](Func`2 call) in c:\HL3\trunk\sources\HeuristicLab.Clients.Hive\3.3\HiveServiceLocator.cs:line 64 at HeuristicLab.Clients.Hive.JobResultPoller.RunPolling() in c:\HL3\trunk\sources\HeuristicLab.Clients.Hive\3.3\JobResultPoller.cs:line 91
I noticed that this resulted in several tasks that didn't have an execution history and which had their state set to "Offline", but which have calculated all runs.
comment:7 Changed 12 years ago by ascheibe
r8848 fixed an InvalidOperationException in the Hive Job Manager
comment:8 Changed 12 years ago by ascheibe
- Owner changed from ascheibe to abeham
- Status changed from assigned to reviewing
comment:9 Changed 12 years ago by abeham
- Description modified (diff)
- Summary changed from HeuristicLab crashes with stack overflow exception to Improve Hive UI stability
I changed the title and description of the ticket since I couldn't reproduce the original StackOverflowException.
But I noticed that I received multiple EndpointNotFoundExceptions when I had the JobManager open, put the computer to sleep and woke it up, but with the network unplugged. A job was shown and to be refreshed automatically. Admittedly, this is expected and a rare case. The problem was that this produces new exception dialogs every second. The exception should either be swallowed and no action performed if not connected, or a state should be remembered which indicates that the computer is not connected and that the exception was already shown.
EndpointNotFoundException: There was no endpoint listening at http://services.heuristiclab.com/Hive-3.3/HiveService.svc that could accept the message. This is often caused by an incorrect address or SOAP action. See InnerException, if present, for more details. Server stack trace: at System.ServiceModel.Security.IssuanceTokenProviderBase`1.DoNegotiation(TimeSpan timeout) at System.ServiceModel.Security.SspiNegotiationTokenProvider.OnOpen(TimeSpan timeout) at System.ServiceModel.Security.TlsnegoTokenProvider.OnOpen(TimeSpan timeout) at System.ServiceModel.Channels.CommunicationObject.Open(TimeSpan timeout) at System.ServiceModel.Security.SymmetricSecurityProtocol.OnOpen(TimeSpan timeout) at System.ServiceModel.Channels.CommunicationObject.Open(TimeSpan timeout) at System.ServiceModel.Channels.SecurityChannelFactory`1.ClientSecurityChannel`1.OnOpen(TimeSpan timeout) at System.ServiceModel.Channels.CommunicationObject.Open(TimeSpan timeout) at System.ServiceModel.Security.SecuritySessionSecurityTokenProvider.DoOperation(SecuritySessionOperation operation, EndpointAddress target, Uri via, SecurityToken currentToken, TimeSpan timeout) at System.ServiceModel.Security.SecuritySessionSecurityTokenProvider.GetTokenCore(TimeSpan timeout) at System.IdentityModel.Selectors.SecurityTokenProvider.GetToken(TimeSpan timeout) at System.ServiceModel.Security.SecuritySessionClientSettings`1.ClientSecuritySessionChannel.OnOpen(TimeSpan timeout) at System.ServiceModel.Channels.CommunicationObject.Open(TimeSpan timeout) at System.ServiceModel.Channels.ServiceChannel.OnOpen(TimeSpan timeout) at System.ServiceModel.Channels.CommunicationObject.Open(TimeSpan timeout) at System.ServiceModel.Channels.ServiceChannel.CallOnceManager.CallOnce(TimeSpan timeout, CallOnceManager cascade) at System.ServiceModel.Channels.ServiceChannel.EnsureOpened(TimeSpan timeout) at System.ServiceModel.Channels.ServiceChannel.Call(String action, Boolean oneway, ProxyOperationRuntime operation, Object[] ins, Object[] outs, TimeSpan timeout) at System.ServiceModel.Channels.ServiceChannelProxy.InvokeService(IMethodCallMessage methodCall, ProxyOperationRuntime operation) at System.ServiceModel.Channels.ServiceChannelProxy.Invoke(IMessage message) Exception rethrown at [0]: at System.Runtime.Remoting.Proxies.RealProxy.HandleReturnMessage(IMessage reqMsg, IMessage retMsg) at System.Runtime.Remoting.Proxies.RealProxy.PrivateInvoke(MessageData& msgData, Int32 type) at HeuristicLab.Clients.Hive.IHiveService.GetLightweightJobTasks(Guid jobId) at HeuristicLab.Clients.Hive.HiveServiceClient.GetLightweightJobTasks(Guid jobId) in c:\HL3\trunk\sources\HeuristicLab.Clients.Hive\3.3\ServiceClients\HiveServiceClient.cs:line 1972 at HeuristicLab.Clients.Hive.JobResultPoller.<FetchJobResults>b__0(IHiveService service) in c:\HL3\trunk\sources\HeuristicLab.Clients.Hive\3.3\JobResultPoller.cs:line 109 at HeuristicLab.Clients.Hive.HiveServiceLocator.CallHiveService[T](Func`2 call) in c:\HL3\trunk\sources\HeuristicLab.Clients.Hive\3.3\HiveServiceLocator.cs:line 64 at HeuristicLab.Clients.Hive.JobResultPoller.RunPolling() in c:\HL3\trunk\sources\HeuristicLab.Clients.Hive\3.3\JobResultPoller.cs:line 91 ----- WebException: The remote name could not be resolved: 'services.heuristiclab.com' at System.Net.HttpWebRequest.GetRequestStream(TransportContext& context) at System.Net.HttpWebRequest.GetRequestStream() at System.ServiceModel.Channels.HttpOutput.WebRequestHttpOutput.GetOutputStream()
comment:10 Changed 12 years ago by ascheibe
- Owner changed from abeham to ascheibe
- Status changed from reviewing to assigned
comment:11 Changed 12 years ago by ascheibe
r8869 fixed multiple EndpointNotFoundExceptions in the HiveJobManager
comment:12 Changed 12 years ago by ascheibe
- Owner changed from ascheibe to abeham
- Status changed from assigned to reviewing
If a job is refreshed automatically and the connection to the server is lost, the Hive Job Manager now displays the error message only once and then stops polling. The "Refresh automatically" checkbox also gets unchecked, so the user gets feedback. If the user has again internet, the checkbox can be checked and the polling and task downloading is continued.
comment:13 Changed 12 years ago by ascheibe
- Owner changed from abeham to ascheibe
- Status changed from reviewing to assigned
comment:14 Changed 12 years ago by ascheibe
- Owner changed from ascheibe to abeham
- Status changed from assigned to reviewing
- fixed a bug where runs were downloaded multiple times
- fixed a bug where "Refresh automatically" wasn't disabled when the job was finished
- added a locker around the code that integrates downloaded optimizers as in rare cases collections were modified by multiple threads which lead to an exception
comment:15 Changed 12 years ago by abeham
- Owner changed from abeham to ascheibe
- Status changed from reviewing to assigned
I reviewed the changes, thx for improving the stability, much appreciated.
Nevertheless, as discussed change r8848 could probably be reverted. I think the exception has to occur within the view and thus the lock isn't effective.
comment:16 Changed 12 years ago by ascheibe
comment:17 Changed 12 years ago by ascheibe
- fixed nasty permission exception: refresh automatically can only be activated after the job is actually uploaded to the server
- fixed setting refresh automatically correctly when switching between jobs
comment:18 Changed 12 years ago by ascheibe
- added more aggressive locking so that the views don't read run collections that get modified in the meantime
- start downloading of tasks after the job has been uploaded completely
- fixed exceptions that got thrown when waiting for the threads that upload the tasks
comment:19 Changed 12 years ago by ascheibe
r8993 fixed downloading of the wrong number of runs
comment:20 Changed 12 years ago by ascheibe
r8994 added locking for assigning runs to the run collection view
comment:21 Changed 12 years ago by ascheibe
- Owner changed from ascheibe to abeham
- Status changed from assigned to reviewing
comment:22 Changed 12 years ago by abeham
- Owner changed from abeham to ascheibe
- Status changed from reviewing to assigned
I found another bug while using it. Maybe you can have a look at it. I did not have a user set in my HL instance (-> anonymous). I opened the administrator and got notified that I have no user set. I then set username and password and hit refresh in the administrator view. I got an exception. I opened the job manager, but all my jobs were shown. I then opened user information from the Access menu, hit refresh and HL crashed. I hope you can reproduce this.
Starting the optimizer with a valid username doesn't show any problems. Something is probably forgotten to update when changing the username. Maybe it's cached in the view, I don't know.
comment:23 Changed 12 years ago by ascheibe
- Owner changed from ascheibe to abeham
- Status changed from assigned to reviewing
r9063 fixed some errors that occurred when the user name was not set correctly
comment:24 follow-up: ↓ 26 Changed 12 years ago by abeham
- Owner changed from abeham to ascheibe
- Status changed from reviewing to assigned
- Type changed from defect to enhancement
Reviewed r8656, r8848, r8869, and r8871 once more.
I have some questions regarding r8869. There you have the following code:
while (!stopRequested) { .. } if (stopRequested) { .. }
Wouldn't the if condition always evaluate to true? If there's a possibility that it could go false (stopRequested is not a local variable after all) then waitHandle wouldn't be closed. Also, the waitHandle wouldn't close if AutoResumeOnException was set to true and an exception was fired.
In r8914:
- SuppressEvents would also need to be set in e.g. Content_RefreshAutomaticallyChanged
- SuppressEvents should also be used in other control event handlers that update the content. Otherwise e.g. nameTextBox.Text = Content.Name triggers nameTextBox_Validating which in turn calls Content.Name = nameTextBox.Text
- There doesn't seem to be an event that handles changes to IsPriviledged
In r8939
- OptimizerHiveTask:
- I do not fully understand UpdateChildHiveTasks, UpdateOptimizerInBatchRun, and UpdateOptimizerInExperiment
- What is the purpose of the code where the batchrun's repetitions and the child tasks are being compared?
- In the comment "only set the first optimizer as Optimizer..." what is "the first optimizer"?
- I do not fully understand UpdateChildHiveTasks, UpdateOptimizerInBatchRun, and UpdateOptimizerInExperiment
r8993:8994: ok
comment:25 Changed 12 years ago by ascheibe
- added missing SuppressEvents conditions
- removed unnecessary check for stopRequested
comment:26 in reply to: ↑ 24 Changed 12 years ago by ascheibe
- Owner changed from ascheibe to abeham
- Status changed from assigned to reviewing
Replying to abeham:
Reviewed r8656, r8848, r8869, and r8871 once more.
I have some questions regarding r8869. There you have the following code:
while (!stopRequested) { .. } if (stopRequested) { .. }Wouldn't the if condition always evaluate to true? If there's a possibility that it could go false (stopRequested is not a local variable after all) then waitHandle wouldn't be closed. Also, the waitHandle wouldn't close if AutoResumeOnException was set to true and an exception was fired.
Thanks, you are right, i have fixed that as well as the missing SuppressEvents conditions.
In r8914:
- SuppressEvents would also need to be set in e.g. Content_RefreshAutomaticallyChanged
- SuppressEvents should also be used in other control event handlers that update the content. Otherwise e.g. nameTextBox.Text = Content.Name triggers nameTextBox_Validating which in turn calls Content.Name = nameTextBox.Text
- There doesn't seem to be an event that handles changes to IsPriviledged
That's true because it can only be changed by the user and therefore a content event is not necessary.
In r8939
- OptimizerHiveTask:
- I do not fully understand UpdateChildHiveTasks, UpdateOptimizerInBatchRun, and UpdateOptimizerInExperiment
- What is the purpose of the code where the batchrun's repetitions and the child tasks are being compared?
- In the comment "only set the first optimizer as Optimizer..." what is "the first optimizer"?
Well, i'm also not quit sure but i will try to explain:
- UpdateChildHiveTasks: In Hive we have an hierarchical structure of tasks. This is the structure that you can see in the Hive Job Manager. It also contains the actual optimizers (the HL algs, experiments and batchruns). It uses the experiments and batchruns to build this hierarchical structure of tasks and child tasks.
- UpdateOptimizerInBatchRun: Because batchruns can be executed in parallel, the result of every batchrun is called Run 1 and therefore we change the names of the batchruns. Because Hive does not know anything about batchruns, we have to assemble it ourselves so that the user can view it again.
- "only set the first optimizer as Optimizer..." means that we have to set the optimizer of the batchrun so that the user can use it in HL (e.g. run the batchrun again). As all optimizers of a batchrun are the same (they get cloned and wrapped in child tasks if parallelization of batchruns is switched on) we simply set the first we get/download.
r8993:8994: ok
comment:27 Changed 12 years ago by ascheibe
- Owner changed from abeham to ascheibe
- Status changed from reviewing to assigned
comment:28 Changed 12 years ago by ascheibe
r9107 fixed handling of IsAllowedPrivileged in the Hive Job Manager
comment:29 Changed 12 years ago by ascheibe
- Owner changed from ascheibe to abeham
- Status changed from assigned to reviewing
comment:30 follow-up: ↓ 33 Changed 12 years ago by abeham
Let me summarize how I understood the change:
The privilege checkbox enabled state is now determined by the HiveClient.IsAllowedPrivileged property in SetEnabledStateOfControls. Before, it was checked only on job creation if a user has privileged rights and it was not updated later on, e.g. when a privileged right was granted or revoked. Now at least when a refresh is taking place the privileged rights are updated. What I don't understand is the change to the IsAllowedPrivileged property in RefreshableJob. Basically you synchronize IsAllowedPrivileged with Job.IsPrivileged (in one direction). However, is it really correct to set Job.IsPrivileged to true when IsAllowedPrivileged is set to true? I assume I could execute a job as not privileged even though I am allowed to do so. Also I don't quite understand why a RefreshableJob can be allowed privileged execution. The allowance, as far as I understood it, is user-specific, not job-specific. But I assume, that's because the hive slave isn't able to query the permissions of the job's owner and thus this permission is serialized into the job itself.
comment:31 Changed 12 years ago by abeham
- Owner changed from abeham to ascheibe
- Status changed from reviewing to assigned
Please have a look at the property IsAllowedPrivileged in RefreshableJob. Also the getter is strange. If IsPrivileged is set to false, then IsAllowedPrivileged would also return false, even though it might actually be allowed.
comment:32 Changed 12 years ago by ascheibe
r9436 removed IsAllowedPrivileged property from RefreshableJob as it's not needed anymore
comment:33 in reply to: ↑ 30 Changed 12 years ago by ascheibe
- Owner changed from ascheibe to abeham
- Status changed from assigned to reviewing
Replying to abeham:
Let me summarize how I understood the change:
The privilege checkbox enabled state is now determined by the HiveClient.IsAllowedPrivileged property in SetEnabledStateOfControls. Before, it was checked only on job creation if a user has privileged rights and it was not updated later on, e.g. when a privileged right was granted or revoked. Now at least when a refresh is taking place the privileged rights are updated.
Yes, so if a user has the privileged role the checkbox should be enabled when creating jobs.
What I don't understand is the change to the IsAllowedPrivileged property in RefreshableJob. Basically you synchronize IsAllowedPrivileged with Job.IsPrivileged (in one direction). However, is it really correct to set Job.IsPrivileged to true when IsAllowedPrivileged is set to true? I assume I could execute a job as not privileged even though I am allowed to do so. Also I don't quite understand why a RefreshableJob can be allowed privileged execution. The allowance, as far as I understood it, is user-specific, not job-specific. But I assume, that's because the hive slave isn't able to query the permissions of the job's owner and thus this permission is serialized into the job itself.
That change didn't make sense, thats true. I have removed the IsAllowedPriviledged property as it's actually not needed anymore because we have the Hive client which holds this information. And the information whether a job is now privileged or not is stored in the job. Also this information is sadly not stored on the server in the job table but for each task. Therefore we have to add this value for every task when uploading it and check when downloading if a task is privileged and update the job object. This should be cleaned up in a future version.
comment:34 Changed 12 years ago by abeham
- Status changed from reviewing to readytorelease
You're right that this property is not needed anymore. Thanks for removing it.
comment:35 Changed 12 years ago by swagner
- Resolution set to done
- Status changed from readytorelease to closed
- Version changed from 3.3.7 to 3.3.8
I profiled the application over night and tried in similar situations that I experienced previous errors. I could not reproduce the crash, but the profiling revealed that the extreme amount of CPU usage is the culprit of UpdateStateLogList() in RefreshableHiveJobView.Content_StateLogListChanged(object, EventArgs). Profiling says it spends a lot of time in the setter of the Content property. Because EventHandlers are attached and removed all the time the setter is called it says that System.ComponentModel.EventHandlerList.RemoveHandler(object, Delegate) is the function doing most individual work. Could it be that there's a handler not deregistered so that this list grows bigger and bigger?
When closing the JobManager view the cpu usage dropped to normal levels.