Opened 12 years ago
Closed 11 years ago
#2030 closed defect (done)
Improve Hive Server performance
Reported by: | ascheibe | Owned by: | ascheibe |
---|---|---|---|
Priority: | medium | Milestone: | HeuristicLab 3.3.9 |
Component: | Hive.Server | Version: | 3.3.8 |
Keywords: | Cc: |
Description
The Hive Server has performance issues which should be analyzed and fixed.
Change History (57)
comment:1 Changed 12 years ago by ascheibe
comment:2 Changed 12 years ago by ascheibe
- Owner changed from ascheibe to pfleck
- Status changed from new to assigned
comment:3 Changed 12 years ago by pfleck
- Status changed from assigned to accepted
comment:4 Changed 12 years ago by pfleck
r9368 branched trunk
comment:5 Changed 12 years ago by pfleck
r9369 Aborted transactions are logged into file. Enabled PerformanceCounter for Service-Operations. New Slaves are assigned to HEAL ResourceGroup automatically.
comment:6 Changed 12 years ago by pfleck
r9372 Added Sleep-BenchmarkAlgorithm.
comment:7 Changed 12 years ago by pfleck
r9373 Added MultiSlaveRunner for starting multiple Slave Cores on a single Machine. Mocked GUID generation for slaves to random GUID. Deactivated PerformanceCounter measurements from ConfigManager to avoid shared resource locks.
comment:8 Changed 12 years ago by pfleck
r9374 Adjust namespace for project.
comment:9 Changed 12 years ago by pfleck
Activated Delayed Loading for binary data.
Added HiveOperationContext to store HiveDataContext for whole ServiceOperation duration.
Added HiveDao methods to query database objects, not DTOs.
Changed HartbeatManager to use only database objects from new queries.
comment:10 Changed 12 years ago by pfleck
Replaced lazy loading with specialized queries.
Compiled queries used for Heardbeat queries.
Changed result types to IQueryable<T> for later query modification.
comment:11 Changed 12 years ago by pfleck
Separated old DTO-Dao from new Dao. DTO-Dao should be replaced completely.
Heartbeat and UpdateTaskState uses new Dao.
DataContext is now closed on ServiceOperation end.
comment:12 Changed 12 years ago by pfleck
Changed Linq.Binary to byte array to avoid hash computation.
DataContext in HiveOperationContext is now lazy initialized.
Added missing HiveDao from last commit failure.
comment:13 Changed 12 years ago by pfleck
Fixed namespace bug in ConfigFile.
Disabled OutOfCoreExeption and OutOfMemoryException for MultiSlave-testing.
Added random waiting time for HartbeatManager to start to avoid simultaneously heartbeats for MultiSlave-testing.
comment:14 Changed 12 years ago by pfleck
Changed recursive Linq2Sql queries to native SQL queries.
comment:15 Changed 12 years ago by pfleck
Removed unnecessary UpdatePlugins in UpdateTask.
Optimized GetTask and GetPlugin with compiled queries.
comment:16 Changed 12 years ago by pfleck
Added SelfHost-Project.
Renamed HiveDtoDao back to HiveDao and renamed the optimized HiveDao into OptimizedDao instead.
Optimized AddTask by using compiled queries.
comment:17 Changed 12 years ago by pfleck
Switched Text encoding to Mtom encoding for better performance for binary data.
Merged trunk into branch.
comment:18 Changed 12 years ago by pfleck
Enabled Tcp Binding for Hive Service on Server.
Notes:
comment:19 Changed 12 years ago by pfleck
Configured Tcp binding for Hive service on Hive clients.
Added binding-configuration priority list in Hive settings for managing multiple binding-configurations.
Added retry mechanism for multiple Hive-service bindings.
comment:20 Changed 12 years ago by pfleck
Task- and Plugin-data is stored in the file system instead of the database. This is accomplished by the Filestream feature of the SQL-Server.
Fixed bug when Heal-Group does not exist.
Note: SqlServer configuration changes are necessary for Filestream feature http://www.codeproject.com/Articles/128657/How-Do-I-Use-SQL-File-Stream
comment:21 Changed 12 years ago by pfleck
Fixed Bugs in Shutdown Queries.
Added SQL script for migrating to filestream based storage of task and plugin data.
To remove files after data removal turned on Auto-Shrink and set recovery-model to Simple. Also set a checkpoint in the database.
comment:22 Changed 12 years ago by pfleck
Added filestream access level configuration in SQL scripts.
comment:23 Changed 12 years ago by ascheibe
- Owner changed from pfleck to ascheibe
- Status changed from accepted to reviewing
comment:24 Changed 12 years ago by ascheibe
r9539 merged changes of trunk into branch
comment:25 Changed 12 years ago by pfleck
Manual for WCF OperationContext.
http://hyperthink.net/blog/a-simple-ish-approach-to-custom-context-in-wcf/
comment:26 Changed 11 years ago by ascheibe
r9616 merged trunk into hive performance branch
comment:27 follow-up: ↓ 33 Changed 11 years ago by ascheibe
I've had a look at the OptimizedHiveDao and here are my comments (I will have a look at the rest later):
- Please update the year in the license headers and plugin files from 2012 to 2013.
- Please switch all app.configs back to the services.heuristiclab.com url and keys.
- Remove the unused #regions from the OptimizedHiveDao and IOptimizedHiveDao.
- Get rid of HiveOperationContext because I want to be able use the improved methods also by directly referencing the assembly and not using the webservice. This would not be possible when using the HiveOperationContext. So if there is not a major performance penalty (I read that datacontext is cheap?) could you please remove it.
- The method GetTaskByIdAndLastStateLogSlaveIdQuery is used by the scheduler for retrieving the task and the last statelog to check if the slave that is calculating the task is the correct one. What I don't understand is why GetTaskByIdAndLastStateLogSlaveIdQuery groups the statelogs by task as the query only retrieves one task? Also, if that is true, then the query can only return a single tuple, so please use Single() and not SingleOrDefault() to make it more clear.
- I assume the improvement of the new GetLightweightTasks is that it is a compiled query and you left some stuff out which is unnecessary. Am I right? The only thing that scares me a little bit is that you do not order the statelogs by date. Is this really not necessary?
comment:28 follow-up: ↓ 35 Changed 11 years ago by ascheibe
This is the rest of the review for the server:
- Revert the changes in TransactionManager as these were just for debugging purposes and should not get into the trunk.
- In HiveService.UpdateTaskState(..) please remove the source line that you commented, you're right, we don't need to do this.
- Remove the code in HiveService.Hello() that adds new slaves to the HEAL group.
- In HeartbeatManager line 77: Is this null check really necessary? If not please remove it.
- Please rename AssignJob to AssignTask in HeartbeatManager, I forgot to rename it back then...
I also reviewed the changes in the Hive Wiki and have one comment:
- "Go to Windows Services and add NetTcpActivator and NetTcpPortSharing services": I wouldn't say "add" but rather "start" or "activate". Also, as it is not possible to do port sharing between tcp and http and we only need one tcp port, I assume we don't have to activate port sharing?
Please replace the old Hive Server Setup Wiki page with the new one.
comment:29 Changed 11 years ago by ascheibe
This is the last bunch of review comments (this time for the changes in the client):
- HiveServiceLocator: Please add public getters for endpointRetries and workingEndpoint, maybe we want to display this information in the future somewhere.
- HiveServiceLocator: Please make the constructor private, this seems to have been overlooked in the past.
- In the app.configs/Settings for the clients please set the MaxParallelDownloads and the MaxParallelUploads to 2.
- I think we should set MaxEndpointRetries to 3 as this is should be enough to verify that tcp does not work.
- Please also reset the message encoding of the wshttp binding to text so that we have the old wshttp endpoint as a fallback for old HL clients.
comment:30 Changed 11 years ago by pfleck
- Owner changed from ascheibe to pfleck
- Status changed from reviewing to assigned
comment:31 Changed 11 years ago by pfleck
Updated license year in headers and AssemblyInfo.
Removed empty regions in OptimizedHiveDao and IOptimizedHiveDao.
The ServiceLocator handles creation of the OptimizedHiveDao different to make the OptimizedHiveDao available for non-service operation calls:
- if the ServiceLocator is called _within_ a service operation, the same DB-context is used for the whole service operation, and therefore is stored in the service operation context.
- if the ServiceLocator is called _outside_ a service operation, a new DB-context is created on each ServiceLocator call.
comment:32 Changed 11 years ago by pfleck
Restructured and simplified GetTaskByIdAndLastStateLogSlaveIdQuery.
Sorted StateLogs in GetLightweightTasksQuery.
comment:33 in reply to: ↑ 27 Changed 11 years ago by pfleck
Replying to ascheibe:
I've had a look at the OptimizedHiveDao and here are my comments (I will have a look at the rest later):
- Please update the year in the license headers and plugin files from 2012 to 2013.
- Please switch all app.configs back to the services.heuristiclab.com url and keys.
- Remove the unused #regions from the OptimizedHiveDao and IOptimizedHiveDao.
- Get rid of HiveOperationContext because I want to be able use the improved methods also by directly referencing the assembly and not using the webservice. This would not be possible when using the HiveOperationContext. So if there is not a major performance penalty (I read that datacontext is cheap?) could you please remove it.
- The method GetTaskByIdAndLastStateLogSlaveIdQuery is used by the scheduler for retrieving the task and the last statelog to check if the slave that is calculating the task is the correct one. What I don't understand is why GetTaskByIdAndLastStateLogSlaveIdQuery groups the statelogs by task as the query only retrieves one task? Also, if that is true, then the query can only return a single tuple, so please use Single() and not SingleOrDefault() to make it more clear.
- I assume the improvement of the new GetLightweightTasks is that it is a compiled query and you left some stuff out which is unnecessary. Am I right? The only thing that scares me a little bit is that you do not order the statelogs by date. Is this really not necessary?
Regarding GetTaskByIdAndLastStateLogSlaveIdQuery, you are right; the query is a bit weird. Changed it to a simpler query.
The StateLogs in GetLightweightTasks sould be sorted as SQL server returns them ordered by insert time. Anyway, added sorting explicitly, just in case.
The improvement is mainly, that the new query result in only one SQL-statement, compared to 2 * number of tasks in job statements at worst case. (the inner query is executed separately each iteration)
comment:34 Changed 11 years ago by pfleck
Added changes as suggested from server review by ascheibe.
comment:35 in reply to: ↑ 28 Changed 11 years ago by pfleck
Replying to ascheibe:
This is the rest of the review for the server:
- Revert the changes in TransactionManager as these were just for debugging purposes and should not get into the trunk.
- In HiveService.UpdateTaskState(..) please remove the source line that you commented, you're right, we don't need to do this.
- Remove the code in HiveService.Hello() that adds new slaves to the HEAL group.
- In HeartbeatManager line 77: Is this null check really necessary? If not please remove it.
- Please rename AssignJob to AssignTask in HeartbeatManager, I forgot to rename it back then...
I also reviewed the changes in the Hive Wiki and have one comment:
- "Go to Windows Services and add NetTcpActivator and NetTcpPortSharing services": I wouldn't say "add" but rather "start" or "activate". Also, as it is not possible to do port sharing between tcp and http and we only need one tcp port, I assume we don't have to activate port sharing?
Please replace the old Hive Server Setup Wiki page with the new one.
Net.Tcp Port sharing is required by Net.Tcp itself, therefore the service is required, although it is not used for port sharing directly. (http://galratner.com/blogs/net/archive/2010/10/08/setting-up-a-nettcpbinding-enabled-wcf-service-in-iis-7.aspx)
comment:36 Changed 11 years ago by pfleck
In HiveServiceLocator: added public properties for endpointRetries and workingEndpoint.
Changed Config default settings.
Changed wshttp encoding to text for backwards compatibility.
comment:37 Changed 11 years ago by pfleck
Changed endpoint addresses to services.heuristiclab.com and adjust certificate values.
comment:38 Changed 11 years ago by pfleck
Added publish script for Hive service.
comment:39 Changed 11 years ago by pfleck
- Owner changed from pfleck to ascheibe
- Status changed from assigned to reviewing
comment:40 follow-up: ↓ 41 Changed 11 years ago by ascheibe
- Owner changed from ascheibe to pfleck
- Status changed from reviewing to assigned
Thanks very much, this looks good! Concerning the possible plugin downloading problem: I have also tried to run a slave on my computer that connects to the test hive. It also doesn't fetch any plugins, so this is really a bug. I think if this is fixed we should be finished with this ticket.
comment:41 in reply to: ↑ 40 ; follow-up: ↓ 42 Changed 11 years ago by pfleck
Replying to ascheibe:
Thanks very much, this looks good! Concerning the possible plugin downloading problem: I have also tried to run a slave on my computer that connects to the test hive. It also doesn't fetch any plugins, so this is really a bug. I think if this is fixed we should be finished with this ticket.
I looked through the code and I think that the lazy loading of the PluginData is not triggered. (See Line 501 in HiveService.cs)
pluginDatas.AddRange(dao.GetPluginDatas(x => x.PluginId == guid).ToList())
The .LoadWith method should fix this bug. I will look into it in the next days.
But I wonder why this problem does not occur with TaskDatas.
comment:42 in reply to: ↑ 41 ; follow-up: ↓ 45 Changed 11 years ago by ascheibe
Replying to pfleck:
Replying to ascheibe:
Thanks very much, this looks good! Concerning the possible plugin downloading problem: I have also tried to run a slave on my computer that connects to the test hive. It also doesn't fetch any plugins, so this is really a bug. I think if this is fixed we should be finished with this ticket.
I looked through the code and I think that the lazy loading of the PluginData is not triggered. (See Line 501 in HiveService.cs)
pluginDatas.AddRange(dao.GetPluginDatas(x => x.PluginId == guid).ToList())
The .LoadWith method should fix this bug. I will look into it in the next days.
But I wonder why this problem does not occur with TaskDatas.
Hmmm, actually I'm not sure about this. GetPluginDatas did not change in your branch, so this method is the same in trunk and in the HivePerformance branch. The activation of lazy loading is something that I already integrated into the trunk and works there. So I think the error is probably somewhere else?
comment:43 Changed 11 years ago by ascheibe
r9655 merged trunk into hive performance branch
comment:44 Changed 11 years ago by ascheibe
r9656 updated the web.config of services with the new tcp binding
comment:45 in reply to: ↑ 42 Changed 11 years ago by pfleck
Replying to ascheibe:
Replying to pfleck:
Replying to ascheibe:
Thanks very much, this looks good! Concerning the possible plugin downloading problem: I have also tried to run a slave on my computer that connects to the test hive. It also doesn't fetch any plugins, so this is really a bug. I think if this is fixed we should be finished with this ticket.
I looked through the code and I think that the lazy loading of the PluginData is not triggered. (See Line 501 in HiveService.cs)
pluginDatas.AddRange(dao.GetPluginDatas(x => x.PluginId == guid).ToList())
The .LoadWith method should fix this bug. I will look into it in the next days.
But I wonder why this problem does not occur with TaskDatas.
Hmmm, actually I'm not sure about this. GetPluginDatas did not change in your branch, so this method is the same in trunk and in the HivePerformance branch. The activation of lazy loading is something that I already integrated into the trunk and works there. So I think the error is probably somewhere else?
The problem was that the Hive Server could not fetch the plugin data from the database, because there weren't any.
Actually this should never happen, but I think it happened somehow during server restart while I was migrating the database to filestream-enhanced tables.
Deleting and re-uploading the corrupted plugins solved the problem, however calculating tasks which depend on those plugins, had to be deleted too.
comment:46 Changed 11 years ago by pfleck
- Owner changed from pfleck to ascheibe
- Status changed from assigned to reviewing
comment:47 Changed 11 years ago by ascheibe
- Status changed from reviewing to assigned
comment:48 Changed 11 years ago by ascheibe
r9663 merged trunk into branch
comment:49 Changed 11 years ago by ascheibe
r9665 merged hive performance branch back into trunk
comment:50 Changed 11 years ago by ascheibe
- added missing tcp config to windows service
- fixed MaxParallelDownloads
- increased receive/send timeouts
comment:51 Changed 11 years ago by ascheibe
r9675 added description to db migration script
comment:52 Changed 11 years ago by ascheibe
- Status changed from assigned to reviewing
comment:53 Changed 11 years ago by ascheibe
- Status changed from reviewing to readytorelease
I have already reviewed this ticket.
comment:54 Changed 11 years ago by ascheibe
comment:55 Changed 11 years ago by ascheibe
r9701 deleted HivePerformance branch
comment:56 Changed 11 years ago by ascheibe
- Milestone changed from HeuristicLab 3.3.x Backlog to HeuristicLab 3.3.9
- Version changed from branch to 3.3.8
comment:57 Changed 11 years ago by ascheibe
- Resolution set to done
- Status changed from readytorelease to closed
r9349 added branch