Opened 4 years ago

Closed 4 years ago

#2030 closed defect (done)

Improve Hive Server performance

Reported by: ascheibe Owned by: ascheibe
Priority: medium Milestone: HeuristicLab 3.3.9
Component: Hive.Server Version: 3.3.8
Keywords: Cc:

Description

The Hive Server has performance issues which should be analyzed and fixed.

Change History (57)

comment:1 Changed 4 years ago by ascheibe

r9349 added branch

comment:2 Changed 4 years ago by ascheibe

  • Owner changed from ascheibe to pfleck
  • Status changed from new to assigned

comment:3 Changed 4 years ago by pfleck

  • Status changed from assigned to accepted

comment:4 Changed 4 years ago by pfleck

r9368 branched trunk

comment:5 Changed 4 years ago by pfleck

r9369 Aborted transactions are logged into file. Enabled PerformanceCounter for Service-Operations. New Slaves are assigned to HEAL ResourceGroup automatically.

comment:6 Changed 4 years ago by pfleck

r9372 Added Sleep-BenchmarkAlgorithm.

comment:7 Changed 4 years ago by pfleck

r9373 Added MultiSlaveRunner for starting multiple Slave Cores on a single Machine. Mocked GUID generation for slaves to random GUID. Deactivated PerformanceCounter measurements from ConfigManager to avoid shared resource locks.

comment:8 Changed 4 years ago by pfleck

r9374 Adjust namespace for project.

comment:9 Changed 4 years ago by pfleck

r9381

Activated Delayed Loading for binary data.

Added HiveOperationContext to store HiveDataContext for whole ServiceOperation duration.

Added HiveDao methods to query database objects, not DTOs.

Changed HartbeatManager to use only database objects from new queries.

comment:10 Changed 4 years ago by pfleck

r9385

Replaced lazy loading with specialized queries.

Compiled queries used for Heardbeat queries.

Changed result types to IQueryable<T> for later query modification.

comment:11 Changed 4 years ago by pfleck

r9391

Separated old DTO-Dao from new Dao. DTO-Dao should be replaced completely.

Heartbeat and UpdateTaskState uses new Dao.

DataContext is now closed on ServiceOperation end.

comment:12 Changed 4 years ago by pfleck

r9393

Changed Linq.Binary to byte array to avoid hash computation.

DataContext in HiveOperationContext is now lazy initialized.

Added missing HiveDao from last commit failure.

comment:13 Changed 4 years ago by pfleck

r9394

Fixed namespace bug in ConfigFile.

Disabled OutOfCoreExeption and OutOfMemoryException for MultiSlave-testing.

Added random waiting time for HartbeatManager to start to avoid simultaneously heartbeats for MultiSlave-testing.

comment:14 Changed 4 years ago by pfleck

r9397

Changed recursive Linq2Sql queries to native SQL queries.

comment:15 Changed 4 years ago by pfleck

r9399

Removed unnecessary UpdatePlugins in UpdateTask.

Optimized GetTask and GetPlugin with compiled queries.

comment:16 Changed 4 years ago by pfleck

r9434

Added SelfHost-Project.

Renamed HiveDtoDao back to HiveDao and renamed the optimized HiveDao into OptimizedDao instead.

Optimized AddTask by using compiled queries.

comment:17 Changed 4 years ago by pfleck

r9444

Switched Text encoding to Mtom encoding for better performance for binary data.

Merged trunk into branch.

comment:18 Changed 4 years ago by pfleck

r9460

Enabled Tcp Binding for Hive Service on Server.

Notes:

comment:19 Changed 4 years ago by pfleck

r9461

Configured Tcp binding for Hive service on Hive clients.

Added binding-configuration priority list in Hive settings for managing multiple binding-configurations.

Added retry mechanism for multiple Hive-service bindings.

comment:20 Changed 4 years ago by pfleck

r9469

Task- and Plugin-data is stored in the file system instead of the database. This is accomplished by the Filestream feature of the SQL-Server.

Fixed bug when Heal-Group does not exist.

Note: SqlServer configuration changes are necessary for Filestream feature http://www.codeproject.com/Articles/128657/How-Do-I-Use-SQL-File-Stream

comment:21 Changed 4 years ago by pfleck

r9485

Fixed Bugs in Shutdown Queries.

Added SQL script for migrating to filestream based storage of task and plugin data.

To remove files after data removal turned on Auto-Shrink and set recovery-model to Simple. Also set a checkpoint in the database.

comment:22 Changed 4 years ago by pfleck

r9492

Added filestream access level configuration in SQL scripts.

comment:23 Changed 4 years ago by ascheibe

  • Owner changed from pfleck to ascheibe
  • Status changed from accepted to reviewing

comment:24 Changed 4 years ago by ascheibe

r9539 merged changes of trunk into branch

comment:26 Changed 4 years ago by ascheibe

r9616 merged trunk into hive performance branch

comment:27 follow-up: Changed 4 years ago by ascheibe

I've had a look at the OptimizedHiveDao and here are my comments (I will have a look at the rest later):

  • Please update the year in the license headers and plugin files from 2012 to 2013.
  • Please switch all app.configs back to the services.heuristiclab.com url and keys.
  • Remove the unused #regions from the OptimizedHiveDao and IOptimizedHiveDao.
  • Get rid of HiveOperationContext because I want to be able use the improved methods also by directly referencing the assembly and not using the webservice. This would not be possible when using the HiveOperationContext. So if there is not a major performance penalty (I read that datacontext is cheap?) could you please remove it.
  • The method GetTaskByIdAndLastStateLogSlaveIdQuery is used by the scheduler for retrieving the task and the last statelog to check if the slave that is calculating the task is the correct one. What I don't understand is why GetTaskByIdAndLastStateLogSlaveIdQuery groups the statelogs by task as the query only retrieves one task? Also, if that is true, then the query can only return a single tuple, so please use Single() and not SingleOrDefault() to make it more clear.
  • I assume the improvement of the new GetLightweightTasks is that it is a compiled query and you left some stuff out which is unnecessary. Am I right? The only thing that scares me a little bit is that you do not order the statelogs by date. Is this really not necessary?

comment:28 follow-up: Changed 4 years ago by ascheibe

This is the rest of the review for the server:

  • Revert the changes in TransactionManager as these were just for debugging purposes and should not get into the trunk.
  • In HiveService.UpdateTaskState(..) please remove the source line that you commented, you're right, we don't need to do this.
  • Remove the code in HiveService.Hello() that adds new slaves to the HEAL group.
  • In HeartbeatManager line 77: Is this null check really necessary? If not please remove it.
  • Please rename AssignJob to AssignTask in HeartbeatManager, I forgot to rename it back then...

I also reviewed the changes in the Hive Wiki and have one comment:

  • "Go to Windows Services and add NetTcpActivator and NetTcpPortSharing services": I wouldn't say "add" but rather "start" or "activate". Also, as it is not possible to do port sharing between tcp and http and we only need one tcp port, I assume we don't have to activate port sharing?

Please replace the old Hive Server Setup Wiki page with the new one.

comment:29 Changed 4 years ago by ascheibe

This is the last bunch of review comments (this time for the changes in the client):

  • HiveServiceLocator: Please add public getters for endpointRetries and workingEndpoint, maybe we want to display this information in the future somewhere.
  • HiveServiceLocator: Please make the constructor private, this seems to have been overlooked in the past.
  • In the app.configs/Settings for the clients please set the MaxParallelDownloads and the MaxParallelUploads to 2.
  • I think we should set MaxEndpointRetries to 3 as this is should be enough to verify that tcp does not work.
  • Please also reset the message encoding of the wshttp binding to text so that we have the old wshttp endpoint as a fallback for old HL clients.

comment:30 Changed 4 years ago by pfleck

  • Owner changed from ascheibe to pfleck
  • Status changed from reviewing to assigned

comment:31 Changed 4 years ago by pfleck

r9634

Updated license year in headers and AssemblyInfo.

Removed empty regions in OptimizedHiveDao and IOptimizedHiveDao.

The ServiceLocator handles creation of the OptimizedHiveDao different to make the OptimizedHiveDao available for non-service operation calls:

  • if the ServiceLocator is called _within_ a service operation, the same DB-context is used for the whole service operation, and therefore is stored in the service operation context.
  • if the ServiceLocator is called _outside_ a service operation, a new DB-context is created on each ServiceLocator call.

comment:32 Changed 4 years ago by pfleck

r9635

Restructured and simplified GetTaskByIdAndLastStateLogSlaveIdQuery.

Sorted StateLogs in GetLightweightTasksQuery.

comment:33 in reply to: ↑ 27 Changed 4 years ago by pfleck

Replying to ascheibe:

I've had a look at the OptimizedHiveDao and here are my comments (I will have a look at the rest later):

  • Please update the year in the license headers and plugin files from 2012 to 2013.
  • Please switch all app.configs back to the services.heuristiclab.com url and keys.
  • Remove the unused #regions from the OptimizedHiveDao and IOptimizedHiveDao.
  • Get rid of HiveOperationContext because I want to be able use the improved methods also by directly referencing the assembly and not using the webservice. This would not be possible when using the HiveOperationContext. So if there is not a major performance penalty (I read that datacontext is cheap?) could you please remove it.
  • The method GetTaskByIdAndLastStateLogSlaveIdQuery is used by the scheduler for retrieving the task and the last statelog to check if the slave that is calculating the task is the correct one. What I don't understand is why GetTaskByIdAndLastStateLogSlaveIdQuery groups the statelogs by task as the query only retrieves one task? Also, if that is true, then the query can only return a single tuple, so please use Single() and not SingleOrDefault() to make it more clear.
  • I assume the improvement of the new GetLightweightTasks is that it is a compiled query and you left some stuff out which is unnecessary. Am I right? The only thing that scares me a little bit is that you do not order the statelogs by date. Is this really not necessary?

Regarding GetTaskByIdAndLastStateLogSlaveIdQuery, you are right; the query is a bit weird. Changed it to a simpler query.

The StateLogs in GetLightweightTasks sould be sorted as SQL server returns them ordered by insert time. Anyway, added sorting explicitly, just in case.

The improvement is mainly, that the new query result in only one SQL-statement, compared to 2 * number of tasks in job statements at worst case. (the inner query is executed separately each iteration)

comment:34 Changed 4 years ago by pfleck

r9636

Added changes as suggested from server review by ascheibe.

comment:35 in reply to: ↑ 28 Changed 4 years ago by pfleck

Replying to ascheibe:

This is the rest of the review for the server:

  • Revert the changes in TransactionManager as these were just for debugging purposes and should not get into the trunk.
  • In HiveService.UpdateTaskState(..) please remove the source line that you commented, you're right, we don't need to do this.
  • Remove the code in HiveService.Hello() that adds new slaves to the HEAL group.
  • In HeartbeatManager line 77: Is this null check really necessary? If not please remove it.
  • Please rename AssignJob to AssignTask in HeartbeatManager, I forgot to rename it back then...

I also reviewed the changes in the Hive Wiki and have one comment:

  • "Go to Windows Services and add NetTcpActivator and NetTcpPortSharing services": I wouldn't say "add" but rather "start" or "activate". Also, as it is not possible to do port sharing between tcp and http and we only need one tcp port, I assume we don't have to activate port sharing?

Please replace the old Hive Server Setup Wiki page with the new one.

Net.Tcp Port sharing is required by Net.Tcp itself, therefore the service is required, although it is not used for port sharing directly. (http://galratner.com/blogs/net/archive/2010/10/08/setting-up-a-nettcpbinding-enabled-wcf-service-in-iis-7.aspx)

Last edited 4 years ago by pfleck (previous) (diff)

comment:36 Changed 4 years ago by pfleck

r9637

In HiveServiceLocator: added public properties for endpointRetries and workingEndpoint.

Changed Config default settings.

Changed wshttp encoding to text for backwards compatibility.

comment:37 Changed 4 years ago by pfleck

r9638

Changed endpoint addresses to services.heuristiclab.com and adjust certificate values.

comment:38 Changed 4 years ago by pfleck

r9639

Added publish script for Hive service.

comment:39 Changed 4 years ago by pfleck

  • Owner changed from pfleck to ascheibe
  • Status changed from assigned to reviewing

comment:40 follow-up: Changed 4 years ago by ascheibe

  • Owner changed from ascheibe to pfleck
  • Status changed from reviewing to assigned

Thanks very much, this looks good! Concerning the possible plugin downloading problem: I have also tried to run a slave on my computer that connects to the test hive. It also doesn't fetch any plugins, so this is really a bug. I think if this is fixed we should be finished with this ticket.

comment:41 in reply to: ↑ 40 ; follow-up: Changed 4 years ago by pfleck

Replying to ascheibe:

Thanks very much, this looks good! Concerning the possible plugin downloading problem: I have also tried to run a slave on my computer that connects to the test hive. It also doesn't fetch any plugins, so this is really a bug. I think if this is fixed we should be finished with this ticket.

I looked through the code and I think that the lazy loading of the PluginData is not triggered. (See Line 501 in HiveService.cs)

pluginDatas.AddRange(dao.GetPluginDatas(x => x.PluginId == guid).ToList())

The .LoadWith method should fix this bug. I will look into it in the next days.

But I wonder why this problem does not occur with TaskDatas.

comment:42 in reply to: ↑ 41 ; follow-up: Changed 4 years ago by ascheibe

Replying to pfleck:

Replying to ascheibe:

Thanks very much, this looks good! Concerning the possible plugin downloading problem: I have also tried to run a slave on my computer that connects to the test hive. It also doesn't fetch any plugins, so this is really a bug. I think if this is fixed we should be finished with this ticket.

I looked through the code and I think that the lazy loading of the PluginData is not triggered. (See Line 501 in HiveService.cs)

pluginDatas.AddRange(dao.GetPluginDatas(x => x.PluginId == guid).ToList())

The .LoadWith method should fix this bug. I will look into it in the next days.

But I wonder why this problem does not occur with TaskDatas.

Hmmm, actually I'm not sure about this. GetPluginDatas did not change in your branch, so this method is the same in trunk and in the HivePerformance branch. The activation of lazy loading is something that I already integrated into the trunk and works there. So I think the error is probably somewhere else?

comment:43 Changed 4 years ago by ascheibe

r9655 merged trunk into hive performance branch

comment:44 Changed 4 years ago by ascheibe

r9656 updated the web.config of services with the new tcp binding

comment:45 in reply to: ↑ 42 Changed 4 years ago by pfleck

Replying to ascheibe:

Replying to pfleck:

Replying to ascheibe:

Thanks very much, this looks good! Concerning the possible plugin downloading problem: I have also tried to run a slave on my computer that connects to the test hive. It also doesn't fetch any plugins, so this is really a bug. I think if this is fixed we should be finished with this ticket.

I looked through the code and I think that the lazy loading of the PluginData is not triggered. (See Line 501 in HiveService.cs)

pluginDatas.AddRange(dao.GetPluginDatas(x => x.PluginId == guid).ToList())

The .LoadWith method should fix this bug. I will look into it in the next days.

But I wonder why this problem does not occur with TaskDatas.

Hmmm, actually I'm not sure about this. GetPluginDatas did not change in your branch, so this method is the same in trunk and in the HivePerformance branch. The activation of lazy loading is something that I already integrated into the trunk and works there. So I think the error is probably somewhere else?

The problem was that the Hive Server could not fetch the plugin data from the database, because there weren't any.

Actually this should never happen, but I think it happened somehow during server restart while I was migrating the database to filestream-enhanced tables.

Deleting and re-uploading the corrupted plugins solved the problem, however calculating tasks which depend on those plugins, had to be deleted too.

comment:46 Changed 4 years ago by pfleck

  • Owner changed from pfleck to ascheibe
  • Status changed from assigned to reviewing

comment:47 Changed 4 years ago by ascheibe

  • Status changed from reviewing to assigned

comment:48 Changed 4 years ago by ascheibe

r9663 merged trunk into branch

comment:49 Changed 4 years ago by ascheibe

r9665 merged hive performance branch back into trunk

comment:50 Changed 4 years ago by ascheibe

r9666

  • added missing tcp config to windows service
  • fixed MaxParallelDownloads
  • increased receive/send timeouts

comment:51 Changed 4 years ago by ascheibe

r9675 added description to db migration script

comment:52 Changed 4 years ago by ascheibe

  • Status changed from assigned to reviewing

comment:53 Changed 4 years ago by ascheibe

  • Status changed from reviewing to readytorelease

I have already reviewed this ticket.

comment:54 Changed 4 years ago by ascheibe

r9700 merged r9665,r9666,r9675 into the stable branch

comment:55 Changed 4 years ago by ascheibe

r9701 deleted HivePerformance branch

comment:56 Changed 4 years ago by ascheibe

  • Milestone changed from HeuristicLab 3.3.x Backlog to HeuristicLab 3.3.9
  • Version changed from branch to 3.3.8

comment:57 Changed 4 years ago by ascheibe

  • Resolution set to done
  • Status changed from readytorelease to closed
Note: See TracTickets for help on using tickets.