HeuristicLab includes a distributed calculation infrastructure called HeuristicLab Hive. Hive follows the master-slave model where the master (server) manages the distribution of jobs to slaves which calculate them.
This is especially useful for large scale parameter tests where different algorithm configurations have to be run and analyzed.
In the last couple of month we have improved the performance of the Hive server as we have been getting to a point where the server was unable to service all requests at peak demand times. Here is an overview of the changes:
- The Hive service uses Linq-to-Sql to access the database. Some queries were written in a suboptimal way. These have been optimized to allow for faster queries.
- Lazy loading has been activated for binary data. This improves memory consumption of the server as binary data isn't loaded when querying e.g. tasks.
- Some Linq queries have been changed to compiled queries. Some queries have been rewritten in SQL (especially recursions which Linq-to-Sql does not support in an efficient way).
- Binary data is now stored using the Filestream property. This greatly reduces memory consumption of the SQL Server.
- Some other minor changes to reduce CPU load, e.g. minimized the amount of conversations between DTOs and DAOs, use byte instead of Binary, etc.
These changes greatly improve the performance of the service and reduce memory consumption as shown in the following chart:
The image shows web service operations that are used the most by the clients. The tests were done using 256 tasks with a size of 4 MB and a runtime of 2.5 minutes. 48 slaves were used with an heartbeat interval of 30 seconds. The displayed results are measured in seconds. This shows speed-ups up to a factor of 30 in some cases.
Another improvement is that the server now also offers a TCP endpoint besides the old WSHTTP endpoint. TCP is a lot faster (up to a factor 2, depending on the transfered data size) than web-services:
The chart compares WSHTTP with Text/MTOM encoding with TCP for different transfered chunk sizes, measured in seconds.
The HeuristicLab Hive Job Manager as well as the slaves can now use this endpoint if possible (e.g. no firewall restrictions) and only fall back to WSHTTP if TCP does not work.
The described changes in the Hive server speed up working with the Job Manager/Hive as well as preventing timeouts when transferring larger tasks as it was the case in the past. Additionally, the improvements also reduce hardware requirements for the server and improve scalability. We have used Hive in the past with around 350 cores calculating in parallel. This was the maximum that the server could process with our hardware configuration and this limit should now be much higher.
For more detailed information please have a look at ticket #2030 where development is tracked. The described improvements currently reside in the trunk and will be released with HeuristicLab 3.3.9.