Opened 14 years ago
Closed 14 years ago
#1424 closed defect (done)
Performance issues when executing more than 17 algorithms
Reported by: | cneumuel | Owned by: | swagner |
---|---|---|---|
Priority: | high | Milestone: | HeuristicLab 3.3.4 |
Component: | Optimization | Version: | 3.3.4 |
Keywords: | Cc: |
Description (last modified by cneumuel)
When more than 17 instances of an algorithm are executed, most of the execution time is spent in set_ExecutionContext of Operator and Parameter, which leads to a significant slowdown.
This behaviour can be reproduced by creating more than 17 instances of an algorithm and then execute them sequentially. After the 17. algorithm the execution time drops quickly, reaching 100X slowdown after ~36 algorithm executions. Further executions reduce performance even more.
The following log shows the execution times of a GeneticAlgorithm with PopulationSize=10 and MaximumGenerations=10, cloned and executed 50 times:
0: 00:00:00.3190994 1: 00:00:00.0640627 2: 00:00:00.0629711 3: 00:00:00.0789977 4: 00:00:00.0636207 5: 00:00:00.0648095 6: 00:00:00.0635349 7: 00:00:00.0638938 8: 00:00:00.0684564 9: 00:00:00.0635349 10: 00:00:00.0640277 11: 00:00:00.0666138 12: 00:00:00.0785264 13: 00:00:00.0636077 14: 00:00:00.0649117 15: 00:00:00.0650256 16: 00:00:00.0651619 17: 00:00:00.1157925 18: 00:00:00.2212117 19: 00:00:00.2721120 20: 00:00:00.4346059 21: 00:00:02.7537371 22: 00:00:03.6094154 23: 00:00:05.4627364 24: 00:00:03.6676727 25: 00:00:02.8755246 26: 00:00:03.1758587 27: 00:00:03.6028118 28: 00:00:04.0009675 29: 00:00:04.5277583 30: 00:00:04.4591044 31: 00:00:04.8525177 32: 00:00:04.8829030 33: 00:00:05.2711543 34: 00:00:05.2257916 35: 00:00:06.0486489 36: 00:00:06.0530127 37: 00:00:06.3948470 38: 00:00:06.7425004 39: 00:00:06.9358513 40: 00:00:07.0969071 41: 00:00:07.9781183 42: 00:00:07.6586048 43: 00:00:08.1097788 44: 00:00:08.0271943 45: 00:00:08.6293224 46: 00:00:09.1268924 47: 00:00:09.1074116 48: 00:00:09.5311929 49: 00:00:10.0175224 50: 00:00:09.8617284
My tests showed that this effect only occurs if all the executed algorithms stay in memory. If the algorithm clones have no more references in memory, the garbage collector takes care of them and there is no performance issue.
To clarify: this has nothing to do with cloning. It also happens when all instances are created independently.
Steps to reproduce:
- Create an HL-Experiment
- Add a genetic algorithm (with low runtime to emphasise the effect)
- Add 20 clones of the genetic algorithm in the experiment
- Start the experiment
Possible solution:
- Maybe it would help to get rid of references to TheadLocal<T> after an algorithm finished.
Attachments (2)
Change History (21)
comment:1 Changed 14 years ago by gkronber
comment:2 Changed 14 years ago by cneumuel
I do not have a good explaination for the number 17 - this is just a reproducable observation. However the implementation of ThreadLocal<T> reveals that there is some sort of cache for 16 elements a cache of 165 (=1048576) ThreadStatic types, which may be exhausted after around 17 GA runs (with TestFunctionProblem). This limit may vary with other algorithms / problems.
comment:3 Changed 14 years ago by cneumuel
- Description modified (diff)
- Summary changed from Performance issues when executing more than 17 algorithm clones to Performance issues when executing more than 17 algorithms
comment:4 Changed 14 years ago by swagner
- Milestone changed from HeuristicLab Backlog to HeuristicLab 3.3.4
- Status changed from new to accepted
comment:5 Changed 14 years ago by swagner
In a discussion with gkronber, abeham and mkommend it was decided that operators and parameters should get a ClearState method which is executed after an algorithm is stopped in order to free all ThreadLocal<T> instances.
comment:6 Changed 14 years ago by svonolfe
I also encountered that issue while executing tests for the EuroCAST paper.
comment:7 Changed 14 years ago by abeham
- Milestone changed from HeuristicLab 3.3.4 to HeuristicLab 3.3.5
comment:8 Changed 14 years ago by abeham
The experiment had problem instances of different size, so there was some variation between different instances. All instances were run 30 times with the same algorithm.
comment:9 follow-up: ↓ 10 Changed 14 years ago by cneumuel
So, you used 23 instances, right? In this case, the performance drawback would be only few seconds (~5 seconds) and only at the first run of the instance. This is the moment when all the execution-contexts are set in operators and parameters. All subsequent runs of one instance should have normal performance.
Maybe you can look into your experiment if the first of the 30 repetitions is slower.
Changed 14 years ago by abeham
I have an experiment with 690 runs, but I could not see this effect. I'm running x64 release builds. Is there a difference between Debug and Release builds? I color-coded the different instances.
comment:10 in reply to: ↑ 9 Changed 14 years ago by abeham
Replying to cneumuel:
So, you used 23 instances, right? In this case, the performance drawback would be only few seconds (~5 seconds) and only at the first run of the instance. This is the moment when all the execution-contexts are set in operators and parameters. All subsequent runs of one instance should have normal performance.
Maybe you can look into your experiment if the first of the 30 repetitions is slower.
Yes the first run often has worst performance, but the margin is sometimes higher. Still I'm not fully understanding the issue and why only the first run of a batch-run is affected and not the subsequent runs.
Changed 14 years ago by abeham
Here an experiment showing the effect of affecting the first run: The higher lone points on the right are first runs each.
comment:11 Changed 14 years ago by cneumuel
In the constructors of Operator and Parameter, the executioncontexts are initialized as follows:
executionContexts = new Lazy<ThreadLocal<IExecutionContext>>(() => { return new ThreadLocal<IExecutionContext>(); }, LazyThreadSafetyMode.ExecutionAndPublication);
Because of the use of Lazy<T>, the ThreadLocal<T> object is instantiated at the first call of the .Value-property. Exactly this call happens when the Execute method of the Operator is called.
The performance issue occurs exactly when the ThreadLocal<T> object is instantiated and there are already too many of those objects in memory.
comment:12 Changed 14 years ago by cneumuel
- Owner changed from swagner to cneumuel
- Status changed from accepted to assigned
comment:13 Changed 14 years ago by cneumuel
- Status changed from assigned to accepted
comment:14 Changed 14 years ago by cneumuel
- Owner changed from cneumuel to swagner
- Status changed from accepted to reviewing
- created extension method GetObjectGraphObjects for objects
- created IStatefulItem interface, which means an object has a state which can be initialized and cleared
- after an algorithm stopped, all objects of the algorithm-objectgraph with the type IStatefulItem are cleared
- on prepare, all IStatefulItem objects are initialized
The overall performance impact of traversing the full object graph of an algorithm and initializing/clearing states is ~3-8ms.
I have tested the solution with 2000 algorithm instances, each of them had approximatly the same executiontime.
comment:15 Changed 14 years ago by cneumuel
- Milestone changed from HeuristicLab 3.3.5 to HeuristicLab 3.3.4
comment:16 Changed 14 years ago by swagner
Very nice. Thanks for implementing this solution.
Added some additional minor changes in r6114.
comment:17 Changed 14 years ago by swagner
Added final minor changes in CollectObjectGraphObjects in r6119.
comment:18 Changed 14 years ago by swagner
- Status changed from reviewing to readytorelease
comment:19 Changed 14 years ago by swagner
- Resolution set to done
- Status changed from readytorelease to closed
- Version changed from 3.3.3 to 3.3.4
Why 17?