Context Navigation

#1424 closed defect (done)

Performance issues when executing more than 17 algorithms

Reported by:	cneumuel	Owned by:	swagner
Priority:	high	Milestone:	HeuristicLab 3.3.4
Component:	Optimization	Version:	3.3.4
Keywords:		Cc:

Description (last modified by cneumuel)

When more than 17 instances of an algorithm are executed, most of the execution time is spent in set_ExecutionContext of Operator and Parameter, which leads to a significant slowdown.

This behaviour can be reproduced by creating more than 17 instances of an algorithm and then execute them sequentially. After the 17. algorithm the execution time drops quickly, reaching 100X slowdown after ~36 algorithm executions. Further executions reduce performance even more.

The following log shows the execution times of a GeneticAlgorithm with PopulationSize=10 and MaximumGenerations=10, cloned and executed 50 times:

0: 00:00:00.3190994
1: 00:00:00.0640627
2: 00:00:00.0629711
3: 00:00:00.0789977
4: 00:00:00.0636207
5: 00:00:00.0648095
6: 00:00:00.0635349
7: 00:00:00.0638938
8: 00:00:00.0684564
9: 00:00:00.0635349
10: 00:00:00.0640277
11: 00:00:00.0666138
12: 00:00:00.0785264
13: 00:00:00.0636077
14: 00:00:00.0649117
15: 00:00:00.0650256
16: 00:00:00.0651619
17: 00:00:00.1157925
18: 00:00:00.2212117
19: 00:00:00.2721120
20: 00:00:00.4346059
21: 00:00:02.7537371
22: 00:00:03.6094154
23: 00:00:05.4627364
24: 00:00:03.6676727
25: 00:00:02.8755246
26: 00:00:03.1758587
27: 00:00:03.6028118
28: 00:00:04.0009675
29: 00:00:04.5277583
30: 00:00:04.4591044
31: 00:00:04.8525177
32: 00:00:04.8829030
33: 00:00:05.2711543
34: 00:00:05.2257916
35: 00:00:06.0486489
36: 00:00:06.0530127
37: 00:00:06.3948470
38: 00:00:06.7425004
39: 00:00:06.9358513
40: 00:00:07.0969071
41: 00:00:07.9781183
42: 00:00:07.6586048
43: 00:00:08.1097788
44: 00:00:08.0271943
45: 00:00:08.6293224
46: 00:00:09.1268924
47: 00:00:09.1074116
48: 00:00:09.5311929
49: 00:00:10.0175224
50: 00:00:09.8617284

My tests showed that this effect only occurs if all the executed algorithms stay in memory. If the algorithm clones have no more references in memory, the garbage collector takes care of them and there is no performance issue.

To clarify: this has nothing to do with cloning. It also happens when all instances are created independently.

Steps to reproduce:

Create an HL-Experiment
Add a genetic algorithm (with low runtime to emphasise the effect)
Add 20 clones of the genetic algorithm in the experiment
Start the experiment

Possible solution:

Maybe it would help to get rid of references to TheadLocal<T> after an algorithm finished.

Attachments (2)

execution times.png (97.3 KB) - added by abeham 14 years ago.: I have an experiment with 690 runs, but I could not see this effect. I'm running x64 release builds. Is there a difference between Debug and Release builds? I color-coded the different instances.
execution times2.png (60.7 KB) - added by abeham 14 years ago.: Here an experiment showing the effect of affecting the first run: The higher lone points on the right are first runs each.

Download all attachments as: .zip

Change History (21)

comment:1 Changed 14 years ago by gkronber

Why 17?

comment:2 Changed 14 years ago by cneumuel

I do not have a good explaination for the number 17 - this is just a reproducable observation. However the implementation of ThreadLocal<T> reveals that there is ~~some sort of cache for 16 elements~~ a cache of 16⁵ (=1048576) ThreadStatic types, which may be exhausted after around 17 GA runs (with TestFunctionProblem). This limit may vary with other algorithms / problems.

Last edited 14 years ago by cneumuel (previous) (diff)

comment:3 Changed 14 years ago by cneumuel

Description modified (diff)
Summary changed from Performance issues when executing more than 17 algorithm clones to Performance issues when executing more than 17 algorithms

comment:4 Changed 14 years ago by swagner

Milestone changed from HeuristicLab Backlog to HeuristicLab 3.3.4
Status changed from new to accepted

comment:5 Changed 14 years ago by swagner

In a discussion with gkronber, abeham and mkommend it was decided that operators and parameters should get a ClearState method which is executed after an algorithm is stopped in order to free all ThreadLocal<T> instances.

comment:6 Changed 14 years ago by svonolfe

I also encountered that issue while executing tests for the EuroCAST paper.

comment:7 Changed 14 years ago by abeham

Milestone changed from HeuristicLab 3.3.4 to HeuristicLab 3.3.5

comment:8 Changed 14 years ago by abeham

The experiment had problem instances of different size, so there was some variation between different instances. All instances were run 30 times with the same algorithm.

comment:9 follow-up: ↓ 10 Changed 14 years ago by cneumuel

So, you used 23 instances, right? In this case, the performance drawback would be only few seconds (~5 seconds) and only at the first run of the instance. This is the moment when all the execution-contexts are set in operators and parameters. All subsequent runs of one instance should have normal performance.

Maybe you can look into your experiment if the first of the 30 repetitions is slower.

Changed 14 years ago by abeham

Attachment execution times.png added

I have an experiment with 690 runs, but I could not see this effect. I'm running x64 release builds. Is there a difference between Debug and Release builds? I color-coded the different instances.

comment:10 in reply to: ↑ 9 Changed 14 years ago by abeham

Replying to cneumuel:

So, you used 23 instances, right? In this case, the performance drawback would be only few seconds (~5 seconds) and only at the first run of the instance. This is the moment when all the execution-contexts are set in operators and parameters. All subsequent runs of one instance should have normal performance.

Maybe you can look into your experiment if the first of the 30 repetitions is slower.

Yes the first run often has worst performance, but the margin is sometimes higher. Still I'm not fully understanding the issue and why only the first run of a batch-run is affected and not the subsequent runs.

Changed 14 years ago by abeham

Attachment execution times2.png added

Here an experiment showing the effect of affecting the first run: The higher lone points on the right are first runs each.

comment:11 Changed 14 years ago by cneumuel

In the constructors of Operator and Parameter, the executioncontexts are initialized as follows:

executionContexts = new Lazy<ThreadLocal<IExecutionContext>>(() => { return new ThreadLocal<IExecutionContext>(); }, LazyThreadSafetyMode.ExecutionAndPublication);

Because of the use of Lazy<T>, the ThreadLocal<T> object is instantiated at the first call of the .Value-property. Exactly this call happens when the Execute method of the Operator is called.

The performance issue occurs exactly when the ThreadLocal<T> object is instantiated and there are already too many of those objects in memory.

comment:12 Changed 14 years ago by cneumuel

Owner changed from swagner to cneumuel
Status changed from accepted to assigned

comment:13 Changed 14 years ago by cneumuel

Status changed from assigned to accepted

comment:14 Changed 14 years ago by cneumuel

Owner changed from cneumuel to swagner
Status changed from accepted to reviewing

r6103

created extension method GetObjectGraphObjects for objects
created IStatefulItem interface, which means an object has a state which can be initialized and cleared
after an algorithm stopped, all objects of the algorithm-objectgraph with the type IStatefulItem are cleared
on prepare, all IStatefulItem objects are initialized

The overall performance impact of traversing the full object graph of an algorithm and initializing/clearing states is ~3-8ms.

I have tested the solution with 2000 algorithm instances, each of them had approximatly the same executiontime.