ECJ supports several bloat control techniques. Many of these techniques are compared in detail in "A Comparison of bloat Control Methods for Genetic Programming" by Sean Luke and Liviu Panait. In this directory we have implementations of several of them. There are several methods in the article which aren't here. The two you should be aware of are BIASED MULTIOBJECTIVE, where we do multiobjective optimization with fitness as one objective and size as another; and plain-old LINEAR parsimony pressure, where the "fitness" F of an individual is actually his real fitness R and his size S, combined in a linear function, that is, F = A*R + S for some value of R. We mention these two because, like many of the techniques below, they perform well over many different problem domains. And importantly, LINEAR performed the best in our tests! Closely followed by RATIO TOURNAMENT SELECTION (see below). Double Tournament was pretty good too. The problem with linear parsimony pressure is that it could need to be tuned carefully -- though a setting of A = 32 seemed to work well in many problems. Thus you may wish to try out doing a simple linear parsimony pressure before going to a more exotic method. You can implement linear parsimony pressure in your evaluation function: compute the fitness, then call the size() method on the individual, then set the fitness of the individual to the linear combination of the two. Another technique commonly used to control bloat in GP is depth limiting. ECJ implements depth limiting Koza-style with a maximum depth set to 17. You can often get better results with a smaller depth limit. Interestingly, when the depth limit is set to 17, you can use depth limiting in COMBINATION with ANY of the parsimony pressure techniques discussed here, (including linear and biased multiobjective) and the result is typically better than using them separately. DOUBLE TOURNAMENT Double tournament is a two-layer hierarchy of tournament selection operators. Some N tournament selections are performed on some criterion; and then the winners of those tournaments become contestants in a final tournament. The winner of the final tournament becomes the selected individual. You can have fitness as the first criterion and size as the second criterion, or the other way around. Here are good settings we've found for typical GP experiments. [BASE] = ec.parsimony.DoubleTournamentSelection # Do length as the initial tournaments, and fitness as the final tournament [BASE].do-length-first = true # The initial tournaments are of size 1.4 [BASE].size = 1.4 # The final tournament is of size 7 [BASE].size2 = 7 The default base is select.double-tournament PROPORTIONAL TOURNAMENT Proportional tournament is a single tournament selection; but some percentage of the time the tournament is according to size rather than according to fitness. Here are good settings we've found for typical GP experiments. [BASE] = ec.parsimony.ProportionalTournamentSelection # The size of the tournament [BASE].size = 7 # The probability that the tournament is by fitness (1.0 is equivalent # to "regular" tournament selection). [BASE].fitness-prob = 0.8 The default base is select.proportional-tournament LEXICOGRAPHIC TOURNAMENT Lexicographic tournament selection is simple. We do a tournament selection by fitness, breaking ties by choosing the smaller individual. Thus size is a secondary consideration: for example, if all your fitness values are likely to be different, then size will never have an effect. Thus plain lexicographic tournament selection works best when there are a limited number of possible fitness values. Lexicographic tournament has no special settings -- it's basically the same as plain tournament selection. Here's how we'd set it up for GP problems: [BASE] = ec.parsimony.LexicographicTournament # The size of the tournament [BASE].size = 7 The default base is select.lexicographic-tournament [DIRECT] BUCKETED TOURNAMENT SELECTION Bucketed tournament selection is an improvement of sorts over plain Lexicographic tournament selection. The idea is to create an artificial equivalency of fitness values, even when none exists in reality, for purposes of lexicographic selection. This allows size to become a significant factor. to create a set of N buckets. The population, of size S, is then sorted and divided into these buckets. It's not divided quite equally. Instead, the bottom S/N individuals are placed in the first bucket. Then any individuals left in the population whose fitness equals the fittest individual in that bucket are *also* put in that bucket. Then the bottom S/N of the remaining individuals in the population are put in the second bucket, plus any individuals whose fitness equals the fittest individual in the second bucket. And so on. This continues until we've run out of individuals to put into buckets. The idea is to make sure that individuals with the same fitness are all placed into the same bucket. The "fitness" of an individual, for purposes of lexicographic selection, is now his bucket number. We did not find a direct bucketing number-of-buckets parameter which was good across several problem domains. We found 100 was good for artificial ant, 250 for 11-bit Multiplexer and 5-bit Parity, and 25 for Symbolic Regression. You'll need to experiment a bit. Here's the settings for Multiplexer: [BASE] = ec.parsimony.BucketTournamentSelection # The size of the tournament [BASE].size = 7 # The number of buckets [BASE].num-buckets = 250 The default base is select.bucket-tournament RATIO BUCKETED TOURNAMENT SELECTION Ratio Bucketing improves a bit over direct bucketing. Here, the idea is to push low-fitness individuals in to large buckets and place high-fitness individuals into smaller buckets, even as small as the individual itself. This allows more fitness-based distinction among the "important" individuals in the search (the fitter ones) and puts more parsimony pressure in the "less important" individuals. We do this by defining a ratio of remaining individuals in the population to put in the next bucket. Let's say this ratio is 1/R. We put the 1/R worst individuals of the population in lowest bucket, plus all remaining individuals in the population whose fitness is equal to the fittest individual in the bucket. We then put the 1/4 next worst remaining individuals in the next bucket, plus all remaining individuals in the population whose fitness is equal to the fittest individual in the second bucket. And so on, until all individuals have been placed into buckets. The "fitness" of an individual, for purposes of lexicographic selection, is now his bucket number. Like direct bucketing, we did not find a value of R which was good across several problem domains. 2 was good for artificial ant, 11-bit mutiplexer, and regression. but 6 ws good for 5-bit parity. So you'll need to experiment. Here's the settings for Multiplexer: [BASE] = ec.parsimony.RatioBucketTournamentSelection # The size of the tournament [BASE].size = 7 # The number of buckets [BASE].ratio = 2 The default base is select.ratio-bucket-tournament TARPEIAN SELECTION Tarpeian is fairly simple but clever. PRIOR to evaluation, we sort the population by size, then identify the M individuals which have above-average size. From those M individuals we "kill" some N percent. Notice that M may vary from population to population depending on the variance of size among the individuals. By "kill" we mean that we set the fitness of those individuals to a very bad value, and also mark them as evaluated so the Evaluator doesn't bother evaluating them. Not evaluating the individuals is really important: if every individual is evaluated, Tarpeian is actually pretty costly compared to other methods. But if we prematurely "kill" the individuals, then Tarpeian is pretty competitive if you count total number of evaluations. Because Tarpeian must do its work prior to evaluation, it can't operate as a selection operator in ECJ's framework. Instead, we've arranged for Tarpeian to be a Statistics subclass which plugs into the preEvaluationStatistics hook. To use it, you just hang it off of your statistics chain. Assuming you only have one existing Statistics object, here's how you'd add Tarpeian in a manner which has proven good across several problem domains: stat.num-children = 1 stat.child.0 = ec.parsimony.TarpeianStatistics stat.child.0.kill-proportion = 0.3 Note that our implementation of Tarpeian will operate over *all* of your subpopuations, even if you don't want that. You may need to hack it to operate differently if you have more than one subpopulation and don't want Tarpeian parsimony on one or more of them.