Opened 12 months ago

# Speed of symbolic regression evaluators

Reported by: Owned by: bburlacu bburlacu medium Problems.DataAnalysis.Symbolic

There is a big difference between the speed of the tree interpreter and the speed of the (R-Squared) evaluator:

• For an arithmetic grammar the new native interpreter achieves a speed of approx. 3 billion nodes/second.
• By contrast, the Pearson's R-Squared evaluator is more than 20 times slower. Using the values returned by the interpreter, it only manages a speed of 0.14 billion nodes/second when computing the R2 value. This is an obvious bottleneck.

Of course, the evaluator performs some extra work:

• Generating random subsets of rows from the training partition
• Scaling the values and bounding them to the estimation limits
• Checking for error conditions in the calculators (linear scaling calculator, pearson'r correlation calculator)

I think there are some inconsistencies and performance bottlenecks in the SymbolicDataAnalysisEvaluator design:

• Spurious usage of Linq and Enumerators
• Thread static cache for estimated values (not clear why this is needed since one tree should be evaluated just once)
• In 99% of use cases the GenerateRows method just returns an IEnumerable<int> between TrainingPartition.Start and TrainingPartition.End, thus only introducing overhead
• The underlying statistical calculators (mean-variance calculator, covariance, correlation, linear scaling) perform the same checks for the added values (IsNan, IsInfinity). this actually adds up to a noticeable amount of overhead since eg., the correlation calculator calls three other calculators in its Add method, performing the same checks

Suggested workarounds:

• Use arrays: this will slightly increase memory pressure but we are generating those values anyway (generate rows, get tree values) so it would speed things up overall
• As an example, adding a GetSymbolicExpressionTreeValues(ISymbolicExpressionTree, IDataset, int[]) to the ISymbolicDataAnalysisExpressionTreeInterpreter interface would simplify usage and not break existing contracts
• Only check values added to the calculator once: I would provide an AddUnchecked method to the online calculators for cases when the caller knows what they are doing

Naively implementing these suggestions results in 2-4x improvement in evaluation speed:

Linear scaling  Evaluator (old)         Evaluator (new) + Checked  Evaluator (new) + Unchecked  Improvement
TRUE            1.383E+008 ± 4.126E+00  2.776E+008 ± 2.538E+00     3.990E+008 ± 2.347E+00       189%
FALSE           2.479E+008 ± 1.095E+00  3.831E+008 ± 2.659E+00     4.881E+008 ± 3.237E+00       97%


On actual problems the improvement is still significant (depending on problem size):

Problem       NMSE (train)  NMSE (test)  Elapsed (old)  Elapsed (new)  Runtime improvement
Airfoil       0.2189        0.2564       24.9           20.5           21.5%
Breiman-I     0.1146        0.1219       78.0           54.2           43.7%
Chemical-I    0.2043        0.3243       22.2           18.4           20.8%
Concrete      0.1513        0.5952       19.6           16.9           16.1%
Friedman-I    0.1384        0.1385       90.9           65.6           38.7%
Friedman-II   0.0412        0.0424       98.9           66.8           48.1%
GP-Challenge  0.0793        0.0794       86.5           60.7           42.5%
Pagie-1       0.0027        0.0235       24.5           20.8           17.8%
Poly-10       0.1698        0.1929       14.7           13.8           6.8%


A proper redesign would likely yield even better results.

### comment:1 Changed 12 months ago by bburlacu

• Description modified (diff)
• Owner set to bburlacu
• Status changed from new to accepted

### Changed 12 months ago by bburlacu

Script for measuring interpreter and evaluator speed

Note: See TracTickets for help on using tickets.