Opened 5 years ago
Last modified 5 years ago
#3058 accepted defect
Speed of symbolic regression evaluators
Reported by: | bburlacu | Owned by: | bburlacu |
---|---|---|---|
Priority: | medium | Milestone: | |
Component: | Problems.DataAnalysis.Symbolic | Version: | |
Keywords: | Cc: |
Description (last modified by bburlacu)
There is a big difference between the speed of the tree interpreter and the speed of the (R-Squared) evaluator:
- For an arithmetic grammar the new native interpreter achieves a speed of approx. 3 billion nodes/second.
- By contrast, the Pearson's R-Squared evaluator is more than 20 times slower. Using the values returned by the interpreter, it only manages a speed of 0.14 billion nodes/second when computing the R2 value. This is an obvious bottleneck.
Of course, the evaluator performs some extra work:
- Generating random subsets of rows from the training partition
- Scaling the values and bounding them to the estimation limits
- Checking for error conditions in the calculators (linear scaling calculator, pearson'r correlation calculator)
I think there are some inconsistencies and performance bottlenecks in the SymbolicDataAnalysisEvaluator design:
- Spurious usage of Linq and Enumerators
- Thread static cache for estimated values (not clear why this is needed since one tree should be evaluated just once)
- In 99% of use cases the GenerateRows method just returns an IEnumerable<int> between TrainingPartition.Start and TrainingPartition.End, thus only introducing overhead
- The underlying statistical calculators (mean-variance calculator, covariance, correlation, linear scaling) perform the same checks for the added values (IsNan, IsInfinity). this actually adds up to a noticeable amount of overhead since eg., the correlation calculator calls three other calculators in its Add method, performing the same checks
Suggested workarounds:
- Use arrays: this will slightly increase memory pressure but we are generating those values anyway (generate rows, get tree values) so it would speed things up overall
- As an example, adding a GetSymbolicExpressionTreeValues(ISymbolicExpressionTree, IDataset, int[]) to the ISymbolicDataAnalysisExpressionTreeInterpreter interface would simplify usage and not break existing contracts
- Only check values added to the calculator once: I would provide an AddUnchecked method to the online calculators for cases when the caller knows what they are doing
Naively implementing these suggestions results in 2-4x improvement in evaluation speed:
Linear scaling Evaluator (old) Evaluator (new) + Checked Evaluator (new) + Unchecked Improvement TRUE 1.383E+008 ± 4.126E+00 2.776E+008 ± 2.538E+00 3.990E+008 ± 2.347E+00 189% FALSE 2.479E+008 ± 1.095E+00 3.831E+008 ± 2.659E+00 4.881E+008 ± 3.237E+00 97%
On actual problems the improvement is still significant (depending on problem size):
Problem NMSE (train) NMSE (test) Elapsed (old) Elapsed (new) Runtime improvement Airfoil 0.2189 0.2564 24.9 20.5 21.5% Breiman-I 0.1146 0.1219 78.0 54.2 43.7% Chemical-I 0.2043 0.3243 22.2 18.4 20.8% Concrete 0.1513 0.5952 19.6 16.9 16.1% Friedman-I 0.1384 0.1385 90.9 65.6 38.7% Friedman-II 0.0412 0.0424 98.9 66.8 48.1% GP-Challenge 0.0793 0.0794 86.5 60.7 42.5% Pagie-1 0.0027 0.0235 24.5 20.8 17.8% Poly-10 0.1698 0.1929 14.7 13.8 6.8%
A proper redesign would likely yield even better results.
Attachments (1)
Change History (2)
comment:1 Changed 5 years ago by bburlacu
- Description modified (diff)
- Owner set to bburlacu
- Status changed from new to accepted
Changed 5 years ago by bburlacu
Note: See
TracTickets for help on using
tickets.
Script for measuring interpreter and evaluator speed