Opened 5 years ago

Closed 4 years ago

IndexOutOfRangeException in SetClassDistributionCutPointThresholds

Reported by: Owned by: abeham abeham medium HeuristicLab 3.3.8 Problems.DataAnalysis.Symbolic.Classification 3.3.8

Description

Probably a corner case that doesn't occur to often, but can lead to an exception. The stack trace is given below.

ArgumentOutOfRangeException: Index was out of range. Must be non-negative and less than the size of the collection.
Parameter name: index
at System.Collections.Generic.List1.get_Item(Int32 index)
at HeuristicLab.Problems.DataAnalysis.NormalDistributionCutPointsThresholdCalculator.CalculateThresholds(IClassificationProblemData problemData, IEnumerable1 estimatedValues, IEnumerable1 targetClassValues, Double[]& classValues, Double[]& thresholds) in D:\HL3\trunk\sources\HeuristicLab.Problems.DataAnalysis\3.4\Implementation\Classification\ThresholdCalculators\NormalDistributionCutPointsThresholdCalculator.cs:line 97
at HeuristicLab.Problems.DataAnalysis.Symbolic.Classification.SymbolicDiscriminantFunctionClassificationModel.SetClassDistributionCutPointThresholds(IDiscriminantFunctionClassificationModel model, IClassificationProblemData problemData) in D:\HL3\trunk\sources\HeuristicLab.Problems.DataAnalysis.Symbolic.Classification\3.4\SymbolicDiscriminantFunctionClassificationModel.cs:line 139
at HeuristicLab.Problems.DataAnalysis.Symbolic.Classification.SymbolicClassificationSingleObjectivePenaltyScoreEvaluator.Evaluate(IExecutionContext context, ISymbolicExpressionTree tree, IClassificationProblemData problemData, IEnumerable1 rows) in D:\HL3\trunk\sources\HeuristicLab.Problems.DataAnalysis.Symbolic.Classification\3.4\SingleObjective\SymbolicClassificationSingleObjectivePenaltyScoreEvaluator.cs:line 68
at HeuristicLab.Problems.DataAnalysis.Symbolic.Classification.SymbolicClassificationSingleObjectivePenaltyScoreEvaluator.Apply() in D:\HL3\trunk\sources\HeuristicLab.Problems.DataAnalysis.Symbolic.Classification\3.4\SingleObjective\SymbolicClassificationSingleObjectivePenaltyScoreEvaluator.cs:line 45
at HeuristicLab.Operators.Operator.Execute(IExecutionContext context, CancellationToken cancellationToken) in D:\HL3\trunk\sources\HeuristicLab.Operators\3.3\Operator.cs:line 123
at HeuristicLab.SequentialEngine.SequentialEngine.Run(CancellationToken cancellationToken) in D:\HL3\trunk\sources\HeuristicLab.SequentialEngine\3.3\SequentialEngine.cs:line 60


comment:1 follow-up: ↓ 2 Changed 5 years ago by gkronber

Probably caused by changes in r8531 (#1919).

Last edited 5 years ago by gkronber (previous) (diff)

comment:2 in reply to: ↑ 1 Changed 5 years ago by mkommend

Probably caused by changes in r8531 (#1919).

The changes in r8531 are IMHO not related to this ticket.

comment:3 Changed 5 years ago by mkommend

• Owner changed from mkommend to gkronber
• Status changed from new to assigned

comment:4 Changed 5 years ago by mkommend

• Milestone changed from HeuristicLab 3.3.x Backlog to HeuristicLab 3.3.8
• Priority changed from low to medium

comment:5 Changed 5 years ago by gkronber

• Status changed from assigned to accepted

comment:6 Changed 5 years ago by gkronber

r8638: implemented improvements for the NormalDistributionCutPointsThresholdCalculator

comment:7 Changed 5 years ago by gkronber

r8658: fixed some bugs in NormalDistributionCutPointsThresholdCalculator (probably more remaining)

comment:8 Changed 5 years ago by gkronber

r8680: minor code improvements

comment:9 Changed 5 years ago by gkronber

• Owner changed from gkronber to abeham
• Status changed from accepted to reviewing

comment:10 follow-up: ↓ 11 Changed 4 years ago by abeham

I reviewed the implementation of the NormalDistributedCutPointsThresholdCalculator:

• Line 106: Wouldn't you want to predict the class with most samples?
• Line 153: The comment seems to suggest Log(1) would be equal to PositiveInfinity
• Line 157-158: I assume this should be an approximation of the normal distribution's CDF. If I understand the purpose correctly you want to compute the area under the curve between upper and lower level (f(upper) - f(lower)) for each distribution to see which of them got the most mass. However, the function f(x) looks quite strange. It resembles the normal distribution's PDF in some parts, but not in others. Can you explain the origin of this function?
• Line 184 - 189: What does this do?
• Line 188: I would write it as (m2s1 - m1s2 - g) instead of -(-m2s1 + m1s2 +g)

comment:11 in reply to: ↑ 10 Changed 4 years ago by gkronber

I reviewed the implementation of the NormalDistributedCutPointsThresholdCalculator:

• Line 106: Wouldn't you want to predict the class with most samples?

Good suggestion!

• Line 153: The comment seems to suggest Log(1) would be equal to PositiveInfinity

Comment is indeed wrong.

• Line 157-158: I assume this should be an approximation of the normal distribution's CDF. If I understand the purpose correctly you want to compute the area under the curve between upper and lower level (f(upper) - f(lower)) for each distribution to see which of them got the most mass. However, the function f(x) looks quite strange. It resembles the normal distribution's PDF in some parts, but not in others. Can you explain the origin of this function?

f(x) is incorrect. Thanks for pointing this out! Additionally, the name of the method is incorrect as we do not return the logarithmic density mass. Probably we can use an alglib function for this.

• Line 184 - 189: What does this do?

Calculates the points x1 and x2 where the distributions N(m1, s1) == N(m2,s2). In the general case there should be two cut points. If either s1 or s2 is 0 then x1==x2. If both s1 and s2 are zero than there are no cut points but we should return something reasonable (e.g. (m1 + m2) / 2) then.

• Line 188: I would write it as (m2s1 - m1s2 - g) instead of -(-m2s1 + m1s2 +g)

Ok.

comment:12 Changed 4 years ago by gkronber

r8913: made corrections as suggested in above comment by abeham (10). In particular hopefully corrected the calculation of the CDF.

comment:13 Changed 4 years ago by abeham

Yes I think the formula is correct now. One last point that is remaining: I tried to verify CalculateCutPoints and think there's a mistake in calculating g: It would need to say (s2 * s2 - s1 * s1), a minus instead of a +. This is also indicated in this answer.

Is it really a requirement that s2 is the larger standard deviation as indicated by the comment in line 194? Is this due to numerical stability issues? Maybe the reason can be better stated.

comment:14 Changed 4 years ago by gkronber

MATLAB produces:

>> syms x sig1 sig2 mu1 mu2;
>> solve(1/sig1/sqrt(2*pi) * exp(-1/2*((x-mu1)/sig1)^2) - 1/sig2/sqrt(2*pi) * exp(-1/2*((x-mu2)/sig2)^2), x)

ans =

(mu2*sig1^2 - mu1*sig2^2 + sig1*sig2*(2*sig2^2*log(sig2/sig1) - 2*sig1^2*log(sig2/sig1) - 2*mu1*mu2 + mu1^2 + mu2^2)^(1/2))/(sig1^2 - sig2^2)
-(mu1*sig2^2 - mu2*sig1^2 + sig1*sig2*(2*sig2^2*log(sig2/sig1) - 2*sig1^2*log(sig2/sig1) - 2*mu1*mu2 + mu1^2 + mu2^2)^(1/2))/(sig1^2 - sig2^2)



comment:15 Changed 4 years ago by gkronber

• used symbolic solution from MATLAB for calculation of cut points.
• Fixed numerical problem with calculation of NormalCDF by calculating / approximated the logarithm of the normal CDF.

comment:16 Changed 4 years ago by gkronber

r8921: improved robustness of NormalDistributionCutPointsThresholdCalculator. Now it always returns at least negative infinity and the most frequent class.

comment:17 Changed 4 years ago by abeham

• Status changed from reviewing to readytorelease

I think it works now. Thanks for the improvements and fixes.

comment:18 Changed 4 years ago by swagner

• Resolution set to done
• Status changed from readytorelease to closed
• Version changed from 3.3.7 to 3.3.8
Note: See TracTickets for help on using tickets.