Opened 5 years ago
Closed 4 years ago
#1925 closed defect (done)
IndexOutOfRangeException in SetClassDistributionCutPointThresholds
Reported by: | abeham | Owned by: | abeham |
---|---|---|---|
Priority: | medium | Milestone: | HeuristicLab 3.3.8 |
Component: | Problems.DataAnalysis.Symbolic.Classification | Version: | 3.3.8 |
Keywords: | Cc: |
Description
Probably a corner case that doesn't occur to often, but can lead to an exception. The stack trace is given below.
ArgumentOutOfRangeException: Index was out of range. Must be non-negative and less than the size of the collection. Parameter name: index at System.Collections.Generic.List`1.get_Item(Int32 index) at HeuristicLab.Problems.DataAnalysis.NormalDistributionCutPointsThresholdCalculator.CalculateThresholds(IClassificationProblemData problemData, IEnumerable`1 estimatedValues, IEnumerable`1 targetClassValues, Double[]& classValues, Double[]& thresholds) in D:\HL3\trunk\sources\HeuristicLab.Problems.DataAnalysis\3.4\Implementation\Classification\ThresholdCalculators\NormalDistributionCutPointsThresholdCalculator.cs:line 97 at HeuristicLab.Problems.DataAnalysis.Symbolic.Classification.SymbolicDiscriminantFunctionClassificationModel.SetClassDistributionCutPointThresholds(IDiscriminantFunctionClassificationModel model, IClassificationProblemData problemData) in D:\HL3\trunk\sources\HeuristicLab.Problems.DataAnalysis.Symbolic.Classification\3.4\SymbolicDiscriminantFunctionClassificationModel.cs:line 139 at HeuristicLab.Problems.DataAnalysis.Symbolic.Classification.SymbolicClassificationSingleObjectivePenaltyScoreEvaluator.Evaluate(IExecutionContext context, ISymbolicExpressionTree tree, IClassificationProblemData problemData, IEnumerable`1 rows) in D:\HL3\trunk\sources\HeuristicLab.Problems.DataAnalysis.Symbolic.Classification\3.4\SingleObjective\SymbolicClassificationSingleObjectivePenaltyScoreEvaluator.cs:line 68 at HeuristicLab.Problems.DataAnalysis.Symbolic.Classification.SymbolicClassificationSingleObjectivePenaltyScoreEvaluator.Apply() in D:\HL3\trunk\sources\HeuristicLab.Problems.DataAnalysis.Symbolic.Classification\3.4\SingleObjective\SymbolicClassificationSingleObjectivePenaltyScoreEvaluator.cs:line 45 at HeuristicLab.Operators.Operator.Execute(IExecutionContext context, CancellationToken cancellationToken) in D:\HL3\trunk\sources\HeuristicLab.Operators\3.3\Operator.cs:line 123 at HeuristicLab.SequentialEngine.SequentialEngine.Run(CancellationToken cancellationToken) in D:\HL3\trunk\sources\HeuristicLab.SequentialEngine\3.3\SequentialEngine.cs:line 60
Change History (18)
comment:1 follow-up: ↓ 2 Changed 5 years ago by gkronber
comment:2 in reply to: ↑ 1 Changed 5 years ago by mkommend
comment:3 Changed 5 years ago by mkommend
- Owner changed from mkommend to gkronber
- Status changed from new to assigned
comment:4 Changed 5 years ago by mkommend
- Milestone changed from HeuristicLab 3.3.x Backlog to HeuristicLab 3.3.8
- Priority changed from low to medium
comment:5 Changed 5 years ago by gkronber
- Status changed from assigned to accepted
comment:6 Changed 5 years ago by gkronber
r8638: implemented improvements for the NormalDistributionCutPointsThresholdCalculator
comment:7 Changed 5 years ago by gkronber
r8658: fixed some bugs in NormalDistributionCutPointsThresholdCalculator (probably more remaining)
comment:8 Changed 5 years ago by gkronber
r8680: minor code improvements
comment:9 Changed 5 years ago by gkronber
- Owner changed from gkronber to abeham
- Status changed from accepted to reviewing
please review.
comment:10 follow-up: ↓ 11 Changed 5 years ago by abeham
I reviewed the implementation of the NormalDistributedCutPointsThresholdCalculator:
- Line 106: Wouldn't you want to predict the class with most samples?
- Line 153: The comment seems to suggest Log(1) would be equal to PositiveInfinity
- Line 157-158: I assume this should be an approximation of the normal distribution's CDF. If I understand the purpose correctly you want to compute the area under the curve between upper and lower level (f(upper) - f(lower)) for each distribution to see which of them got the most mass. However, the function f(x) looks quite strange. It resembles the normal distribution's PDF in some parts, but not in others. Can you explain the origin of this function?
- Line 184 - 189: What does this do?
- Line 188: I would write it as (m2s1 - m1s2 - g) instead of -(-m2s1 + m1s2 +g)
comment:11 in reply to: ↑ 10 Changed 5 years ago by gkronber
Replying to abeham:
I reviewed the implementation of the NormalDistributedCutPointsThresholdCalculator:
- Line 106: Wouldn't you want to predict the class with most samples?
Good suggestion!
- Line 153: The comment seems to suggest Log(1) would be equal to PositiveInfinity
Comment is indeed wrong.
- Line 157-158: I assume this should be an approximation of the normal distribution's CDF. If I understand the purpose correctly you want to compute the area under the curve between upper and lower level (f(upper) - f(lower)) for each distribution to see which of them got the most mass. However, the function f(x) looks quite strange. It resembles the normal distribution's PDF in some parts, but not in others. Can you explain the origin of this function?
f(x) is incorrect. Thanks for pointing this out! Additionally, the name of the method is incorrect as we do not return the logarithmic density mass. Probably we can use an alglib function for this.
- Line 184 - 189: What does this do?
Calculates the points x1 and x2 where the distributions N(m1, s1) == N(m2,s2). In the general case there should be two cut points. If either s1 or s2 is 0 then x1==x2. If both s1 and s2 are zero than there are no cut points but we should return something reasonable (e.g. (m1 + m2) / 2) then.
- Line 188: I would write it as (m2s1 - m1s2 - g) instead of -(-m2s1 + m1s2 +g)
Ok.
comment:12 Changed 5 years ago by gkronber
comment:13 Changed 5 years ago by abeham
Yes I think the formula is correct now. One last point that is remaining: I tried to verify CalculateCutPoints and think there's a mistake in calculating g: It would need to say (s2 * s2 - s1 * s1), a minus instead of a +. This is also indicated in this answer.
Is it really a requirement that s2 is the larger standard deviation as indicated by the comment in line 194? Is this due to numerical stability issues? Maybe the reason can be better stated.
comment:14 Changed 5 years ago by gkronber
MATLAB produces:
>> syms x sig1 sig2 mu1 mu2; >> solve(1/sig1/sqrt(2*pi) * exp(-1/2*((x-mu1)/sig1)^2) - 1/sig2/sqrt(2*pi) * exp(-1/2*((x-mu2)/sig2)^2), x) ans = (mu2*sig1^2 - mu1*sig2^2 + sig1*sig2*(2*sig2^2*log(sig2/sig1) - 2*sig1^2*log(sig2/sig1) - 2*mu1*mu2 + mu1^2 + mu2^2)^(1/2))/(sig1^2 - sig2^2) -(mu1*sig2^2 - mu2*sig1^2 + sig1*sig2*(2*sig2^2*log(sig2/sig1) - 2*sig1^2*log(sig2/sig1) - 2*mu1*mu2 + mu1^2 + mu2^2)^(1/2))/(sig1^2 - sig2^2)
comment:15 Changed 5 years ago by gkronber
- used symbolic solution from MATLAB for calculation of cut points.
- Fixed numerical problem with calculation of NormalCDF by calculating / approximated the logarithm of the normal CDF.
comment:16 Changed 5 years ago by gkronber
r8921: improved robustness of NormalDistributionCutPointsThresholdCalculator. Now it always returns at least negative infinity and the most frequent class.
comment:17 Changed 5 years ago by abeham
- Status changed from reviewing to readytorelease
I think it works now. Thanks for the improvements and fixes.
comment:18 Changed 4 years ago by swagner
- Resolution set to done
- Status changed from readytorelease to closed
- Version changed from 3.3.7 to 3.3.8
Probably caused by changes in r8531 (#1919).