Opened 7 years ago
Closed 7 years ago
#2902 closed enhancement (done)
Performance improvement of NaN/Inf-check on a double-matrix
Reported by: | fholzing | Owned by: | gkronber |
---|---|---|---|
Priority: | low | Milestone: | HeuristicLab 3.3.16 |
Component: | Algorithms.DataAnalysis | Version: | trunk |
Keywords: | Performance | Cc: |
Description (last modified by fholzing)
The current implementation to check whether a double[,] contains any Nan/Inf-values or not is rather time consuming.
inputMatrix.Cast<double>().Any(...) At first glance it takes about 6 seconds for a 50000x500 matrix. After consulting mkommend, a faster alternative would be preferred.
Attachments (1)
Change History (22)
comment:1 Changed 7 years ago by fholzing
- Type changed from defect to enhancement
comment:2 Changed 7 years ago by fholzing
- Status changed from new to accepted
comment:3 Changed 7 years ago by fholzing
- Description modified (diff)
comment:4 Changed 7 years ago by fholzing
comment:5 Changed 7 years ago by fholzing
After a first benchmark run the difference between the current approach (mentioned in the description) and three alternatives seems to be approximately a factor of 50.
comment:6 Changed 7 years ago by fholzing
Additional note of importance: if you are ever in need of a BIG array, you have to enable gcallowverylargeobjects (see https://msdn.microsoft.com/de-de/library/hh285054(v=vs.110).aspx)
comment:7 Changed 7 years ago by fholzing
My approach would be the following: Extend the class ObjectExtensions (Common-Project) with the Iterator-Method (as shown in the .xlsx, imho the easiest/most readable one, with very little performance-impact) and use the new extension method for all 11 occurrences.
comment:8 Changed 7 years ago by gkronber
Ok, go for it but please use double.IsNaN instead of Double.IsNaN.
It would be great to know why there is such a big difference between the current code (Cast<double>) and the iterator.
comment:9 Changed 7 years ago by mkommend
I suspect that it either has to do with the lambda call or more likely with boxing involved, because the Cast extension method is defined on IEnumerable instead of IEnumerable<double>.
comment:10 Changed 7 years ago by gkronber
Aha, boxing seems to be a good explanation.
comment:11 Changed 7 years ago by fholzing
r15783: Changed from Cast to Iterator and adapted all occurrences.
comment:12 Changed 7 years ago by fholzing
FieldTest:
Random Forest Regression (Generated Testbed, 50000x500, Test/Train Datapartitions chosen with 0-20000 / 20000-40000)
M: 0,5
NoTree: 200
R: 0,2
Performance for Variable Impact View
Before optimization: ~1323 sec
After optimization: ~ 762 sec
Without check: ~ 749 sec
comment:13 Changed 7 years ago by fholzing
- Owner changed from fholzing to mkommend
- Status changed from accepted to reviewing
comment:14 Changed 7 years ago by gkronber
- Owner changed from mkommend to fholzing
- Status changed from reviewing to assigned
Reviewed r15783.
- I found no other references to .Cast<double>, so this is fine.
- I would prefer the name .ContainsNanOrInfinity()
comment:15 Changed 7 years ago by fholzing
r15786: Renamed ContainsNanInf to ContainsNanOrInfinity
comment:16 Changed 7 years ago by fholzing
- Owner changed from fholzing to mkommend
- Status changed from assigned to reviewing
comment:17 Changed 7 years ago by fholzing
- Owner changed from mkommend to gkronber
comment:18 Changed 7 years ago by gkronber
Reviewed r15786
comment:19 Changed 7 years ago by gkronber
- Status changed from reviewing to readytorelease
comment:20 Changed 7 years ago by gkronber
comment:21 Changed 7 years ago by gkronber
- Resolution set to done
- Status changed from readytorelease to closed
There is a total of 11 occurrences in the HeuristicLab.Algorithms.DataAnalysis-project.