Opened 6 years ago
Closed 6 years ago
#2850 closed enhancement (done)
Extend tSNE with relevance weights
Reported by: | bwerth | Owned by: | mkommend |
---|---|---|---|
Priority: | medium | Milestone: | HeuristicLab 3.3.15 |
Component: | Algorithms.DataAnalysis | Version: | trunk |
Keywords: | tSNE | Cc: |
Description
The original tSNE uses all inputs dimensions equally. This decreases image quality if a larger number of irrelevant features are present. In the supervised case where the embedding should produce a separation for a preselected feature (eg. the class variable for classification), weighting the input dimensions according to the impact values determined by some form of supervised learning, potentially produces better images.
Change History (28)
comment:1 Changed 6 years ago by bwerth
- Status changed from new to accepted
comment:2 Changed 6 years ago by bwerth
comment:3 Changed 6 years ago by bwerth
r15455 added WeightedEuclideanDistance && fixed minor bug in scatterPlot coloring
comment:4 Changed 6 years ago by bwerth
- Owner changed from bwerth to mkommend
- Status changed from accepted to reviewing
comment:5 Changed 6 years ago by mkommend
- Milestone changed from HeuristicLab 3.3.16 to HeuristicLab 3.3.15
- Version set to branch
comment:6 Changed 6 years ago by mkommend
Review comments
- It is rather unusual (read: ugly) to branch the internals of a plugin so that the branch contains directly the 3.4 folder. It is hard to grasp, which plugin has been branched and especially difficult if multiple plugins have been affected.
- Why did the base class from DistanceBase change from Item to ParameterizedItem? Does not this change break the persistence of saved files.
- WeightedEuclideanDistance
- Use a standard cast (TTT) instead of an as cast for the parameter property. The reasoning is to get an InvalidCastException directly instead of a NullReferenceException sometimes later.
- Is the weights parameter description accurate? Quote: "... If no weights are specified a Random Forrest Regression / Classification is used to automatically set the weigths."
- Why is the weights parameter optional when GetDistance (called by Get) throws an Exception if null is encountered?
- Why is the DoubleArray called impacts in GetDistance?
- Avoid Linq if the length of a DoubleArray should be retrieved.
- What's the point of squaring weights? and afterwards taking the sqrt? Item description contradicts this information. √Σ(w²) != Σ(w)
- Should't weights be always positive? If so a check would be appropriate.
- Efficiency should be improved by avoiding ToArray calls in the Get method!
- TSNEUtils
- NthElement sorts a list between two indexes, which is not what the name suggests
- TSNEAlgorithm
- Remove Dependency to Encodings.RealVector.
- Run method extraction of data kills the performance by accessing and extracting every value individually. Please benchmark and improve the call (line 289).
- Use Color.Gradient instead of HsVtoRgb conversion
- TSNEStatic
- Line 219-223 never use implicitly nested loops without opening a block!
comment:7 Changed 6 years ago by mkommend
- Owner changed from mkommend to bwerth
- Status changed from reviewing to assigned
comment:8 Changed 6 years ago by bwerth
- changed DistanceBase back to Item; made WeightedEuclideanDistance a ParameterizedItem (no longer derives from DistanceBase)
- WeightedEuclideanDistance:
- changed cast
- Parameter Description fixed
- Weights are no longer optional
- renamed weights in GetDistance
- Fixed incorrect Item description applying weights √Σ(w[i]²*(p1[i]-p2[i])²) should equate to multiplying each dimension with w[i] before calculating the distance with respect to a constant factor of √(d/Σ(w[i]²)) (constant factors do not change the tsne projection)
- Weights theoretically need not be positive. As it might be strange that setting a weight to -10 has the same effect as setting it to 10, I added a check anyway
- removed ToArray() calls
- TSNEUtils:
- Renamed NthElement to PartialSort
- TSNEAlgorithm:
- changed safe casts to direct casts in parameter properties
- removed dependency to RealVectorEncoding;
- changed coloring scheme to ColorGradient
- TSNEStatic:
- added block parenthesis
comment:9 Changed 6 years ago by bwerth
r15484 changed data extraction
comment:10 Changed 6 years ago by bwerth
r15485 fixed comment in TSNEAlgorithm; changed private methods in TSNEAlgorithm from T[] to IReadOnlyList<T>
Note regarding the need for jagged arrays in TSNEAlgorithm: TSNEStatic<T> takes T[] as data because this way the static interface can be used to embed arbitrary data types (double arrays, strings, custom types) as long as a corresponding IDistance<T> exists, a feature I would like to preserve.
comment:11 Changed 6 years ago by bwerth
- Owner changed from bwerth to mkommend
- Status changed from assigned to reviewing
comment:12 Changed 6 years ago by mkommend
- Owner changed from mkommend to bwerth
- Status changed from reviewing to assigned
Review Comments
- This ticket breaks already persisted tSNE files so that they can be opened, but fail during execution.
- r15479 introduces errors in most Distance functions!!! Be more careful while implementing & refactoring
- while could be reformulated as while(p1Enum.MoveNext() & p2Enum.MoveNext()) instead of managing extra boolean variables.
- Please test your changes thoroughly!
comment:13 Changed 6 years ago by bwerth
r15487 reenabled backwards compatibility; fixed distances
comment:14 Changed 6 years ago by bwerth
- Owner changed from bwerth to mkommend
- Status changed from assigned to reviewing
comment:15 Changed 6 years ago by mkommend
Final review comments
- Weights should be automatically adapted if the number of checked input changes
- Weights are taken based on the according row name and only if the row name for all rows is string.empty (maybe incl. null) the weights are associated based on their index
- Add parameters for backwards compatibility in AfterDeserializationHook
When those changes are implemented, please unit test the adapted plugin and merge the changes to the trunk.
comment:16 Changed 6 years ago by mkommend
- Owner changed from mkommend to bwerth
- Status changed from reviewing to assigned
comment:17 Changed 6 years ago by bwerth
- added automatic weight-length adaption
- Weights are assigned to input dimensions based on name, (eg. one can use the e.g variable impacts from a RegressionModel as weights)
- added Parameter in Hook
comment:18 Changed 6 years ago by bwerth
r15532 merged Weighted TSNE to trunk
comment:19 Changed 6 years ago by bwerth
- Owner changed from bwerth to mkommend
- Status changed from assigned to reviewing
comment:20 Changed 6 years ago by mkommend
r15545: Corrected after deserialization hook of tSNE.
Sorry for changing the spacing in that file.
comment:21 Changed 6 years ago by mkommend
r15548: Changed name for tsne to be more descriptive. Adapted StringConvertibleArrayView to automatically resize row headers.
comment:22 Changed 6 years ago by mkommend
- Version changed from branch to trunk
comment:23 Changed 6 years ago by bwerth
r15551 fixed event registration
comment:24 Changed 6 years ago by bwerth
r15556 reduced state of TSNEAlgorithm.cs
comment:25 Changed 6 years ago by mkommend
- Status changed from reviewing to readytorelease
comment:26 Changed 6 years ago by mkommend
r15570: Deleted branch for weighted tSNE.
comment:27 Changed 6 years ago by mkommend
comment:28 Changed 6 years ago by jkarder
- Resolution set to done
- Status changed from readytorelease to closed
r15451 created branch & added WeightedEuclideanDistance