Opened 11 months ago

Closed 9 months ago

#2850 closed enhancement (done)

Extend tSNE with relevance weights

Reported by: bwerth Owned by: mkommend
Priority: medium Milestone: HeuristicLab 3.3.15
Component: Algorithms.DataAnalysis Version: trunk
Keywords: tSNE Cc:

Description

The original tSNE uses all inputs dimensions equally. This decreases image quality if a larger number of irrelevant features are present. In the supervised case where the embedding should produce a separation for a preselected feature (eg. the class variable for classification), weighting the input dimensions according to the impact values determined by some form of supervised learning, potentially produces better images.

Change History (28)

comment:1 Changed 11 months ago by bwerth

  • Status changed from new to accepted

comment:2 Changed 11 months ago by bwerth

r15451 created branch & added WeightedEuclideanDistance

comment:3 Changed 11 months ago by bwerth

r15455 added WeightedEuclideanDistance && fixed minor bug in scatterPlot coloring

comment:4 Changed 11 months ago by bwerth

  • Owner changed from bwerth to mkommend
  • Status changed from accepted to reviewing

comment:5 Changed 11 months ago by mkommend

  • Milestone changed from HeuristicLab 3.3.16 to HeuristicLab 3.3.15
  • Version set to branch

comment:6 Changed 10 months ago by mkommend

Review comments

  • It is rather unusual (read: ugly) to branch the internals of a plugin so that the branch contains directly the 3.4 folder. It is hard to grasp, which plugin has been branched and especially difficult if multiple plugins have been affected.
  • Why did the base class from DistanceBase change from Item to ParameterizedItem? Does not this change break the persistence of saved files.
  • WeightedEuclideanDistance
    • Use a standard cast (TTT) instead of an as cast for the parameter property. The reasoning is to get an InvalidCastException directly instead of a NullReferenceException sometimes later.
    • Is the weights parameter description accurate? Quote: "... If no weights are specified a Random Forrest Regression / Classification is used to automatically set the weigths."
    • Why is the weights parameter optional when GetDistance (called by Get) throws an Exception if null is encountered?
    • Why is the DoubleArray called impacts in GetDistance?
    • Avoid Linq if the length of a DoubleArray should be retrieved.
    • What's the point of squaring weights? and afterwards taking the sqrt? Item description contradicts this information. √Σ(w²) != Σ(w)
    • Should't weights be always positive? If so a check would be appropriate.
    • Efficiency should be improved by avoiding ToArray calls in the Get method!
  • TSNEUtils
    • NthElement sorts a list between two indexes, which is not what the name suggests
  • TSNEAlgorithm
    • Remove Dependency to Encodings.RealVector.
    • Run method extraction of data kills the performance by accessing and extracting every value individually. Please benchmark and improve the call (line 289).
    • Use Color.Gradient instead of HsVtoRgb conversion
  • TSNEStatic
    • Line 219-223 never use implicitly nested loops without opening a block!
Last edited 10 months ago by mkommend (previous) (diff)

comment:7 Changed 10 months ago by mkommend

  • Owner changed from mkommend to bwerth
  • Status changed from reviewing to assigned

comment:8 Changed 10 months ago by bwerth

r15479

  • changed DistanceBase back to Item; made WeightedEuclideanDistance a ParameterizedItem (no longer derives from DistanceBase)
  • WeightedEuclideanDistance:
    • changed cast
    • Parameter Description fixed
    • Weights are no longer optional
    • renamed weights in GetDistance
    • Fixed incorrect Item description applying weights √Σ(w[i]²*(p1[i]-p2[i])²) should equate to multiplying each dimension with w[i] before calculating the distance with respect to a constant factor of √(d/Σ(w[i]²)) (constant factors do not change the tsne projection)
    • Weights theoretically need not be positive. As it might be strange that setting a weight to -10 has the same effect as setting it to 10, I added a check anyway
    • removed ToArray() calls
  • TSNEUtils:
    • Renamed NthElement to PartialSort
  • TSNEAlgorithm:
    • changed safe casts to direct casts in parameter properties
    • removed dependency to RealVectorEncoding;
    • changed coloring scheme to ColorGradient
  • TSNEStatic:
    • added block parenthesis

comment:9 Changed 10 months ago by bwerth

r15484 changed data extraction

comment:10 Changed 10 months ago by bwerth

r15485 fixed comment in TSNEAlgorithm; changed private methods in TSNEAlgorithm from T[] to IReadOnlyList<T>

Note regarding the need for jagged arrays in TSNEAlgorithm: TSNEStatic<T> takes T[] as data because this way the static interface can be used to embed arbitrary data types (double arrays, strings, custom types) as long as a corresponding IDistance<T> exists, a feature I would like to preserve.

Last edited 10 months ago by mkommend (previous) (diff)

comment:11 Changed 10 months ago by bwerth

  • Owner changed from bwerth to mkommend
  • Status changed from assigned to reviewing

comment:12 Changed 10 months ago by mkommend

  • Owner changed from mkommend to bwerth
  • Status changed from reviewing to assigned

Review Comments

  • This ticket breaks already persisted tSNE files so that they can be opened, but fail during execution.
  • r15479 introduces errors in most Distance functions!!! Be more careful while implementing & refactoring
    • while could be reformulated as while(p1Enum.MoveNext() & p2Enum.MoveNext()) instead of managing extra boolean variables.
    • Please test your changes thoroughly!

comment:13 Changed 10 months ago by bwerth

r15487 reenabled backwards compatibility; fixed distances

comment:14 Changed 10 months ago by bwerth

  • Owner changed from bwerth to mkommend
  • Status changed from assigned to reviewing

comment:15 Changed 10 months ago by mkommend

Final review comments

  • Weights should be automatically adapted if the number of checked input changes
  • Weights are taken based on the according row name and only if the row name for all rows is string.empty (maybe incl. null) the weights are associated based on their index
  • Add parameters for backwards compatibility in AfterDeserializationHook

When those changes are implemented, please unit test the adapted plugin and merge the changes to the trunk.

comment:16 Changed 10 months ago by mkommend

  • Owner changed from mkommend to bwerth
  • Status changed from reviewing to assigned

comment:17 Changed 9 months ago by bwerth

r15531

  • added automatic weight-length adaption
  • Weights are assigned to input dimensions based on name, (eg. one can use the e.g variable impacts from a RegressionModel as weights)
  • added Parameter in Hook
Last edited 9 months ago by bwerth (previous) (diff)

comment:18 Changed 9 months ago by bwerth

r15532 merged Weighted TSNE to trunk

comment:19 Changed 9 months ago by bwerth

  • Owner changed from bwerth to mkommend
  • Status changed from assigned to reviewing

comment:20 Changed 9 months ago by mkommend

r15545: Corrected after deserialization hook of tSNE.

Sorry for changing the spacing in that file.

comment:21 Changed 9 months ago by mkommend

r15548: Changed name for tsne to be more descriptive. Adapted StringConvertibleArrayView to automatically resize row headers.

comment:22 Changed 9 months ago by mkommend

  • Version changed from branch to trunk

comment:23 Changed 9 months ago by bwerth

r15551 fixed event registration

comment:24 Changed 9 months ago by bwerth

r15556 reduced state of TSNEAlgorithm.cs

comment:25 Changed 9 months ago by mkommend

  • Status changed from reviewing to readytorelease

comment:26 Changed 9 months ago by mkommend

r15570: Deleted branch for weighted tSNE.

comment:27 Changed 9 months ago by mkommend

r15571: Merged r15532, r15545, r15548, r15551, r15560, r15570 into stable.

comment:28 Changed 9 months ago by jkarder

  • Resolution set to done
  • Status changed from readytorelease to closed
Note: See TracTickets for help on using tickets.