Free cookie consent management tool by TermsFeed Policy Generator

Opened 6 years ago

Closed 6 years ago

#2850 closed enhancement (done)

Extend tSNE with relevance weights

Reported by: bwerth Owned by: mkommend
Priority: medium Milestone: HeuristicLab 3.3.15
Component: Algorithms.DataAnalysis Version: trunk
Keywords: tSNE Cc:

Description

The original tSNE uses all inputs dimensions equally. This decreases image quality if a larger number of irrelevant features are present. In the supervised case where the embedding should produce a separation for a preselected feature (eg. the class variable for classification), weighting the input dimensions according to the impact values determined by some form of supervised learning, potentially produces better images.

Change History (28)

comment:1 Changed 6 years ago by bwerth

  • Status changed from new to accepted

comment:2 Changed 6 years ago by bwerth

r15451 created branch & added WeightedEuclideanDistance

comment:3 Changed 6 years ago by bwerth

r15455 added WeightedEuclideanDistance && fixed minor bug in scatterPlot coloring

comment:4 Changed 6 years ago by bwerth

  • Owner changed from bwerth to mkommend
  • Status changed from accepted to reviewing

comment:5 Changed 6 years ago by mkommend

  • Milestone changed from HeuristicLab 3.3.16 to HeuristicLab 3.3.15
  • Version set to branch

comment:6 Changed 6 years ago by mkommend

Review comments

  • It is rather unusual (read: ugly) to branch the internals of a plugin so that the branch contains directly the 3.4 folder. It is hard to grasp, which plugin has been branched and especially difficult if multiple plugins have been affected.
  • Why did the base class from DistanceBase change from Item to ParameterizedItem? Does not this change break the persistence of saved files.
  • WeightedEuclideanDistance
    • Use a standard cast (TTT) instead of an as cast for the parameter property. The reasoning is to get an InvalidCastException directly instead of a NullReferenceException sometimes later.
    • Is the weights parameter description accurate? Quote: "... If no weights are specified a Random Forrest Regression / Classification is used to automatically set the weigths."
    • Why is the weights parameter optional when GetDistance (called by Get) throws an Exception if null is encountered?
    • Why is the DoubleArray called impacts in GetDistance?
    • Avoid Linq if the length of a DoubleArray should be retrieved.
    • What's the point of squaring weights? and afterwards taking the sqrt? Item description contradicts this information. √Σ(w²) != Σ(w)
    • Should't weights be always positive? If so a check would be appropriate.
    • Efficiency should be improved by avoiding ToArray calls in the Get method!
  • TSNEUtils
    • NthElement sorts a list between two indexes, which is not what the name suggests
  • TSNEAlgorithm
    • Remove Dependency to Encodings.RealVector.
    • Run method extraction of data kills the performance by accessing and extracting every value individually. Please benchmark and improve the call (line 289).
    • Use Color.Gradient instead of HsVtoRgb conversion
  • TSNEStatic
    • Line 219-223 never use implicitly nested loops without opening a block!
Last edited 6 years ago by mkommend (previous) (diff)

comment:7 Changed 6 years ago by mkommend

  • Owner changed from mkommend to bwerth
  • Status changed from reviewing to assigned

comment:8 Changed 6 years ago by bwerth

r15479

  • changed DistanceBase back to Item; made WeightedEuclideanDistance a ParameterizedItem (no longer derives from DistanceBase)
  • WeightedEuclideanDistance:
    • changed cast
    • Parameter Description fixed
    • Weights are no longer optional
    • renamed weights in GetDistance
    • Fixed incorrect Item description applying weights √Σ(w[i]²*(p1[i]-p2[i])²) should equate to multiplying each dimension with w[i] before calculating the distance with respect to a constant factor of √(d/Σ(w[i]²)) (constant factors do not change the tsne projection)
    • Weights theoretically need not be positive. As it might be strange that setting a weight to -10 has the same effect as setting it to 10, I added a check anyway
    • removed ToArray() calls
  • TSNEUtils:
    • Renamed NthElement to PartialSort
  • TSNEAlgorithm:
    • changed safe casts to direct casts in parameter properties
    • removed dependency to RealVectorEncoding;
    • changed coloring scheme to ColorGradient
  • TSNEStatic:
    • added block parenthesis

comment:9 Changed 6 years ago by bwerth

r15484 changed data extraction

comment:10 Changed 6 years ago by bwerth

r15485 fixed comment in TSNEAlgorithm; changed private methods in TSNEAlgorithm from T[] to IReadOnlyList<T>

Note regarding the need for jagged arrays in TSNEAlgorithm: TSNEStatic<T> takes T[] as data because this way the static interface can be used to embed arbitrary data types (double arrays, strings, custom types) as long as a corresponding IDistance<T> exists, a feature I would like to preserve.

Last edited 6 years ago by mkommend (previous) (diff)

comment:11 Changed 6 years ago by bwerth

  • Owner changed from bwerth to mkommend
  • Status changed from assigned to reviewing

comment:12 Changed 6 years ago by mkommend

  • Owner changed from mkommend to bwerth
  • Status changed from reviewing to assigned

Review Comments

  • This ticket breaks already persisted tSNE files so that they can be opened, but fail during execution.
  • r15479 introduces errors in most Distance functions!!! Be more careful while implementing & refactoring
    • while could be reformulated as while(p1Enum.MoveNext() & p2Enum.MoveNext()) instead of managing extra boolean variables.
    • Please test your changes thoroughly!

comment:13 Changed 6 years ago by bwerth

r15487 reenabled backwards compatibility; fixed distances

comment:14 Changed 6 years ago by bwerth

  • Owner changed from bwerth to mkommend
  • Status changed from assigned to reviewing

comment:15 Changed 6 years ago by mkommend

Final review comments

  • Weights should be automatically adapted if the number of checked input changes
  • Weights are taken based on the according row name and only if the row name for all rows is string.empty (maybe incl. null) the weights are associated based on their index
  • Add parameters for backwards compatibility in AfterDeserializationHook

When those changes are implemented, please unit test the adapted plugin and merge the changes to the trunk.

comment:16 Changed 6 years ago by mkommend

  • Owner changed from mkommend to bwerth
  • Status changed from reviewing to assigned

comment:17 Changed 6 years ago by bwerth

r15531

  • added automatic weight-length adaption
  • Weights are assigned to input dimensions based on name, (eg. one can use the e.g variable impacts from a RegressionModel as weights)
  • added Parameter in Hook
Last edited 6 years ago by bwerth (previous) (diff)

comment:18 Changed 6 years ago by bwerth

r15532 merged Weighted TSNE to trunk

comment:19 Changed 6 years ago by bwerth

  • Owner changed from bwerth to mkommend
  • Status changed from assigned to reviewing

comment:20 Changed 6 years ago by mkommend

r15545: Corrected after deserialization hook of tSNE.

Sorry for changing the spacing in that file.

comment:21 Changed 6 years ago by mkommend

r15548: Changed name for tsne to be more descriptive. Adapted StringConvertibleArrayView to automatically resize row headers.

comment:22 Changed 6 years ago by mkommend

  • Version changed from branch to trunk

comment:23 Changed 6 years ago by bwerth

r15551 fixed event registration

comment:24 Changed 6 years ago by bwerth

r15556 reduced state of TSNEAlgorithm.cs

comment:25 Changed 6 years ago by mkommend

  • Status changed from reviewing to readytorelease

comment:26 Changed 6 years ago by mkommend

r15570: Deleted branch for weighted tSNE.

comment:27 Changed 6 years ago by mkommend

r15571: Merged r15532, r15545, r15548, r15551, r15560, r15570 into stable.

comment:28 Changed 6 years ago by jkarder

  • Resolution set to done
  • Status changed from readytorelease to closed
Note: See TracTickets for help on using tickets.