Opened 12 months ago

Closed 3 months ago

#2700 closed feature request (done)

t-Distributed Stochastic Neighbor Embedding

Reported by: bwerth Owned by: gkronber
Priority: medium Milestone: HeuristicLab 3.3.15
Component: Algorithms.DataAnalysis Version: 3.3.14
Keywords: Cc:

Description

t-Distributed Stochastic Neighbor Embedding (t-SNE) is a technique for dimensionality reduction that is particularly well suited for the visualization of high-dimensional datasets and should be avialable in HeuristicLab

Change History (55)

comment:1 Changed 12 months ago by bwerth

  • Owner set to bwerth
  • Status changed from new to assigned

comment:2 Changed 12 months ago by bwerth

  • Type changed from defect to feature request

comment:3 Changed 12 months ago by bwerth

  • Status changed from assigned to accepted
  • Version changed from 3.3.14 to branch

comment:4 Changed 12 months ago by bwerth

r14387 created initial branch

comment:5 Changed 11 months ago by bwerth

r14413 refactored c++ style code to C# (use of [,] arrays, int vs uint,..) + corrected IterationsCounter

Last edited 11 months ago by bwerth (previous) (diff)

comment:6 Changed 11 months ago by bwerth

r14414 forgot to add files

comment:7 Changed 11 months ago by bwerth

  • Owner changed from bwerth to gkronber
  • Status changed from accepted to reviewing

comment:8 Changed 10 months ago by gkronber

Comments from an initial review:

  • It should be possible to stop the algorithm at any time
  • A quality line chart for the error would probably be interesting
  • It would be nice to be able to view the projection after each iteration
  • The descriptions for parameters should contain information on default settings or useful settings for the parameters.
  • Is it necessary to tune all the parameters for learning? Or would it also be ok to just use some robust default settings and hide most of the parameters (except for perplexity)
  • I think it is not strictly necessary that TSNE derives from Item (since it is probably never used directly in the GUI)
  • Error message: "Perplexity should be lower than K" what's K?

Let's discuss this in person...

Last edited 7 months ago by gkronber (previous) (diff)

comment:9 Changed 10 months ago by gkronber

  • Owner changed from gkronber to bwerth
  • Status changed from reviewing to assigned

comment:10 Changed 10 months ago by gkronber

r14503: minor change while reviewing

comment:11 Changed 10 months ago by bwerth

r14512 worked in several comments from mkommend, extended analysis during algorithm run, added more Distances, made algorithm stoppable

comment:12 Changed 10 months ago by gkronber

More observations:

  • TSNE should be a BasicAlgorithm
  • Exception when switching between views (projected data & quality line chart) while the algorithm is running
  • r14512 added references to files for a kernel PCA in the project file (please remove).
  • Why does the error change abruptly when the 'stop-lying-iteration' is reached? (--> OK)
  • Hide parameters: *Momentum, Eta, MomentumSwitch, StopLying. Set StopLying to zero per default.
Last edited 7 months ago by gkronber (previous) (diff)

comment:13 Changed 10 months ago by bwerth

r14518 TSNEAnalysis is now a BasicAlg, hid Parameters, added optional data normalization to make TSNE scaling-invariant

comment:14 Changed 10 months ago by bwerth

r14558 made TSNE compatible with the new pausible BasicAlgs, removed rescaling of scatterplots during alg to give it a more movie-esque feel

comment:15 Changed 8 months ago by abeham

r14682: fixed references for alglib and libsvm

comment:16 Changed 8 months ago by bwerth

  • Owner changed from bwerth to gkronber
  • Status changed from assigned to reviewing

r14742 fixed displaying of randomly generated seed and some minor code simplifications

comment:17 Changed 7 months ago by gkronber

The 'performance-improved' distance methods which also accept a threshold seem to be implemented incorrectly. However, they are not used by tSNE anyway so I'm removing them.

 sum = 0;
 ...
 while(sum > threshold ...) {
   sum += ...
 }
 return sum;

comment:18 Changed 7 months ago by gkronber

VPTree contains a TODO item TODO check if minheap or maxheap should be used here

comment:19 Changed 7 months ago by gkronber

r14767: made some changes while reviewing the code

comment:20 Changed 7 months ago by gkronber

ABE: fixed revision numbers

Last edited 7 months ago by abeham (previous) (diff)

comment:21 follow-up: Changed 7 months ago by abeham

What would be the call if I don't have any features, but only a matrix of distances? Is there some kind of DistanceFunction that is just a lookup in a matrix?

Also as we (bwerth + abeham) discussed: The initial parameters are probably not that well suited for large number of cases. A benchmark set of several different data should be generated and different parameters applied to identify one set that works best on average. Above all, I would strongly recommend theta to be default to 0, because this seems to be only suitable for large datasets. Maybe we can also auto-set this parameter after the problem dimension is known.

Finally, I would recommend to have both TSNE in form of an easy-to-use API call and as a BasicAlgorithm.

comment:22 in reply to: ↑ 21 Changed 7 months ago by gkronber

Replying to abeham:

What would be the call if I don't have any features, but only a matrix of distances? Is there some kind of DistanceFunction that is just a lookup in a matrix?

Good point. I propose that we provide a different version of the algorithm for this case (because we don't have a dataset as for all other data-analysis algs).

Also as we (bwerth + abeham) discussed: The initial parameters are probably not that well suited for large number of cases. A benchmark set of several different data should be generated and different parameters applied to identify one set that works best on average. Above all, I would strongly recommend theta to be default to 0, because this seems to be only suitable for large datasets. Maybe we can also auto-set this parameter after the problem dimension is known.

I also already noticed that theta=0 produces very different results. We should leave this option for the case of N>5000 but use theta=0 as default. Additionally, we should look at the differences between theta=0 and theta=eps, maybe there is another issue hidden there.

Finally, I would recommend to have both TSNE in form of an easy-to-use API call and as a BasicAlgorithm.

Full ack. I plan to refactor the code to provide a simple static method.

comment:23 Changed 7 months ago by gkronber

r14784: fixed build fail

comment:24 Changed 7 months ago by abeham

For comparison purposes, there's also a JavaScript implementation of TSNE (and probably some more).

comment:25 Changed 7 months ago by gkronber

TODO:

  • Add parameter to update results only X iterations (done)
Last edited 5 months ago by gkronber (previous) (diff)

comment:26 Changed 7 months ago by gkronber

r14785: changes and while reviewing

comment:27 Changed 7 months ago by gkronber

r14787: more changes while reviewing

comment:28 Changed 7 months ago by gkronber

r14788: refactoring

comment:29 Changed 7 months ago by gkronber

r14806: worked on tSNE, storable and cloning for tSNE state. Added some TODO comments while reviewing.

comment:30 Changed 7 months ago by gkronber

r14807: support clone and persistence and pause/resume

comment:31 Changed 7 months ago by gkronber

Review comments:

  • in fast_tsne.m an initial dimensionality reduction using PCA is performed before running bh_tsne features is not absolutely necessary
  • normalization of data should be moved to TSNEStatic (ZeroMean and diving by max) this is not possible as the tSNE implementation is not specific to real vectors. Therefore, scaling was left in the main algorithm
Last edited 6 months ago by gkronber (previous) (diff)

comment:32 Changed 7 months ago by gkronber

r14836: merged changesets from trunk to branch

comment:33 Changed 7 months ago by gkronber

r14837: made some changes while reviewing (comparsion with bh_tsne implementation by van der Maarten)

comment:34 Changed 6 months ago by gkronber

r14855: made some changes / bug-fixes while reviewing

comment:35 Changed 6 months ago by gkronber

Script for comparison with bh_tsne.exe

  public override void Main() {
    double theta =0;
    int no_dims = 2;
    int iters = 1000;
    double perplexity = 100;
    
    var ds = (Dataset)vars["tower"];
    var inputVars = (ReadOnlyCheckedItemList<StringValue>)vars["input_vars"];
    using(var writer = new BinaryWriter(new FileStream(@"C:\Users\P24581\Downloads\bhtsne-master\bhtsne-master\windows\data.dat", FileMode.Create))) {
      writer.Write((int)ds.Rows);
      writer.Write((int)inputVars.CheckedItems.Count());
      writer.Write((double)theta);
      writer.Write((double)perplexity);
      writer.Write((int)no_dims);
      writer.Write((int)iters);
      for(int row = 0;row<ds.Rows;row++) {        
        foreach(var @var in inputVars.CheckedItems) {
          writer.Write(ds.GetDoubleValue(@var.Value.Value, row));
        }
      }
    }
    
    using(var reader = new BinaryReader(new FileStream(@"C:\Users\P24581\Downloads\bhtsne-master\bhtsne-master\windows\result.dat", FileMode.Open))) {
      int h = reader.ReadInt32();
      int d = reader.ReadInt32();
      var scatterPlot = new ScatterPlot();
      vars["mapped"] = scatterPlot;
      
      Console.WriteLine("d={0} h={1}", d, h);
      var scatterRow = new ScatterPlotDataRow();
      
      for(int r = 0;r<h;r++) {
         
        var p = new Point2D<double>(reader.ReadDouble(), reader.ReadDouble());
        scatterRow.Points.Add(p);
      }
      
      
      scatterPlot.Rows.Add(scatterRow);
    }
  }

comment:36 Changed 6 months ago by gkronber

r14856: fixed bug in approximate gradient calculation

comment:37 Changed 6 months ago by gkronber

r14858: fixed another bug in the bh-tsne implementation

comment:38 Changed 6 months ago by gkronber

tSNE and approximate tSNE produce the same results now and results are comparable to the C++ implementation.

The C# version is much slower though.

comment:39 Changed 6 months ago by gkronber

r14859: update only every x iterations and fixed output paths

comment:40 Changed 6 months ago by gkronber

r14862: copied tSNE from branch to trunk

comment:41 Changed 6 months ago by gkronber

  • Owner changed from gkronber to abeham

comment:42 Changed 6 months ago by gkronber

r14863: fixed bug in cloning ctor

comment:43 Changed 6 months ago by gkronber

r14867: deleted branch as it has been integrated into the trunk

comment:44 Changed 5 months ago by gkronber

  • Version changed from branch to 3.3.14

comment:45 Changed 3 months ago by abeham

Code Review Comments

Distance related calculation

This should all be moved to HeuristicLab.Common.

The InnerProductDistance needs to be looked at. It does not satisfy point 3 of the conditions mentioned in IDistance. It is not zero-reflexive, because d(x,x) can only be 0 if x is equal to the null vector. Inner product is not a measure for distance, because it gets larger the more similar two vectors are (if normalized with the product of the vectors' lengths). Also, the description mentions angular distance and speaks of normalization (there is no normalization). It is not clear to me why it is only defined for vectors with all non-negative dimensions. A distance measure based on inner product would probably look something like 1 - cosine(alpha) where cosine(alpha) = <x,y> / (||x|| * ||y||) (<x,y> = inner product) to me. This results in a distance measure in the interval [0; 2] related to the angle of two vectors (0 for vectors of identical angle, 1 for vectors perpendicular to each other and 2 for vectors in the exact opposite direction).

Implementation of the distance measures with .Zip() and .Sum() is nice to read, but inefficient. I would recommend to implement efficiently and sacrifice a little bit on readability, especially if it's put in Common.

Collection and Trees

It is mentioned in the TODO of PriorityQueue that this should move to HeuristicLab.Collections. However, it's not a storable class and it's not observable. I would move it to HeuristicLab.Common. VantagePointTree and SpacePartitioningTree should also be moved to Common (maybe all together in a folder Collections).

Static TSNE

It uses different default values than the TSNE. I personally think that an eta of 200 is too much, I would go with the same values that are default in the tSNE for data analysis problems. van der Maarten mentions adaptive learning rate at some points.

TSNEAlgorithm

Also concerns static TSNE: if our implementation is based on some variant of t-sne (of which there are several) we should state that in the ItemName and/or the ItemDescription. Additionally, a reference to the publication or other source should be given in the description.

I have yet to test the t-sne with data.

Last edited 3 months ago by abeham (previous) (diff)

comment:46 Changed 3 months ago by bwerth

r15207, r15208, r15209

  • Replaced InnerProductDistance with CosineDistance;
  • Removed LINQ from Distances;
  • Removed unused Interfaces;
  • Removed unused methods from SpacePartitioningTree and made it internal
  • some code formatting
  • changed default values in TSNEStatic
  • added links to barnes hut paper + code in ItemDescription of TSNEAlgorithm
  • added some comments to the VantagePointTree
Last edited 3 months ago by bwerth (previous) (diff)

comment:47 Changed 3 months ago by abeham

Further Review Comments

ClassNames parameter should be an OptionalConstrainedValueParameter<StringValue> and should be filled with the input variable names when a problem is loaded.

DistanceFunction should be a ConstrainedValueParameter<IDistance<double[]>> and should be filled with all instances of type IDistance<double[]> through discovery using ApplicationManager.GetInstances().

Both should be exposed as IConstrainedValueParameter<T> in the parameter properties.

In implementing this you must then override OnProblemChanged. Add event handlers to ProblemDataChanged. In case problem or problemdata changes clear the constrained value parameter for class names and fill them with the ProblemData's InputVariables (use all of them, not just the allowed ones). If event handlers are added you must add an after deserialization hook and also register the event handlers there. Event handlers must also be registered in the default constructor and after the cloning constructor.

Ideally Theta parameter would be a PercentValue instead of a DoubleValue. Internally this is completely the same, but it's presented to the user as a percentage and it makes it more obvious that this is something that is in the range [0;1].

TSNE throws an exception if a variable is included that has the same value for each row. The ArithmeticException that is thrown (in Math.Sign) says: "Function does not accept floating point Not-a-Number values." Maybe this can be caught and rethrown with a better description.

comment:48 Changed 3 months ago by abeham

  • Owner changed from abeham to bwerth
  • Status changed from reviewing to assigned

This concludes the review. I have tested it with a few datasets. The results seem to be good.

comment:49 Changed 3 months ago by bwerth

r15225, r15227

  • made ClassesParameter an OptionalConstrainedValueParameter<StringValue>
  • made DistanceParameter a ConstrainedValueParameter<IDistance<double[]>>
  • overrode OnProblemChanged, RegisterProblemEvents and DeregisterProblemEvents to correctly update the ClassesParameter
  • Changed Theta to PercentValue
  • Changed Normalization to no longer perform division by zero (columns with 0 variance, stay 0 variance). This should get rid of the NaNs without throwing an Exception
Last edited 3 months ago by bwerth (previous) (diff)

comment:50 Changed 3 months ago by bwerth

  • Owner changed from bwerth to abeham
  • Status changed from assigned to reviewing

comment:51 Changed 3 months ago by gkronber

  • Owner changed from abeham to gkronber

comment:52 Changed 3 months ago by abeham

r15234: fixed some typos

comment:53 Changed 3 months ago by gkronber

  • Status changed from reviewing to readytorelease

Reviewed and tested again. Everything looks fine.

Last edited 3 months ago by gkronber (previous) (diff)

comment:54 Changed 3 months ago by gkronber

r15249: merged r14862, r14863, r14911, r14936, r15156, r15157, r15158, r15164, r15169, r15207:15209, r15225, r15227, r15234, r15248 from trunk to stable

(combined merge of #2699 and #2700)

Last edited 3 months ago by gkronber (previous) (diff)

comment:55 Changed 3 months ago by gkronber

  • Resolution set to done
  • Status changed from readytorelease to closed
Note: See TracTickets for help on using tickets.