Opened 8 months ago

Last modified 4 weeks ago

#2700 reviewing feature request

t-Distributed Stochastic Neighbor Embedding

Reported by: bwerth Owned by: abeham
Priority: medium Milestone: HeuristicLab 3.3.15
Component: Algorithms.DataAnalysis Version: 3.3.14
Keywords: Cc:

Description

t-Distributed Stochastic Neighbor Embedding (t-SNE) is a technique for dimensionality reduction that is particularly well suited for the visualization of high-dimensional datasets and should be avialable in HeuristicLab

Change History (44)

comment:1 Changed 8 months ago by bwerth

  • Owner set to bwerth
  • Status changed from new to assigned

comment:2 Changed 8 months ago by bwerth

  • Type changed from defect to feature request

comment:3 Changed 8 months ago by bwerth

  • Status changed from assigned to accepted
  • Version changed from 3.3.14 to branch

comment:4 Changed 8 months ago by bwerth

r14387 created initial branch

comment:5 Changed 7 months ago by bwerth

r14413 refactored c++ style code to C# (use of [,] arrays, int vs uint,..) + corrected IterationsCounter

Last edited 7 months ago by bwerth (previous) (diff)

comment:6 Changed 7 months ago by bwerth

r14414 forgot to add files

comment:7 Changed 7 months ago by bwerth

  • Owner changed from bwerth to gkronber
  • Status changed from accepted to reviewing

comment:8 Changed 6 months ago by gkronber

Comments from an initial review:

  • It should be possible to stop the algorithm at any time
  • A quality line chart for the error would probably be interesting
  • It would be nice to be able to view the projection after each iteration
  • The descriptions for parameters should contain information on default settings or useful settings for the parameters.
  • Is it necessary to tune all the parameters for learning? Or would it also be ok to just use some robust default settings and hide most of the parameters (except for perplexity)
  • I think it is not strictly necessary that TSNE derives from Item (since it is probably never used directly in the GUI)
  • Error message: "Perplexity should be lower than K" what's K?

Let's discuss this in person...

Last edited 3 months ago by gkronber (previous) (diff)

comment:9 Changed 6 months ago by gkronber

  • Owner changed from gkronber to bwerth
  • Status changed from reviewing to assigned

comment:10 Changed 6 months ago by gkronber

r14503: minor change while reviewing

comment:11 Changed 6 months ago by bwerth

r14512 worked in several comments from mkommend, extended analysis during algorithm run, added more Distances, made algorithm stoppable

comment:12 Changed 6 months ago by gkronber

More observations:

  • TSNE should be a BasicAlgorithm
  • Exception when switching between views (projected data & quality line chart) while the algorithm is running
  • r14512 added references to files for a kernel PCA in the project file (please remove).
  • Why does the error change abruptly when the 'stop-lying-iteration' is reached? (--> OK)
  • Hide parameters: *Momentum, Eta, MomentumSwitch, StopLying. Set StopLying to zero per default.
Last edited 3 months ago by gkronber (previous) (diff)

comment:13 Changed 6 months ago by bwerth

r14518 TSNEAnalysis is now a BasicAlg, hid Parameters, added optional data normalization to make TSNE scaling-invariant

comment:14 Changed 5 months ago by bwerth

r14558 made TSNE compatible with the new pausible BasicAlgs, removed rescaling of scatterplots during alg to give it a more movie-esque feel

comment:15 Changed 4 months ago by abeham

r14682: fixed references for alglib and libsvm

comment:16 Changed 4 months ago by bwerth

  • Owner changed from bwerth to gkronber
  • Status changed from assigned to reviewing

r14742 fixed displaying of randomly generated seed and some minor code simplifications

comment:17 Changed 3 months ago by gkronber

The 'performance-improved' distance methods which also accept a threshold seem to be implemented incorrectly. However, they are not used by tSNE anyway so I'm removing them.

 sum = 0;
 ...
 while(sum > threshold ...) {
   sum += ...
 }
 return sum;

comment:18 Changed 3 months ago by gkronber

VPTree contains a TODO item TODO check if minheap or maxheap should be used here

comment:19 Changed 3 months ago by gkronber

r14767: made some changes while reviewing the code

comment:20 Changed 3 months ago by gkronber

ABE: fixed revision numbers

Last edited 3 months ago by abeham (previous) (diff)

comment:21 follow-up: Changed 3 months ago by abeham

What would be the call if I don't have any features, but only a matrix of distances? Is there some kind of DistanceFunction that is just a lookup in a matrix?

Also as we (bwerth + abeham) discussed: The initial parameters are probably not that well suited for large number of cases. A benchmark set of several different data should be generated and different parameters applied to identify one set that works best on average. Above all, I would strongly recommend theta to be default to 0, because this seems to be only suitable for large datasets. Maybe we can also auto-set this parameter after the problem dimension is known.

Finally, I would recommend to have both TSNE in form of an easy-to-use API call and as a BasicAlgorithm.

comment:22 in reply to: ↑ 21 Changed 3 months ago by gkronber

Replying to abeham:

What would be the call if I don't have any features, but only a matrix of distances? Is there some kind of DistanceFunction that is just a lookup in a matrix?

Good point. I propose that we provide a different version of the algorithm for this case (because we don't have a dataset as for all other data-analysis algs).

Also as we (bwerth + abeham) discussed: The initial parameters are probably not that well suited for large number of cases. A benchmark set of several different data should be generated and different parameters applied to identify one set that works best on average. Above all, I would strongly recommend theta to be default to 0, because this seems to be only suitable for large datasets. Maybe we can also auto-set this parameter after the problem dimension is known.

I also already noticed that theta=0 produces very different results. We should leave this option for the case of N>5000 but use theta=0 as default. Additionally, we should look at the differences between theta=0 and theta=eps, maybe there is another issue hidden there.

Finally, I would recommend to have both TSNE in form of an easy-to-use API call and as a BasicAlgorithm.

Full ack. I plan to refactor the code to provide a simple static method.

comment:23 Changed 3 months ago by gkronber

r14784: fixed build fail

comment:24 Changed 3 months ago by abeham

For comparison purposes, there's also a JavaScript implementation of TSNE (and probably some more).

comment:25 Changed 3 months ago by gkronber

TODO:

  • Add parameter to update results only X iterations (done)
Last edited 4 weeks ago by gkronber (previous) (diff)

comment:26 Changed 3 months ago by gkronber

r14785: changes and while reviewing

comment:27 Changed 3 months ago by gkronber

r14787: more changes while reviewing

comment:28 Changed 3 months ago by gkronber

r14788: refactoring

comment:29 Changed 3 months ago by gkronber

r14806: worked on tSNE, storable and cloning for tSNE state. Added some TODO comments while reviewing.

comment:30 Changed 3 months ago by gkronber

r14807: support clone and persistence and pause/resume

comment:31 Changed 3 months ago by gkronber

Review comments:

  • in fast_tsne.m an initial dimensionality reduction using PCA is performed before running bh_tsne features is not absolutely necessary
  • normalization of data should be moved to TSNEStatic (ZeroMean and diving by max) this is not possible as the tSNE implementation is not specific to real vectors. Therefore, scaling was left in the main algorithm
Last edited 2 months ago by gkronber (previous) (diff)

comment:32 Changed 3 months ago by gkronber

r14836: merged changesets from trunk to branch

comment:33 Changed 3 months ago by gkronber

r14837: made some changes while reviewing (comparsion with bh_tsne implementation by van der Maarten)

comment:34 Changed 2 months ago by gkronber

r14855: made some changes / bug-fixes while reviewing

comment:35 Changed 2 months ago by gkronber

Script for comparison with bh_tsne.exe

  public override void Main() {
    double theta =0;
    int no_dims = 2;
    int iters = 1000;
    double perplexity = 100;
    
    var ds = (Dataset)vars["tower"];
    var inputVars = (ReadOnlyCheckedItemList<StringValue>)vars["input_vars"];
    using(var writer = new BinaryWriter(new FileStream(@"C:\Users\P24581\Downloads\bhtsne-master\bhtsne-master\windows\data.dat", FileMode.Create))) {
      writer.Write((int)ds.Rows);
      writer.Write((int)inputVars.CheckedItems.Count());
      writer.Write((double)theta);
      writer.Write((double)perplexity);
      writer.Write((int)no_dims);
      writer.Write((int)iters);
      for(int row = 0;row<ds.Rows;row++) {        
        foreach(var @var in inputVars.CheckedItems) {
          writer.Write(ds.GetDoubleValue(@var.Value.Value, row));
        }
      }
    }
    
    using(var reader = new BinaryReader(new FileStream(@"C:\Users\P24581\Downloads\bhtsne-master\bhtsne-master\windows\result.dat", FileMode.Open))) {
      int h = reader.ReadInt32();
      int d = reader.ReadInt32();
      var scatterPlot = new ScatterPlot();
      vars["mapped"] = scatterPlot;
      
      Console.WriteLine("d={0} h={1}", d, h);
      var scatterRow = new ScatterPlotDataRow();
      
      for(int r = 0;r<h;r++) {
         
        var p = new Point2D<double>(reader.ReadDouble(), reader.ReadDouble());
        scatterRow.Points.Add(p);
      }
      
      
      scatterPlot.Rows.Add(scatterRow);
    }
  }

comment:36 Changed 2 months ago by gkronber

r14856: fixed bug in approximate gradient calculation

comment:37 Changed 2 months ago by gkronber

r14858: fixed another bug in the bh-tsne implementation

comment:38 Changed 2 months ago by gkronber

tSNE and approximate tSNE produce the same results now and results are comparable to the C++ implementation.

The C# version is much slower though.

comment:39 Changed 2 months ago by gkronber

r14859: update only every x iterations and fixed output paths

comment:40 Changed 2 months ago by gkronber

r14862: copied tSNE from branch to trunk

comment:41 Changed 2 months ago by gkronber

  • Owner changed from gkronber to abeham

comment:42 Changed 2 months ago by gkronber

r14863: fixed bug in cloning ctor

comment:43 Changed 2 months ago by gkronber

r14867: deleted branch as it has been integrated into the trunk

comment:44 Changed 4 weeks ago by gkronber

  • Version changed from branch to 3.3.14
Note: See TracTickets for help on using tickets.