Opened 6 months ago

Last modified 2 weeks ago

#2700 reviewing feature request

t-Distributed Stochastic Neighbor Embedding

Reported by: bwerth Owned by: abeham
Priority: medium Milestone: HeuristicLab 3.3.15
Component: Algorithms.DataAnalysis Version: branch
Keywords: Cc:

Description

t-Distributed Stochastic Neighbor Embedding (t-SNE) is a technique for dimensionality reduction that is particularly well suited for the visualization of high-dimensional datasets and should be avialable in HeuristicLab

Change History (43)

comment:1 Changed 6 months ago by bwerth

  • Owner set to bwerth
  • Status changed from new to assigned

comment:2 Changed 6 months ago by bwerth

  • Type changed from defect to feature request

comment:3 Changed 6 months ago by bwerth

  • Status changed from assigned to accepted
  • Version changed from 3.3.14 to branch

comment:4 Changed 6 months ago by bwerth

r14387 created initial branch

comment:5 Changed 5 months ago by bwerth

r14413 refactored c++ style code to C# (use of [,] arrays, int vs uint,..) + corrected IterationsCounter

Last edited 5 months ago by bwerth (previous) (diff)

comment:6 Changed 5 months ago by bwerth

r14414 forgot to add files

comment:7 Changed 5 months ago by bwerth

  • Owner changed from bwerth to gkronber
  • Status changed from accepted to reviewing

comment:8 Changed 4 months ago by gkronber

Comments from an initial review:

  • It should be possible to stop the algorithm at any time
  • A quality line chart for the error would probably be interesting
  • It would be nice to be able to view the projection after each iteration
  • The descriptions for parameters should contain information on default settings or useful settings for the parameters.
  • Is it necessary to tune all the parameters for learning? Or would it also be ok to just use some robust default settings and hide most of the parameters (except for perplexity)
  • I think it is not strictly necessary that TSNE derives from Item (since it is probably never used directly in the GUI)
  • Error message: "Perplexity should be lower than K" what's K?

Let's discuss this in person...

Last edited 6 weeks ago by gkronber (previous) (diff)

comment:9 Changed 4 months ago by gkronber

  • Owner changed from gkronber to bwerth
  • Status changed from reviewing to assigned

comment:10 Changed 4 months ago by gkronber

r14503: minor change while reviewing

comment:11 Changed 4 months ago by bwerth

r14512 worked in several comments from mkommend, extended analysis during algorithm run, added more Distances, made algorithm stoppable

comment:12 Changed 4 months ago by gkronber

More observations:

  • TSNE should be a BasicAlgorithm
  • Exception when switching between views (projected data & quality line chart) while the algorithm is running
  • r14512 added references to files for a kernel PCA in the project file (please remove).
  • Why does the error change abruptly when the 'stop-lying-iteration' is reached? (--> OK)
  • Hide parameters: *Momentum, Eta, MomentumSwitch, StopLying. Set StopLying to zero per default.
Last edited 6 weeks ago by gkronber (previous) (diff)

comment:13 Changed 4 months ago by bwerth

r14518 TSNEAnalysis is now a BasicAlg, hid Parameters, added optional data normalization to make TSNE scaling-invariant

comment:14 Changed 4 months ago by bwerth

r14558 made TSNE compatible with the new pausible BasicAlgs, removed rescaling of scatterplots during alg to give it a more movie-esque feel

comment:15 Changed 2 months ago by abeham

r14682: fixed references for alglib and libsvm

comment:16 Changed 7 weeks ago by bwerth

  • Owner changed from bwerth to gkronber
  • Status changed from assigned to reviewing

r14742 fixed displaying of randomly generated seed and some minor code simplifications

comment:17 Changed 6 weeks ago by gkronber

The 'performance-improved' distance methods which also accept a threshold seem to be implemented incorrectly. However, they are not used by tSNE anyway so I'm removing them.

 sum = 0;
 ...
 while(sum > threshold ...) {
   sum += ...
 }
 return sum;

comment:18 Changed 6 weeks ago by gkronber

VPTree contains a TODO item TODO check if minheap or maxheap should be used here

comment:19 Changed 6 weeks ago by gkronber

r14767: made some changes while reviewing the code

comment:20 Changed 5 weeks ago by gkronber

ABE: fixed revision numbers

Last edited 5 weeks ago by abeham (previous) (diff)

comment:21 follow-up: Changed 5 weeks ago by abeham

What would be the call if I don't have any features, but only a matrix of distances? Is there some kind of DistanceFunction that is just a lookup in a matrix?

Also as we (bwerth + abeham) discussed: The initial parameters are probably not that well suited for large number of cases. A benchmark set of several different data should be generated and different parameters applied to identify one set that works best on average. Above all, I would strongly recommend theta to be default to 0, because this seems to be only suitable for large datasets. Maybe we can also auto-set this parameter after the problem dimension is known.

Finally, I would recommend to have both TSNE in form of an easy-to-use API call and as a BasicAlgorithm.

comment:22 in reply to: ↑ 21 Changed 5 weeks ago by gkronber

Replying to abeham:

What would be the call if I don't have any features, but only a matrix of distances? Is there some kind of DistanceFunction that is just a lookup in a matrix?

Good point. I propose that we provide a different version of the algorithm for this case (because we don't have a dataset as for all other data-analysis algs).

Also as we (bwerth + abeham) discussed: The initial parameters are probably not that well suited for large number of cases. A benchmark set of several different data should be generated and different parameters applied to identify one set that works best on average. Above all, I would strongly recommend theta to be default to 0, because this seems to be only suitable for large datasets. Maybe we can also auto-set this parameter after the problem dimension is known.

I also already noticed that theta=0 produces very different results. We should leave this option for the case of N>5000 but use theta=0 as default. Additionally, we should look at the differences between theta=0 and theta=eps, maybe there is another issue hidden there.

Finally, I would recommend to have both TSNE in form of an easy-to-use API call and as a BasicAlgorithm.

Full ack. I plan to refactor the code to provide a simple static method.

comment:23 Changed 5 weeks ago by gkronber

r14784: fixed build fail

comment:24 Changed 5 weeks ago by abeham

For comparison purposes, there's also a JavaScript implementation of TSNE (and probably some more).

comment:25 Changed 5 weeks ago by gkronber

TODO:

  • Add parameter to update results only X iterations

comment:26 Changed 5 weeks ago by gkronber

r14785: changes and while reviewing

comment:27 Changed 5 weeks ago by gkronber

r14787: more changes while reviewing

comment:28 Changed 5 weeks ago by gkronber

r14788: refactoring

comment:29 Changed 4 weeks ago by gkronber

r14806: worked on tSNE, storable and cloning for tSNE state. Added some TODO comments while reviewing.

comment:30 Changed 4 weeks ago by gkronber

r14807: support clone and persistence and pause/resume

comment:31 Changed 3 weeks ago by gkronber

Review comments:

  • in fast_tsne.m an initial dimensionality reduction using PCA is performed before running bh_tsne features is not absolutely necessary
  • normalization of data should be moved to TSNEStatic (ZeroMean and diving by max) this is not possible as the tSNE implementation is not specific to real vectors. Therefore, scaling was left in the main algorithm
Last edited 2 weeks ago by gkronber (previous) (diff)

comment:32 Changed 3 weeks ago by gkronber

r14836: merged changesets from trunk to branch

comment:33 Changed 3 weeks ago by gkronber

r14837: made some changes while reviewing (comparsion with bh_tsne implementation by van der Maarten)

comment:34 Changed 2 weeks ago by gkronber

r14855: made some changes / bug-fixes while reviewing

comment:35 Changed 2 weeks ago by gkronber

Script for comparison with bh_tsne.exe

  public override void Main() {
    double theta =0;
    int no_dims = 2;
    int iters = 1000;
    double perplexity = 100;
    
    var ds = (Dataset)vars["tower"];
    var inputVars = (ReadOnlyCheckedItemList<StringValue>)vars["input_vars"];
    using(var writer = new BinaryWriter(new FileStream(@"C:\Users\P24581\Downloads\bhtsne-master\bhtsne-master\windows\data.dat", FileMode.Create))) {
      writer.Write((int)ds.Rows);
      writer.Write((int)inputVars.CheckedItems.Count());
      writer.Write((double)theta);
      writer.Write((double)perplexity);
      writer.Write((int)no_dims);
      writer.Write((int)iters);
      for(int row = 0;row<ds.Rows;row++) {        
        foreach(var @var in inputVars.CheckedItems) {
          writer.Write(ds.GetDoubleValue(@var.Value.Value, row));
        }
      }
    }
    
    using(var reader = new BinaryReader(new FileStream(@"C:\Users\P24581\Downloads\bhtsne-master\bhtsne-master\windows\result.dat", FileMode.Open))) {
      int h = reader.ReadInt32();
      int d = reader.ReadInt32();
      var scatterPlot = new ScatterPlot();
      vars["mapped"] = scatterPlot;
      
      Console.WriteLine("d={0} h={1}", d, h);
      var scatterRow = new ScatterPlotDataRow();
      
      for(int r = 0;r<h;r++) {
         
        var p = new Point2D<double>(reader.ReadDouble(), reader.ReadDouble());
        scatterRow.Points.Add(p);
      }
      
      
      scatterPlot.Rows.Add(scatterRow);
    }
  }

comment:36 Changed 2 weeks ago by gkronber

r14856: fixed bug in approximate gradient calculation

comment:37 Changed 2 weeks ago by gkronber

r14858: fixed another bug in the bh-tsne implementation

comment:38 Changed 2 weeks ago by gkronber

tSNE and approximate tSNE produce the same results now and results are comparable to the C++ implementation.

The C# version is much slower though.

comment:39 Changed 2 weeks ago by gkronber

r14859: update only every x iterations and fixed output paths

comment:40 Changed 2 weeks ago by gkronber

r14862: copied tSNE from branch to trunk

comment:41 Changed 2 weeks ago by gkronber

  • Owner changed from gkronber to abeham

comment:42 Changed 2 weeks ago by gkronber

r14863: fixed bug in cloning ctor

comment:43 Changed 2 weeks ago by gkronber

r14867: deleted branch as it has been integrated into the trunk

Note: See TracTickets for help on using tickets.