Opened 8 years ago
Closed 7 years ago
#2700 closed feature request (done)
t-Distributed Stochastic Neighbor Embedding
Reported by: | bwerth | Owned by: | gkronber |
---|---|---|---|
Priority: | medium | Milestone: | HeuristicLab 3.3.15 |
Component: | Algorithms.DataAnalysis | Version: | 3.3.14 |
Keywords: | Cc: |
Description
t-Distributed Stochastic Neighbor Embedding (t-SNE) is a technique for dimensionality reduction that is particularly well suited for the visualization of high-dimensional datasets and should be avialable in HeuristicLab
Change History (55)
comment:1 Changed 8 years ago by bwerth
- Owner set to bwerth
- Status changed from new to assigned
comment:2 Changed 8 years ago by bwerth
- Type changed from defect to feature request
comment:3 Changed 8 years ago by bwerth
- Status changed from assigned to accepted
- Version changed from 3.3.14 to branch
comment:4 Changed 8 years ago by bwerth
comment:5 Changed 8 years ago by bwerth
r14413 refactored c++ style code to C# (use of [,] arrays, int vs uint,..) + corrected IterationsCounter
comment:6 Changed 8 years ago by bwerth
r14414 forgot to add files
comment:7 Changed 8 years ago by bwerth
- Owner changed from bwerth to gkronber
- Status changed from accepted to reviewing
comment:8 Changed 8 years ago by gkronber
Comments from an initial review:
It should be possible to stop the algorithm at any timeA quality line chart for the error would probably be interestingIt would be nice to be able to view the projection after each iterationThe descriptions for parameters should contain information on default settings or useful settings for the parameters.Is it necessary to tune all the parameters for learning? Or would it also be ok to just use some robust default settings and hide most of the parameters (except for perplexity)- I think it is not strictly necessary that TSNE derives from Item (since it is probably never used directly in the GUI)
Error message: "Perplexity should be lower than K" what's K?
Let's discuss this in person...
comment:9 Changed 8 years ago by gkronber
- Owner changed from gkronber to bwerth
- Status changed from reviewing to assigned
comment:10 Changed 8 years ago by gkronber
r14503: minor change while reviewing
comment:11 Changed 8 years ago by bwerth
r14512 worked in several comments from mkommend, extended analysis during algorithm run, added more Distances, made algorithm stoppable
comment:12 Changed 8 years ago by gkronber
More observations:
- TSNE should be a BasicAlgorithm
Exception when switching between views (projected data & quality line chart) while the algorithm is runningr14512 added references to files for a kernel PCA in the project file (please remove).Why does the error change abruptly when the 'stop-lying-iteration' is reached? (--> OK)Hide parameters: *Momentum, Eta, MomentumSwitch, StopLying. Set StopLying to zero per default.
comment:13 Changed 8 years ago by bwerth
r14518 TSNEAnalysis is now a BasicAlg, hid Parameters, added optional data normalization to make TSNE scaling-invariant
comment:14 Changed 8 years ago by bwerth
r14558 made TSNE compatible with the new pausible BasicAlgs, removed rescaling of scatterplots during alg to give it a more movie-esque feel
comment:15 Changed 8 years ago by abeham
r14682: fixed references for alglib and libsvm
comment:16 Changed 8 years ago by bwerth
- Owner changed from bwerth to gkronber
- Status changed from assigned to reviewing
r14742 fixed displaying of randomly generated seed and some minor code simplifications
comment:17 Changed 8 years ago by gkronber
The 'performance-improved' distance methods which also accept a threshold seem to be implemented incorrectly. However, they are not used by tSNE anyway so I'm removing them.
sum = 0; ... while(sum > threshold ...) { sum += ... } return sum;
comment:18 Changed 8 years ago by gkronber
VPTree contains a TODO item TODO check if minheap or maxheap should be used here
comment:19 Changed 8 years ago by gkronber
r14767: made some changes while reviewing the code
comment:20 Changed 8 years ago by gkronber
comment:21 follow-up: ↓ 22 Changed 8 years ago by abeham
What would be the call if I don't have any features, but only a matrix of distances? Is there some kind of DistanceFunction that is just a lookup in a matrix?
Also as we (bwerth + abeham) discussed: The initial parameters are probably not that well suited for large number of cases. A benchmark set of several different data should be generated and different parameters applied to identify one set that works best on average. Above all, I would strongly recommend theta to be default to 0, because this seems to be only suitable for large datasets. Maybe we can also auto-set this parameter after the problem dimension is known.
Finally, I would recommend to have both TSNE in form of an easy-to-use API call and as a BasicAlgorithm.
comment:22 in reply to: ↑ 21 Changed 8 years ago by gkronber
Replying to abeham:
What would be the call if I don't have any features, but only a matrix of distances? Is there some kind of DistanceFunction that is just a lookup in a matrix?
Good point. I propose that we provide a different version of the algorithm for this case (because we don't have a dataset as for all other data-analysis algs).
Also as we (bwerth + abeham) discussed: The initial parameters are probably not that well suited for large number of cases. A benchmark set of several different data should be generated and different parameters applied to identify one set that works best on average. Above all, I would strongly recommend theta to be default to 0, because this seems to be only suitable for large datasets. Maybe we can also auto-set this parameter after the problem dimension is known.
I also already noticed that theta=0 produces very different results. We should leave this option for the case of N>5000 but use theta=0 as default. Additionally, we should look at the differences between theta=0 and theta=eps, maybe there is another issue hidden there.
Finally, I would recommend to have both TSNE in form of an easy-to-use API call and as a BasicAlgorithm.
Full ack. I plan to refactor the code to provide a simple static method.
comment:23 Changed 8 years ago by gkronber
r14784: fixed build fail
comment:24 Changed 8 years ago by abeham
For comparison purposes, there's also a JavaScript implementation of TSNE (and probably some more).
comment:25 Changed 8 years ago by gkronber
TODO:
Add parameter to update results only X iterations(done)
comment:26 Changed 8 years ago by gkronber
r14785: changes and while reviewing
comment:27 Changed 8 years ago by gkronber
r14787: more changes while reviewing
comment:28 Changed 8 years ago by gkronber
r14788: refactoring
comment:29 Changed 8 years ago by gkronber
r14806: worked on tSNE, storable and cloning for tSNE state. Added some TODO comments while reviewing.
comment:30 Changed 8 years ago by gkronber
r14807: support clone and persistence and pause/resume
comment:31 Changed 8 years ago by gkronber
Review comments:
in fast_tsne.m an initial dimensionality reduction using PCA is performed before running bh_tsnefeatures is not absolutely necessarynormalization of data should be moved to TSNEStatic (ZeroMean and diving by max)this is not possible as the tSNE implementation is not specific to real vectors. Therefore, scaling was left in the main algorithm
comment:32 Changed 8 years ago by gkronber
r14836: merged changesets from trunk to branch
comment:33 Changed 8 years ago by gkronber
r14837: made some changes while reviewing (comparsion with bh_tsne implementation by van der Maarten)
comment:34 Changed 8 years ago by gkronber
r14855: made some changes / bug-fixes while reviewing
comment:35 Changed 8 years ago by gkronber
Script for comparison with bh_tsne.exe
public override void Main() { double theta =0; int no_dims = 2; int iters = 1000; double perplexity = 100; var ds = (Dataset)vars["tower"]; var inputVars = (ReadOnlyCheckedItemList<StringValue>)vars["input_vars"]; using(var writer = new BinaryWriter(new FileStream(@"C:\Users\P24581\Downloads\bhtsne-master\bhtsne-master\windows\data.dat", FileMode.Create))) { writer.Write((int)ds.Rows); writer.Write((int)inputVars.CheckedItems.Count()); writer.Write((double)theta); writer.Write((double)perplexity); writer.Write((int)no_dims); writer.Write((int)iters); for(int row = 0;row<ds.Rows;row++) { foreach(var @var in inputVars.CheckedItems) { writer.Write(ds.GetDoubleValue(@var.Value.Value, row)); } } } using(var reader = new BinaryReader(new FileStream(@"C:\Users\P24581\Downloads\bhtsne-master\bhtsne-master\windows\result.dat", FileMode.Open))) { int h = reader.ReadInt32(); int d = reader.ReadInt32(); var scatterPlot = new ScatterPlot(); vars["mapped"] = scatterPlot; Console.WriteLine("d={0} h={1}", d, h); var scatterRow = new ScatterPlotDataRow(); for(int r = 0;r<h;r++) { var p = new Point2D<double>(reader.ReadDouble(), reader.ReadDouble()); scatterRow.Points.Add(p); } scatterPlot.Rows.Add(scatterRow); } }
comment:36 Changed 8 years ago by gkronber
r14856: fixed bug in approximate gradient calculation
comment:37 Changed 8 years ago by gkronber
r14858: fixed another bug in the bh-tsne implementation
comment:38 Changed 8 years ago by gkronber
tSNE and approximate tSNE produce the same results now and results are comparable to the C++ implementation.
The C# version is much slower though.
comment:39 Changed 8 years ago by gkronber
r14859: update only every x iterations and fixed output paths
comment:40 Changed 8 years ago by gkronber
r14862: copied tSNE from branch to trunk
comment:41 Changed 8 years ago by gkronber
- Owner changed from gkronber to abeham
comment:42 Changed 8 years ago by gkronber
r14863: fixed bug in cloning ctor
comment:43 Changed 8 years ago by gkronber
r14867: deleted branch as it has been integrated into the trunk
comment:44 Changed 8 years ago by gkronber
- Version changed from branch to 3.3.14
comment:45 Changed 7 years ago by abeham
Code Review Comments
Distance related calculation
This should all be moved to HeuristicLab.Common.
The InnerProductDistance needs to be looked at. It does not satisfy point 3 of the conditions mentioned in IDistance. It is not zero-reflexive, because d(x,x) can only be 0 if x is equal to the null vector. Inner product is not a measure for distance, because it gets larger the more similar two vectors are (if normalized with the product of the vectors' lengths). Also, the description mentions angular distance and speaks of normalization (there is no normalization). It is not clear to me why it is only defined for vectors with all non-negative dimensions. A distance measure based on inner product would probably look something like 1 - cosine(alpha) where cosine(alpha) = <x,y> / (||x|| * ||y||) (<x,y> = inner product) to me. This results in a distance measure in the interval [0; 2] related to the angle of two vectors (0 for vectors of identical angle, 1 for vectors perpendicular to each other and 2 for vectors in the exact opposite direction).
Implementation of the distance measures with .Zip() and .Sum() is nice to read, but inefficient. I would recommend to implement efficiently and sacrifice a little bit on readability, especially if it's put in Common.
Collection and Trees
It is mentioned in the TODO of PriorityQueue that this should move to HeuristicLab.Collections. However, it's not a storable class and it's not observable. I would move it to HeuristicLab.Common. VantagePointTree and SpacePartitioningTree should also be moved to Common (maybe all together in a folder Collections).
Static TSNE
It uses different default values than the TSNE. I personally think that an eta of 200 is too much, I would go with the same values that are default in the tSNE for data analysis problems. van der Maarten mentions adaptive learning rate at some points.
TSNEAlgorithm
Also concerns static TSNE: if our implementation is based on some variant of t-sne (of which there are several) we should state that in the ItemName and/or the ItemDescription. Additionally, a reference to the publication or other source should be given in the description.
I have yet to test the t-sne with data.
comment:46 Changed 7 years ago by bwerth
- Replaced InnerProductDistance with CosineDistance;
- Removed LINQ from Distances;
- Removed unused Interfaces;
- Removed unused methods from SpacePartitioningTree and made it internal
- some code formatting
- changed default values in TSNEStatic
- added links to barnes hut paper + code in ItemDescription of TSNEAlgorithm
- added some comments to the VantagePointTree
comment:47 Changed 7 years ago by abeham
Further Review Comments
ClassNames parameter should be an OptionalConstrainedValueParameter<StringValue> and should be filled with the input variable names when a problem is loaded.
DistanceFunction should be a ConstrainedValueParameter<IDistance<double[]>> and should be filled with all instances of type IDistance<double[]> through discovery using ApplicationManager.GetInstances().
Both should be exposed as IConstrainedValueParameter<T> in the parameter properties.
In implementing this you must then override OnProblemChanged. Add event handlers to ProblemDataChanged. In case problem or problemdata changes clear the constrained value parameter for class names and fill them with the ProblemData's InputVariables (use all of them, not just the allowed ones). If event handlers are added you must add an after deserialization hook and also register the event handlers there. Event handlers must also be registered in the default constructor and after the cloning constructor.
Ideally Theta parameter would be a PercentValue instead of a DoubleValue. Internally this is completely the same, but it's presented to the user as a percentage and it makes it more obvious that this is something that is in the range [0;1].
TSNE throws an exception if a variable is included that has the same value for each row. The ArithmeticException that is thrown (in Math.Sign) says: "Function does not accept floating point Not-a-Number values." Maybe this can be caught and rethrown with a better description.
comment:48 Changed 7 years ago by abeham
- Owner changed from abeham to bwerth
- Status changed from reviewing to assigned
This concludes the review. I have tested it with a few datasets. The results seem to be good.
comment:49 Changed 7 years ago by bwerth
- made ClassesParameter an OptionalConstrainedValueParameter<StringValue>
- made DistanceParameter a ConstrainedValueParameter<IDistance<double[]>>
- overrode OnProblemChanged, RegisterProblemEvents and DeregisterProblemEvents to correctly update the ClassesParameter
- Changed Theta to PercentValue
- Changed Normalization to no longer perform division by zero (columns with 0 variance, stay 0 variance). This should get rid of the NaNs without throwing an Exception
comment:50 Changed 7 years ago by bwerth
- Owner changed from bwerth to abeham
- Status changed from assigned to reviewing
comment:51 Changed 7 years ago by gkronber
- Owner changed from abeham to gkronber
comment:52 Changed 7 years ago by abeham
r15234: fixed some typos
comment:53 Changed 7 years ago by gkronber
- Status changed from reviewing to readytorelease
Reviewed and tested again. Everything looks fine.
comment:54 Changed 7 years ago by gkronber
comment:55 Changed 7 years ago by gkronber
- Resolution set to done
- Status changed from readytorelease to closed
r14387 created initial branch