Opened 5 years ago
Last modified 3 years ago
#3040 accepted enhancement
Vector-based GP
Reported by: | pfleck | Owned by: | pfleck |
---|---|---|---|
Priority: | medium | Milestone: | |
Component: | Problems.DataAnalysis.Symbolic | Version: | branch |
Keywords: | Cc: |
Description
This ticket will track the overall development of implementing Vector-based GP for Time-Series Regression and Classification.
The main idea is supporting vectors as a new "datatype" in symbolic expression trees along regular numerical values. Additionally, new operators will be developed to work with those vectors to combine them with the existing numerical values.
Because developing the required features will likely involve implementing several components simultaneously, along with some changes within the core DataAnalysis plugins, there will be a main branch in which development will take place, with some trunk-reintegration branches to get completed features back into the trunk.
Change History (93)
comment:1 Changed 5 years ago by pfleck
- Status changed from new to accepted
comment:2 Changed 5 years ago by pfleck
- Added double vectors for Dataset. Extended the type-checks for DataAnalysisProblemData.
- Added a small benchmark instance with data containing vectors. Adapted the ArtificialRegressionDataDescriptor to be able to specify non-double values.
Additional thoughts:
- Consider ModifiableDataset and DataPreprocessing.
- Consider adding generic vector capabilities to IDataset that only allows double, string, DateTime.
- Consider changing IList within the Dataset to a covariant alternative (non-generic IReadOnlyList does not exist, however). Currently the type must be exactly IReadOnlyList<double>, otherwise the invariant IList<T> is not a subtype of IList<IList<double>> for instance.
- Each DataAnalysis algorithm should check on it's own, whether the types of the allowed input variables is compatible. For instance, the LR would only allow double-values, whereas SymReg also supports string-variables (as factor variables) and double-vector-variables.
comment:3 Changed 5 years ago by pfleck
r17365 Added explicit vector types to avoid type-missmatches when representing vectors as IList<T>, List<T> or IReadOnlyList<T>.
Additional toughts:
- The IDataset interface (and its implementation) now contains a lot of methods due to all the different available types (double, string, DateTime and also vector-versions). In the future, this should be unified.
- Whether the types of the input variables are allowed should be decided by the algorithms, rather than the ProblemData.
comment:4 Changed 5 years ago by pfleck
r17369 Added Vector symbols to TypeCoherentExpressionGrammar & fixes.
comment:5 Changed 5 years ago by pfleck
r17400 Added Azzali benchmarks
comment:6 Changed 5 years ago by pfleck
r17401 Added parser for new benchmark data but did not commit the data yet (too large)
comment:7 Changed 5 years ago by pfleck
r17403 Added fix for non-numeric class labels
comment:8 Changed 5 years ago by pfleck
r17414 Started adding UCI time series regression benchmarks. Adapted parser (extracted format options & added parsing for double vectors).
comment:9 Changed 5 years ago by pfleck
r17415 Added additional UCI instances for time series regression
comment:10 Changed 5 years ago by pfleck
r17416 enabled variable impacts for vectorial data (if vectors have the same length)
comment:11 Changed 5 years ago by pfleck
- (partially) enabled data preprocessing for vectorial data
- use flat zip-files for large benchmarks instead of embedded resources (faster build times)
- added multiple variants of vector benchmark I (vector lenght constraints)
comment:12 Changed 5 years ago by pfleck
r17419 added missing source file
comment:13 Changed 5 years ago by pfleck
r17447 Added TransportPlugin for MathNet.Numerics.
comment:14 Changed 5 years ago by pfleck
r17448 Replaced own Vector with MathNet.Numerics Vector.
- Used types are not yet storable.
- I do not like the using DoubleVector = MathNet.Numerics.LinearAlgebra.Vector<double>; directive. Maybe Ill switch to using MathNet.Numerics.LinearAlgebra.Single; and only use Vector as type.
comment:15 Changed 5 years ago by pfleck
r17449 Added Transformers for Vectors. Added specialiced Transformers for double Dense/SparseVectorStorage and a generic mapper for the remaining (serializable) types.
comment:16 Changed 5 years ago by pfleck
r17452 Improved Persistence for Vectors (removed the generic transformer and used the existing array transformer instead).
comment:17 Changed 5 years ago by pfleck
r17455 Added separate Interpreter for vector that reuse the existing symbols instead of creating explicit vector symbols.
comment:18 Changed 5 years ago by pfleck
r17456 Merged trunk to branch
comment:19 Changed 5 years ago by pfleck
- Added full functional grammar for vectors.
- Added sum and mean aggregation for vectors.
comment:20 Changed 5 years ago by pfleck
r17463 Added type coherent vector grammar to enforce that the root symbol is a scalar.
comment:21 Changed 5 years ago by pfleck
r17465 Simplified default vector grammar.
comment:22 Changed 5 years ago by pfleck
r17466 Added separate mean symbol instead of reusing the average symbol.
comment:23 Changed 5 years ago by pfleck
r17467 Added a "final aggregation" option for the vector interpreter in case the result is a vector.
comment:24 Changed 5 years ago by pfleck
r17469 Added TensorFlow.NET library for constant optimization with vectors (as alternative to AutoDiff+Alglib).
The build process for TensorFlow.NET is somewhat tedious for multiple reasons:
- First, the NumSharp dependency for TensorFlow.NET is not strongly named, thus cannot be loaded with HL.
- The native tensorflow.dll does not ship correctly with the Framework edition (on dotnet core it works).
- A newer version of Google.Protobuf is required.
Due to the reasons above, the following steps were taken to import TensorFlow.NET:
- All dependencies for Google.Protobuf are upgraded to 3.11.4. This includes HEAL.Attic, which is manually built and then replaces the binaries in the bin. A manual build of HEAL.Attic is currently required anyway, because the BoxTransformer is still internal in the latest Nuget release but already fixed in the Master branch.
- Since the OR-Tools (for exact optimization) includes already built assemblies referencing the old Google.Protobuf version, they are currently excluded. Also I removed the HeuristicLab.ProtobufCS-2.4.1 version to avoid any further conflicts. Therefore, external evaluation and some other plugins do not work on this branch.
- Although TensorFlow.NET is strongly named, it's dependency NumSharp is not. Simply signing NumSharp did not work, because the reference from TensorFlow.NET expects an unsigned NumSharp assembly. As a solution, there is a standalone project within the Extlibs (TensorFlowNet) that references the Nuget package for TensorFlow.NET and uses ILMerge (also via Nuget package) to create a single assembly, containing both TensorFlow.NET.dll and the NumSharp.dll, and signed with the HL key. The resulting TensorFlow.NET.signed.dll is (file-) referenced within the transport plugin HeuristicLab.TensorFlowNet.
- The native tensorflow.dll is located within a separate nuget redist package. However, this does not work for dotnet framework for some reason. I created a dotnet core project with the redist package referenced, and copied the native x64 dll from there into the HeuristicLab.TensorFlowNet transport plugin as native dll plugin dependency.
As a final note: The whole build process is instable. Sometimes the resulting TensorFlow.Signed.dll contains some unloadble types. Clearing the bin folder, praying to ILMerge and the build gods usually helps.
comment:25 Changed 5 years ago by pfleck
r17472 Moved Alglib+AutoDiff constant optimizer in own class and created base class to provide multiple constant-opt implementations.
comment:26 Changed 5 years ago by pfleck
r17474 Started working on the TF constant opt evaluator.
comment:27 Changed 5 years ago by pfleck
r17475 Updated HeuristicLab.Algorithms.DataAnalysis plugin and its dependencies to Framework 4.7.2 to avoid conflicting System.ValueTuple locations (mscorelib or nuget).
comment:28 Changed 5 years ago by pfleck
r17476 Worked on TF-based constant optimization.
comment:29 Changed 5 years ago by pfleck
r17489 Added version with explicit array shapes for explicit broadcasting.
comment:30 Changed 5 years ago by pfleck
r17493 Write optimized constants back to tree.
comment:31 Changed 5 years ago by pfleck
- Switched whole TF-graph to float (Adam optimizer won't work with double).
- Added progress and cancellation support for TF-const opt.
- Added optional logging with console and/or file for later plotting.
comment:32 Changed 5 years ago by pfleck
r17541 Added a new benchmark instance.
comment:33 Changed 4 years ago by pfleck
r17554 Added some symbols for statistical aggregation.
comment:34 Changed 4 years ago by pfleck
r17556 Some corner cases for empty or length-one vectors now return NaN.
comment:35 Changed 4 years ago by pfleck
r17573 Added first draft for WindowedSymbol.
ToDo:
- Make other aggregation symbols windowed
- Better encoding for Offset and Length parameter
- Continuous interpretation (e.g. weighted sum/mean)
- Mutation is currently not symmetric (due to cast/floor mechanic when calculating the actual indices)
- Create a test function specifically for benchmarking windowed symbols
- Evaluate alternative: explicit "SubVector" symbol?
- No continuous interetation
- Potential issues with incompatible vector lengths
comment:36 Changed 4 years ago by pfleck
- Adapted existing benchmarks (no mean/sum of vectors with zero-mean).
- Added new benchmark for testing windowed aggregations.
comment:37 Changed 4 years ago by pfleck
r17593 Added a new simplifier that can also simplify vector-specific operators.
- Added simplification rules for sum-symbol and mean-symbol for addition and multiplication
comment:38 Changed 4 years ago by pfleck
r17596: added subtraction/division simplification for sum and mean symbols by converting them to sums/products.
comment:39 Changed 4 years ago by pfleck
r17597: Added simplification rules for length-aggregation.
comment:40 Changed 4 years ago by pfleck
- Changed stddev, variance, etc. to population variant
- Added multiplicative simplifications for stdev and variance symbols
comment:41 Changed 4 years ago by pfleck
- Added additive simplification rules for stdev and variance symbols.
- Extended simplifications of constants to simplification of all scalar-nodes for aggregation symbols.
comment:42 Changed 4 years ago by pfleck
r17604 Stores the datatype of a tree node (e.g. variable nodes) in the tree itself for the interpreter to derive the datatypes for subtrees. This way, the interpreter (and simplifier) do not need an actual dataset to figure out datatypes for subtrees.
comment:43 Changed 4 years ago by pfleck
r17605 Adapted unit test for trunk changes.
comment:44 Changed 4 years ago by pfleck
- Extended importer (vectorvariable, vec-aggregations, ...).
- Started adding unit test for vector simplifications.
comment:45 Changed 4 years ago by pfleck
r17622 Added vector variables to infix parser/formatter.
comment:46 Changed 4 years ago by pfleck
- Switched vector-simplification unit-test to infix notation to avoid ambiguities between the peek-string "VAR" for variables and the variance function.
- Added additional unit tests for mean, length, stdev and var simplifications.
comment:47 Changed 4 years ago by pfleck
r17626 Unified simplification rules for vector aggregation functions.
comment:48 Changed 4 years ago by pfleck
r17629 fixed bug when simplifying sum aggregation (node not cloned).
comment:49 Changed 4 years ago by pfleck
- Added aggregation symbols to latex formatter.
- Use boldsymbol for vector-variables.
comment:50 Changed 4 years ago by pfleck
r17633 Reenabled the old optimize button in the simplifier and added a new button for const opt with vectors.
comment:51 Changed 4 years ago by pfleck
r17721 First draft of different-vector-length strategies (cut, fill, resample, cycle, ...)
comment:52 Changed 4 years ago by pfleck
r17725 Adapted dependencies and versions for hive execution.
comment:53 Changed 4 years ago by pfleck
r17726 Added a constant opt evaluator for vectors that uses the existing AutoDiff library by unrolling all vector operations.
A non-unrolled version that defines jacobians for each vector-operation would be faster, but would require substantial extensions to AutoDiff so that it can higher-dimensional data.
comment:54 Changed 4 years ago by pfleck
r17741 Added new benchmark and some minor bugfixes.
comment:55 Changed 4 years ago by pfleck
r17752 Added subvector symbol to grammar
comment:56 Changed 4 years ago by pfleck
r17759 Added DiffSharp as alternative for AutoDiff and TensorFlowNet
comment:57 Changed 4 years ago by pfleck
r17785 Made vector separator symbol configurable in the CSV import dialog.
comment:58 Changed 4 years ago by pfleck
r17786 Worked in DiffSharp for constant-opt.
comment:59 Changed 4 years ago by pfleck
r17825 Merged trunk into branch.
comment:60 Changed 4 years ago by pfleck
r17830 First draft additional vector aggregation symbols (distribution characteristics & time series dynamics)
comment:61 Changed 4 years ago by pfleck
r17915 started added some vector benchmarks for gptp.
comment:62 Changed 4 years ago by pfleck
r17930 Reworked external dependencies and merged some libraries (ILmerge) to avoid versions conflicts occuring on Hive.
comment:63 Changed 4 years ago by pfleck
r17935 Worked on library dependencies for hive.
comment:64 Changed 4 years ago by pfleck
- Added additional benchmark instances for vector GP.
- Removed old binding redirect.
comment:65 Changed 3 years ago by pfleck
r18058 Switched from offset-length sub-vector to "start-end with wrapping" subvector.
Maybe it would be better to offer multiple sub-vector symbols that have different mechanisms (wrapping, repetition, lower/higher index repair, ...).
comment:66 Changed 3 years ago by pfleck
r18058 Added a subvector symbol with ranges as subtrees.
comment:67 Changed 3 years ago by pfleck
r18082: Added the ISymbolicDataAnalysisExpressionManipulator and necessary code analogously to the ISymbolicDataAnalysisExpressionCrossover.
comment:68 Changed 3 years ago by pfleck
r18083 Added first draft of the SubVectorImprovementManipulator.
comment:69 Changed 3 years ago by pfleck
r18092 Added first draft of simple SegmentOptimizationProblem.
comment:70 Changed 3 years ago by pfleck
r18094 Added missing StorableType and fixed cloning-ctor.
comment:71 Changed 3 years ago by pfleck
r18096 Added instance provider for segment opt problem with WIP instances.
comment:72 Changed 3 years ago by pfleck
r18097 Added SOP instances from csv file with vectors.
comment:73 Changed 3 years ago by pfleck
r18098 Added support for multi-row data for SOP instances where the segment aggregation results are averaged over multiple rows.
comment:74 Changed 3 years ago by pfleck
r18186 Added first version of SegmentOptimization Mutators.
comment:75 Changed 3 years ago by pfleck
r18193 Added additional parameters and fixed index generation for random search ranges.
comment:76 Changed 3 years ago by pfleck
r18201 Improved handling of "None" Enum flags.
comment:77 Changed 3 years ago by pfleck
r18202 Fixed best result analysis for non-plus selection.
comment:78 Changed 3 years ago by pfleck
r18204 Improved performance by avoiding allocating memory for vector segments & fixed some index bounds corner cases.
comment:79 Changed 3 years ago by pfleck
- Count batch evaluation as number of evaluations for SOP manipulator.
- Added the functionality to remove duplicate data matrices or similar to duplicated datasets to remove the size of segment optimization experiment runs.
comment:80 Changed 3 years ago by pfleck
r18214 Added first version of sampling segment optimization manipulator for vectorial sym reg.
comment:81 Changed 3 years ago by pfleck
r18217 Added guided direction and range sampling using approximated gradients.
comment:82 Changed 3 years ago by pfleck
r18218 Added SOP benchmark with interacting vectors.
comment:83 Changed 3 years ago by pfleck
r18227 Added cached option version of sampling mutator.
comment:84 Changed 3 years ago by pfleck
r18228 Added no-noise versions of SOP benchmarks.
comment:85 Changed 3 years ago by pfleck
r18229 Added mutation for optimizing aggregation window that uses a nested optimizer.
comment:86 Changed 3 years ago by pfleck
r18230 Changed SubVector symbol to include end-index in result.
comment:87 Changed 3 years ago by pfleck
r18233 Added Guided Direction Mutation for nested Optimizers.
comment:88 Changed 3 years ago by pfleck
r18234 Fixed vector-unrolling AutoDiff conversion.
comment:89 Changed 3 years ago by pfleck
r18235 Added GuidedRangeManipulator for nested index optimization.
comment:90 Changed 3 years ago by pfleck
r18237 Added sub-vector, std dev and variance support for Tree to Tensor converter.
comment:91 Changed 3 years ago by pfleck
r18238 Print MSE progress for constant opt in simplifier.
comment:92 Changed 3 years ago by pfleck
r18239 Updated to newer TensorFlow.NET version.
- Removed IL Merge from TensorFlow.NET.
- Temporarily removed DiffSharp.
- Changed to a locally built Attic with a specific Protobuf version that is compatible with TensorFlow.NET. (Also adapted other versions of nuget dependencies.)
comment:93 Changed 3 years ago by pfleck
r18240 smaller fixes and some code cleanup
r17362 Branched trunk