Opened 7 months ago

Last modified 7 days ago

#3040 accepted enhancement

Vector-based GP

Reported by: pfleck Owned by: pfleck
Priority: medium Milestone:
Component: Problems.DataAnalysis.Symbolic Version: branch
Keywords: Cc:

Description

This ticket will track the overall development of implementing Vector-based GP for Time-Series Regression and Classification.

The main idea is supporting vectors as a new "datatype" in symbolic expression trees along regular numerical values. Additionally, new operators will be developed to work with those vectors to combine them with the existing numerical values.

Because developing the required features will likely involve implementing several components simultaneously, along with some changes within the core DataAnalysis plugins, there will be a main branch in which development will take place, with some trunk-reintegration branches to get completed features back into the trunk.

Change History (35)

comment:1 Changed 7 months ago by pfleck

  • Status changed from new to accepted

r17362 Branched trunk

comment:2 Changed 7 months ago by pfleck

r17364

  • Added double vectors for Dataset. Extended the type-checks for DataAnalysisProblemData.
  • Added a small benchmark instance with data containing vectors. Adapted the ArtificialRegressionDataDescriptor to be able to specify non-double values.

Additional thoughts:

  • Consider ModifiableDataset and DataPreprocessing.
  • Consider adding generic vector capabilities to IDataset that only allows double, string, DateTime.
  • Consider changing IList within the Dataset to a covariant alternative (non-generic IReadOnlyList does not exist, however). Currently the type must be exactly IReadOnlyList<double>, otherwise the invariant IList<T> is not a subtype of IList<IList<double>> for instance.
  • Each DataAnalysis algorithm should check on it's own, whether the types of the allowed input variables is compatible. For instance, the LR would only allow double-values, whereas SymReg also supports string-variables (as factor variables) and double-vector-variables.
Last edited 7 months ago by pfleck (previous) (diff)

comment:3 Changed 7 months ago by pfleck

r17365 Added explicit vector types to avoid type-missmatches when representing vectors as IList<T>, List<T> or IReadOnlyList<T>.

Additional toughts:

  • The IDataset interface (and its implementation) now contains a lot of methods due to all the different available types (double, string, DateTime and also vector-versions). In the future, this should be unified.
  • Whether the types of the input variables are allowed should be decided by the algorithms, rather than the ProblemData.

comment:4 Changed 6 months ago by pfleck

r17369 Added Vector symbols to TypeCoherentExpressionGrammar & fixes.

comment:5 Changed 5 months ago by pfleck

r17400 Added Azzali benchmarks

comment:6 Changed 5 months ago by pfleck

r17401 Added parser for new benchmark data but did not commit the data yet (too large)

comment:7 Changed 5 months ago by pfleck

r17403 Added fix for non-numeric class labels

comment:8 Changed 4 months ago by pfleck

r17414 Started adding UCI time series regression benchmarks. Adapted parser (extracted format options & added parsing for double vectors).

comment:9 Changed 4 months ago by pfleck

r17415 Added additional UCI instances for time series regression

comment:10 Changed 4 months ago by pfleck

r17416 enabled variable impacts for vectorial data (if vectors have the same length)

comment:11 Changed 4 months ago by pfleck

r17418

  • (partially) enabled data preprocessing for vectorial data
  • use flat zip-files for large benchmarks instead of embedded resources (faster build times)
  • added multiple variants of vector benchmark I (vector lenght constraints)

comment:12 Changed 4 months ago by pfleck

r17419 added missing source file

comment:13 Changed 3 months ago by pfleck

r17447 Added TransportPlugin for MathNet.Numerics.

comment:14 Changed 3 months ago by pfleck

r17448 Replaced own Vector with MathNet.Numerics Vector.

  • Used types are not yet storable.
  • I do not like the using DoubleVector = MathNet.Numerics.LinearAlgebra.Vector<double>; directive. Maybe Ill switch to using MathNet.Numerics.LinearAlgebra.Single; and only use Vector as type.

comment:15 Changed 3 months ago by pfleck

r17449 Added Transformers for Vectors. Added specialiced Transformers for double Dense/SparseVectorStorage and a generic mapper for the remaining (serializable) types.

comment:16 Changed 3 months ago by pfleck

r17452 Improved Persistence for Vectors (removed the generic transformer and used the existing array transformer instead).

comment:17 Changed 3 months ago by pfleck

r17455 Added separate Interpreter for vector that reuse the existing symbols instead of creating explicit vector symbols.

comment:18 Changed 3 months ago by pfleck

r17456 Merged trunk to branch

comment:19 Changed 3 months ago by pfleck

r17460

  • Added full functional grammar for vectors.
  • Added sum and mean aggregation for vectors.

comment:20 Changed 3 months ago by pfleck

r17463 Added type coherent vector grammar to enforce that the root symbol is a scalar.

comment:21 Changed 3 months ago by pfleck

r17465 Simplified default vector grammar.

comment:22 Changed 3 months ago by pfleck

r17466 Added separate mean symbol instead of reusing the average symbol.

comment:23 Changed 3 months ago by pfleck

r17467 Added a "final aggregation" option for the vector interpreter in case the result is a vector.

comment:24 Changed 3 months ago by pfleck

r17469 Added TensorFlow.NET library for constant optimization with vectors (as alternative to AutoDiff+Alglib).

The build process for TensorFlow.NET is somewhat tedious for multiple reasons:

  • First, the NumSharp dependency for TensorFlow.NET is not strongly named, thus cannot be loaded with HL.
  • The native tensorflow.dll does not ship correctly with the Framework edition (on dotnet core it works).
  • A newer version of Google.Protobuf is required.

Due to the reasons above, the following steps were taken to import TensorFlow.NET:

  1. All dependencies for Google.Protobuf are upgraded to 3.11.4. This includes HEAL.Attic, which is manually built and then replaces the binaries in the bin. A manual build of HEAL.Attic is currently required anyway, because the BoxTransformer is still internal in the latest Nuget release but already fixed in the Master branch.
  2. Since the OR-Tools (for exact optimization) includes already built assemblies referencing the old Google.Protobuf version, they are currently excluded. Also I removed the HeuristicLab.ProtobufCS-2.4.1 version to avoid any further conflicts. Therefore, external evaluation and some other plugins do not work on this branch.
  3. Although TensorFlow.NET is strongly named, it's dependency NumSharp is not. Simply signing NumSharp did not work, because the reference from TensorFlow.NET expects an unsigned NumSharp assembly. As a solution, there is a standalone project within the Extlibs (TensorFlowNet) that references the Nuget package for TensorFlow.NET and uses ILMerge (also via Nuget package) to create a single assembly, containing both TensorFlow.NET.dll and the NumSharp.dll, and signed with the HL key. The resulting TensorFlow.NET.signed.dll is (file-) referenced within the transport plugin HeuristicLab.TensorFlowNet.
  4. The native tensorflow.dll is located within a separate nuget redist package. However, this does not work for dotnet framework for some reason. I created a dotnet core project with the redist package referenced, and copied the native x64 dll from there into the HeuristicLab.TensorFlowNet transport plugin as native dll plugin dependency.

As a final note: The whole build process is instable. Sometimes the resulting TensorFlow.Signed.dll contains some unloadble types. Clearing the bin folder, praying to ILMerge and the build gods usually helps.

comment:25 Changed 3 months ago by pfleck

r17472 Moved Alglib+AutoDiff constant optimizer in own class and created base class to provide multiple constant-opt implementations.

comment:26 Changed 3 months ago by pfleck

r17474 Started working on the TF constant opt evaluator.

comment:27 Changed 3 months ago by pfleck

r17475 Updated HeuristicLab.Algorithms.DataAnalysis plugin and its dependencies to Framework 4.7.2 to avoid conflicting System.ValueTuple locations (mscorelib or nuget).

comment:28 Changed 3 months ago by pfleck

r17476 Worked on TF-based constant optimization.

comment:29 Changed 2 months ago by pfleck

r17489 Added version with explicit array shapes for explicit broadcasting.

comment:30 Changed 2 months ago by pfleck

r17493 Write optimized constants back to tree.

Last edited 2 months ago by pfleck (previous) (diff)

comment:31 Changed 8 weeks ago by pfleck

r17502

  • Switched whole TF-graph to float (Adam optimizer won't work with double).
  • Added progress and cancellation support for TF-const opt.
  • Added optional logging with console and/or file for later plotting.

comment:32 Changed 3 weeks ago by pfleck

r17541 Added a new benchmark instance.

comment:33 Changed 11 days ago by pfleck

r17554 Added some symbols for statistical aggregation.

comment:34 Changed 11 days ago by pfleck

r17556 Some corner cases for empty or length-one vectors now return NaN.

comment:35 Changed 7 days ago by pfleck

r17573 Added first draft for WindowedSymbol.

ToDo:

  • Make other aggregation symbols windowed
  • Better encoding for Offset and Length parameter
    • Continuous interpretation (e.g. weighted sum/mean)
    • Mutation is currently not symmetric (due to cast/floor mechanic when calculating the actual indices)
  • Create a test function specifically for benchmarking windowed symbols
  • Evaluate alternative: explicit "SubVector" symbol?
    • No continuous interetation
    • Potential issues with incompatible vector lengths
Note: See TracTickets for help on using tickets.