Opened 9 months ago

Last modified 7 months ago

#2906 accepted feature request

Variable-Transformations for Data Analysis

Reported by: pfleck Owned by: pfleck
Priority: high Milestone: HeuristicLab 3.3.16
Component: Problems.DataAnalysis Version: branch
Keywords: Cc:

Description

The current transformation feature (implemented during the first version of the data preprocessing) is neither practical nor functioning satisfactorily. Originally, the transformation feature was intended to support data analysis by being able to specify transformations to the data to make the training process easier for the learning algorithms.

Possible usage scenarios are:

  • Scale variables to a given range (e.g. 0 - 1)
  • Z-Normalize a variable
  • log-Transform a variable

After training with transformed variables, an intermediate step is required, that performs the data transformation on the original values before feeding them to the actual model. This creates two options for calculating the model accuracy (R², MSE, ...), depending on whether the calculation is based on

  • the transformed variables or
  • the original variables.

While the first describes the model-accuracy in terms of the training algorithm, the later describes how the model actually performs in real use. Currently, we are not sure which option is better; therefore, we want to support both options.

Additional thoughts

  • Performing the intermediate step of transforming the original variables before feeding them to the actual model could be done with a "Transformation-Model" that wraps the original model.
  • From the users' perspective, the transformations could be done "explicitly" or "hidden", i.e. actually showing the transformed variables in the Dataset and displaying them as additional input-or as target variable, or showing the original Dataset and performing the transformation hidden from the user. Currently, we want to make transformations explicitly visible to the user.
  • Each transformation must also specify an inverse transformation that has to be applied in case a transformation is performed on the target variable. For instance, if the target variable is log-transformed, the intermediate model must use the exponential function to transform the target back to its original value range.
  • For symbolic regression, the intermediate model can be also applied by directly changing the model tree.

Attachments (1)

MergeTransformationsToTrunk.patch (286.5 KB) - added by pfleck 8 months ago.

Download all attachments as: .zip

Change History (21)

comment:1 Changed 9 months ago by pfleck

  • Status changed from new to accepted
  • Version set to branch

r15826 created branch and branched plugins

comment:2 Changed 9 months ago by pfleck

r15837 Adapted paths and plugin references.

comment:3 Changed 9 months ago by pfleck

r15846 First concept of simple transformation (single target transformation)

comment:4 Changed 9 months ago by pfleck

r15847 Implemented chained transformations on target and input variables.

comment:5 Changed 9 months ago by pfleck

r15848 Added TransformedModelView, implemented "remove virtual columns"

After discussing with mkommenda, we decided that, ultimately, we want to implement a Dataset or ProblemData with “virtual variables”.

  • Virtual variables are created via transformations, thus their values cannot be changed manually.
  • Virtual variables can be the target-variable of a transformation, i.e. multiple transformation on the same (virtual) variables. Original variables cannot be target of a transformation.

comment:6 Changed 9 months ago by pfleck

r15856 Implemented transformation re-apply of an already reverse-transformed model.

comment:7 Changed 9 months ago by pfleck

r15858 Fixed inverse transform early removal of columns.

comment:8 Changed 9 months ago by pfleck

The current transformation-feature also allows adding transformation-records to data that was transformed elsewhere before importing it into HL, for example in KNIME. After modeling, the back-transformation can be applied in the same way as if the transformation were performed in HL-DataPreprocessing.

However, the only current issue is that the Transformations in a ProblemData is read-only. Thus, the good, old register-cloned-object-trick must be applied:

IRegressionSolution solution = vars.Solution;
var problemData = (RegressionProblemData)solution.ProblemData;
var variableNames = problemData.Dataset.VariableNames.Select(x => new StringValue(x));

var shiftTransformation = new DataAnalysisTransformation(variableNames) {
  OriginalVariable = "MyOriginalTarget",
  TransformedVariable = "MyTransformedTarget",
  Transformation = new LinearTransformation() { Intercept = 1.0 }
};
var logTransformation = new DataAnalysisTransformation(variableNames) {
  OriginalVariable = "MyTransformedTarget",
  TransformedVariable = "MyTransformedTarget",
  Transformation = new LogarithmTransformation() { Base = 10 }
};

var newTransformations = new ItemList<IDataAnalysisTransformation>() {
  shiftTransformation,
  logTransformation
};

var cloner = new Cloner();
cloner.RegisterClonedObject(problemData.TransformationsParameter.Value, newTransformations.AsReadOnly());

vars.TransformedSolution = cloner.Clone(solution);

comment:9 Changed 9 months ago by pfleck

r15862 Restore specialized regression solution type when re-applying transformations.

comment:10 Changed 9 months ago by pfleck

r15863 branched DataAnalysis.Symbolic(+ Regression) plugins.

comment:11 Changed 9 months ago by pfleck

r15864 Removed obsolete transformation related code for SymReg. Small UI changes for transformed solutions.

comment:12 Changed 9 months ago by pfleck

r15865 Added PreprocessingTransformation as a custom view-model for transformations in preprocessing.

comment:13 Changed 9 months ago by pfleck

r15870

  • Implemented for classification, clustering, etc.
  • Simplified Transformation interfaces (read-only, ...).
  • Started moving transformation logic out of ProblemData.

comment:14 Changed 9 months ago by pfleck

r15879 minor refactoring

comment:15 Changed 9 months ago by pfleck

r15880 forgot to commit file

comment:16 Changed 9 months ago by pfleck

r15884 Refactoring

  • Moved transformation-specific parts out of existing interfaces.
  • Moved all Transformation logic to DataAnalysisTransformation.
  • Simplified (Inverse)Transformation of Dataset/ProblemData/Model/Solution.

comment:17 Changed 8 months ago by pfleck

r15885 Updated project references + small refactoring

Last edited 8 months ago by pfleck (previous) (diff)

Changed 8 months ago by pfleck

comment:18 Changed 8 months ago by pfleck

  • Owner changed from pfleck to mkommend
  • Status changed from accepted to reviewing

Please review using the attached patch for merging into trunk.

The remaining transformations will be implemented after the first reviewing cycle is completed.

comment:19 Changed 7 months ago by pfleck

  • Owner changed from mkommend to pfleck
  • Status changed from reviewing to assigned

r15938

  • added Offset to log transformation
  • renamed parameter of linear transformation
  • added RescaleTransformation

Additionally I am thinking of removing the static interfaces of the transformation. Having the static versions already causes a lot of (redundant) code and the benefit of using transformations without creating instances is rather minor.

comment:20 Changed 7 months ago by pfleck

  • Status changed from assigned to accepted
Note: See TracTickets for help on using tickets.