Opened 16 months ago
Last modified 8 weeks ago
#2906 accepted feature request
Variable-Transformations for Data Analysis
Reported by: | pfleck | Owned by: | pfleck |
---|---|---|---|
Priority: | high | Milestone: | HeuristicLab 3.3.17 |
Component: | Problems.DataAnalysis | Version: | branch |
Keywords: | Cc: |
Description
The current transformation feature (implemented during the first version of the data preprocessing) is neither practical nor functioning satisfactorily. Originally, the transformation feature was intended to support data analysis by being able to specify transformations to the data to make the training process easier for the learning algorithms.
Possible usage scenarios are:
- Scale variables to a given range (e.g. 0 - 1)
- Z-Normalize a variable
- log-Transform a variable
After training with transformed variables, an intermediate step is required, that performs the data transformation on the original values before feeding them to the actual model. This creates two options for calculating the model accuracy (R², MSE, ...), depending on whether the calculation is based on
- the transformed variables or
- the original variables.
While the first describes the model-accuracy in terms of the training algorithm, the later describes how the model actually performs in real use. Currently, we are not sure which option is better; therefore, we want to support both options.
Additional thoughts
- Performing the intermediate step of transforming the original variables before feeding them to the actual model could be done with a "Transformation-Model" that wraps the original model.
- From the users' perspective, the transformations could be done "explicitly" or "hidden", i.e. actually showing the transformed variables in the Dataset and displaying them as additional input-or as target variable, or showing the original Dataset and performing the transformation hidden from the user. Currently, we want to make transformations explicitly visible to the user.
- Each transformation must also specify an inverse transformation that has to be applied in case a transformation is performed on the target variable. For instance, if the target variable is log-transformed, the intermediate model must use the exponential function to transform the target back to its original value range.
- For symbolic regression, the intermediate model can be also applied by directly changing the model tree.
Attachments (1)
Change History (22)
comment:1 Changed 16 months ago by pfleck
- Status changed from new to accepted
- Version set to branch
comment:2 Changed 16 months ago by pfleck
r15837 Adapted paths and plugin references.
comment:3 Changed 15 months ago by pfleck
r15846 First concept of simple transformation (single target transformation)
comment:4 Changed 15 months ago by pfleck
r15847 Implemented chained transformations on target and input variables.
comment:5 Changed 15 months ago by pfleck
r15848 Added TransformedModelView, implemented "remove virtual columns"
After discussing with mkommenda, we decided that, ultimately, we want to implement a Dataset or ProblemData with “virtual variables”.
- Virtual variables are created via transformations, thus their values cannot be changed manually.
- Virtual variables can be the target-variable of a transformation, i.e. multiple transformation on the same (virtual) variables. Original variables cannot be target of a transformation.
comment:6 Changed 15 months ago by pfleck
r15856 Implemented transformation re-apply of an already reverse-transformed model.
comment:7 Changed 15 months ago by pfleck
r15858 Fixed inverse transform early removal of columns.
comment:8 Changed 15 months ago by pfleck
The current transformation-feature also allows adding transformation-records to data that was transformed elsewhere before importing it into HL, for example in KNIME. After modeling, the back-transformation can be applied in the same way as if the transformation were performed in HL-DataPreprocessing.
However, the only current issue is that the Transformations in a ProblemData is read-only. Thus, the good, old register-cloned-object-trick must be applied:
IRegressionSolution solution = vars.Solution; var problemData = (RegressionProblemData)solution.ProblemData; var variableNames = problemData.Dataset.VariableNames.Select(x => new StringValue(x)); var shiftTransformation = new DataAnalysisTransformation(variableNames) { OriginalVariable = "MyOriginalTarget", TransformedVariable = "MyTransformedTarget", Transformation = new LinearTransformation() { Intercept = 1.0 } }; var logTransformation = new DataAnalysisTransformation(variableNames) { OriginalVariable = "MyTransformedTarget", TransformedVariable = "MyTransformedTarget", Transformation = new LogarithmTransformation() { Base = 10 } }; var newTransformations = new ItemList<IDataAnalysisTransformation>() { shiftTransformation, logTransformation }; var cloner = new Cloner(); cloner.RegisterClonedObject(problemData.TransformationsParameter.Value, newTransformations.AsReadOnly()); vars.TransformedSolution = cloner.Clone(solution);
comment:9 Changed 15 months ago by pfleck
r15862 Restore specialized regression solution type when re-applying transformations.
comment:10 Changed 15 months ago by pfleck
r15863 branched DataAnalysis.Symbolic(+ Regression) plugins.
comment:11 Changed 15 months ago by pfleck
r15864 Removed obsolete transformation related code for SymReg. Small UI changes for transformed solutions.
comment:12 Changed 15 months ago by pfleck
r15865 Added PreprocessingTransformation as a custom view-model for transformations in preprocessing.
comment:13 Changed 15 months ago by pfleck
- Implemented for classification, clustering, etc.
- Simplified Transformation interfaces (read-only, ...).
- Started moving transformation logic out of ProblemData.
comment:14 Changed 15 months ago by pfleck
r15879 minor refactoring
comment:15 Changed 15 months ago by pfleck
r15880 forgot to commit file
comment:16 Changed 15 months ago by pfleck
r15884 Refactoring
- Moved transformation-specific parts out of existing interfaces.
- Moved all Transformation logic to DataAnalysisTransformation.
- Simplified (Inverse)Transformation of Dataset/ProblemData/Model/Solution.
comment:17 Changed 15 months ago by pfleck
r15885 Updated project references + small refactoring
Changed 15 months ago by pfleck
comment:18 Changed 15 months ago by pfleck
- Owner changed from pfleck to mkommend
- Status changed from accepted to reviewing
Please review using the attached patch for merging into trunk.
The remaining transformations will be implemented after the first reviewing cycle is completed.
comment:19 Changed 13 months ago by pfleck
- Owner changed from mkommend to pfleck
- Status changed from reviewing to assigned
- added Offset to log transformation
- renamed parameter of linear transformation
- added RescaleTransformation
Additionally I am thinking of removing the static interfaces of the transformation. Having the static versions already causes a lot of (redundant) code and the benefit of using transformations without creating instances is rather minor.
comment:20 Changed 13 months ago by pfleck
- Status changed from assigned to accepted
comment:21 Changed 8 weeks ago by abeham
- Milestone changed from HeuristicLab 3.3.16 to HeuristicLab 3.3.17
r15826 created branch and branched plugins