Opened 8 months ago
Last modified 10 days ago
#2650 reviewing feature request
Support for categorical variables (R factors) for symbolic regression with GP
Reported by: | gkronber | Owned by: | mkommend |
---|---|---|---|
Priority: | medium | Milestone: | HeuristicLab 3.3.15 |
Component: | Problems.DataAnalysis.Symbolic | Version: | branch |
Keywords: | Cc: |
Description
We frequently encounter regression / classification problems where the dataset contains categorical variables. It would be great if such variables can be used directly within symbolic regression models.
Change History (63)
comment:1 Changed 8 months ago by gkronber
comment:3 Changed 8 months ago by gkronber
- added weight for FactorVariable (necessary for LR)
- introduced VariableBase and VariableTreeNodeBase and IVariableSymbol
- support for factors in LR
- extended variable impacts in solution view
- fixed ERC view for regression
- support for FactorVariable in simplifier
- improved support for FactorVariable in constants optimizer
- multiple related changes and small fixes
comment:4 Changed 8 months ago by gkronber
r14239: #2650: merged r14234:14236 from trunk to branch
comment:5 Changed 8 months ago by gkronber
Shouldn't the variable impacts view be added as a solution view instead of an extra button?
comment:6 Changed 8 months ago by gkronber
r14240: added support for categorical variables to LDA and MNL
comment:7 Changed 8 months ago by gkronber
r14241: added support for factor variables in specific solution comparison view for symbolic classification solutions
comment:8 Changed 8 months ago by gkronber
- Status changed from new to accepted
comment:9 Changed 8 months ago by gkronber
- Version changed from 3.3.14 to branch
comment:10 Changed 8 months ago by gkronber
r14242: added support for factor variables to OneR algorithm
comment:11 Changed 8 months ago by gkronber
r14243: renamed FactorVariable -> BinaryFactorVariable
comment:12 Changed 8 months ago by gkronber
r14248: added support for factor variables to target variation view together with Philipp
comment:13 Changed 8 months ago by gkronber
r14249: added new symbol FactorVariable (renamed previous symbol to BinaryFactorVariable) Work in progress.
comment:14 Changed 8 months ago by gkronber
- extended non-linear regression to work with factors
- fixed bugs in constants optimizer and tree interpreter
- improved simplification of factor variables
- added support for factors to ERC view
- added support for factors to solution comparison view
- activated view for all factors
comment:15 Changed 7 months ago by gkronber
r14259: added support for factor variables to Excel formatter and Excel exporter as well as to the Latex formatter and consequently the mathematical representation view.
comment:16 Changed 7 months ago by gkronber
r14266: improved handling of factors in ConstantOptimizationEvaluator (create binary indicators only once)
comment:17 Changed 7 months ago by gkronber
- r14276: merged r14244 from trunk to branch
- r14277: merged r14245:14273 from trunk to branch (fixing conflicts in RegressionSolutionTargetResponseGradientView)
comment:18 Changed 6 months ago by gkronber
Bugs:
comment:19 Changed 6 months ago by gkronber
r14330: merged r14282:14322 from trunk to branch (fixing conflicts)
comment:20 Changed 6 months ago by gkronber
r14331: fixed compilation errors after merge
comment:21 Changed 5 months ago by gkronber
r14339: fixed bug in simplification of factor symbols
comment:22 Changed 5 months ago by gkronber
r14351: merged r14332:14350 from trunk to branch
comment:23 Changed 4 months ago by gkronber
r14399: merged r14352:14376 from trunk to branch (resolving conflicts in SymbolicDataAnalysisExpressionLatexFormatter
comment:24 Changed 4 months ago by gkronber
r14401: merged r14378:14400 from trunk to branch
comment:25 Changed 4 months ago by gkronber
r14402: fixed a bug in constant optimizer in relation to lagged variables
comment:26 Changed 4 months ago by gkronber
r14403: added support for factor variables to C# formatter
comment:27 Changed 4 months ago by gkronber
r14421 merged r14405:14418 from trunk to branch
comment:28 Changed 4 months ago by gkronber
Should be finished before #2697
comment:29 Changed 4 months ago by gkronber
r14449: merged r14422:14443 from trunk to branches resolving conflicts
comment:30 Changed 3 months ago by gkronber
r14497: updated mergeinfo to record the merged changesets r14422:14443 (happened in r14449)
comment:31 Changed 3 months ago by gkronber
r14498: merged r14457:14494 from trunk to branch (resolving conflicts)
comment:32 Changed 3 months ago by gkronber
r14499: updated mergeinfo to record the merged changesets r14244:14273 (happened in r14276 and r14277)
comment:33 Changed 3 months ago by gkronber
r14501: better handling of variable names (as identifiers) and fixed some bugs
comment:34 Changed 3 months ago by gkronber
r14502: another small fix in the C# formatter
comment:35 Changed 3 months ago by gkronber
r14534: added simplifier unit tests for factor symbols
comment:36 Changed 3 months ago by gkronber
r14535: worked on simplifier
comment:37 Changed 3 months ago by gkronber
r14539: extended simplifier to pass all new unit tests for factors and binary factors
comment:38 Changed 3 months ago by gkronber
r14540: fixed bugs in simplifier (causing multiple references to the same tree nodes within a tree)
comment:39 Changed 3 months ago by gkronber
Features to review / test:
- Symbolic Regression:
- String variables are visible in the input variables list in the problem data view
- Grammar for symbolic regression contains two new symbols (Factor and BinaryFactor). The symbols are activated if string inputs are selected in the problem data view.
- For string variables the variable frequency analyzer can also analyze references to specific variable values
- Simplification of factor and binary factors nodes
- Constant opt of factor and binary factor nodes
- LaTeX formatter and view for mathematical representation shows all constant values for factor variables.
- Excel exporter and formatter use nested if to produce the correct constants for string variables
- Infix formatter produces an expression referencing string variables that can be used directly for NLR
- C# formatter uses helper methods to produce the correct constants for string variables, C# formatter uses encoded identifier names for all variables.
- NLR:
- The function can also reference string variables in the same way as double variables.
- Factor symbols are used for string variables. Factor weights are tuned with constant opt.
- LR:
- If string variables are selected as input variables the algorithm includes binary variables for each string variable value (using the BinaryFactor symbol)
- If the string variable only contains two distinct values, only one binary variable is generated (variables are 100% correlated)
- General:
- Variable impact calculation also includes string variables (replace by most common value)
- For string variables the error characteristics curve also produces the LR base line model where binary variables are generated for each string variable value
- Partial dependency plots show bar charts for models which reference string variables
- Partial dependency plots show bar charts with confidence intervals for models which produce an estimated variance and which reference string variables
- generate a symb. reg. model
- show the visualization for a symbolic regression model with factors
- generate a Gaussian process model (without factors)
- drag the Gaussian process solution onto the visualization for the symbolic regression model
- Classification:
- LDA: if string variables are allowed as inputs binary factors are generated for each variable value (similar to LR)
- OneR: allows splitting by string variable values
- Multi-nomial Logit supports string variables
- Solution comparison view for classification supports string variables because OneR, LDA, and Multi-nomial Logit support string variables
comment:40 Changed 3 months ago by gkronber
r14541: better wording in exception string
comment:41 Changed 3 months ago by gkronber
r14542: merged r14504:14533 from trunk to branch
comment:42 Changed 3 months ago by gkronber
- Owner changed from gkronber to mkommend
- Status changed from accepted to reviewing
comment:43 Changed 3 months ago by mkommend
- Milestone changed from HeuristicLab 3.3.x Backlog to HeuristicLab 3.3.15
comment:44 Changed 3 months ago by gkronber
r14554: removed warnings
comment:45 Changed 3 months ago by gkronber
r14560: added a method for exporting models as Excel expressions with given variable mapping (for convenience)
comment:46 follow-up: ↓ 50 Changed 2 months ago by mkommend
Notes:
- Merge conflicts with current trunk (r14587).
- Will elastic net be adapted to work with factors (similar as LR).
- Why are categorial (string) variables displayed last in the problem data (input variables)?
- Alglib util: use any to check for empty enumerables instead of .Count() == 0.
LR works as expected. TBC.
comment:47 Changed 2 months ago by gkronber
r14589: merged r14548:14582 from trunk to to branch
comment:48 Changed 2 months ago by gkronber
r14590: created a branch for the reintegration into trunk
comment:49 Changed 2 months ago by gkronber
comment:50 in reply to: ↑ 46 Changed 2 months ago by gkronber
comment:51 Changed 2 months ago by gkronber
r14593: deleted branch again
comment:53 Changed 2 months ago by mkommend
r14615: Minor changes in factors branch (sealed one factor classes).
comment:54 Changed 5 weeks ago by mkommend
r14693: Fixed ordering of variables in problem data.
comment:55 Changed 5 weeks ago by mkommend
r14701: Switched definition of vertical and horizontal concatenation of matrixes. (cf. https://de.mathworks.com/help/matlab/ref/vertcat.html#examples)
comment:56 Changed 4 weeks ago by mkommend
- Owner changed from mkommend to gkronber
- Status changed from reviewing to assigned
Review comments
IDataAnalysisProblemData still contains a TODO comment, where all usages should be checked.(gkronber: checked and removed with r14763.)RegressionSolutionVariableImpactsCalculator: Calculation for factorsThe impact of a factor is calculated based on evaluating all combinations and ignores the configured replacement method!median and average should use the modeshuffle would work the samenoise should calculate a distribution and sample from it (similar to shuffle?)noise is more or less uselses (only for normal distributed numbers)- (gkronber: r14762: added option to specify replacement method for factor variables. However, the solution is different from the solution proposed above. Please check.)
ConstantOptimization should use Dictionaries instead of [] for variable names and valuesr14756 improved code for handling variables in the constant optimizer by using a dictionaryC# Formatter VariableName2Identifier method using the bytes from the encoding produces unreadable code. Just use the variable name and maybe replace whitespaces with underscores. If it does not compile the user should handle the remaining issues manually.(gkronber: solved with r14720)Infix Expression Formatter does not handle factor variable symbols correctly (parser does not handle them as well)(gkronber: solved with r14761)ComplexityCalculator does not handle BinaryFactorVariables(gkronber: fixed with r14760)BinaryFactorVariable and FactorVariableSymbol are identical(gkronber: discussed with mko and we decided that this should be ok.)FactorVariableTreeNode uses linear search in GetValue(gkronber: solved with r14717:14719)BinaryFactorVariableTreeNode ShakeLocalParameters, why is the variable name only changed with a 20% probability ?(gkronber: r14758 unified mutation behaviour for all variable tree nodes. Introduced parameter for probability of changing a variable. r14759 added a way to set the probability of variable changes via the GUI)Shaking in the different types of VariableTreeNodes should work at least similiarly (reuse of weights, variable name changes lead to completely new weights, ...)(gkronber see r14758 and r14759)
Optional remarks
- Will elastic net be adapted to work with factors (similar as LR).
- Alglib algs (LR,LDA, multinomial logit, ...) create two matrices for double and binary variables and merge these two afterwards. Wouldn't it be more efficient to create the whole matrix in one pass (without the two intermediate matrices)?
ILEmittingInterpreter changes have no effect and actually lead to less descriptive error messages (revert?)(gkronber: reverted with r14715)Are the data densities in the gradientchart correct?just checked this; I believe they are.Simplifier can combine a constant with a FactorVariable if its beneath an additation or subtratction (w_x + c)(gkronber: This is already supported and a unit test exists. Added another unit test for subtraction with r14716.)There are remaining TODO items in comment:1:ticket:2650(gkronber: checked and made some more changes: r14764, r14766)- Be careful when reintegrating the branch due to SVN issues (replaced and copied files).
comment:57 Changed 2 weeks ago by gkronber
r14751: merged r14597:14737 from trunk to branch
comment:58 Changed 2 weeks ago by gkronber
comment:59 Changed 2 weeks ago by gkronber
comment:60 Changed 2 weeks ago by gkronber
comment:61 Changed 2 weeks ago by gkronber
r14755 merged r14749:14750 from trunk to branch (commit message of r14755 is incorrect)
comment:62 Changed 12 days ago by gkronber
r14764: adapted formatters to handle factor symbols
comment:63 Changed 12 days ago by gkronber
r14765: added support for negative weights for parsing expressions with factors
comment:64 Changed 10 days ago by gkronber
r14766: reviewed and tested all analyzers and made some smaller changes
comment:65 Changed 10 days ago by gkronber
- Owner changed from gkronber to mkommend
- Status changed from assigned to reviewing
r14232:14233 : created a feature branch for #2650 (support for categorical variables in symb reg) with a first set of changes work in progress...
TODO:
handle correctly in all formatters(Smalltalk formatter and external evaluation formatter have not been adjusted)view for factor variables (configuration of actually allowed factors)create a set of unit tests for the simplifier (handle correctly in simplifier)extend simplifier to handle BinaryFactorVariableextend simplifier to combine FactorVariables with BinaryFactorVariablehandle correctly in variable impacts viewhandle correctly in Non-linear regression (infix parser and infix formatter)support in all analyzers which handle variable symbols specificallysupport for pruningsymbol for WeightedFactorVariable (instead of only 0/1)add an interface for variable symbols (with VariableName property)handle correctly in gradient viewshandle correctly in mathematical expression viewhandle correctly in ERC view (create linear regression model)handle correctly in symbolic classification - solution comparisonhandle correctly in OneROpen issues which are not strictly necessary for a first merge of the functionality: