#2650 reviewing feature request
Support for categorical variables (R factors) for symbolic regression with GP
Description
We frequently encounter regression / classification problems where the dataset contains categorical variables. It would be great if such variables can be used directly within symbolic regression models.
- added weight for FactorVariable (necessary for LR)
- introduced VariableBase and VariableTreeNodeBase and IVariableSymbol
- support for factors in LR
- extended variable impacts in solution view
- fixed ERC view for regression
- support for FactorVariable in simplifier
- improved support for FactorVariable in constants optimizer
- multiple related changes and small fixes
merged r14234:14236 from trunk to branch
Shouldn't the variable impacts view be added as a solution view instead of an extra button?
added support for categorical variables to LDA and MNL
added support for factor variables in specific solution comparison view for symbolic classification solutions
Status changed from new to accepted
Version changed from 3.3.14 to branch
added support for factor variables to OneR algorithm
renamed FactorVariable -> BinaryFactorVariable
added support for factor variables to target variation view together with Philipp
added new symbol FactorVariable (renamed previous symbol to BinaryFactorVariable) Work in progress.
- extended non-linear regression to work with factors
- fixed bugs in constants optimizer and tree interpreter
- improved simplification of factor variables
- added support for factors to ERC view
- added support for factors to solution comparison view
- activated view for all factors
added support for factor variables to Excel formatter and Excel exporter as well as to the Latex formatter and consequently the mathematical representation view.
improved handling of factors in ConstantOptimizationEvaluator (create binary indicators only once)
- r14276: merged r14244 from trunk to branch
- r14277: merged r14245:14273 from trunk to branch (fixing conflicts in RegressionSolutionTargetResponseGradientView)
Bugs:
r14330: merged r14282:14322 from trunk to branch (fixing conflicts)
r14331: fixed compilation errors after merge
fixed bug in simplification of factor symbols
r14351: merged r14332:14350 from trunk to branch
r14399: merged r14352:14376 from trunk to branch (resolving conflicts in SymbolicDataAnalysisExpressionLatexFormatter
r14401: merged r14378:14400 from trunk to branch
fixed a bug in constant optimizer in relation to lagged variables
added support for factor variables to C# formatter
r14421 merged r14405:14418 from trunk to branch
Should be finished before #2697
r14449: merged r14422:14443 from trunk to branches resolving conflicts
r14497: updated mergeinfo to record the merged changesets r14422:14443 (happened in r14449)
r14498: merged r14457:14494 from trunk to branch (resolving conflicts)
r14499: updated mergeinfo to record the merged changesets r14244:14273 (happened in r14276 and r14277)
better handling of variable names (as identifiers) and fixed some bugs
another small fix in the C# formatter
added simplifier unit tests for factor symbols
worked on simplifier
extended simplifier to pass all new unit tests for factors and binary factors
fixed bugs in simplifier (causing multiple references to the same tree nodes within a tree)
Features to review / test:
- Symbolic Regression:
- String variables are visible in the input variables list in the problem data view
- Grammar for symbolic regression contains two new symbols (Factor and BinaryFactor). The symbols are activated if string inputs are selected in the problem data view.
- For string variables the variable frequency analyzer can also analyze references to specific variable values
- Simplification of factor and binary factors nodes
- Constant opt of factor and binary factor nodes
- LaTeX formatter and view for mathematical representation shows all constant values for factor variables.
- Excel exporter and formatter use nested if to produce the correct constants for string variables
- Infix formatter produces an expression referencing string variables that can be used directly for NLR
- C# formatter uses helper methods to produce the correct constants for string variables, C# formatter uses encoded identifier names for all variables.
- NLR:
- The function can also reference string variables in the same way as double variables.
- Factor symbols are used for string variables. Factor weights are tuned with constant opt.
- LR:
- If string variables are selected as input variables the algorithm includes binary variables for each string variable value (using the BinaryFactor symbol)
- If the string variable only contains two distinct values, only one binary variable is generated (variables are 100% correlated)
- General:
- Variable impact calculation also includes string variables (replace by most common value)
- For string variables the error characteristics curve also produces the LR base line model where binary variables are generated for each string variable value
- Partial dependency plots show bar charts for models which reference string variables
- Partial dependency plots show bar charts with confidence intervals for models which produce an estimated variance and which reference string variables
- generate a symb. reg. model
- show the visualization for a symbolic regression model with factors
- generate a Gaussian process model (without factors)
- drag the Gaussian process solution onto the visualization for the symbolic regression model
- Classification:
- LDA: if string variables are allowed as inputs binary factors are generated for each variable value (similar to LR)
- OneR: allows splitting by string variable values
- Multi-nomial Logit supports string variables
- Solution comparison view for classification supports string variables because OneR, LDA, and Multi-nomial Logit support string variables
better wording in exception string
r14542: merged r14504:14533 from trunk to branch
comment:43 Changed 3 months ago by mkommend
removed warnings
added a method for exporting models as Excel expressions with given variable mapping (for convenience)
Notes:
- Merge conflicts with current trunk (r14587).
- Will elastic net be adapted to work with factors (similar as LR).
- Why are categorial (string) variables displayed last in the problem data (input variables)?
- Alglib util: use any to check for empty enumerables instead of .Count() == 0.
LR works as expected. TBC.
r14589: merged r14548:14582 from trunk to to branch
r14590: created a branch for the reintegration into trunk
comment:50 in reply to: ↑ 46 Changed 2 months ago by gkronber
Minor changes in factors branch (sealed one factor classes).
Fixed ordering of variables in problem data.
Switched definition of vertical and horizontal concatenation of matrixes. (cf. https://de.mathworks.com/help/matlab/ref/vertcat.html#examples)
Review comments
IDataAnalysisProblemData still contains a TODO comment, where all usages should be checked.(gkronber: checked and removed with r14763.)RegressionSolutionVariableImpactsCalculator: Calculation for factorsThe impact of a factor is calculated based on evaluating all combinations and ignores the configured replacement method!median and average should use the modeshuffle would work the samenoise should calculate a distribution and sample from it (similar to shuffle?)noise is more or less uselses (only for normal distributed numbers)- (gkronber: r14762: added option to specify replacement method for factor variables. However, the solution is different from the solution proposed above. Please check.)
ConstantOptimization should use Dictionaries instead of [] for variable names and valuesr14756 improved code for handling variables in the constant optimizer by using a dictionaryC# Formatter VariableName2Identifier method using the bytes from the encoding produces unreadable code. Just use the variable name and maybe replace whitespaces with underscores. If it does not compile the user should handle the remaining issues manually.(gkronber: solved with r14720)Infix Expression Formatter does not handle factor variable symbols correctly (parser does not handle them as well)(gkronber: solved with r14761)ComplexityCalculator does not handle BinaryFactorVariables(gkronber: fixed with r14760)BinaryFactorVariable and FactorVariableSymbol are identical(gkronber: discussed with mko and we decided that this should be ok.)FactorVariableTreeNode uses linear search in GetValue(gkronber: solved with r14717:14719)BinaryFactorVariableTreeNode ShakeLocalParameters, why is the variable name only changed with a 20% probability ?(gkronber: r14758 unified mutation behaviour for all variable tree nodes. Introduced parameter for probability of changing a variable. r14759 added a way to set the probability of variable changes via the GUI)Shaking in the different types of VariableTreeNodes should work at least similiarly (reuse of weights, variable name changes lead to completely new weights, ...)(gkronber see r14758 and r14759)
Optional remarks
- Will elastic net be adapted to work with factors (similar as LR).
- Alglib algs (LR,LDA, multinomial logit, ...) create two matrices for double and binary variables and merge these two afterwards. Wouldn't it be more efficient to create the whole matrix in one pass (without the two intermediate matrices)?
ILEmittingInterpreter changes have no effect and actually lead to less descriptive error messages (revert?)(gkronber: reverted with r14715)Are the data densities in the gradientchart correct?just checked this; I believe they are.Simplifier can combine a constant with a FactorVariable if its beneath an additation or subtratction (w_x + c)(gkronber: This is already supported and a unit test exists. Added another unit test for subtraction with r14716.)There are remaining TODO items in comment:1:ticket:2650(gkronber: checked and made some more changes: r14764, r14766)- Be careful when reintegrating the branch due to SVN issues (replaced and copied files).
r14751: merged r14597:14737 from trunk to branch
adapted formatters to handle factor symbols
added support for negative weights for parsing expressions with factors
reviewed and tested all analyzers and made some smaller changes
created a feature branch for #2650 (support for categorical variables in symb reg) with a first set of changes work in progress...
TODO:
handle correctly in all formatters(Smalltalk formatter and external evaluation formatter have not been adjusted)view for factor variables (configuration of actually allowed factors)create a set of unit tests for the simplifier (handle correctly in simplifier)extend simplifier to handle BinaryFactorVariableextend simplifier to combine FactorVariables with BinaryFactorVariablehandle correctly in variable impacts viewhandle correctly in Non-linear regression (infix parser and infix formatter)support in all analyzers which handle variable symbols specificallysupport for pruningsymbol for WeightedFactorVariable (instead of only 0/1)add an interface for variable symbols (with VariableName property)handle correctly in gradient viewshandle correctly in mathematical expression viewhandle correctly in ERC view (create linear regression model)handle correctly in symbolic classification - solution comparisonhandle correctly in OneROpen issues which are not strictly necessary for a first merge of the functionality: