Opened 5 years ago
Closed 4 years ago
#2650 closed feature request (done)
Support for categorical variables (R factors) for symbolic regression with GP
Reported by: | gkronber | Owned by: | gkronber |
---|---|---|---|
Priority: | medium | Milestone: | HeuristicLab 3.3.15 |
Component: | Problems.DataAnalysis.Symbolic | Version: | 3.3.14 |
Keywords: | Cc: |
Description
We frequently encounter regression / classification problems where the dataset contains categorical variables. It would be great if such variables can be used directly within symbolic regression models.
Attachments (2)
Change History (104)
comment:1 Changed 5 years ago by gkronber
comment:3 Changed 5 years ago by gkronber
- added weight for FactorVariable (necessary for LR)
- introduced VariableBase and VariableTreeNodeBase and IVariableSymbol
- support for factors in LR
- extended variable impacts in solution view
- fixed ERC view for regression
- support for FactorVariable in simplifier
- improved support for FactorVariable in constants optimizer
- multiple related changes and small fixes
comment:4 Changed 5 years ago by gkronber
r14239: #2650: merged r14234:14236 from trunk to branch
comment:5 Changed 5 years ago by gkronber
Shouldn't the variable impacts view be added as a solution view instead of an extra button?
comment:6 Changed 5 years ago by gkronber
r14240: added support for categorical variables to LDA and MNL
comment:7 Changed 5 years ago by gkronber
r14241: added support for factor variables in specific solution comparison view for symbolic classification solutions
comment:8 Changed 5 years ago by gkronber
- Status changed from new to accepted
comment:9 Changed 5 years ago by gkronber
- Version changed from 3.3.14 to branch
comment:10 Changed 5 years ago by gkronber
r14242: added support for factor variables to OneR algorithm
comment:11 Changed 5 years ago by gkronber
r14243: renamed FactorVariable -> BinaryFactorVariable
comment:12 Changed 5 years ago by gkronber
r14248: added support for factor variables to target variation view together with Philipp
comment:13 Changed 5 years ago by gkronber
r14249: added new symbol FactorVariable (renamed previous symbol to BinaryFactorVariable) Work in progress.
comment:14 Changed 5 years ago by gkronber
- extended non-linear regression to work with factors
- fixed bugs in constants optimizer and tree interpreter
- improved simplification of factor variables
- added support for factors to ERC view
- added support for factors to solution comparison view
- activated view for all factors
comment:15 Changed 5 years ago by gkronber
r14259: added support for factor variables to Excel formatter and Excel exporter as well as to the Latex formatter and consequently the mathematical representation view.
comment:16 Changed 5 years ago by gkronber
r14266: improved handling of factors in ConstantOptimizationEvaluator (create binary indicators only once)
comment:17 Changed 5 years ago by gkronber
- r14276: merged r14244 from trunk to branch
- r14277: merged r14245:14273 from trunk to branch (fixing conflicts in RegressionSolutionTargetResponseGradientView)
comment:18 Changed 5 years ago by gkronber
Bugs:
comment:19 Changed 5 years ago by gkronber
r14330: merged r14282:14322 from trunk to branch (fixing conflicts)
comment:20 Changed 5 years ago by gkronber
r14331: fixed compilation errors after merge
comment:21 Changed 4 years ago by gkronber
r14339: fixed bug in simplification of factor symbols
comment:22 Changed 4 years ago by gkronber
r14351: merged r14332:14350 from trunk to branch
comment:23 Changed 4 years ago by gkronber
r14399: merged r14352:14376 from trunk to branch (resolving conflicts in SymbolicDataAnalysisExpressionLatexFormatter
comment:24 Changed 4 years ago by gkronber
r14401: merged r14378:14400 from trunk to branch
comment:25 Changed 4 years ago by gkronber
r14402: fixed a bug in constant optimizer in relation to lagged variables
comment:26 Changed 4 years ago by gkronber
r14403: added support for factor variables to C# formatter
comment:27 Changed 4 years ago by gkronber
r14421 merged r14405:14418 from trunk to branch
comment:28 Changed 4 years ago by gkronber
Should be finished before #2697
comment:29 Changed 4 years ago by gkronber
r14449: merged r14422:14443 from trunk to branches resolving conflicts
comment:30 Changed 4 years ago by gkronber
r14497: updated mergeinfo to record the merged changesets r14422:14443 (happened in r14449)
comment:31 Changed 4 years ago by gkronber
r14498: merged r14457:14494 from trunk to branch (resolving conflicts)
comment:32 Changed 4 years ago by gkronber
r14499: updated mergeinfo to record the merged changesets r14244:14273 (happened in r14276 and r14277)
comment:33 Changed 4 years ago by gkronber
r14501: better handling of variable names (as identifiers) and fixed some bugs
comment:34 Changed 4 years ago by gkronber
r14502: another small fix in the C# formatter
comment:35 Changed 4 years ago by gkronber
r14534: added simplifier unit tests for factor symbols
comment:36 Changed 4 years ago by gkronber
r14535: worked on simplifier
comment:37 Changed 4 years ago by gkronber
r14539: extended simplifier to pass all new unit tests for factors and binary factors
comment:38 Changed 4 years ago by gkronber
r14540: fixed bugs in simplifier (causing multiple references to the same tree nodes within a tree)
comment:39 Changed 4 years ago by gkronber
Features to review / test:
- Symbolic Regression:
- String variables are visible in the input variables list in the problem data view
- Grammar for symbolic regression contains two new symbols (Factor and BinaryFactor). The symbols are activated if string inputs are selected in the problem data view.
- For string variables the variable frequency analyzer can also analyze references to specific variable values
- Simplification of factor and binary factors nodes
- Constant opt of factor and binary factor nodes
- LaTeX formatter and view for mathematical representation shows all constant values for factor variables.
- Excel exporter and formatter use nested if to produce the correct constants for string variables
- Infix formatter produces an expression referencing string variables that can be used directly for NLR
- C# formatter uses helper methods to produce the correct constants for string variables, C# formatter uses encoded identifier names for all variables.
- NLR:
- The function can also reference string variables in the same way as double variables.
- Factor symbols are used for string variables. Factor weights are tuned with constant opt.
- LR:
- If string variables are selected as input variables the algorithm includes binary variables for each string variable value (using the BinaryFactor symbol)
- If the string variable only contains two distinct values, only one binary variable is generated (variables are 100% correlated)
- General:
- Variable impact calculation also includes string variables (replace by most common value)
- For string variables the error characteristics curve also produces the LR base line model where binary variables are generated for each string variable value
- Partial dependency plots show bar charts for models which reference string variables
- Partial dependency plots show bar charts with confidence intervals for models which produce an estimated variance and which reference string variables
- generate a symb. reg. model
- show the visualization for a symbolic regression model with factors
- generate a Gaussian process model (without factors)
- drag the Gaussian process solution onto the visualization for the symbolic regression model
- Classification:
- LDA: if string variables are allowed as inputs binary factors are generated for each variable value (similar to LR)
- OneR: allows splitting by string variable values
- Multi-nomial Logit supports string variables
- Solution comparison view for classification supports string variables because OneR, LDA, and Multi-nomial Logit support string variables
comment:40 Changed 4 years ago by gkronber
r14541: better wording in exception string
comment:41 Changed 4 years ago by gkronber
r14542: merged r14504:14533 from trunk to branch
comment:42 Changed 4 years ago by gkronber
- Owner changed from gkronber to mkommend
- Status changed from accepted to reviewing
comment:43 Changed 4 years ago by mkommend
- Milestone changed from HeuristicLab 3.3.x Backlog to HeuristicLab 3.3.15
comment:44 Changed 4 years ago by gkronber
r14554: removed warnings
comment:45 Changed 4 years ago by gkronber
r14560: added a method for exporting models as Excel expressions with given variable mapping (for convenience)
comment:46 follow-up: ↓ 50 Changed 4 years ago by mkommend
Notes:
- Merge conflicts with current trunk (r14587).
- Will elastic net be adapted to work with factors (similar as LR).
- Why are categorial (string) variables displayed last in the problem data (input variables)?
- Alglib util: use any to check for empty enumerables instead of .Count() == 0.
LR works as expected. TBC.
comment:47 Changed 4 years ago by gkronber
r14589: merged r14548:14582 from trunk to to branch
comment:48 Changed 4 years ago by gkronber
r14590: created a branch for the reintegration into trunk
comment:49 Changed 4 years ago by gkronber
comment:50 in reply to: ↑ 46 Changed 4 years ago by gkronber
comment:51 Changed 4 years ago by gkronber
r14593: deleted branch again
comment:53 Changed 4 years ago by mkommend
r14615: Minor changes in factors branch (sealed one factor classes).
comment:54 Changed 4 years ago by mkommend
r14693: Fixed ordering of variables in problem data.
comment:55 Changed 4 years ago by mkommend
r14701: Switched definition of vertical and horizontal concatenation of matrixes. (cf. https://de.mathworks.com/help/matlab/ref/vertcat.html#examples)
comment:56 Changed 4 years ago by mkommend
- Owner changed from mkommend to gkronber
- Status changed from reviewing to assigned
Review comments
IDataAnalysisProblemData still contains a TODO comment, where all usages should be checked.(gkronber: checked and removed with r14763.)RegressionSolutionVariableImpactsCalculator: Calculation for factorsThe impact of a factor is calculated based on evaluating all combinations and ignores the configured replacement method!median and average should use the modeshuffle would work the samenoise should calculate a distribution and sample from it (similar to shuffle?)noise is more or less uselses (only for normal distributed numbers)- (gkronber: r14762: added option to specify replacement method for factor variables. However, the solution is different from the solution proposed above. Please check.)
ConstantOptimization should use Dictionaries instead of [] for variable names and valuesr14756 improved code for handling variables in the constant optimizer by using a dictionaryC# Formatter VariableName2Identifier method using the bytes from the encoding produces unreadable code. Just use the variable name and maybe replace whitespaces with underscores. If it does not compile the user should handle the remaining issues manually.(gkronber: solved with r14720)Infix Expression Formatter does not handle factor variable symbols correctly (parser does not handle them as well)(gkronber: solved with r14761)ComplexityCalculator does not handle BinaryFactorVariables(gkronber: fixed with r14760)BinaryFactorVariable and FactorVariableSymbol are identical(gkronber: discussed with mko and we decided that this should be ok.)FactorVariableTreeNode uses linear search in GetValue(gkronber: solved with r14717,r14719)BinaryFactorVariableTreeNode ShakeLocalParameters, why is the variable name only changed with a 20% probability ?(gkronber: r14758 unified mutation behaviour for all variable tree nodes. Introduced parameter for probability of changing a variable. r14759 added a way to set the probability of variable changes via the GUI)Shaking in the different types of VariableTreeNodes should work at least similiarly (reuse of weights, variable name changes lead to completely new weights, ...)(gkronber see r14758 and r14759)
Optional remarks
- Will elastic net be adapted to work with factors (similar as LR).
- Alglib algs (LR,LDA, multinomial logit, ...) create two matrices for double and binary variables and merge these two afterwards. Wouldn't it be more efficient to create the whole matrix in one pass (without the two intermediate matrices)?
ILEmittingInterpreter changes have no effect and actually lead to less descriptive error messages (revert?)(gkronber: reverted with r14715)Are the data densities in the gradientchart correct?just checked this; I believe they are.Simplifier can combine a constant with a FactorVariable if its beneath an additation or subtratction (w_x + c)(gkronber: This is already supported and a unit test exists. Added another unit test for subtraction with r14716.)There are remaining TODO items in comment:1:ticket:2650(gkronber: checked and made some more changes: r14764, r14766)- Be careful when reintegrating the branch due to SVN issues (replaced and copied files).
comment:57 Changed 4 years ago by gkronber
r14751: merged r14597:14737 from trunk to branch
comment:58 Changed 4 years ago by gkronber
comment:59 Changed 4 years ago by gkronber
comment:60 Changed 4 years ago by gkronber
comment:61 Changed 4 years ago by gkronber
r14755 merged r14749:14750 from trunk to branch (commit message of r14755 is incorrect)
comment:62 Changed 4 years ago by gkronber
r14764: adapted formatters to handle factor symbols
comment:63 Changed 4 years ago by gkronber
r14765: added support for negative weights for parsing expressions with factors
comment:64 Changed 4 years ago by gkronber
r14766: reviewed and tested all analyzers and made some smaller changes
comment:65 Changed 4 years ago by gkronber
- Owner changed from gkronber to mkommend
- Status changed from assigned to reviewing
comment:66 Changed 4 years ago by mkommend
r14812: Adapted ArithmeticExpressionGrammar to set subtree count of factor variables explicitly.
comment:67 Changed 4 years ago by mkommend
r14813: Removed commented code from constant optimization evaluator.
comment:68 Changed 4 years ago by mkommend
r14814: Removed misleading comment from C# formatter.
comment:69 Changed 4 years ago by mkommend
r14815: Corrected comment in VariableTreeNodeBase.
comment:70 Changed 4 years ago by mkommend
- Owner changed from mkommend to gkronber
- Status changed from reviewing to assigned
Reviewed all changes in this ticket! *phew*
Last review comments:
- Infix parser cannot handle binary factor variables. The round trip model -> text -> model does not work.
- Show the bug in the manipulation of variable weights be fixed?
After the mentioned defect is fixed the branch is ready for reintegration in trunk. Please be careful so that all the white-space changes (e.g. "if (.." -> ""if(..", or ""foreach (..." -> ""foreach(...") will not be introduced in the trunk. You somehow use different formatting settings than everyone else.
comment:71 Changed 4 years ago by gkronber
r14823: fixed round-trip for binary factor variables (and formatting changes)
comment:72 Changed 4 years ago by gkronber
r14824: added a TODO comment for the mutation bug
comment:73 Changed 4 years ago by gkronber
r14825: merged r14769:14820 from trunk to branch to prepare for branch reintegration
comment:74 Changed 4 years ago by gkronber
r14826: merged the factors branch into trunk
comment:75 Changed 4 years ago by gkronber
r14827: removed a plugin dependency and added a plugin dependency
comment:76 Changed 4 years ago by gkronber
r14829: reformatting of VariableBase
comment:77 Changed 4 years ago by gkronber
r14830: sealed variable symbol classes
comment:78 Changed 4 years ago by gkronber
r14831: made backwards compatibe code for mutation of variables more obvious in VariableTreeNodeBase
comment:79 Changed 4 years ago by gkronber
r14832: adapted GP unit test so that they produce the same outcomes as before reintegration of the factors branch
comment:80 Changed 4 years ago by gkronber
- Owner changed from gkronber to mkommend
- Status changed from assigned to reviewing
comment:81 Changed 4 years ago by mkommend
- Status changed from reviewing to readytorelease
comment:82 Changed 4 years ago by mkommend
- Owner changed from mkommend to gkronber
- Status changed from readytorelease to reviewing
comment:83 Changed 4 years ago by gkronber
r14866: deleted branch as it has been integrated into trunk
comment:84 Changed 4 years ago by gkronber
- Owner changed from gkronber to mkommend
comment:85 Changed 4 years ago by mkommend
- Owner changed from mkommend to gkronber
- Status changed from reviewing to readytorelease
comment:86 Changed 4 years ago by gkronber
Depends on other tickets which must be merged first:
comment:87 Changed 4 years ago by gkronber
comment:88 Changed 4 years ago by gkronber
Unfortunately, r14826 cannot be easily merged from trunk to stable.
Probably it is best to merge all changesets made to the trunk before r14826 to stable before.
A comparison of stable with trunk shows that changesets (before r14826) which are associated with the following tickets have not yet been merged from trunk to stable :
- #2255 (corrected and merged),
- #2432 (corrected)
- #2433 (corrected)
- #2435 (corrected, merge & reverse merge, applied to stable)
- #2442 (corrected)
- #2445 (corrected)
- #2446 (corrected)
- #2451 (corrected)
- #2457 (
mainly branch development, but includes trunk change r13593corrected: r13593 should be released with #2560) - #2470 (corrected)
- #2477 (corrected)
- #2480 (corrected)
- #2524 (PausableAlg, merged to stable)
- #2526 (just record the merge info and ignore the conflicting change?)
- #2547 (merged to stable)
- #2560 (merged to stable)
- #2581 (
MCTS for symb regremoved from trunk (changes so far have been merged to stable)) - #2588 (OKB solution download and upload, merged to stable)
- #2594 (corrected)
- #2651 (igraph, tree conflicts)
- #2660 (Variable network instances)
- #2690 (Views for random forests and gradient boosted trees)
Interestingly, this includes a number of already closed tickets (with the following unmerged changes):
- #2432:
r12770 (has been merged but not recorded) - #2433:
r12772 (has been merged but not recorded) - #2442:
- #2445:
r12811 (merged but not recorded) - #2446:
r12836, r12812 (both merged but not recorded) - #2451:
r12837 (merged but not recorded) - #2470:
r12907 (merged but not recorded) - #2477:
r12971 (merged but not recorded) - #2480:
r12973, r12977 (merged and corresponding reverse merge but seemingly not marked as such)(done, see #2640) - #2526 (Release):
- r14208, (should be merged to stable, not only svn:ignore!) (not so easy because of a later changeset which conflicts)
- r14187, (never merged to stable but same change to stable with r14188)
- r14185, (never merged to stable but same change to stable with r14168)
- r14171, (never merged to stable but same change to stable with r14183)
- #2594:
r14160 (should be merged to stable!)(done, see #2640)
comment:89 Changed 4 years ago by abeham
comment:90 Changed 4 years ago by abeham
Do not merge this to stable again:
r14988: recorded merge of revisions 12770,12772,12811,12812,12836,12837,12907,12971 in stable
comment:91 Changed 4 years ago by gkronber
- Owner changed from gkronber to architects
- Status changed from readytorelease to assigned
comment:92 Changed 4 years ago by mkommend
The FactorPartialDependencePlot still throws an NotImplementedException in RemoveSolutionAsync that results in a compilation warning of the trunk!
A possible implementation would be something like this:
public async Task RemoveSolutionAsync(IRegressionSolution solution) { if (!solutions.Remove(solution)) return; seriesCache.Remove(solution); ciSeriesCache.Remove(solution); await RecalculateAsync(); var args = new EventArgs<IRegressionSolution>(solution); OnSolutionRemoved(this, args); }
which is blatantly copied from the PartialDependencePlot and not tested.
Changed 4 years ago by mkommend
Changed 4 years ago by mkommend
comment:93 Changed 4 years ago by mkommend
The simplifier does not always handle factor variables correctly. The first attachment is a minimal example showing an evaluation difference after automatic simplification. The second file is the whole solution from which the first is a subset of.
comment:94 Changed 4 years ago by gkronber
Thanks for spotting this!
r15053: fixes the bug in the simplifier when simplifiying a sum of factors with a constant (+ f1 f1 f2 c) -> (+ 2f1 f2 c)
comment:95 Changed 4 years ago by gkronber
- Owner changed from architects to mkommend
- Status changed from assigned to reviewing
r15054: added an implementation for RemoveSolutionAsync in PDP for factors. Untested because this cannot be triggered from the UI.
comment:96 Changed 4 years ago by gkronber
- Owner changed from mkommend to gkronber
comment:97 Changed 4 years ago by gkronber
- Version changed from branch to 3.3.14
comment:98 Changed 4 years ago by mkommend
comment:99 Changed 4 years ago by gkronber
- Status changed from reviewing to readytorelease
comment:100 Changed 4 years ago by gkronber
Merging r14826 from trunk to stable leads to the following tree conflicts:
- DataTableControl and ScatterPlotControl: These files have been renamed in r14982 (#2713). The change has already been merged to stable (out of order).
HeuristicLab.ExtLibs/HeuristicLab.Igraph. This folder has been created with r14234 (#2651). Which has not yet been merged to stable.HeuristicLab.Tests/HeuristicLab.IGraph. This folder has been created with r14244 (#2651). Which has not yet been merged to stable.
comment:101 Changed 4 years ago by gkronber
comment:102 Changed 4 years ago by gkronber
r15132: merged r14827, r14829:14832 from trunk to stable
comment:103 Changed 4 years ago by gkronber
comment:104 Changed 4 years ago by gkronber
- Resolution set to done
- Status changed from readytorelease to closed
r14232:14233 : created a feature branch for #2650 (support for categorical variables in symb reg) with a first set of changes work in progress...
TODO:
handle correctly in all formatters(Smalltalk formatter and external evaluation formatter have not been adjusted)view for factor variables (configuration of actually allowed factors)create a set of unit tests for the simplifier (handle correctly in simplifier)extend simplifier to handle BinaryFactorVariableextend simplifier to combine FactorVariables with BinaryFactorVariablehandle correctly in variable impacts viewhandle correctly in Non-linear regression (infix parser and infix formatter)support in all analyzers which handle variable symbols specificallysupport for pruningsymbol for WeightedFactorVariable (instead of only 0/1)add an interface for variable symbols (with VariableName property)handle correctly in gradient viewshandle correctly in mathematical expression viewhandle correctly in ERC view (create linear regression model)handle correctly in symbolic classification - solution comparisonhandle correctly in OneROpen issues which are not strictly necessary for a first merge of the functionality: