Opened 10 months ago
Last modified 8 days ago
#2650 readytorelease feature request
Support for categorical variables (R factors) for symbolic regression with GP
Reported by: | gkronber | Owned by: | gkronber |
---|---|---|---|
Priority: | medium | Milestone: | HeuristicLab 3.3.15 |
Component: | Problems.DataAnalysis.Symbolic | Version: | branch |
Keywords: | Cc: |
Description
We frequently encounter regression / classification problems where the dataset contains categorical variables. It would be great if such variables can be used directly within symbolic regression models.
Change History (88)
comment:1 Changed 10 months ago by gkronber
comment:3 Changed 10 months ago by gkronber
- added weight for FactorVariable (necessary for LR)
- introduced VariableBase and VariableTreeNodeBase and IVariableSymbol
- support for factors in LR
- extended variable impacts in solution view
- fixed ERC view for regression
- support for FactorVariable in simplifier
- improved support for FactorVariable in constants optimizer
- multiple related changes and small fixes
comment:4 Changed 10 months ago by gkronber
r14239: #2650: merged r14234:14236 from trunk to branch
comment:5 Changed 10 months ago by gkronber
Shouldn't the variable impacts view be added as a solution view instead of an extra button?
comment:6 Changed 10 months ago by gkronber
r14240: added support for categorical variables to LDA and MNL
comment:7 Changed 10 months ago by gkronber
r14241: added support for factor variables in specific solution comparison view for symbolic classification solutions
comment:8 Changed 10 months ago by gkronber
- Status changed from new to accepted
comment:9 Changed 10 months ago by gkronber
- Version changed from 3.3.14 to branch
comment:10 Changed 10 months ago by gkronber
r14242: added support for factor variables to OneR algorithm
comment:11 Changed 10 months ago by gkronber
r14243: renamed FactorVariable -> BinaryFactorVariable
comment:12 Changed 10 months ago by gkronber
r14248: added support for factor variables to target variation view together with Philipp
comment:13 Changed 10 months ago by gkronber
r14249: added new symbol FactorVariable (renamed previous symbol to BinaryFactorVariable) Work in progress.
comment:14 Changed 10 months ago by gkronber
- extended non-linear regression to work with factors
- fixed bugs in constants optimizer and tree interpreter
- improved simplification of factor variables
- added support for factors to ERC view
- added support for factors to solution comparison view
- activated view for all factors
comment:15 Changed 9 months ago by gkronber
r14259: added support for factor variables to Excel formatter and Excel exporter as well as to the Latex formatter and consequently the mathematical representation view.
comment:16 Changed 9 months ago by gkronber
r14266: improved handling of factors in ConstantOptimizationEvaluator (create binary indicators only once)
comment:17 Changed 9 months ago by gkronber
- r14276: merged r14244 from trunk to branch
- r14277: merged r14245:14273 from trunk to branch (fixing conflicts in RegressionSolutionTargetResponseGradientView)
comment:18 Changed 8 months ago by gkronber
Bugs:
comment:19 Changed 8 months ago by gkronber
r14330: merged r14282:14322 from trunk to branch (fixing conflicts)
comment:20 Changed 8 months ago by gkronber
r14331: fixed compilation errors after merge
comment:21 Changed 7 months ago by gkronber
r14339: fixed bug in simplification of factor symbols
comment:22 Changed 7 months ago by gkronber
r14351: merged r14332:14350 from trunk to branch
comment:23 Changed 6 months ago by gkronber
r14399: merged r14352:14376 from trunk to branch (resolving conflicts in SymbolicDataAnalysisExpressionLatexFormatter
comment:24 Changed 6 months ago by gkronber
r14401: merged r14378:14400 from trunk to branch
comment:25 Changed 6 months ago by gkronber
r14402: fixed a bug in constant optimizer in relation to lagged variables
comment:26 Changed 6 months ago by gkronber
r14403: added support for factor variables to C# formatter
comment:27 Changed 6 months ago by gkronber
r14421 merged r14405:14418 from trunk to branch
comment:28 Changed 6 months ago by gkronber
Should be finished before #2697
comment:29 Changed 6 months ago by gkronber
r14449: merged r14422:14443 from trunk to branches resolving conflicts
comment:30 Changed 5 months ago by gkronber
r14497: updated mergeinfo to record the merged changesets r14422:14443 (happened in r14449)
comment:31 Changed 5 months ago by gkronber
r14498: merged r14457:14494 from trunk to branch (resolving conflicts)
comment:32 Changed 5 months ago by gkronber
r14499: updated mergeinfo to record the merged changesets r14244:14273 (happened in r14276 and r14277)
comment:33 Changed 5 months ago by gkronber
r14501: better handling of variable names (as identifiers) and fixed some bugs
comment:34 Changed 5 months ago by gkronber
r14502: another small fix in the C# formatter
comment:35 Changed 5 months ago by gkronber
r14534: added simplifier unit tests for factor symbols
comment:36 Changed 5 months ago by gkronber
r14535: worked on simplifier
comment:37 Changed 5 months ago by gkronber
r14539: extended simplifier to pass all new unit tests for factors and binary factors
comment:38 Changed 5 months ago by gkronber
r14540: fixed bugs in simplifier (causing multiple references to the same tree nodes within a tree)
comment:39 Changed 5 months ago by gkronber
Features to review / test:
- Symbolic Regression:
- String variables are visible in the input variables list in the problem data view
- Grammar for symbolic regression contains two new symbols (Factor and BinaryFactor). The symbols are activated if string inputs are selected in the problem data view.
- For string variables the variable frequency analyzer can also analyze references to specific variable values
- Simplification of factor and binary factors nodes
- Constant opt of factor and binary factor nodes
- LaTeX formatter and view for mathematical representation shows all constant values for factor variables.
- Excel exporter and formatter use nested if to produce the correct constants for string variables
- Infix formatter produces an expression referencing string variables that can be used directly for NLR
- C# formatter uses helper methods to produce the correct constants for string variables, C# formatter uses encoded identifier names for all variables.
- NLR:
- The function can also reference string variables in the same way as double variables.
- Factor symbols are used for string variables. Factor weights are tuned with constant opt.
- LR:
- If string variables are selected as input variables the algorithm includes binary variables for each string variable value (using the BinaryFactor symbol)
- If the string variable only contains two distinct values, only one binary variable is generated (variables are 100% correlated)
- General:
- Variable impact calculation also includes string variables (replace by most common value)
- For string variables the error characteristics curve also produces the LR base line model where binary variables are generated for each string variable value
- Partial dependency plots show bar charts for models which reference string variables
- Partial dependency plots show bar charts with confidence intervals for models which produce an estimated variance and which reference string variables
- generate a symb. reg. model
- show the visualization for a symbolic regression model with factors
- generate a Gaussian process model (without factors)
- drag the Gaussian process solution onto the visualization for the symbolic regression model
- Classification:
- LDA: if string variables are allowed as inputs binary factors are generated for each variable value (similar to LR)
- OneR: allows splitting by string variable values
- Multi-nomial Logit supports string variables
- Solution comparison view for classification supports string variables because OneR, LDA, and Multi-nomial Logit support string variables
comment:40 Changed 5 months ago by gkronber
r14541: better wording in exception string
comment:41 Changed 5 months ago by gkronber
r14542: merged r14504:14533 from trunk to branch
comment:42 Changed 5 months ago by gkronber
- Owner changed from gkronber to mkommend
- Status changed from accepted to reviewing
comment:43 Changed 5 months ago by mkommend
- Milestone changed from HeuristicLab 3.3.x Backlog to HeuristicLab 3.3.15
comment:44 Changed 5 months ago by gkronber
r14554: removed warnings
comment:45 Changed 5 months ago by gkronber
r14560: added a method for exporting models as Excel expressions with given variable mapping (for convenience)
comment:46 follow-up: ↓ 50 Changed 4 months ago by mkommend
Notes:
- Merge conflicts with current trunk (r14587).
- Will elastic net be adapted to work with factors (similar as LR).
- Why are categorial (string) variables displayed last in the problem data (input variables)?
- Alglib util: use any to check for empty enumerables instead of .Count() == 0.
LR works as expected. TBC.
comment:47 Changed 4 months ago by gkronber
r14589: merged r14548:14582 from trunk to to branch
comment:48 Changed 4 months ago by gkronber
r14590: created a branch for the reintegration into trunk
comment:49 Changed 4 months ago by gkronber
comment:50 in reply to: ↑ 46 Changed 4 months ago by gkronber
comment:51 Changed 4 months ago by gkronber
r14593: deleted branch again
comment:53 Changed 4 months ago by mkommend
r14615: Minor changes in factors branch (sealed one factor classes).
comment:54 Changed 3 months ago by mkommend
r14693: Fixed ordering of variables in problem data.
comment:55 Changed 3 months ago by mkommend
r14701: Switched definition of vertical and horizontal concatenation of matrixes. (cf. https://de.mathworks.com/help/matlab/ref/vertcat.html#examples)
comment:56 Changed 3 months ago by mkommend
- Owner changed from mkommend to gkronber
- Status changed from reviewing to assigned
Review comments
IDataAnalysisProblemData still contains a TODO comment, where all usages should be checked.(gkronber: checked and removed with r14763.)RegressionSolutionVariableImpactsCalculator: Calculation for factorsThe impact of a factor is calculated based on evaluating all combinations and ignores the configured replacement method!median and average should use the modeshuffle would work the samenoise should calculate a distribution and sample from it (similar to shuffle?)noise is more or less uselses (only for normal distributed numbers)- (gkronber: r14762: added option to specify replacement method for factor variables. However, the solution is different from the solution proposed above. Please check.)
ConstantOptimization should use Dictionaries instead of [] for variable names and valuesr14756 improved code for handling variables in the constant optimizer by using a dictionaryC# Formatter VariableName2Identifier method using the bytes from the encoding produces unreadable code. Just use the variable name and maybe replace whitespaces with underscores. If it does not compile the user should handle the remaining issues manually.(gkronber: solved with r14720)Infix Expression Formatter does not handle factor variable symbols correctly (parser does not handle them as well)(gkronber: solved with r14761)ComplexityCalculator does not handle BinaryFactorVariables(gkronber: fixed with r14760)BinaryFactorVariable and FactorVariableSymbol are identical(gkronber: discussed with mko and we decided that this should be ok.)FactorVariableTreeNode uses linear search in GetValue(gkronber: solved with r14717,r14719)BinaryFactorVariableTreeNode ShakeLocalParameters, why is the variable name only changed with a 20% probability ?(gkronber: r14758 unified mutation behaviour for all variable tree nodes. Introduced parameter for probability of changing a variable. r14759 added a way to set the probability of variable changes via the GUI)Shaking in the different types of VariableTreeNodes should work at least similiarly (reuse of weights, variable name changes lead to completely new weights, ...)(gkronber see r14758 and r14759)
Optional remarks
- Will elastic net be adapted to work with factors (similar as LR).
- Alglib algs (LR,LDA, multinomial logit, ...) create two matrices for double and binary variables and merge these two afterwards. Wouldn't it be more efficient to create the whole matrix in one pass (without the two intermediate matrices)?
ILEmittingInterpreter changes have no effect and actually lead to less descriptive error messages (revert?)(gkronber: reverted with r14715)Are the data densities in the gradientchart correct?just checked this; I believe they are.Simplifier can combine a constant with a FactorVariable if its beneath an additation or subtratction (w_x + c)(gkronber: This is already supported and a unit test exists. Added another unit test for subtraction with r14716.)There are remaining TODO items in comment:1:ticket:2650(gkronber: checked and made some more changes: r14764, r14766)- Be careful when reintegrating the branch due to SVN issues (replaced and copied files).
comment:57 Changed 2 months ago by gkronber
r14751: merged r14597:14737 from trunk to branch
comment:58 Changed 2 months ago by gkronber
comment:59 Changed 2 months ago by gkronber
comment:60 Changed 2 months ago by gkronber
comment:61 Changed 2 months ago by gkronber
r14755 merged r14749:14750 from trunk to branch (commit message of r14755 is incorrect)
comment:62 Changed 2 months ago by gkronber
r14764: adapted formatters to handle factor symbols
comment:63 Changed 2 months ago by gkronber
r14765: added support for negative weights for parsing expressions with factors
comment:64 Changed 2 months ago by gkronber
r14766: reviewed and tested all analyzers and made some smaller changes
comment:65 Changed 2 months ago by gkronber
- Owner changed from gkronber to mkommend
- Status changed from assigned to reviewing
comment:66 Changed 2 months ago by mkommend
r14812: Adapted ArithmeticExpressionGrammar to set subtree count of factor variables explicitly.
comment:67 Changed 2 months ago by mkommend
r14813: Removed commented code from constant optimization evaluator.
comment:68 Changed 2 months ago by mkommend
r14814: Removed misleading comment from C# formatter.
comment:69 Changed 2 months ago by mkommend
r14815: Corrected comment in VariableTreeNodeBase.
comment:70 Changed 2 months ago by mkommend
- Owner changed from mkommend to gkronber
- Status changed from reviewing to assigned
Reviewed all changes in this ticket! *phew*
Last review comments:
- Infix parser cannot handle binary factor variables. The round trip model -> text -> model does not work.
- Show the bug in the manipulation of variable weights be fixed?
After the mentioned defect is fixed the branch is ready for reintegration in trunk. Please be careful so that all the white-space changes (e.g. "if (.." -> ""if(..", or ""foreach (..." -> ""foreach(...") will not be introduced in the trunk. You somehow use different formatting settings than everyone else.
comment:71 Changed 8 weeks ago by gkronber
r14823: fixed round-trip for binary factor variables (and formatting changes)
comment:72 Changed 8 weeks ago by gkronber
r14824: added a TODO comment for the mutation bug
comment:73 Changed 8 weeks ago by gkronber
r14825: merged r14769:14820 from trunk to branch to prepare for branch reintegration
comment:74 Changed 8 weeks ago by gkronber
r14826: merged the factors branch into trunk
comment:75 Changed 8 weeks ago by gkronber
r14827: removed a plugin dependency and added a plugin dependency
comment:76 Changed 7 weeks ago by gkronber
r14829: reformatting of VariableBase
comment:77 Changed 7 weeks ago by gkronber
r14830: sealed variable symbol classes
comment:78 Changed 7 weeks ago by gkronber
r14831: made backwards compatibe code for mutation of variables more obvious in VariableTreeNodeBase
comment:79 Changed 7 weeks ago by gkronber
r14832: adapted GP unit test so that they produce the same outcomes as before reintegration of the factors branch
comment:80 Changed 7 weeks ago by gkronber
- Owner changed from gkronber to mkommend
- Status changed from assigned to reviewing
comment:81 Changed 7 weeks ago by mkommend
- Status changed from reviewing to readytorelease
comment:82 Changed 7 weeks ago by mkommend
- Owner changed from mkommend to gkronber
- Status changed from readytorelease to reviewing
comment:83 Changed 6 weeks ago by gkronber
r14866: deleted branch as it has been integrated into trunk
comment:84 Changed 6 weeks ago by gkronber
- Owner changed from gkronber to mkommend
comment:85 Changed 6 weeks ago by mkommend
- Owner changed from mkommend to gkronber
- Status changed from reviewing to readytorelease
comment:86 Changed 6 weeks ago by gkronber
Depends on other tickets which must be merged first:
comment:87 Changed 3 weeks ago by gkronber
comment:88 Changed 2 weeks ago by gkronber
Unfortunately, r14826 cannot be easily merged from trunk to stable.
Probably it is best to merge all changesets made to the trunk before r14826 to stable before.
A comparison of stable with trunk shows that changesets (before r14826) which are associated with the following tickets have not yet been merged from trunk to stable :
- #2255 (corrected and merged),
- #2288
- #2432 (corrected)
- #2433 (corrected)
- #2435 (corrected, merge & reverse merge, applied to stable)
- #2442 (corrected)
- #2445 (corrected)
- #2446 (corrected)
- #2451 (corrected)
- #2457 (mainly branch development, but includes a trunk change (r13593)
- #2470 (corrected)
- #2477 (corrected)
- #2480 (corrected)
- #2524 (PausableAlg, almost done)
- #2526 (just record the merge info and ignore the conflicting change?)
- #2547 (only one change so far, @abeham can we release this with 3.3.15?)
- #2560 (@abeham will this be released with 3.3.15?)
- #2581 (MCTS for symb reg)
- #2588 (OKB solution download and upload, released with 3.3.15?)
- #2594 (corrected)
Interestingly, this includes a number of already closed tickets (with the following unmerged changes):
- #2432:
r12770 (has been merged but not recorded) - #2433:
r12772 (has been merged but not recorded) - #2442:
- #2445:
r12811 (merged but not recorded) - #2446:
r12836, r12812 (both merged but not recorded) - #2451:
r12837 (merged but not recorded) - #2470:
r12907 (merged but not recorded) - #2477:
r12971 (merged but not recorded) - #2480:
r12973, r12977 (merged and corresponding reverse merge but seemingly not marked as such)(done, see #2640) - #2526 (Release):
- r14208, (should be merged to stable, not only svn:ignore!) (not so easy because of a later changeset which conflicts)
- r14187, (never merged to stable but same change to stable with r14188)
- r14185, (never merged to stable but same change to stable with r14168)
- r14171, (never merged to stable but same change to stable with r14183)
- #2594:
r14160 (should be merged to stable!)(done, see #2640)
comment:89 Changed 2 weeks ago by abeham
comment:90 Changed 12 days ago by abeham
Do not merge this to stable again:
r14988: recorded merge of revisions 12770,12772,12811,12812,12836,12837,12907,12971 in stable
r14232:14233 : created a feature branch for #2650 (support for categorical variables in symb reg) with a first set of changes work in progress...
TODO:
handle correctly in all formatters(Smalltalk formatter and external evaluation formatter have not been adjusted)view for factor variables (configuration of actually allowed factors)create a set of unit tests for the simplifier (handle correctly in simplifier)extend simplifier to handle BinaryFactorVariableextend simplifier to combine FactorVariables with BinaryFactorVariablehandle correctly in variable impacts viewhandle correctly in Non-linear regression (infix parser and infix formatter)support in all analyzers which handle variable symbols specificallysupport for pruningsymbol for WeightedFactorVariable (instead of only 0/1)add an interface for variable symbols (with VariableName property)handle correctly in gradient viewshandle correctly in mathematical expression viewhandle correctly in ERC view (create linear regression model)handle correctly in symbolic classification - solution comparisonhandle correctly in OneROpen issues which are not strictly necessary for a first merge of the functionality: