Free cookie consent management tool by TermsFeed Policy Generator

Opened 8 years ago

Closed 7 years ago

#2650 closed feature request (done)

Support for categorical variables (R factors) for symbolic regression with GP

Reported by: gkronber Owned by: gkronber
Priority: medium Milestone: HeuristicLab 3.3.15
Component: Problems.DataAnalysis.Symbolic Version: 3.3.14
Keywords: Cc:

Description

We frequently encounter regression / classification problems where the dataset contains categorical variables. It would be great if such variables can be used directly within symbolic regression models.

Attachments (2)

FactorsSimplificationError.hl (28.7 KB) - added by mkommend 7 years ago.
FactorsSimplificationError - whole solution.hl (31.6 KB) - added by mkommend 7 years ago.

Download all attachments as: .zip

Change History (104)

comment:1 Changed 8 years ago by gkronber

r14232:14233 : created a feature branch for #2650 (support for categorical variables in symb reg) with a first set of changes work in progress...

TODO:

  • handle correctly in all formatters (Smalltalk formatter and external evaluation formatter have not been adjusted)
  • view for factor variables (configuration of actually allowed factors)
  • create a set of unit tests for the simplifier (handle correctly in simplifier)
  • extend simplifier to handle BinaryFactorVariable
  • extend simplifier to combine FactorVariables with BinaryFactorVariable
  • handle correctly in variable impacts view
  • handle correctly in Non-linear regression (infix parser and infix formatter)
  • support in all analyzers which handle variable symbols specifically
  • support for pruning
  • symbol for WeightedFactorVariable (instead of only 0/1)
  • add an interface for variable symbols (with VariableName property)
  • handle correctly in gradient views
  • handle correctly in mathematical expression view
  • handle correctly in ERC view (create linear regression model)
  • handle correctly in symbolic classification - solution comparison
  • handle correctly in OneR

Open issues which are not strictly necessary for a first merge of the functionality:

  • support string variables in data preprocessing view
  • allow factor variables in decision trees (and therefore GBT)?
  • allow string variables as target variables in classification algorithms
  • Switch/Case symbol with one subtree for each possible factor value
  • handle correctly in SymbolicDataAnalysisExpressionTreeILEmittingInterpreter and SymbolicDataAnalysisExpressionCompiledTreeInterpreter (done: tree and linear interpreter)
  • support in more algs?
Last edited 8 years ago by gkronber (previous) (diff)

comment:3 Changed 8 years ago by gkronber

r14237:14238 :

  • added weight for FactorVariable (necessary for LR)
  • introduced VariableBase and VariableTreeNodeBase and IVariableSymbol
  • support for factors in LR
  • extended variable impacts in solution view
  • fixed ERC view for regression
  • support for FactorVariable in simplifier
  • improved support for FactorVariable in constants optimizer
  • multiple related changes and small fixes
Last edited 8 years ago by gkronber (previous) (diff)

comment:4 Changed 8 years ago by gkronber

r14239: #2650: merged r14234:14236 from trunk to branch

comment:5 Changed 8 years ago by gkronber

Shouldn't the variable impacts view be added as a solution view instead of an extra button?

comment:6 Changed 8 years ago by gkronber

r14240: added support for categorical variables to LDA and MNL

Last edited 8 years ago by gkronber (previous) (diff)

comment:7 Changed 8 years ago by gkronber

r14241: added support for factor variables in specific solution comparison view for symbolic classification solutions

comment:8 Changed 8 years ago by gkronber

  • Status changed from new to accepted

comment:9 Changed 8 years ago by gkronber

  • Version changed from 3.3.14 to branch

comment:10 Changed 8 years ago by gkronber

r14242: added support for factor variables to OneR algorithm

comment:11 Changed 8 years ago by gkronber

r14243: renamed FactorVariable -> BinaryFactorVariable

comment:12 Changed 8 years ago by gkronber

r14248: added support for factor variables to target variation view together with Philipp

comment:13 Changed 8 years ago by gkronber

r14249: added new symbol FactorVariable (renamed previous symbol to BinaryFactorVariable) Work in progress.

comment:14 Changed 8 years ago by gkronber

r14251:

  • extended non-linear regression to work with factors
  • fixed bugs in constants optimizer and tree interpreter
  • improved simplification of factor variables
  • added support for factors to ERC view
  • added support for factors to solution comparison view
  • activated view for all factors

comment:15 Changed 8 years ago by gkronber

r14259: added support for factor variables to Excel formatter and Excel exporter as well as to the Latex formatter and consequently the mathematical representation view.

comment:16 Changed 8 years ago by gkronber

r14266: improved handling of factors in ConstantOptimizationEvaluator (create binary indicators only once)

comment:17 Changed 8 years ago by gkronber

  • r14276: merged r14244 from trunk to branch
  • r14277: merged r14245:14273 from trunk to branch (fixing conflicts in RegressionSolutionTargetResponseGradientView)

comment:18 Changed 8 years ago by gkronber

Bugs:

  • Exception when showing the simplifier view after simplification of the tree (it seems some nodes are not cloned).
  • Exception when trying to open data preprocessing view for a ProblemData object stored in a solution
Version 0, edited 8 years ago by gkronber (next)

comment:19 Changed 8 years ago by gkronber

r14330: merged r14282:14322 from trunk to branch (fixing conflicts)

comment:20 Changed 8 years ago by gkronber

r14331: fixed compilation errors after merge

comment:21 Changed 8 years ago by gkronber

r14339: fixed bug in simplification of factor symbols

comment:22 Changed 8 years ago by gkronber

r14351: merged r14332:14350 from trunk to branch

comment:23 Changed 8 years ago by gkronber

r14399: merged r14352:14376 from trunk to branch (resolving conflicts in SymbolicDataAnalysisExpressionLatexFormatter

comment:24 Changed 8 years ago by gkronber

r14401: merged r14378:14400 from trunk to branch

comment:25 Changed 8 years ago by gkronber

r14402: fixed a bug in constant optimizer in relation to lagged variables

comment:26 Changed 8 years ago by gkronber

r14403: added support for factor variables to C# formatter

comment:27 Changed 8 years ago by gkronber

r14421 merged r14405:14418 from trunk to branch

comment:28 Changed 8 years ago by gkronber

Should be finished before #2697

comment:29 Changed 8 years ago by gkronber

r14449: merged r14422:14443 from trunk to branches resolving conflicts

comment:30 Changed 8 years ago by gkronber

r14497: updated mergeinfo to record the merged changesets r14422:14443 (happened in r14449)

comment:31 Changed 8 years ago by gkronber

r14498: merged r14457:14494 from trunk to branch (resolving conflicts)

comment:32 Changed 8 years ago by gkronber

r14499: updated mergeinfo to record the merged changesets r14244:14273 (happened in r14276 and r14277)

comment:33 Changed 8 years ago by gkronber

r14501: better handling of variable names (as identifiers) and fixed some bugs

comment:34 Changed 8 years ago by gkronber

r14502: another small fix in the C# formatter

comment:35 Changed 8 years ago by gkronber

r14534: added simplifier unit tests for factor symbols

comment:36 Changed 8 years ago by gkronber

r14535: worked on simplifier

comment:37 Changed 8 years ago by gkronber

r14539: extended simplifier to pass all new unit tests for factors and binary factors

comment:38 Changed 8 years ago by gkronber

r14540: fixed bugs in simplifier (causing multiple references to the same tree nodes within a tree)

comment:39 Changed 8 years ago by gkronber

Features to review / test:

  • Symbolic Regression:
    • String variables are visible in the input variables list in the problem data view
    • Grammar for symbolic regression contains two new symbols (Factor and BinaryFactor). The symbols are activated if string inputs are selected in the problem data view.
    • For string variables the variable frequency analyzer can also analyze references to specific variable values
    • Simplification of factor and binary factors nodes
    • Constant opt of factor and binary factor nodes
    • LaTeX formatter and view for mathematical representation shows all constant values for factor variables.
    • Excel exporter and formatter use nested if to produce the correct constants for string variables
    • Infix formatter produces an expression referencing string variables that can be used directly for NLR
    • C# formatter uses helper methods to produce the correct constants for string variables, C# formatter uses encoded identifier names for all variables.
  • NLR:
    • The function can also reference string variables in the same way as double variables.
    • Factor symbols are used for string variables. Factor weights are tuned with constant opt.
  • LR:
    • If string variables are selected as input variables the algorithm includes binary variables for each string variable value (using the BinaryFactor symbol)
    • If the string variable only contains two distinct values, only one binary variable is generated (variables are 100% correlated)
  • General:
    • Variable impact calculation also includes string variables (replace by most common value)
    • For string variables the error characteristics curve also produces the LR base line model where binary variables are generated for each string variable value
    • Partial dependency plots show bar charts for models which reference string variables
    • Partial dependency plots show bar charts with confidence intervals for models which produce an estimated variance and which reference string variables
      1. generate a symb. reg. model
      2. show the visualization for a symbolic regression model with factors
      3. generate a Gaussian process model (without factors)
      4. drag the Gaussian process solution onto the visualization for the symbolic regression model
  • Classification:
    • LDA: if string variables are allowed as inputs binary factors are generated for each variable value (similar to LR)
    • OneR: allows splitting by string variable values
    • Multi-nomial Logit supports string variables
    • Solution comparison view for classification supports string variables because OneR, LDA, and Multi-nomial Logit support string variables

comment:40 Changed 8 years ago by gkronber

r14541: better wording in exception string

comment:41 Changed 8 years ago by gkronber

r14542: merged r14504:14533 from trunk to branch

comment:42 Changed 8 years ago by gkronber

  • Owner changed from gkronber to mkommend
  • Status changed from accepted to reviewing

comment:43 Changed 8 years ago by mkommend

  • Milestone changed from HeuristicLab 3.3.x Backlog to HeuristicLab 3.3.15

comment:44 Changed 8 years ago by gkronber

r14554: removed warnings

comment:45 Changed 8 years ago by gkronber

r14560: added a method for exporting models as Excel expressions with given variable mapping (for convenience)

comment:46 follow-up: Changed 8 years ago by mkommend

Notes:

  • Merge conflicts with current trunk (r14587).
  • Will elastic net be adapted to work with factors (similar as LR).
  • Why are categorial (string) variables displayed last in the problem data (input variables)?
  • Alglib util: use any to check for empty enumerables instead of .Count() == 0.

LR works as expected. TBC.

comment:47 Changed 8 years ago by gkronber

r14589: merged r14548:14582 from trunk to to branch

comment:48 Changed 8 years ago by gkronber

r14590: created a branch for the reintegration into trunk

comment:49 Changed 8 years ago by gkronber

r14591: record merge info for (changeset 14353 and reverse merge 14354)

r14592: updated merged info to record the changesets r14378,r14390,r14391,r14393,r14394,r14396, and the reverse merge r14400

Last edited 8 years ago by gkronber (previous) (diff)

comment:50 in reply to: ↑ 46 Changed 8 years ago by gkronber

Replying to mkommend:

Notes:

  • Merge conflicts with current trunk (r14587).

After r14591 and r14592, reintegration into the trunk works.

comment:51 Changed 8 years ago by gkronber

r14593: deleted branch again

comment:53 Changed 8 years ago by mkommend

r14615: Minor changes in factors branch (sealed one factor classes).

comment:54 Changed 8 years ago by mkommend

r14693: Fixed ordering of variables in problem data.

comment:55 Changed 8 years ago by mkommend

r14701: Switched definition of vertical and horizontal concatenation of matrixes. (cf. https://de.mathworks.com/help/matlab/ref/vertcat.html#examples)

Last edited 8 years ago by mkommend (previous) (diff)

comment:56 Changed 8 years ago by mkommend

  • Owner changed from mkommend to gkronber
  • Status changed from reviewing to assigned

Review comments

  • IDataAnalysisProblemData still contains a TODO comment, where all usages should be checked. (gkronber: checked and removed with r14763.)
  • RegressionSolutionVariableImpactsCalculator: Calculation for factors
    • The impact of a factor is calculated based on evaluating all combinations and ignores the configured replacement method!
    • median and average should use the mode
    • shuffle would work the same
    • noise should calculate a distribution and sample from it (similar to shuffle?)
    • noise is more or less uselses (only for normal distributed numbers)
    • (gkronber: r14762: added option to specify replacement method for factor variables. However, the solution is different from the solution proposed above. Please check.)
  • ConstantOptimization should use Dictionaries instead of [] for variable names and values r14756 improved code for handling variables in the constant optimizer by using a dictionary
  • C# Formatter VariableName2Identifier method using the bytes from the encoding produces unreadable code. Just use the variable name and maybe replace whitespaces with underscores. If it does not compile the user should handle the remaining issues manually. (gkronber: solved with r14720)
  • Infix Expression Formatter does not handle factor variable symbols correctly (parser does not handle them as well) (gkronber: solved with r14761)
  • ComplexityCalculator does not handle BinaryFactorVariables (gkronber: fixed with r14760)
  • BinaryFactorVariable and FactorVariableSymbol are identical (gkronber: discussed with mko and we decided that this should be ok.)
  • FactorVariableTreeNode uses linear search in GetValue (gkronber: solved with r14717,r14719)
  • BinaryFactorVariableTreeNode ShakeLocalParameters, why is the variable name only changed with a 20% probability ? (gkronber: r14758 unified mutation behaviour for all variable tree nodes. Introduced parameter for probability of changing a variable. r14759 added a way to set the probability of variable changes via the GUI)
  • Shaking in the different types of VariableTreeNodes should work at least similiarly (reuse of weights, variable name changes lead to completely new weights, ...) (gkronber see r14758 and r14759)

Optional remarks

  • Will elastic net be adapted to work with factors (similar as LR).
  • Alglib algs (LR,LDA, multinomial logit, ...) create two matrices for double and binary variables and merge these two afterwards. Wouldn't it be more efficient to create the whole matrix in one pass (without the two intermediate matrices)?
  • ILEmittingInterpreter changes have no effect and actually lead to less descriptive error messages (revert?) (gkronber: reverted with r14715)
  • Are the data densities in the gradientchart correct? just checked this; I believe they are.
  • Simplifier can combine a constant with a FactorVariable if its beneath an additation or subtratction (w_x + c) (gkronber: This is already supported and a unit test exists. Added another unit test for subtraction with r14716.)
  • There are remaining TODO items in comment:1:ticket:2650 (gkronber: checked and made some more changes: r14764, r14766)
  • Be careful when reintegrating the branch due to SVN issues (replaced and copied files).
Last edited 8 years ago by mkommend (previous) (diff)

comment:57 Changed 8 years ago by gkronber

r14751: merged r14597:14737 from trunk to branch

comment:58 Changed 8 years ago by gkronber

r14752: merged r14738 from trunk to branch

Last edited 8 years ago by gkronber (previous) (diff)

comment:59 Changed 8 years ago by gkronber

r14753: merged r14740 from trunk to branch

comment:60 Changed 8 years ago by gkronber

r14754: merged r14748 from trunk to branch (change set contains file renames lower case to upper case)

comment:61 Changed 8 years ago by gkronber

r14755 merged r14749:14750 from trunk to branch (commit message of r14755 is incorrect)

Last edited 8 years ago by gkronber (previous) (diff)

comment:62 Changed 8 years ago by gkronber

r14764: adapted formatters to handle factor symbols

comment:63 Changed 8 years ago by gkronber

r14765: added support for negative weights for parsing expressions with factors

comment:64 Changed 8 years ago by gkronber

r14766: reviewed and tested all analyzers and made some smaller changes

comment:65 Changed 8 years ago by gkronber

  • Owner changed from gkronber to mkommend
  • Status changed from assigned to reviewing

comment:66 Changed 8 years ago by mkommend

r14812: Adapted ArithmeticExpressionGrammar to set subtree count of factor variables explicitly.

comment:67 Changed 8 years ago by mkommend

r14813: Removed commented code from constant optimization evaluator.

comment:68 Changed 8 years ago by mkommend

r14814: Removed misleading comment from C# formatter.

comment:69 Changed 8 years ago by mkommend

r14815: Corrected comment in VariableTreeNodeBase.

comment:70 Changed 8 years ago by mkommend

  • Owner changed from mkommend to gkronber
  • Status changed from reviewing to assigned

Reviewed all changes in this ticket! *phew*

Last review comments:

  • Infix parser cannot handle binary factor variables. The round trip model -> text -> model does not work.
  • Show the bug in the manipulation of variable weights be fixed?

After the mentioned defect is fixed the branch is ready for reintegration in trunk. Please be careful so that all the white-space changes (e.g. "if (.." -> ""if(..", or ""foreach (..." -> ""foreach(...") will not be introduced in the trunk. You somehow use different formatting settings than everyone else.

Last edited 8 years ago by mkommend (previous) (diff)

comment:71 Changed 8 years ago by gkronber

r14823: fixed round-trip for binary factor variables (and formatting changes)

comment:72 Changed 8 years ago by gkronber

r14824: added a TODO comment for the mutation bug

comment:73 Changed 8 years ago by gkronber

r14825: merged r14769:14820 from trunk to branch to prepare for branch reintegration

comment:74 Changed 8 years ago by gkronber

r14826: merged the factors branch into trunk

comment:75 Changed 8 years ago by gkronber

r14827: removed a plugin dependency and added a plugin dependency

comment:76 Changed 8 years ago by gkronber

r14829: reformatting of VariableBase

comment:77 Changed 8 years ago by gkronber

r14830: sealed variable symbol classes

comment:78 Changed 8 years ago by gkronber

r14831: made backwards compatibe code for mutation of variables more obvious in VariableTreeNodeBase

comment:79 Changed 8 years ago by gkronber

r14832: adapted GP unit test so that they produce the same outcomes as before reintegration of the factors branch

comment:80 Changed 8 years ago by gkronber

  • Owner changed from gkronber to mkommend
  • Status changed from assigned to reviewing

comment:81 Changed 8 years ago by mkommend

  • Status changed from reviewing to readytorelease

Reviewed r14823, r14827, r14829, r14830, r14831, and r14832.

comment:82 Changed 8 years ago by mkommend

  • Owner changed from mkommend to gkronber
  • Status changed from readytorelease to reviewing

comment:83 Changed 8 years ago by gkronber

r14866: deleted branch as it has been integrated into trunk

comment:84 Changed 8 years ago by gkronber

  • Owner changed from gkronber to mkommend

comment:85 Changed 8 years ago by mkommend

  • Owner changed from mkommend to gkronber
  • Status changed from reviewing to readytorelease

comment:86 Changed 8 years ago by gkronber

Depends on other tickets which must be merged first:

  • Constant Opt: #2686
  • DataAnalysis.Views-3.4.csproj: #2529, #2718, #2759
  • DataPreprocessing.Views-3.4.csproj: #2698 (factors branch does not really depend on this but a changeset from the factors branch includes a change from #2698)
Last edited 8 years ago by gkronber (previous) (diff)

comment:87 Changed 8 years ago by gkronber

This is only blocked by #2698 now.

#2529 and #2718 have been merged. #2686 is almost ready for merging.

Last edited 8 years ago by gkronber (previous) (diff)

comment:88 Changed 8 years ago by gkronber

Unfortunately, r14826 cannot be easily merged from trunk to stable.

Probably it is best to merge all changesets made to the trunk before r14826 to stable before.

A comparison of stable with trunk shows that changesets (before r14826) which are associated with the following tickets have not yet been merged from trunk to stable :

  • #2255 (corrected and merged),
  • #2432 (corrected)
  • #2433 (corrected)
  • #2435 (corrected, merge & reverse merge, applied to stable)
  • #2442 (corrected)
  • #2445 (corrected)
  • #2446 (corrected)
  • #2451 (corrected)
  • #2457 (mainly branch development, but includes trunk change r13593 corrected: r13593 should be released with #2560)
  • #2470 (corrected)
  • #2477 (corrected)
  • #2480 (corrected)
  • #2524 (PausableAlg, merged to stable)
  • #2526 (just record the merge info and ignore the conflicting change?)
  • #2547 (merged to stable)
  • #2560 (merged to stable)
  • #2581 (MCTS for symb reg removed from trunk (changes so far have been merged to stable))
  • #2588 (OKB solution download and upload, merged to stable)
  • #2594 (corrected)
  • #2651 (igraph, tree conflicts)
  • #2660 (Variable network instances)
  • #2690 (Views for random forests and gradient boosted trees)

Interestingly, this includes a number of already closed tickets (with the following unmerged changes):

Last edited 7 years ago by gkronber (previous) (diff)

comment:89 Changed 8 years ago by abeham

r12811 has already been merged to stable (r13192), but probably the merge was not recorded.

comment:90 Changed 8 years ago by abeham

Do not merge this to stable again:

r14988: recorded merge of revisions 12770,12772,12811,12812,12836,12837,12907,12971 in stable

comment:91 Changed 7 years ago by gkronber

  • Owner changed from gkronber to architects
  • Status changed from readytorelease to assigned

comment:92 Changed 7 years ago by mkommend

The FactorPartialDependencePlot still throws an NotImplementedException in RemoveSolutionAsync that results in a compilation warning of the trunk!

A possible implementation would be something like this:

   public async Task RemoveSolutionAsync(IRegressionSolution solution) {
      if (!solutions.Remove(solution))
        return;

      seriesCache.Remove(solution);
      ciSeriesCache.Remove(solution);

      await RecalculateAsync();
      var args = new EventArgs<IRegressionSolution>(solution);
      OnSolutionRemoved(this, args);
    }

which is blatantly copied from the PartialDependencePlot and not tested.

Changed 7 years ago by mkommend

Changed 7 years ago by mkommend

comment:93 Changed 7 years ago by mkommend

The simplifier does not always handle factor variables correctly. The first attachment is a minimal example showing an evaluation difference after automatic simplification. The second file is the whole solution from which the first is a subset of.

comment:94 Changed 7 years ago by gkronber

Thanks for spotting this!

r15053: fixes the bug in the simplifier when simplifiying a sum of factors with a constant (+ f1 f1 f2 c) -> (+ 2f1 f2 c)

comment:95 Changed 7 years ago by gkronber

  • Owner changed from architects to mkommend
  • Status changed from assigned to reviewing

r15054: added an implementation for RemoveSolutionAsync in PDP for factors. Untested because this cannot be triggered from the UI.

comment:96 Changed 7 years ago by gkronber

  • Owner changed from mkommend to gkronber

comment:97 Changed 7 years ago by gkronber

  • Version changed from branch to 3.3.14

comment:98 Changed 7 years ago by mkommend

Reviewed r15053 and r15054. Everything works well and the described issues are resolved.

comment:99 Changed 7 years ago by gkronber

  • Status changed from reviewing to readytorelease

comment:100 Changed 7 years ago by gkronber

Merging r14826 from trunk to stable leads to the following tree conflicts:

  • DataTableControl and ScatterPlotControl: These files have been renamed in r14982 (#2713). The change has already been merged to stable (out of order).
  • HeuristicLab.ExtLibs/HeuristicLab.Igraph. This folder has been created with r14234 (#2651). Which has not yet been merged to stable.
  • HeuristicLab.Tests/HeuristicLab.IGraph. This folder has been created with r14244 (#2651). Which has not yet been merged to stable.
Last edited 7 years ago by gkronber (previous) (diff)

comment:101 Changed 7 years ago by gkronber

r15131: merged r14826 from trunk to stable. The only remaining conflict is DataTableControl and ScatterPlotControl which have been renamed within r14982 (-> tree conflict).

comment:102 Changed 7 years ago by gkronber

r15132: merged r14827, r14829:14832 from trunk to stable

comment:103 Changed 7 years ago by gkronber

r15148: merged r15053,r15054 from trunk to stable (all changesets merged)

comment:104 Changed 7 years ago by gkronber

  • Resolution set to done
  • Status changed from readytorelease to closed
Note: See TracTickets for help on using tickets.