Opened 8 years ago

Closed 2 months ago

#745 closed feature request (done)

More advanced linear regression methods with included feature selection

Reported by: gkronber Owned by: gkronber
Priority: medium Milestone: HeuristicLab 3.3.15
Component: Algorithms.DataAnalysis Version: 3.3.14
Keywords: Cc:

Description


Attachments (1)

Elastic-net Linear Regression (LR).hl (2.3 MB) - added by mkommend 14 months ago.
Elastic Net LR - R² Values Bug

Change History (66)

comment:1 Changed 7 years ago by gkronber

  • Component changed from LinearRegression to Algorithms.DataAnalysis
  • Priority changed from minor to trivial
  • Version changed from 3.2 to 3.3

comment:2 Changed 7 years ago by gkronber

  • Summary changed from Ridge regression for linear models to More advanced linear regression with included feature selection

comment:3 Changed 7 years ago by gkronber

  • Summary changed from More advanced linear regression with included feature selection to More advanced linear regression methods with included feature selection

comment:4 Changed 7 years ago by gkronber

  • Version changed from 3.3 to branch

comment:5 Changed 4 years ago by gkronber

  • Version changed from branch to 3.4

comment:6 Changed 4 years ago by gkronber

  • Priority changed from lowest to medium

Several ways of feature selection are possible:

  • Wrappers: Best-subset-, forward-, and backward-selection
  • Regularized models: L0, L1, L2 or combinations thereof.

comment:7 follow-up: Changed 4 years ago by gkronber

There are plans to extend alglib into relevant directions: http://bugs.alglib.net/view.php?id=543

comment:8 Changed 15 months ago by gkronber

  • Status changed from new to accepted

comment:9 Changed 15 months ago by gkronber

It is possible to wrap the glmnet library (R implementation) to provide a large number regularized generalized linear models. (https://cran.r-project.org/web/packages/glmnet/index.html)

comment:10 Changed 15 months ago by gkronber

r13926: created a folder for feature branch (glmnet)

comment:11 Changed 15 months ago by gkronber

r13927: import of first implementation of elastic-net LR

comment:12 Changed 15 months ago by gkronber

r13928: first version of elastic-net alg that runs in HL

comment:13 Changed 15 months ago by gkronber

  • Owner changed from gkronber to mkommend
  • Status changed from accepted to reviewing
  • Version changed from 3.4 to branch

comment:14 in reply to: ↑ 7 Changed 15 months ago by gkronber

Replying to gkronber:

There are plans to extend alglib into relevant directions: http://bugs.alglib.net/view.php?id=543

There have been no activities regarding this functionality in alglib for 3 years.

comment:15 Changed 15 months ago by gkronber

r13929: fixed copying of license file and added evaluation of all models along the path on the test set

comment:16 Changed 15 months ago by gkronber

r13930: added parameter lambda to support calculation of a solution for a specific lambda value (instead of the full path)

comment:17 Changed 15 months ago by gkronber

r13931: copy local -> false

comment:18 Changed 15 months ago by gkronber

  • Owner changed from mkommend to gkronber
  • Status changed from reviewing to assigned

Should show R² values in scatter plot.

comment:19 Changed 15 months ago by gkronber

  • Milestone changed from HeuristicLab 3.3.x Backlog to HeuristicLab 3.3.14

comment:20 Changed 15 months ago by gkronber

  • Status changed from assigned to accepted

comment:21 Changed 15 months ago by gkronber

  • Owner changed from gkronber to mkommend
  • Status changed from accepted to reviewing

r13940:

  • added scatterplot of R² values over lambda instead of line chart,
  • normalized coefficient values in coefficient path chart
  • changed parameter lambda to LogLambda
Last edited 14 months ago by mkommend (previous) (diff)

comment:22 Changed 15 months ago by gkronber

r13961: minor change to compile with current trunk

comment:23 Changed 14 months ago by mkommend

Reviewed changesets in the glmnet branch, which all look reasonable.

However, during testing i found some strange behavior wrt to the calculated R² values. In the attached model the following R² values:

Results : R² train: 0.69791824732380425 R² test: 0.95628133008046667 Solution:R² train: 0.97482359155658100 R² test: 0.95628133008046667

The calculated R² values for the test partition are identical between the outcome of the algorithm and what the HL solution reports. However, the training R² values differ significantly. Furthermore, it looks suspicious to me that the training R² of the glmnet is so much lower compared to the test R².

Changed 14 months ago by mkommend

Elastic Net LR - R² Values Bug

comment:24 Changed 14 months ago by mkommend

  • Milestone changed from HeuristicLab 3.3.14 to HeuristicLab 3.3.15
  • Owner changed from mkommend to gkronber
  • Status changed from reviewing to assigned

comment:25 follow-up: Changed 14 months ago by gkronber

glmnet documentation: "dev.ratio: The fraction of (null) deviance explained (for "elnet", this is the R-square) [...] Hence dev.ratio=1-dev/nulldev. [...] The NULL model refers to the intercept model"

It seems like they return R² = (1-NMSE).

The following can be observed in the attached example:

  • Pearsons R² (training) = 0.974823591556581
  • R² (train) = 0.69791824732380425 (this is the value returned by elnet)
  • NMSE (train) = 0.29704705679825893
  • 1 - NMSE (train) = 0.70295294320174107 (not equal to R² (train) !)
  • MSE (train) = 76.204598038339682
  • Constant model MSE (train) = 252.26481693524903 (via ERC)
  • MSE / MSE (constant model) = 76.2046 / 252.2648 = 0.30208
  • 1 - 0.30208 = 0.6979182

It seems our calculation of NMSE is not equivalent to MSE_model / MSE_constant

Last edited 14 months ago by gkronber (previous) (diff)

comment:26 in reply to: ↑ 25 Changed 14 months ago by gkronber

Replying to gkronber:

It seems our calculation of NMSE is not equivalent to MSE_model / MSE_constant

Yes, we calculate MSE / Variance(y). The attached file only has 60 training samples so the factor n / (n+1) is relevant here. The OnlineNMSECalculator should be adapted to use population variance (see #2628).

comment:27 Changed 14 months ago by gkronber

r14225: used NMSE instead of squared Pearson's correlation coeff as results.

comment:28 Changed 14 months ago by gkronber

  • Owner changed from gkronber to mkommend
  • Status changed from assigned to reviewing

comment:29 Changed 11 months ago by mkommend

  • Owner changed from mkommend to gkronber
  • Status changed from reviewing to assigned

Review comments

  • HL.Algorithms.DataAnalysis.Glmnet ships the external library glmnet (x86, x64). This is the first time that a standard HL plugin encapsulates an external library. Until now special transport plugins in the ExtLib solutions have been created for external libraries.
  • DLL handling (imports) should be extracted into a separate file.
  • License Header is missing.
  • PluginDependencies are incorrect.
  • Should not the penalty parameter contain alpha anywhere in its name or description to be consistent with the CRAN package?
  • Why is lambda specified as log10 instead of the actual value?
  • Coefficient paths => Coefficients paths or just Coefficients ?
  • Line 165 trainRsq & testRsq should be renamed to NMSE (train & test).
  • The method name 'CreateElasticNetLinearRegressionSolution' is misleading, because actually no solution is created but rather the coefficients & intercept are calculated.

I have reviewed the implementation from line 217 until the end only briefly, but I tested it thoroughly using the feature selection benchmark problems and it works pretty well! During testing i got some further ideas for improvements listed below that are not necessary for trunk integration, but would ease working with elastic net regression and improve the source code.

Suggestions and further ideas

  • A sample on the StartPage would be nice.
  • AFAIK The coefficient paths show the variable weights over different lambda values. Isn't it possible to determine the lambda value from the coefficient paths instead from the NMSE chart? Therefore, the x-axis must be changed to show the lambda values instead of an index.
  • A pretty cool feature would be a data table that shows textually, the lambda value, intercept, number of used variables (coefficient != 0), train & test NMSE, and all coefficients (similar to the coefficient paths when displayed as StringConvertibleMatrix). Additionally, on double-click on the row-header a new symbolic regression solution is created on the fly that uses the coefficients from the selected row and another elastic-net run with a fixed lambda value would be obsolete.
  • PrepareData is used in lots of data analysis algorithms to interact with external libs that expect double arrays. Wouldn't it be helpful to provide methods for getting training (test) input and target (x,y) values directly inside the RegressionProblemData?
  • Line 76 - 98 should be refactored and extracted into a utility class and reused from LR, glmnet, ERC. Already pointed out by the comment above line 76.
Last edited 3 months ago by gkronber (previous) (diff)

comment:30 Changed 11 months ago by mkommend

  • Owner changed from gkronber to mkommend
  • Status changed from assigned to accepted

comment:31 Changed 11 months ago by mkommend

r14370: Addressed some easy to implement review comments:

  • missing license header
  • renaming of variables
  • extracted DLLImports into a separate file
  • corrected plugin dependencies

Access modifiers of Glmnet can be changed to internal to hide the class completely.

comment:32 Changed 11 months ago by mkommend

  • Owner changed from mkommend to gkronber
  • Status changed from accepted to assigned

comment:33 Changed 11 months ago by gkronber

r14373:

  • using a line chart (IndexedDataTable) instead of a scatter plot
  • added number of variables (non-zero coefficients) to the line chart

comment:34 Changed 11 months ago by gkronber

  • Status changed from assigned to accepted

comment:35 Changed 11 months ago by gkronber

r14374: added number of variables as secondary axis to the chart for coefficient values and changed x-axis to log-scaled.

comment:36 Changed 11 months ago by gkronber

r14375: 'Coefficient values' -> 'Coefficients'

comment:37 Changed 11 months ago by gkronber

r14377: added alpha to description of penalty parameter

comment:38 Changed 10 months ago by gkronber

r14395: code simplification using functionality from refactored trunk and fixed lambda parameter

comment:39 Changed 10 months ago by gkronber

This branch depends on #2697.

comment:40 Changed 10 months ago by mkommend

r14461: Changed test NSME to double.Nan (previously double.Max) in the case of an OnlineCalculationError, because the NMSEs line chart cannot handle double.MaxValue and throws an exception.

This happens for example when no test partition is defined.

comment:41 Changed 7 months ago by mkommend

r14674: Adapted elastic net linear regression to support basic algorithms.

This change has been necessary, because FixedDataAnalysisAlgorithms now derive from basic algorithms, which have a cancellationtoken as parameter in the run method (r14523).

comment:42 Changed 5 months ago by gkronber

r14844: ordered rows for coefficients in data-table by total absolute value

comment:43 Changed 5 months ago by gkronber

r14846: copied elastic net from branch to trunk with minor changes

comment:44 Changed 5 months ago by gkronber

r14847: deleted unnecessary files

comment:45 Changed 5 months ago by gkronber

  • Owner changed from gkronber to mkommend
  • Status changed from accepted to reviewing

comment:46 Changed 5 months ago by mkommend

  • Version changed from branch to 3.3.14

r14871: Corrected build path of HL.DataAnalysis.Glmnet.

comment:47 Changed 5 months ago by gkronber

I think it would be easy to add support for factors within this ticket before release.

comment:48 Changed 4 months ago by mkommend

  • Owner changed from mkommend to gkronber
  • Status changed from reviewing to assigned

comment:49 Changed 4 months ago by gkronber

r15023: added support for factor variables to elastic net regression

comment:50 Changed 4 months ago by gkronber

  • Owner changed from gkronber to mkommend
  • Status changed from assigned to reviewing

comment:51 Changed 3 months ago by mkommend

  • Owner changed from mkommend to gkronber
  • Status changed from reviewing to assigned

r15046: Extracted solution creation into a separate static method and renamed CreateElasticNetLinearRegressionSolution to CalculateModelCoefficients.

Reviewed all changesets.

comment:52 Changed 3 months ago by mkommend

  • Owner changed from gkronber to mkommend
  • Status changed from assigned to reviewing

comment:53 Changed 3 months ago by mkommend

  • Owner changed from mkommend to gkronber
  • Status changed from reviewing to readytorelease

Accidentally changed status to assigned instead of ready to release.

comment:54 Changed 3 months ago by mkommend

  • Owner changed from gkronber to mkommend
  • Status changed from readytorelease to assigned

Please delete the outdated branch for this feature.

comment:55 Changed 3 months ago by mkommend

Additional review comment: HL.Algorithms.DataAnalysis.Glmnet is missing the version information in the project file name.

comment:56 Changed 3 months ago by gkronber

  • Owner changed from mkommend to gkronber
  • Status changed from assigned to accepted

comment:57 Changed 3 months ago by gkronber

r15138: deleted branch

comment:58 Changed 3 months ago by gkronber

r15139: renamed project file for glmnet

comment:59 Changed 3 months ago by gkronber

r15146: merged r14846,r14847,r14871 from trunk to stable

Last edited 3 months ago by gkronber (previous) (diff)

comment:60 Changed 3 months ago by gkronber

r15147: merged r15023,r15046 from trunk to stable

Last edited 3 months ago by gkronber (previous) (diff)

comment:61 Changed 3 months ago by gkronber

  • Status changed from accepted to readytorelease

comment:62 Changed 3 months ago by gkronber

Depends on #2634

comment:63 Changed 3 months ago by gkronber

r15151 merged r15139 from trunk to stable

comment:64 Changed 3 months ago by gkronber

r15153: temporarily removed glmnet project because of compilation fail of stable branch (#2634 needs to be merged to stable, then the glmnet project can be included again)

comment:65 Changed 2 months ago by abeham

  • Resolution set to done
  • Status changed from readytorelease to closed

r15220: merged revisions 14102, 14647, 14652, 14654, 14734, 14737, 14775, 15048, 15125, 15126 to stable, reverted changes from revision 15153

Note: See TracTickets for help on using tickets.