Opened 8 years ago

Last modified 2 weeks ago

#745 assigned feature request

More advanced linear regression methods with included feature selection

Reported by: gkronber Owned by: gkronber
Priority: medium Milestone: HeuristicLab 3.3.15
Component: Algorithms.DataAnalysis Version: 3.3.14
Keywords: Cc:

Description


Attachments (1)

Elastic-net Linear Regression (LR).hl (2.3 MB) - added by mkommend 10 months ago.
Elastic Net LR - R² Values Bug

Change History (49)

comment:1 Changed 7 years ago by gkronber

  • Component changed from LinearRegression to Algorithms.DataAnalysis
  • Priority changed from minor to trivial
  • Version changed from 3.2 to 3.3

comment:2 Changed 7 years ago by gkronber

  • Summary changed from Ridge regression for linear models to More advanced linear regression with included feature selection

comment:3 Changed 7 years ago by gkronber

  • Summary changed from More advanced linear regression with included feature selection to More advanced linear regression methods with included feature selection

comment:4 Changed 7 years ago by gkronber

  • Version changed from 3.3 to branch

comment:5 Changed 4 years ago by gkronber

  • Version changed from branch to 3.4

comment:6 Changed 4 years ago by gkronber

  • Priority changed from lowest to medium

Several ways of feature selection are possible:

  • Wrappers: Best-subset-, forward-, and backward-selection
  • Regularized models: L0, L1, L2 or combinations thereof.

comment:7 follow-up: Changed 4 years ago by gkronber

There are plans to extend alglib into relevant directions: http://bugs.alglib.net/view.php?id=543

comment:8 Changed 11 months ago by gkronber

  • Status changed from new to accepted

comment:9 Changed 11 months ago by gkronber

It is possible to wrap the glmnet library (R implementation) to provide a large number regularized generalized linear models. (https://cran.r-project.org/web/packages/glmnet/index.html)

comment:10 Changed 11 months ago by gkronber

r13926: created a folder for feature branch (glmnet)

comment:11 Changed 11 months ago by gkronber

r13927: import of first implementation of elastic-net LR

comment:12 Changed 11 months ago by gkronber

r13928: first version of elastic-net alg that runs in HL

comment:13 Changed 11 months ago by gkronber

  • Owner changed from gkronber to mkommend
  • Status changed from accepted to reviewing
  • Version changed from 3.4 to branch

comment:14 in reply to: ↑ 7 Changed 11 months ago by gkronber

Replying to gkronber:

There are plans to extend alglib into relevant directions: http://bugs.alglib.net/view.php?id=543

There have been no activities regarding this functionality in alglib for 3 years.

comment:15 Changed 11 months ago by gkronber

r13929: fixed copying of license file and added evaluation of all models along the path on the test set

comment:16 Changed 11 months ago by gkronber

r13930: added parameter lambda to support calculation of a solution for a specific lambda value (instead of the full path)

comment:17 Changed 11 months ago by gkronber

r13931: copy local -> false

comment:18 Changed 11 months ago by gkronber

  • Owner changed from mkommend to gkronber
  • Status changed from reviewing to assigned

Should show R² values in scatter plot.

comment:19 Changed 11 months ago by gkronber

  • Milestone changed from HeuristicLab 3.3.x Backlog to HeuristicLab 3.3.14

comment:20 Changed 11 months ago by gkronber

  • Status changed from assigned to accepted

comment:21 Changed 11 months ago by gkronber

  • Owner changed from gkronber to mkommend
  • Status changed from accepted to reviewing

r13940:

  • added scatterplot of R² values over lambda instead of line chart,
  • normalized coefficient values in coefficient path chart
  • changed parameter lambda to LogLambda
Last edited 10 months ago by mkommend (previous) (diff)

comment:22 Changed 11 months ago by gkronber

r13961: minor change to compile with current trunk

comment:23 Changed 10 months ago by mkommend

Reviewed changesets in the glmnet branch, which all look reasonable.

However, during testing i found some strange behavior wrt to the calculated R² values. In the attached model the following R² values:

Results : R² train: 0.69791824732380425 R² test: 0.95628133008046667 Solution:R² train: 0.97482359155658100 R² test: 0.95628133008046667

The calculated R² values for the test partition are identical between the outcome of the algorithm and what the HL solution reports. However, the training R² values differ significantly. Furthermore, it looks suspicious to me that the training R² of the glmnet is so much lower compared to the test R².

Changed 10 months ago by mkommend

Elastic Net LR - R² Values Bug

comment:24 Changed 10 months ago by mkommend

  • Milestone changed from HeuristicLab 3.3.14 to HeuristicLab 3.3.15
  • Owner changed from mkommend to gkronber
  • Status changed from reviewing to assigned

comment:25 follow-up: Changed 10 months ago by gkronber

glmnet documentation: "dev.ratio: The fraction of (null) deviance explained (for "elnet", this is the R-square) [...] Hence dev.ratio=1-dev/nulldev. [...] The NULL model refers to the intercept model"

It seems like they return R² = (1-NMSE).

The following can be observed in the attached example:

  • Pearsons R² (training) = 0.974823591556581
  • R² (train) = 0.69791824732380425 (this is the value returned by elnet)
  • NMSE (train) = 0.29704705679825893
  • 1 - NMSE (train) = 0.70295294320174107 (not equal to R² (train) !)
  • MSE (train) = 76.204598038339682
  • Constant model MSE (train) = 252.26481693524903 (via ERC)
  • MSE / MSE (constant model) = 76.2046 / 252.2648 = 0.30208
  • 1 - 0.30208 = 0.6979182

It seems our calculation of NMSE is not equivalent to MSE_model / MSE_constant

Last edited 10 months ago by gkronber (previous) (diff)

comment:26 in reply to: ↑ 25 Changed 10 months ago by gkronber

Replying to gkronber:

It seems our calculation of NMSE is not equivalent to MSE_model / MSE_constant

Yes, we calculate MSE / Variance(y). The attached file only has 60 training samples so the factor n / (n+1) is relevant here. The OnlineNMSECalculator should be adapted to use population variance (see #2628).

comment:27 Changed 10 months ago by gkronber

r14225: used NMSE instead of squared Pearson's correlation coeff as results.

comment:28 Changed 10 months ago by gkronber

  • Owner changed from gkronber to mkommend
  • Status changed from assigned to reviewing

comment:29 Changed 7 months ago by mkommend

  • Owner changed from mkommend to gkronber
  • Status changed from reviewing to assigned

Review comments

  • HL.Algorithms.DataAnalysis.Glmnet ships the external library glmnet (x86, x64). This is the first time that a standard HL plugin encapsulates an external library. Until now special transport plugins in the ExtLib solutions have been created for external libraries.
  • DLL handling (imports) should be extracted into a separate file.
  • License Header is missing.
  • PluginDependencies are incorrect.
  • Should not the penalty parameter contain alpha anywhere in its name or description to be consistent with the CRAN package?
  • Why is lambda specified as log10 instead of the actual value?
  • Coefficient paths => Coefficients paths or just Coefficients ?
  • Line 165 trainRsq & testRsq should be renamed to NMSE (train & test).
  • The method name 'CreateElasticNetLinearRegressionSolution' is misleading, because actually no solution is created but rather the coefficients & intercept are calculated.

I have reviewed the implementation from line 217 until the end only briefly, but I tested it thoroughly using the feature selection benchmark problems and it works pretty well! During testing i got some further ideas for improvements listed below that are not necessary for trunk integration, but would ease working with elastic net regression and improve the source code.

Suggestions and further ideas

  • A sample on the StartPage would be nice.
  • AFAIK The coefficient paths show the variable weights over different lambda values. Isn't it possible to determine the lambda value from the coefficient paths instead from the NMSE chart? Therefore, the x-axis must be changed to show the lambda values instead of an index.
  • A pretty cool feature would be a data table that shows textually, the lambda value, intercept, number of used variables (coefficient != 0), train & test NMSE, and all coefficients (similar to the coefficient paths when displayed as StringConvertibleMatrix). Additionally, on double-click on the row-header a new symbolic regression solution is created on the fly that uses the coefficients from the selected row and another elastic-net run with a fixed lambda value would be obsolete.
  • PrepareData is used in lots of data analysis algorithms to interact with external libs that expect double arrays. Wouldn't it be helpful to provide methods for getting training (test) input and target (x,y) values directly inside the RegressionProblemData?
  • Line 76 - 98 should be refactored and extracted into a utility class and reused from LR, glmnet, ERC. Already pointed out by the comment above line 76.
Last edited 7 months ago by gkronber (previous) (diff)

comment:30 Changed 7 months ago by mkommend

  • Owner changed from gkronber to mkommend
  • Status changed from assigned to accepted

comment:31 Changed 7 months ago by mkommend

r14370: Addressed some easy to implement review comments:

  • missing license header
  • renaming of variables
  • extracted DLLImports into a separate file
  • corrected plugin dependencies

Access modifiers of Glmnet can be changed to internal to hide the class completely.

comment:32 Changed 7 months ago by mkommend

  • Owner changed from mkommend to gkronber
  • Status changed from accepted to assigned

comment:33 Changed 7 months ago by gkronber

r14373:

  • using a line chart (IndexedDataTable) instead of a scatter plot
  • added number of variables (non-zero coefficients) to the line chart

comment:34 Changed 7 months ago by gkronber

  • Status changed from assigned to accepted

comment:35 Changed 7 months ago by gkronber

r14374: added number of variables as secondary axis to the chart for coefficient values and changed x-axis to log-scaled.

comment:36 Changed 7 months ago by gkronber

r14375: 'Coefficient values' -> 'Coefficients'

comment:37 Changed 7 months ago by gkronber

r14377: added alpha to description of penalty parameter

comment:38 Changed 7 months ago by gkronber

r14395: code simplification using functionality from refactored trunk and fixed lambda parameter

comment:39 Changed 6 months ago by gkronber

This branch depends on #2697.

comment:40 Changed 6 months ago by mkommend

r14461: Changed test NSME to double.Nan (previously double.Max) in the case of an OnlineCalculationError, because the NMSEs line chart cannot handle double.MaxValue and throws an exception.

This happens for example when no test partition is defined.

comment:41 Changed 3 months ago by mkommend

r14674: Adapted elastic net linear regression to support basic algorithms.

This change has been necessary, because FixedDataAnalysisAlgorithms now derive from basic algorithms, which have a cancellationtoken as parameter in the run method (r14523).

comment:42 Changed 7 weeks ago by gkronber

r14844: ordered rows for coefficients in data-table by total absolute value

comment:43 Changed 7 weeks ago by gkronber

r14846: copied elastic net from branch to trunk with minor changes

comment:44 Changed 7 weeks ago by gkronber

r14847: deleted unnecessary files

comment:45 Changed 7 weeks ago by gkronber

  • Owner changed from gkronber to mkommend
  • Status changed from accepted to reviewing

comment:46 Changed 7 weeks ago by mkommend

  • Version changed from branch to 3.3.14

r14871: Corrected build path of HL.DataAnalysis.Glmnet.

comment:47 Changed 3 weeks ago by gkronber

I think it would be easy to add support for factors within this ticket before release.

comment:48 Changed 2 weeks ago by mkommend

  • Owner changed from mkommend to gkronber
  • Status changed from reviewing to assigned
Note: See TracTickets for help on using tickets.