Opened 8 years ago
Last modified 6 weeks ago
#745 accepted feature request
More advanced linear regression methods with included feature selection
Reported by: | gkronber | Owned by: | gkronber |
---|---|---|---|
Priority: | medium | Milestone: | HeuristicLab 3.3.15 |
Component: | Algorithms.DataAnalysis | Version: | branch |
Keywords: | Cc: |
Description
Attachments (1)
Change History (42)
comment:1 Changed 7 years ago by gkronber
- Component changed from LinearRegression to Algorithms.DataAnalysis
- Priority changed from minor to trivial
- Version changed from 3.2 to 3.3
comment:2 Changed 7 years ago by gkronber
- Summary changed from Ridge regression for linear models to More advanced linear regression with included feature selection
comment:3 Changed 7 years ago by gkronber
- Summary changed from More advanced linear regression with included feature selection to More advanced linear regression methods with included feature selection
comment:4 Changed 6 years ago by gkronber
- Version changed from 3.3 to branch
comment:5 Changed 4 years ago by gkronber
- Version changed from branch to 3.4
comment:6 Changed 4 years ago by gkronber
- Priority changed from lowest to medium
comment:7 follow-up: ↓ 14 Changed 3 years ago by gkronber
There are plans to extend alglib into relevant directions: http://bugs.alglib.net/view.php?id=543
comment:8 Changed 9 months ago by gkronber
- Status changed from new to accepted
comment:9 Changed 9 months ago by gkronber
It is possible to wrap the glmnet library (R implementation) to provide a large number regularized generalized linear models. (https://cran.r-project.org/web/packages/glmnet/index.html)
comment:10 Changed 9 months ago by gkronber
r13926: created a folder for feature branch (glmnet)
comment:11 Changed 9 months ago by gkronber
r13927: import of first implementation of elastic-net LR
comment:12 Changed 9 months ago by gkronber
r13928: first version of elastic-net alg that runs in HL
comment:13 Changed 9 months ago by gkronber
- Owner changed from gkronber to mkommend
- Status changed from accepted to reviewing
- Version changed from 3.4 to branch
comment:14 in reply to: ↑ 7 Changed 9 months ago by gkronber
Replying to gkronber:
There are plans to extend alglib into relevant directions: http://bugs.alglib.net/view.php?id=543
There have been no activities regarding this functionality in alglib for 3 years.
comment:15 Changed 9 months ago by gkronber
r13929: fixed copying of license file and added evaluation of all models along the path on the test set
comment:16 Changed 9 months ago by gkronber
r13930: added parameter lambda to support calculation of a solution for a specific lambda value (instead of the full path)
comment:17 Changed 9 months ago by gkronber
r13931: copy local -> false
comment:18 Changed 9 months ago by gkronber
- Owner changed from mkommend to gkronber
- Status changed from reviewing to assigned
Should show R² values in scatter plot.
comment:19 Changed 9 months ago by gkronber
- Milestone changed from HeuristicLab 3.3.x Backlog to HeuristicLab 3.3.14
comment:20 Changed 9 months ago by gkronber
- Status changed from assigned to accepted
comment:21 Changed 9 months ago by gkronber
- Owner changed from gkronber to mkommend
- Status changed from accepted to reviewing
- added scatterplot of R² values over lambda instead of line chart,
- normalized coefficient values in coefficient path chart
- changed parameter lambda to LogLambda
comment:22 Changed 9 months ago by gkronber
r13961: minor change to compile with current trunk
comment:23 Changed 8 months ago by mkommend
Reviewed changesets in the glmnet branch, which all look reasonable.
However, during testing i found some strange behavior wrt to the calculated R² values. In the attached model the following R² values:
Results : R² train: 0.69791824732380425 R² test: 0.95628133008046667 Solution:R² train: 0.97482359155658100 R² test: 0.95628133008046667
The calculated R² values for the test partition are identical between the outcome of the algorithm and what the HL solution reports. However, the training R² values differ significantly. Furthermore, it looks suspicious to me that the training R² of the glmnet is so much lower compared to the test R².
comment:24 Changed 8 months ago by mkommend
- Milestone changed from HeuristicLab 3.3.14 to HeuristicLab 3.3.15
- Owner changed from mkommend to gkronber
- Status changed from reviewing to assigned
comment:25 follow-up: ↓ 26 Changed 8 months ago by gkronber
glmnet documentation: "dev.ratio: The fraction of (null) deviance explained (for "elnet", this is the R-square) [...] Hence dev.ratio=1-dev/nulldev. [...] The NULL model refers to the intercept model"
It seems like they return R² = (1-NMSE).
The following can be observed in the attached example:
- Pearsons R² (training) = 0.974823591556581
- R² (train) = 0.69791824732380425 (this is the value returned by elnet)
- NMSE (train) = 0.29704705679825893
- 1 - NMSE (train) = 0.70295294320174107 (not equal to R² (train) !)
- MSE (train) = 76.204598038339682
- Constant model MSE (train) = 252.26481693524903 (via ERC)
- MSE / MSE (constant model) = 76.2046 / 252.2648 = 0.30208
- 1 - 0.30208 = 0.6979182 ∎
It seems our calculation of NMSE is not equivalent to MSE_model / MSE_constant
comment:26 in reply to: ↑ 25 Changed 8 months ago by gkronber
Replying to gkronber:
It seems our calculation of NMSE is not equivalent to MSE_model / MSE_constant
Yes, we calculate MSE / Variance(y). The attached file only has 60 training samples so the factor n / (n+1) is relevant here. The OnlineNMSECalculator should be adapted to use population variance (see #2628).
comment:27 Changed 8 months ago by gkronber
r14225: used NMSE instead of squared Pearson's correlation coeff as results.
comment:28 Changed 8 months ago by gkronber
- Owner changed from gkronber to mkommend
- Status changed from assigned to reviewing
comment:29 Changed 5 months ago by mkommend
- Owner changed from mkommend to gkronber
- Status changed from reviewing to assigned
Review comments
- HL.Algorithms.DataAnalysis.Glmnet ships the external library glmnet (x86, x64). This is the first time that a standard HL plugin encapsulates an external library. Until now special transport plugins in the ExtLib solutions have been created for external libraries.
DLL handling (imports) should be extracted into a separate file.License Header is missing.PluginDependencies are incorrect.Should not the penalty parameter contain alpha anywhere in its name or description to be consistent with the CRAN package?Why is lambda specified as log10 instead of the actual value?Coefficient paths => Coefficients paths or just Coefficients ?Line 165 trainRsq & testRsq should be renamed to NMSE (train & test).- The method name 'CreateElasticNetLinearRegressionSolution' is misleading, because actually no solution is created but rather the coefficients & intercept are calculated.
I have reviewed the implementation from line 217 until the end only briefly, but I tested it thoroughly using the feature selection benchmark problems and it works pretty well! During testing i got some further ideas for improvements listed below that are not necessary for trunk integration, but would ease working with elastic net regression and improve the source code.
Suggestions and further ideas
- A sample on the StartPage would be nice.
AFAIK The coefficient paths show the variable weights over different lambda values. Isn't it possible to determine the lambda value from the coefficient paths instead from the NMSE chart? Therefore, the x-axis must be changed to show the lambda values instead of an index.- A pretty cool feature would be a data table that shows textually, the lambda value, intercept, number of used variables (coefficient != 0), train & test NMSE, and all coefficients (similar to the coefficient paths when displayed as StringConvertibleMatrix). Additionally, on double-click on the row-header a new symbolic regression solution is created on the fly that uses the coefficients from the selected row and another elastic-net run with a fixed lambda value would be obsolete.
PrepareData is used in lots of data analysis algorithms to interact with external libs that expect double arrays. Wouldn't it be helpful to provide methods for getting training (test) input and target (x,y) values directly inside the RegressionProblemData?Line 76 - 98 should be refactored and extracted into a utility class and reused from LR, glmnet, ERC. Already pointed out by the comment above line 76.
comment:30 Changed 5 months ago by mkommend
- Owner changed from gkronber to mkommend
- Status changed from assigned to accepted
comment:31 Changed 5 months ago by mkommend
r14370: Addressed some easy to implement review comments:
- missing license header
- renaming of variables
- extracted DLLImports into a separate file
- corrected plugin dependencies
Access modifiers of Glmnet can be changed to internal to hide the class completely.
comment:32 Changed 5 months ago by mkommend
- Owner changed from mkommend to gkronber
- Status changed from accepted to assigned
comment:33 Changed 5 months ago by gkronber
- using a line chart (IndexedDataTable) instead of a scatter plot
- added number of variables (non-zero coefficients) to the line chart
comment:34 Changed 5 months ago by gkronber
- Status changed from assigned to accepted
comment:35 Changed 5 months ago by gkronber
r14374: added number of variables as secondary axis to the chart for coefficient values and changed x-axis to log-scaled.
comment:36 Changed 5 months ago by gkronber
r14375: 'Coefficient values' -> 'Coefficients'
comment:37 Changed 5 months ago by gkronber
r14377: added alpha to description of penalty parameter
comment:38 Changed 4 months ago by gkronber
r14395: code simplification using functionality from refactored trunk and fixed lambda parameter
comment:39 Changed 4 months ago by gkronber
This branch depends on #2697.
comment:40 Changed 4 months ago by mkommend
r14461: Changed test NSME to double.Nan (previously double.Max) in the case of an OnlineCalculationError, because the NMSEs line chart cannot handle double.MaxValue and throws an exception.
This happens for example when no test partition is defined.
Several ways of feature selection are possible: