Opened 9 years ago
Closed 5 years ago
#1795 closed feature request (done)
Gradient boosting meta-learner for regression and classification
Reported by: | gkronber | Owned by: | gkronber |
---|---|---|---|
Priority: | low | Milestone: | HeuristicLab 3.3.14 |
Component: | Algorithms.DataAnalysis | Version: | |
Keywords: | Cc: |
Description (last modified by gkronber)
It would be nice to support a kind of boosting where multiple models are learned step by step and the weight of observations is adapted based on the residuals of the models learned so far.
Friedmans "Stochasic Gradient Boosting" (1999) could be implemented for regression and classification problems.
Since version 3.3.12 there is a specific implementation of gradient boosted trees. It would be great if we could also implement gradient boosting as a meta learner that uses any regression algorithm as the base learner.
Change History (27)
comment:1 Changed 8 years ago by gkronber
- Priority changed from medium to low
comment:2 Changed 7 years ago by gkronber
- Description modified (diff)
comment:3 Changed 6 years ago by gkronber
- Description modified (diff)
- Summary changed from Boosting support for classification and regression algorithms to Gradient boosting meta-learner for regression and classification
- Version changed from 3.3.6 to branch
comment:4 Changed 5 years ago by gkronber
- Description modified (diff)
- Milestone changed from HeuristicLab 3.3.x Backlog to HeuristicLab 3.3.14
- Status changed from new to accepted
- Version branch deleted
comment:5 Changed 5 years ago by gkronber
comment:6 Changed 5 years ago by gkronber
r13653: added OSGP for gradient boosting meta-learner
comment:7 Changed 5 years ago by gkronber
r13655: updated plugin dependencies. Added HL.OSGA and HL.Selection to HL.Algorithms.DataAnalysis.
comment:8 Changed 5 years ago by gkronber
- Owner changed from gkronber to mkommend
- Status changed from accepted to assigned
Still missing: optimization of weights for terms in the final solution (e.g. using some form of regularized regression).
comment:9 Changed 5 years ago by mkommend
r13699: Adapted creation of regression ensemble to new ctors (see also #2590 r13698).
#2590 must be released before this ticket.
comment:10 Changed 5 years ago by mkommend
r13703: Disabled averaging of estimated values in ensemble solution created by GBM regression.
comment:11 follow-up: ↓ 19 Changed 5 years ago by mkommend
TODO: when the seed is fixed in GBM it is not respected by the inner regression algorithm => different solutions for each execution although the seed is fixed.
comment:12 Changed 5 years ago by gkronber
r13707: fixed a problem for datasets with variables that don't have double type
comment:13 Changed 5 years ago by gkronber
If the problem is a maximization problem (e.g. using R² as objective function) the data rows should not be named 'Loss'.
comment:14 follow-up: ↓ 15 Changed 5 years ago by gkronber
GBM produces huge runs (problem for persistence and hive). Maybe related to ModifiableDataset?
comment:15 in reply to: ↑ 14 ; follow-up: ↓ 16 Changed 5 years ago by mkommend
Replying to gkronber:
GBM produces huge runs (problem for persistence and hive). Maybe related to ModifiableDataset?
I don't see a reason why the ModifiableDataset should increase the file size of GBM runs. Could you provide an example file for further investigation?
comment:16 in reply to: ↑ 15 ; follow-up: ↓ 17 Changed 5 years ago by gkronber
Replying to mkommend:
Replying to gkronber:
GBM produces huge runs (problem for persistence and hive). Maybe related to ModifiableDataset?
I don't see a reason why the ModifiableDataset should increase the file size of GBM runs. Could you provide an example file for further investigation?
I believe it's not an issue with ModifiableDataset but rather the fact that full clones of the dataset might be created.
comment:17 in reply to: ↑ 16 Changed 5 years ago by gkronber
Replying to gkronber:
Replying to mkommend:
Replying to gkronber:
GBM produces huge runs (problem for persistence and hive). Maybe related to ModifiableDataset?
I don't see a reason why the ModifiableDataset should increase the file size of GBM runs. Could you provide an example file for further investigation?
I believe it's not an issue with ModifiableDataset but rather the fact that full clones of the dataset might be created.
In each iteration a run object is produced with stores the solution and the problemData for this iteration. Both have a reference to a ModifiableDataset. The individual run results are primarily interesting for detailed analysis / debugging. Proposed fix: add parameter to turn storing runs on/off.
comment:18 Changed 5 years ago by gkronber
- added parameter to turn on/off storing of individual runs
- changed "loss" -> "R²" in the qualities line chart
comment:19 in reply to: ↑ 11 Changed 5 years ago by gkronber
comment:20 Changed 5 years ago by mkommend
- Owner changed from mkommend to gkronber
- Status changed from assigned to reviewing
r13917: Added linear scaling of solutions while producing a model ensemble for GBM.
comment:21 Changed 5 years ago by gkronber
reviewed r13917
comment:22 Changed 5 years ago by gkronber
- Status changed from reviewing to readytorelease
comment:23 Changed 5 years ago by gkronber
comment:24 Changed 5 years ago by gkronber
r13978: removed MCTS symb reg as algorithm in GBM to cut the dependency between GBM and MCTS symb reg (MCTS symb reg not ready for release yet)
comment:25 Changed 5 years ago by gkronber
comment:26 Changed 5 years ago by gkronber
Everything merged, close ticket as done when integration tests succeed
comment:27 Changed 5 years ago by mkommend
- Resolution set to done
- Status changed from readytorelease to closed
r13646: added a data analysis algorithm for gradient boosting for regression which uses another regression algorithm as a base learner. Currently, only squared error loss is supported.