Context Navigation

#1795 closed feature request (done)

Gradient boosting meta-learner for regression and classification

Reported by:	gkronber	Owned by:	gkronber
Priority:	low	Milestone:	HeuristicLab 3.3.14
Component:	Algorithms.DataAnalysis	Version:
Keywords:		Cc:

Description (last modified by gkronber)

It would be nice to support a kind of boosting where multiple models are learned step by step and the weight of observations is adapted based on the residuals of the models learned so far.

Friedmans "Stochasic Gradient Boosting" (1999) could be implemented for regression and classification problems.

Since version 3.3.12 there is a specific implementation of gradient boosted trees. It would be great if we could also implement gradient boosting as a meta learner that uses any regression algorithm as the base learner.

Change History (27)

comment:1 Changed 11 years ago by gkronber

Priority changed from medium to low

comment:2 Changed 10 years ago by gkronber

Description modified (diff)

comment:3 Changed 9 years ago by gkronber

Description modified (diff)
Summary changed from Boosting support for classification and regression algorithms to Gradient boosting meta-learner for regression and classification
Version changed from 3.3.6 to branch

comment:4 Changed 8 years ago by gkronber

Description modified (diff)
Milestone changed from HeuristicLab 3.3.x Backlog to HeuristicLab 3.3.14
Status changed from new to accepted
Version branch deleted

comment:5 Changed 8 years ago by gkronber

r13646: added a data analysis algorithm for gradient boosting for regression which uses another regression algorithm as a base learner. Currently, only squared error loss is supported.

comment:6 Changed 8 years ago by gkronber

r13653: added OSGP for gradient boosting meta-learner

comment:7 Changed 8 years ago by gkronber

r13655: updated plugin dependencies. Added HL.OSGA and HL.Selection to HL.Algorithms.DataAnalysis.

comment:8 Changed 8 years ago by gkronber

Owner changed from gkronber to mkommend
Status changed from accepted to assigned

Still missing: optimization of weights for terms in the final solution (e.g. using some form of regularized regression).

comment:9 Changed 8 years ago by mkommend

r13699: Adapted creation of regression ensemble to new ctors (see also #2590 r13698).

#2590 must be released before this ticket. --> DONE

Last edited 8 years ago by gkronber (previous) (diff)

comment:10 Changed 8 years ago by mkommend

r13703: Disabled averaging of estimated values in ensemble solution created by GBM regression.

comment:11 follow-up: ↓ 19 Changed 8 years ago by mkommend

~~TODO: when the seed is fixed in GBM it is not respected by the inner regression algorithm => different solutions for each execution although the seed is fixed.~~

Last edited 8 years ago by gkronber (previous) (diff)

comment:12 Changed 8 years ago by gkronber

r13707: fixed a problem for datasets with variables that don't have double type

comment:13 Changed 8 years ago by gkronber

If the problem is a maximization problem (e.g. using R² as objective function) the data rows should not be named 'Loss'.

comment:14 follow-up: ↓ 15 Changed 8 years ago by gkronber

GBM produces huge runs (problem for persistence and hive). Maybe related to ModifiableDataset?

comment:15 in reply to: ↑ 14 ; follow-up: ↓ 16 Changed 8 years ago by mkommend

Replying to gkronber:

GBM produces huge runs (problem for persistence and hive). Maybe related to ModifiableDataset?

I don't see a reason why the ModifiableDataset should increase the file size of GBM runs. Could you provide an example file for further investigation?

comment:16 in reply to: ↑ 15 ; follow-up: ↓ 17 Changed 8 years ago by gkronber

Replying to mkommend:

Replying to gkronber:

GBM produces huge runs (problem for persistence and hive). Maybe related to ModifiableDataset?

I don't see a reason why the ModifiableDataset should increase the file size of GBM runs. Could you provide an example file for further investigation?

I believe it's not an issue with ModifiableDataset but rather the fact that full clones of the dataset might be created.

comment:17 in reply to: ↑ 16 Changed 8 years ago by gkronber

Replying to gkronber:

Replying to mkommend:

Replying to gkronber:

GBM produces huge runs (problem for persistence and hive). Maybe related to ModifiableDataset?

I don't see a reason why the ModifiableDataset should increase the file size of GBM runs. Could you provide an example file for further investigation?

I believe it's not an issue with ModifiableDataset but rather the fact that full clones of the dataset might be created.

In each iteration a run object is produced with stores the solution and the problemData for this iteration. Both have a reference to a ModifiableDataset. The individual run results are primarily interesting for detailed analysis / debugging. Proposed fix: add parameter to turn storing runs on/off.

comment:18 Changed 8 years ago by gkronber

r13889:

added parameter to turn on/off storing of individual runs
changed "loss" -> "R²" in the qualities line chart

comment:19 in reply to: ↑ 11 Changed 8 years ago by gkronber

Replying to mkommend:

TODO: when the seed is fixed in GBM it is not respected by the inner regression algorithm => different solutions for each execution although the seed is fixed.

r13898: GBM now sets the seed of the base learner in each iteration