Opened 5 years ago

Closed 11 months ago

#1795 closed feature request (done)

Gradient boosting meta-learner for regression and classification

Reported by: gkronber Owned by: gkronber
Priority: low Milestone: HeuristicLab 3.3.14
Component: Algorithms.DataAnalysis Version:
Keywords: Cc:

Description (last modified by gkronber)

It would be nice to support a kind of boosting where multiple models are learned step by step and the weight of observations is adapted based on the residuals of the models learned so far.

Friedmans "Stochasic Gradient Boosting" (1999) could be implemented for regression and classification problems.

Since version 3.3.12 there is a specific implementation of gradient boosted trees. It would be great if we could also implement gradient boosting as a meta learner that uses any regression algorithm as the base learner.

Change History (27)

comment:1 Changed 4 years ago by gkronber

  • Priority changed from medium to low

comment:2 Changed 3 years ago by gkronber

  • Description modified (diff)

comment:3 Changed 23 months ago by gkronber

  • Description modified (diff)
  • Summary changed from Boosting support for classification and regression algorithms to Gradient boosting meta-learner for regression and classification
  • Version changed from 3.3.6 to branch

comment:4 Changed 15 months ago by gkronber

  • Description modified (diff)
  • Milestone changed from HeuristicLab 3.3.x Backlog to HeuristicLab 3.3.14
  • Status changed from new to accepted
  • Version branch deleted

comment:5 Changed 15 months ago by gkronber

r13646: added a data analysis algorithm for gradient boosting for regression which uses another regression algorithm as a base learner. Currently, only squared error loss is supported.

comment:6 Changed 15 months ago by gkronber

r13653: added OSGP for gradient boosting meta-learner

comment:7 Changed 15 months ago by gkronber

r13655: updated plugin dependencies. Added HL.OSGA and HL.Selection to HL.Algorithms.DataAnalysis.

comment:8 Changed 15 months ago by gkronber

  • Owner changed from gkronber to mkommend
  • Status changed from accepted to assigned

Still missing: optimization of weights for terms in the final solution (e.g. using some form of regularized regression).

comment:9 Changed 15 months ago by mkommend

r13699: Adapted creation of regression ensemble to new ctors (see also #2590 r13698).

#2590 must be released before this ticket. --> DONE

Last edited 11 months ago by gkronber (previous) (diff)

comment:10 Changed 15 months ago by mkommend

r13703: Disabled averaging of estimated values in ensemble solution created by GBM regression.

comment:11 follow-up: Changed 15 months ago by mkommend

TODO: when the seed is fixed in GBM it is not respected by the inner regression algorithm => different solutions for each execution although the seed is fixed.

Last edited 11 months ago by gkronber (previous) (diff)

comment:12 Changed 15 months ago by gkronber

r13707: fixed a problem for datasets with variables that don't have double type

comment:13 Changed 12 months ago by gkronber

If the problem is a maximization problem (e.g. using R² as objective function) the data rows should not be named 'Loss'.

comment:14 follow-up: Changed 12 months ago by gkronber

GBM produces huge runs (problem for persistence and hive). Maybe related to ModifiableDataset?

comment:15 in reply to: ↑ 14 ; follow-up: Changed 12 months ago by mkommend

Replying to gkronber:

GBM produces huge runs (problem for persistence and hive). Maybe related to ModifiableDataset?

I don't see a reason why the ModifiableDataset should increase the file size of GBM runs. Could you provide an example file for further investigation?

comment:16 in reply to: ↑ 15 ; follow-up: Changed 12 months ago by gkronber

Replying to mkommend:

Replying to gkronber:

GBM produces huge runs (problem for persistence and hive). Maybe related to ModifiableDataset?

I don't see a reason why the ModifiableDataset should increase the file size of GBM runs. Could you provide an example file for further investigation?

I believe it's not an issue with ModifiableDataset but rather the fact that full clones of the dataset might be created.

comment:17 in reply to: ↑ 16 Changed 12 months ago by gkronber

Replying to gkronber:

Replying to mkommend:

Replying to gkronber:

GBM produces huge runs (problem for persistence and hive). Maybe related to ModifiableDataset?

I don't see a reason why the ModifiableDataset should increase the file size of GBM runs. Could you provide an example file for further investigation?

I believe it's not an issue with ModifiableDataset but rather the fact that full clones of the dataset might be created.

In each iteration a run object is produced with stores the solution and the problemData for this iteration. Both have a reference to a ModifiableDataset. The individual run results are primarily interesting for detailed analysis / debugging. Proposed fix: add parameter to turn storing runs on/off.

comment:18 Changed 12 months ago by gkronber

r13889:

  • added parameter to turn on/off storing of individual runs
  • changed "loss" -> "R²" in the qualities line chart

comment:19 in reply to: ↑ 11 Changed 11 months ago by gkronber

Replying to mkommend:

TODO: when the seed is fixed in GBM it is not respected by the inner regression algorithm => different solutions for each execution although the seed is fixed.

r13898: GBM now sets the seed of the base learner in each iteration

comment:20 Changed 11 months ago by mkommend

  • Owner changed from mkommend to gkronber
  • Status changed from assigned to reviewing

r13917: Added linear scaling of solutions while producing a model ensemble for GBM.

comment:21 Changed 11 months ago by gkronber

reviewed r13917

Last edited 11 months ago by gkronber (previous) (diff)

comment:22 Changed 11 months ago by gkronber

  • Status changed from reviewing to readytorelease

comment:23 Changed 11 months ago by gkronber

r13977: merged r13646,r13653,r13655,r13699,r13703,r13707,r13889,r13898,r13917 from trunk to stable

comment:24 Changed 11 months ago by gkronber

r13978: removed MCTS symb reg as algorithm in GBM to cut the dependency between GBM and MCTS symb reg (MCTS symb reg not ready for release yet)

comment:25 Changed 11 months ago by gkronber

r13980: merged r13978 from trunk to stable

comment:26 Changed 11 months ago by gkronber

Everything merged, close ticket as done when integration tests succeed

comment:27 Changed 11 months ago by mkommend

  • Resolution set to done
  • Status changed from readytorelease to closed
Note: See TracTickets for help on using tickets.