Free cookie consent management tool by TermsFeed Policy Generator

Opened 8 years ago

Closed 7 years ago

#2760 closed feature request (done)

Shuffle samples in the cross-validation wrapper for data analysis algorithms

Reported by: bburlacu Owned by: gkronber
Priority: medium Milestone: HeuristicLab 3.3.15
Component: Algorithms.DataAnalysis Version: 3.3.14
Keywords: Cc:

Description

The cross-validation wrapper should offer an option to shuffle the data samples.

Change History (22)

comment:1 Changed 8 years ago by bburlacu

  • Owner set to bburlacu
  • Status changed from new to accepted

r14864: Implement shuffling of crossvalidation samples.

comment:2 Changed 8 years ago by bburlacu

  • Owner changed from bburlacu to mkommend
  • Status changed from accepted to reviewing

comment:3 Changed 8 years ago by bburlacu

r14865: Fix issue with resources in CrossValidationView.Designer.cs

comment:4 Changed 8 years ago by gkronber

It seems that in the ensemble the information wether a point was used for training or test is not stored correctly. Reproduce:

  1. Use cross-validation with shuffling and produce an overfit model on purpose.
  2. Check line chart
  3. Expected result: errors for training predictions (yellow) are very small, errors for test predictions (red) are significantly higher.
  4. Actual result: some errors for training predictions are also high, some errors for test points are suspiciously small.

comment:5 Changed 8 years ago by bburlacu

r14904: Reuse the shuffled data when creating the solution ensemble.

comment:6 Changed 8 years ago by gkronber

Overlaps with changes in r14781 (#2756) must be merged together.

comment:7 Changed 8 years ago by mkommend

  • Owner changed from mkommend to bburlacu
  • Status changed from reviewing to assigned

Review comments:

  • Backwards compatibility is not ensured
  • Shuffling can be changed during execution yielding inconsistent results
  • Clone shows wrong value of shuffle samples in view
  • Shuffled problemData is neither cloned nor serialized

Why do we need the shuffledProblemData at all?

Last edited 8 years ago by mkommend (previous) (diff)

comment:8 Changed 7 years ago by bburlacu

r15002: Got rid of the shuffledProblemData by using a shared seed for all the folds (so that the dataset for each fold is shuffled in exactly the same way). Backwards compatibility should be restored. Shuffling cannot be changed during algorithm execution, cloning also clones the checked value for the shuffled checkbox.

comment:9 Changed 7 years ago by bburlacu

  • Owner changed from bburlacu to mkommend
  • Status changed from assigned to reviewing

comment:10 Changed 7 years ago by mkommend

  • Owner changed from mkommend to bburlacu
  • Status changed from reviewing to assigned

This ticket broke the backwards compatibility for CrossValidation (probably due to the shuffle sample flag).

comment:11 Changed 7 years ago by bburlacu

  • Owner changed from bburlacu to mkommend
  • Status changed from assigned to reviewing

r15026: Ensure that the shuffleSamples flag is initialized after deserialization.

comment:12 Changed 7 years ago by mkommend

r15077: Reordered backwards compatibility and event registration in after deserialization hook of CrossValidation.

comment:13 Changed 7 years ago by mkommend

  • Owner changed from mkommend to bburlacu
  • Status changed from reviewing to assigned

Review comments

CrossValidationView

  • The ShuffleSamples checkbox should be checked / unchecked in OnContentChanged and not SetEnabledStateOfControls
  • I would enable the ShuffleSamples checkbox only if the CrossValidation is prepared.

CrossValidation

  • Why do extract and aggregate regression / classification solution work differently (clone of problemData)?

comment:14 Changed 7 years ago by bburlacu

  • Owner changed from bburlacu to mkommend
  • Status changed from assigned to reviewing

r15111: Set check state of the ShuffleSamples checkbox inthe OnContentChanged method. Enable the checkbox only when the CrossValidation is prepared.

Regarding the different way of cloning the classification solution: this is done differently to account for a special use case when the GBT algorithm with the logistic regression loss function returns a regression solution (from which a new classification solution is built).

comment:15 Changed 7 years ago by mkommend

Reviewed r15111.

comment:16 Changed 7 years ago by mkommend

  • Status changed from reviewing to readytorelease

comment:17 Changed 7 years ago by gkronber

  • Owner changed from mkommend to gkronber
  • Status changed from readytorelease to assigned

comment:18 Changed 7 years ago by gkronber

  • Status changed from assigned to accepted

comment:19 Changed 7 years ago by gkronber

  • Status changed from accepted to readytorelease

comment:20 Changed 7 years ago by gkronber

r15150: merged r14864, r14865, r14904, r15002, r15026, r15077, r15111 from trunk to stable (all changesets merged)

comment:21 Changed 7 years ago by gkronber

Depends on #2723.

comment:22 Changed 7 years ago by gkronber

  • Resolution set to done
  • Status changed from readytorelease to closed
Note: See TracTickets for help on using tickets.