Opened 4 months ago

Closed 3 weeks ago

#2760 closed feature request (done)

Shuffle samples in the cross-validation wrapper for data analysis algorithms

Reported by: bburlacu Owned by: gkronber
Priority: medium Milestone: HeuristicLab 3.3.15
Component: Algorithms.DataAnalysis Version: 3.3.14
Keywords: Cc:

Description

The cross-validation wrapper should offer an option to shuffle the data samples.

Change History (22)

comment:1 Changed 3 months ago by bburlacu

  • Owner set to bburlacu
  • Status changed from new to accepted

r14864: Implement shuffling of crossvalidation samples.

comment:2 Changed 3 months ago by bburlacu

  • Owner changed from bburlacu to mkommend
  • Status changed from accepted to reviewing

comment:3 Changed 3 months ago by bburlacu

r14865: Fix issue with resources in CrossValidationView.Designer.cs

comment:4 Changed 3 months ago by gkronber

It seems that in the ensemble the information wether a point was used for training or test is not stored correctly. Reproduce:

  1. Use cross-validation with shuffling and produce an overfit model on purpose.
  2. Check line chart
  3. Expected result: errors for training predictions (yellow) are very small, errors for test predictions (red) are significantly higher.
  4. Actual result: some errors for training predictions are also high, some errors for test points are suspiciously small.

comment:5 Changed 3 months ago by bburlacu

r14904: Reuse the shuffled data when creating the solution ensemble.

comment:6 Changed 3 months ago by gkronber

Overlaps with changes in r14781 (#2756) must be merged together.

comment:7 Changed 2 months ago by mkommend

  • Owner changed from mkommend to bburlacu
  • Status changed from reviewing to assigned

Review comments:

  • Backwards compatibility is not ensured
  • Shuffling can be changed during execution yielding inconsistent results
  • Clone shows wrong value of shuffle samples in view
  • Shuffled problemData is neither cloned nor serialized

Why do we need the shuffledProblemData at all?

Last edited 2 months ago by mkommend (previous) (diff)

comment:8 Changed 2 months ago by bburlacu

r15002: Got rid of the shuffledProblemData by using a shared seed for all the folds (so that the dataset for each fold is shuffled in exactly the same way). Backwards compatibility should be restored. Shuffling cannot be changed during algorithm execution, cloning also clones the checked value for the shuffled checkbox.

comment:9 Changed 8 weeks ago by bburlacu

  • Owner changed from bburlacu to mkommend
  • Status changed from assigned to reviewing

comment:10 Changed 7 weeks ago by mkommend

  • Owner changed from mkommend to bburlacu
  • Status changed from reviewing to assigned

This ticket broke the backwards compatibility for CrossValidation (probably due to the shuffle sample flag).

comment:11 Changed 7 weeks ago by bburlacu

  • Owner changed from bburlacu to mkommend
  • Status changed from assigned to reviewing

r15026: Ensure that the shuffleSamples flag is initialized after deserialization.

comment:12 Changed 4 weeks ago by mkommend

r15077: Reordered backwards compatibility and event registration in after deserialization hook of CrossValidation.

comment:13 Changed 4 weeks ago by mkommend

  • Owner changed from mkommend to bburlacu
  • Status changed from reviewing to assigned

Review comments

CrossValidationView

  • The ShuffleSamples checkbox should be checked / unchecked in OnContentChanged and not SetEnabledStateOfControls
  • I would enable the ShuffleSamples checkbox only if the CrossValidation is prepared.

CrossValidation

  • Why do extract and aggregate regression / classification solution work differently (clone of problemData)?

comment:14 Changed 3 weeks ago by bburlacu

  • Owner changed from bburlacu to mkommend
  • Status changed from assigned to reviewing

r15111: Set check state of the ShuffleSamples checkbox inthe OnContentChanged method. Enable the checkbox only when the CrossValidation is prepared.

Regarding the different way of cloning the classification solution: this is done differently to account for a special use case when the GBT algorithm with the logistic regression loss function returns a regression solution (from which a new classification solution is built).

comment:15 Changed 3 weeks ago by mkommend

Reviewed r15111.

comment:16 Changed 3 weeks ago by mkommend

  • Status changed from reviewing to readytorelease

comment:17 Changed 3 weeks ago by gkronber

  • Owner changed from mkommend to gkronber
  • Status changed from readytorelease to assigned

comment:18 Changed 3 weeks ago by gkronber

  • Status changed from assigned to accepted

comment:19 Changed 3 weeks ago by gkronber

  • Status changed from accepted to readytorelease

comment:20 Changed 3 weeks ago by gkronber

r15150: merged r14864, r14865, r14904, r15002, r15026, r15077, r15111 from trunk to stable (all changesets merged)

comment:21 Changed 3 weeks ago by gkronber

Depends on #2723.

comment:22 Changed 3 weeks ago by gkronber

  • Resolution set to done
  • Status changed from readytorelease to closed
Note: See TracTickets for help on using tickets.