For model validation and inspection an analysis of residuals over input variables could be insightful
Description
In model validation we should check if the distribution of residuals is independent of the inputs and the target variable. If patterns are visible in the distribution of residuals this is a hint that the model does not fit the available data well.
We already have the nice bubble chart for analysis of experiments but it works only for runs and not for any kind of tabular data. However, the bubble chart easily handles 10.000 runs so it could potentially be used for this purpose as well.
comment:1 Changed 2 months ago by gkronber
comment:2 Changed 2 months ago by gkronber
r14890: removed reference to resx file
comment:3 Changed 2 months ago by gkronber
TODO:
Clean up codeAdd absolute rel. error and abs. error- Maybe: restrict number of entries in xAxis dropdown box
- decided against this as I found that it is convenient to color rows by residuals and then show a scatter plot of two input variables.
Move calculated entries (error, prediction, ...) to the top of the dropdown boxremove entries in the dropdown box which are constant
comment:4 follow-up: ↓ 6 Changed 2 months ago by abeham
Having such a view for any tabular data would be really awesome. Maybe something like a dataframe in R? The IDataset is already pretty close to a dataframe. And the Dataset could easily be moved to HeuristicLab.Data where it would be more generally usable as a data structure for tabular data. Row names would probably still be nice to have...
comment:5 Changed 7 weeks ago by gkronber
- cleaned code (use '>' as a marker for calculated variables)
- added absolute residual and error
comment:6 in reply to: ↑ 4 Changed 7 weeks ago by gkronber
Replying to abeham:
Having such a view for any tabular data would be really awesome. Maybe something like a dataframe in R? The IDataset is already pretty close to a dataframe. And the Dataset could easily be moved to HeuristicLab.Data where it would be more generally usable as a data structure for tabular data. Row names would probably still be nice to have...
Related to efforts in #2726?
comment:7 Changed 7 weeks ago by abeham
#2726 is based on IndexedDataTable showing algorithm performance over evaluations as a line chart. I don't think it is related.
comment:8 Changed 4 weeks ago by gkronber
- Status changed from new to accepted
comment:9 Changed 3 weeks ago by gkronber
r15024: hide 'constant variables' (only one distinct value)
comment:10 Changed 3 weeks ago by gkronber
I have used and tested this extensively in the last few weeks and found it really helpful to find systematic errors in models.
comment:11 Changed 3 weeks ago by gkronber
- Owner changed from gkronber to mkommend
- Status changed from accepted to reviewing
r14889: added a solution view which uses the bubble chart for interactive visualization of model residuals. (HACK)
Also made small modifications to the bubble chart.