Opened 5 years ago

Closed 4 years ago

#1941 closed feature request (done)

Additional classification problem instances

Reported by: sforsten Owned by: abeham
Priority: medium Milestone: HeuristicLab 3.3.8
Component: Problems.Instances Version: 3.3.8
Keywords: Cc:

Description

Adding additional classification problems to problem instances from http://archive.ics.uci.edu/ml/datasets.html

Change History (24)

comment:1 Changed 5 years ago by gkronber

  • Summary changed from Adding addtitional classification problems to Addtitional classification problems
  • Type changed from enhancement to feature request

comment:2 Changed 5 years ago by sforsten

  • Status changed from new to accepted

comment:3 Changed 5 years ago by sforsten

r8595:

  • renamed real world classification to UCI
  • added wine classification problem
  • deleted Iris class, which has been forgotten in r8254

comment:4 Changed 5 years ago by sforsten

r8596: adjusted unit tests

comment:5 Changed 5 years ago by mkommend

  • Milestone changed from HeuristicLab 3.3.x Backlog to HeuristicLab 3.3.8

comment:6 Changed 5 years ago by abeham

I just received this error when building HL as Release and x86 using Build.cmd:

"c:\Windows\Microsoft.NET\Framework64\v4.0.30319\Microsoft.Common.targets(483,9): error : The OutputPath property is not set for project 'HeuristicLab.Problems.Instances.DataAnalysis.Views-3.3.csproj'. Please check to make sure that you have specified a valid combination of Configuration and Platform for this project. Configuration='Release' Platform='x86'. You may be seeing this message because you are trying to build a project without a solution file, and have specified a non-default Configuration or Platform that doesn't exist for this project. [D:\HL3\trunk\sources\HeuristicLab.Problems.Instances.DataAnalysis.Views\3.3\HeuristicLab.Problems.Instances.DataAnalysis.Views-3.3.csproj]"

comment:7 Changed 5 years ago by abeham

r8781: added missing platforms in Problems.Instances.DataAnalysis.Views

comment:8 Changed 5 years ago by sforsten

  • Owner changed from sforsten to abeham
  • Status changed from accepted to reviewing

r8841: added Iris and Thyroid problems from UCI

comment:9 Changed 5 years ago by abeham

  • Owner changed from abeham to sforsten
  • Status changed from reviewing to assigned

I see a problem with Thyroid in that there are many versions and different databases. I think to avoid confusion, we should distinguish it by name. I would suggest something like "Thyroid, S. Aeberhard, 1992" (name, donor, year received), as this seems to be the version that you have added. Should be done for the other UCI datasets also.

comment:10 Changed 5 years ago by sforsten

  • Owner changed from sforsten to abeham
  • Status changed from assigned to reviewing

r8889: IUCIDataDescriptor to add more information about the datasets to the data, to distinguish between different version, as suggested by abeham

comment:11 follow-up: Changed 5 years ago by abeham

I looked through some other available datasets and found two that could be interesting to add as benchmarking:

  • Vertebral Column (the 3-class dataset) has 310 instances and 6 attributes
  • Parkinson (remove the first attribute (it's a string) and move the class attribute to the end) - seems to be suited for k-NN (after a quick analysis with weka), has few instances and a skewed class distribution

Please shuffle both datasets randomly before adding them.

Last edited 4 years ago by gkronber (previous) (diff)

comment:12 Changed 5 years ago by abeham

  • Owner changed from abeham to sforsten
  • Status changed from reviewing to assigned

comment:13 Changed 5 years ago by sforsten

r8902: fixed CreateGpSymbolicClassificationSampleTest and therefore also RunGpSymbolicClassificationSampleTest

comment:14 Changed 5 years ago by sforsten

  • Owner changed from sforsten to abeham
  • Status changed from assigned to reviewing

r8903:

  • added two additional classification instances as suggested by abeham
  • fixed a bug in UCIInstanceProvider

comment:15 Changed 5 years ago by abeham

  • Status changed from reviewing to readytorelease

r8908: Abbreviated first names as in the other datasets

comment:16 Changed 5 years ago by gkronber

  • Summary changed from Addtitional classification problems to Additional classification problem instances

comment:17 Changed 4 years ago by sforsten

  • Owner changed from abeham to sforsten
  • Status changed from readytorelease to assigned

comment:18 Changed 4 years ago by sforsten

  • Status changed from assigned to accepted

comment:19 Changed 4 years ago by sforsten

  • Owner changed from sforsten to mkommend
  • Status changed from accepted to reviewing

r9208:

  • added wisconsin breast cancer problem instance
  • corrected iris dataset
  • changed classification data descriptors to be able to set training and test partition as well as input and target variables (in the same way as it is done in regression)

comment:20 Changed 4 years ago by mkommend

  • Status changed from reviewing to readytorelease

comment:21 in reply to: ↑ 11 Changed 4 years ago by gkronber

  • Owner changed from mkommend to gkronber
  • Status changed from readytorelease to assigned

Replying to abeham:

I looked through some other available datasets and found two that could be interesting to add as benchmarking:

  • Vertebral Column (the 3-class dataset) has 310 instances and 6 attributes
  • Parkinson (remove the first attribute (it's a string) and move the class attribute to the end) - seems to be suited for k-NN (after a quick analysis with weka), has little instances and a skewed class distribution

Please shuffle both datasets randomly before adding them.

I just checked the 'Parkinsons' data set and strongly believe that instances should not be shuffled. The data set contains multiple measurements for each patient. So, if the data set is shuffled it should be guaranteed that all measurements from a patient are either in the training partition or in the test partition only. I also verified this in HL using the original and the shuffled data set (leads to much better estimated values).

Last edited 4 years ago by gkronber (previous) (diff)

comment:22 Changed 4 years ago by gkronber

  • Owner changed from gkronber to abeham
  • Status changed from assigned to reviewing

r9453: removed shuffling for the Parkinson's data set from UCI.

comment:23 Changed 4 years ago by abeham

  • Status changed from reviewing to readytorelease

You're right, thx!

comment:24 Changed 4 years ago by swagner

  • Resolution set to done
  • Status changed from readytorelease to closed
  • Version changed from 3.3.7 to 3.3.8
Note: See TracTickets for help on using tickets.