Opened 5 years ago

Closed 4 years ago

#1943 closed feature request (done)

Add nearest neighbour model for symbolic classification

Reported by: abeham Owned by: abeham
Priority: medium Milestone: HeuristicLab 3.3.8
Component: Problems.DataAnalysis.Symbolic.Classification Version: 3.3.8
Keywords: Cc:

Description

Instead of using thresholds to discretize the estimated values into classes one could also learn a nearest neighbour model on some of the estimated values and use it to classify the others.

Change History (13)

comment:1 Changed 5 years ago by abeham

  • Status changed from new to accepted

comment:2 Changed 5 years ago by abeham

r8606: Added nearest neighbor model (+creator) for symbolic classification

comment:3 Changed 5 years ago by abeham

  • Owner changed from abeham to gkronber
  • Status changed from accepted to reviewing

comment:4 Changed 4 years ago by gkronber

  • Owner changed from gkronber to abeham
  • Status changed from reviewing to assigned

review comments for r8606:

  • a few comments in SymbolicNearestNeighbourClassificationModel.GetEstimatedClassValues() would be nice. I'm not sure I understand everything correctly...
  • SymbolicNearestNeighbourClassificationModel ignores upper and lower estimation limits!
  • Estimated values can be NaN or infinity. These are not handled correctly. Maybe this is not necessary, but it is suspicious that there is absolutely no code for this.
  • SymbolicNearestNeighbourClassificationModel line 108: is it OK to simply use the first element of the enumerable if there are multiple entries with the same count? Should we use the most frequent class value in this case? The same thing happens in line 89.
  • line 66-73: seems like we could also use a binary search here as trainedTargetPair is sorted.
  • It seems some pairs of estimated & target values are lost in RecaluateValues is it not necessary to keep all pairs and later in GetEstimatedClassValues determine the most frequent target in the neighborhood?
  • In the ModelCreator would it be better to use a larger default value for K than three?

Sorry, I have not yet understood the code fully. Please review my questions and adapt the code where necessary and then assign the ticket back to me. Thanks!

comment:5 Changed 4 years ago by abeham

  • Status changed from assigned to accepted

comment:6 Changed 4 years ago by abeham

  • Owner changed from abeham to gkronber
  • Status changed from accepted to reviewing

r8978

  • a few comments in SymbolicNearestNeighbourClassificationModel.GetEstimatedClassValues() would be nice.
  • SymbolicNearestNeighbourClassificationModel line 108: is it OK to simply use the first element of the enumerable
  • line 66-73: seems like we could also use a binary search here as trainedTargetPair is sorted

Comments have been added, code has been rewritten to use BinarySearch and return class with most samples first

  • SymbolicNearestNeighbourClassificationModel ignores upper and lower estimation limits!

Added handling of estimation limits

  • It seems some pairs of estimated & target values are lost in RecaluateValues

I don't know what you mean, please elaborate

  • In the ModelCreator would it be better to use a larger default value for K than three?

I changed it to 11, but feel free to suggest a reasonable default value

comment:7 Changed 4 years ago by abeham

r8979: sealed class

comment:8 Changed 4 years ago by gkronber

  • Owner changed from gkronber to abeham
  • Status changed from reviewing to assigned

comment:9 Changed 4 years ago by abeham

  • Status changed from assigned to accepted

comment:10 Changed 4 years ago by abeham

  • Owner changed from abeham to gkronber
  • Status changed from accepted to reviewing

r9002: modified nearest neighbor model to include number of samples at each estimated value

The last problem in the review was that the estimated values were flattened with a majority voting at every estimated value (which might contain many samples of different targets). The neighbors were then selected out of the flattened points which loses the information of how many samples have been at a certain value. In this change I keep the number of samples for an estimated value.

comment:11 Changed 4 years ago by gkronber

r9003: I reviewed the changes of r9002 and believe that it should work correctly. However, I was not too happy with the code because of the "dictionary mangling". Therefore, I tried to come up with a simpler solution. Please review my changes and let me know what you think about it.

comment:12 Changed 4 years ago by abeham

  • Owner changed from gkronber to abeham
  • Status changed from reviewing to readytorelease

The code looks good and is readable, but I think it's a bit inefficient, memory-wise, to hold all estimated and training values. Nevertheless, I think it's not worthy to invest more time in this.

comment:13 Changed 4 years ago by swagner

  • Resolution set to done
  • Status changed from readytorelease to closed
  • Version changed from 3.3.7 to 3.3.8
Note: See TracTickets for help on using tickets.