Opened 6 years ago

Closed 3 years ago

Last modified 3 years ago

#1721 closed defect (done)

Persistence of random forest solutions takes a long time and creates really big files

Reported by: gkronber Owned by: gkronber
Priority: medium Milestone: HeuristicLab 3.3.10
Component: Algorithms.DataAnalysis Version: 3.3.5
Keywords: Cc:

Description


Change History (15)

comment:1 Changed 4 years ago by gkronber

  • Priority changed from medium to highest

comment:2 Changed 4 years ago by gkronber

  • Priority changed from highest to medium

comment:3 Changed 4 years ago by gkronber

ALGLIB also provides two proceedures for serialization of decision forests (dfserialize / dfunserialize) which produce similar output as our current serialization code (also same size).

The size of the double array that is persisted is O(nTrees * nSamples). It seems this is overly pessimistic and only a small fragment of this buffer is actually necessary.

comment:4 Changed 4 years ago by gkronber

The space required for storing the trees of a random forest model is much larger than the space required for the original dataset. It is possible to recalculate the trees on demand when the random seed and the original data set is saved. Recalculation is only necessary for seldomly required functionality. Therefore, it should be possible to improve persistence and memory requirements.

comment:5 Changed 4 years ago by gkronber

r10306: created a branch for the ticket.

comment:6 Changed 4 years ago by gkronber

r10321: refactored RandomForestModel and changed persistence (store data and parameters instead of model)

Last edited 4 years ago by gkronber (previous) (diff)

comment:7 Changed 4 years ago by gkronber

r10322: changed the RF model to store the original problem data directly and fixed bugs in backwards compatibility loading and saving.

Persistence has been improved. For example for the Friedman-II problem (10000 rows) a RF solution with 500 trees takes up 33MB zipped in HL 3.3.9 and only 1MB zipped after the changes in this ticket. Loading the old file takes about 5 seconds with no recalculation effort. Loading the new file is almost instantanous (<1s), however when e.g. the line chart is opened for the first time the recalculation takes around 7 sec.

comment:8 Changed 4 years ago by gkronber

  • Milestone changed from HeuristicLab 3.3.x Backlog to HeuristicLab 3.3.10
  • Status changed from new to accepted

comment:9 Changed 4 years ago by gkronber

  • Owner changed from gkronber to mkommend
  • Status changed from accepted to reviewing

comment:10 Changed 4 years ago by mkommend

Extensively tested the implemented functionality and it works perfect (file size reduction from ~350 MB to ~ 1.5 MB). The evaluation time was also acceptable < 1s and the evaluation result stays constant. All in all a really good solution.

Source code review pending.

comment:11 Changed 3 years ago by mkommend

  • Owner changed from mkommend to gkronber
  • Status changed from reviewing to readytorelease

Reviewed r10321:10322.

Last edited 3 years ago by gkronber (previous) (diff)

comment:12 Changed 3 years ago by gkronber

r10963: merged r10321:10322 from feature branch to the trunk

comment:13 Changed 3 years ago by gkronber

r10966: deleted feature branch

comment:14 Changed 3 years ago by gkronber

  • Resolution set to done
  • Status changed from readytorelease to closed

comment:15 Changed 3 years ago by gkronber

r11006: merged improved random forest persistence from trunk to stable branch

Note: See TracTickets for help on using tickets.