#1721 closed defect (done)
Persistence of random forest solutions takes a long time and creates really big files
Reported by: | gkronber | Owned by: | gkronber |
---|---|---|---|
Priority: | medium | Milestone: | HeuristicLab 3.3.10 |
Component: | Algorithms.DataAnalysis | Version: | 3.3.5 |
Keywords: | Cc: |
Description
Change History (15)
comment:1 Changed 12 years ago by gkronber
- Priority changed from medium to highest
comment:2 Changed 12 years ago by gkronber
- Priority changed from highest to medium
comment:3 Changed 11 years ago by gkronber
comment:4 Changed 11 years ago by gkronber
The space required for storing the trees of a random forest model is much larger than the space required for the original dataset. It is possible to recalculate the trees on demand when the random seed and the original data set is saved. Recalculation is only necessary for seldomly required functionality. Therefore, it should be possible to improve persistence and memory requirements.
comment:5 Changed 11 years ago by gkronber
r10306: created a branch for the ticket.
comment:6 Changed 11 years ago by gkronber
r10321: refactored RandomForestModel and changed persistence (store data and parameters instead of model)
comment:7 Changed 11 years ago by gkronber
r10322: changed the RF model to store the original problem data directly and fixed bugs in backwards compatibility loading and saving.
Persistence has been improved. For example for the Friedman-II problem (10000 rows) a RF solution with 500 trees takes up 33MB zipped in HL 3.3.9 and only 1MB zipped after the changes in this ticket. Loading the old file takes about 5 seconds with no recalculation effort. Loading the new file is almost instantanous (<1s), however when e.g. the line chart is opened for the first time the recalculation takes around 7 sec.
comment:8 Changed 11 years ago by gkronber
- Milestone changed from HeuristicLab 3.3.x Backlog to HeuristicLab 3.3.10
- Status changed from new to accepted
comment:9 Changed 11 years ago by gkronber
- Owner changed from gkronber to mkommend
- Status changed from accepted to reviewing
comment:10 Changed 11 years ago by mkommend
Extensively tested the implemented functionality and it works perfect (file size reduction from ~350 MB to ~ 1.5 MB). The evaluation time was also acceptable < 1s and the evaluation result stays constant. All in all a really good solution.
Source code review pending.
comment:11 Changed 11 years ago by mkommend
- Owner changed from mkommend to gkronber
- Status changed from reviewing to readytorelease
Reviewed r10321:10322.
comment:12 Changed 10 years ago by gkronber
r10963: merged r10321:10322 from feature branch to the trunk
comment:13 Changed 10 years ago by gkronber
r10966: deleted feature branch
comment:14 Changed 10 years ago by gkronber
- Resolution set to done
- Status changed from readytorelease to closed
comment:15 Changed 10 years ago by gkronber
r11006: merged improved random forest persistence from trunk to stable branch
ALGLIB also provides two proceedures for serialization of decision forests (dfserialize / dfunserialize) which produce similar output as our current serialization code (also same size).
The size of the double array that is persisted is O(nTrees * nSamples). It seems this is overly pessimistic and only a small fragment of this buffer is actually necessary.