Context Navigation

#1721 closed defect (done)

Persistence of random forest solutions takes a long time and creates really big files

Reported by:	gkronber	Owned by:	gkronber
Priority:	medium	Milestone:	HeuristicLab 3.3.10
Component:	Algorithms.DataAnalysis	Version:	3.3.5
Keywords:		Cc:

Description ¶

Change History (15)

comment:1 Changed 12 years ago by gkronber

Priority changed from medium to highest

comment:2 Changed 12 years ago by gkronber

Priority changed from highest to medium

comment:3 Changed 12 years ago by gkronber

ALGLIB also provides two proceedures for serialization of decision forests (dfserialize / dfunserialize) which produce similar output as our current serialization code (also same size).

The size of the double array that is persisted is O(nTrees * nSamples). It seems this is overly pessimistic and only a small fragment of this buffer is actually necessary.

comment:4 Changed 11 years ago by gkronber

The space required for storing the trees of a random forest model is much larger than the space required for the original dataset. It is possible to recalculate the trees on demand when the random seed and the original data set is saved. Recalculation is only necessary for seldomly required functionality. Therefore, it should be possible to improve persistence and memory requirements.

comment:5 Changed 11 years ago by gkronber

r10306: created a branch for the ticket.

comment:6 Changed 11 years ago by gkronber

r10321: refactored RandomForestModel and changed persistence (store data and parameters instead of model)

Last edited 11 years ago by gkronber (previous) (diff)

comment:7 Changed 11 years ago by gkronber

r10322: changed the RF model to store the original problem data directly and fixed bugs in backwards compatibility loading and saving.

Persistence has been improved. For example for the Friedman-II problem (10000 rows) a RF solution with 500 trees takes up 33MB zipped in HL 3.3.9 and only 1MB zipped after the changes in this ticket. Loading the old file takes about 5 seconds with no recalculation effort. Loading the new file is almost instantanous (<1s), however when e.g. the line chart is opened for the first time the recalculation takes around 7 sec.

comment:8 Changed 11 years ago by gkronber

Milestone changed from HeuristicLab 3.3.x Backlog to HeuristicLab 3.3.10
Status changed from new to accepted

comment:9 Changed 11 years ago by gkronber

Owner changed from gkronber to mkommend
Status changed from accepted to reviewing

comment:10 Changed 11 years ago by mkommend

Extensively tested the implemented functionality and it works perfect (file size reduction from ~350 MB to ~ 1.5 MB). The evaluation time was also acceptable < 1s and the evaluation result stays constant. All in all a really good solution.

Source code review pending.