Free cookie consent management tool by TermsFeed Policy Generator

source: branches/OKBJavaConnector/ECJClient/src/ec/eval/README @ 10785

Last change on this file since 10785 was 6152, checked in by bfarka, 14 years ago

added ecj and custom statistics to communicate with the okb services #1441

File size: 17.1 KB
Line 
1This directory contains the code for running the ECJ master/slave
2evaluator.  The master/slave evaluator allows you to connect one ECJ
3evolution process with some N remote slave processes.  These processes
4can come on-line at any time, and you can add new processes at any time,
5perhaps if new machines come available, or to replace slaves which have
6died for some reason.  If a remote slave dies, ECJ gracefully handles
7it, rescheduling its jobs to the next available slave.
8
9Slaves run in single-threaded mode, and so have a single random number
10generator each.  A slave receives a 32-bit random number seed from the
11master.  Initially the master selects a random number seed based on
12current wall clock time.  Each time a slave goes on-line, the master
13increments this current seed and gives the slave the revised seed.
14Slaves use this seed regardless of their seeds in their parameter files.
15
16You can use the master/slave evaluator freely in conjunction with island
17models to connect each island to its own private group of N slaves,
18though we have no examples for that in the distribution.
19
20Typical params files for the master and for the slaves are illustrated
21in the ec/app/star directory.
22
23You fire up the master something like this:
24
25java ec.Evolve -file foo.master.params
26
27  (where foo.master.params might have parent.0 be normal ec
28         parameters, and have parent.1 = master.params)
29
30You fire up each of the N slaves something like this:
31
32java ec.eval.Slave -file foo.slave.params
33
34  (where foo.slave.params might have parent.0 be normal ec
35         parameters, and have parent.1 = slave.params)
36
37...and it should all nicely work!  The system works fine under
38checkpointing as far as we know.
39
40
41
42
43
44MASTERS AND SLAVES
45
46The master and slave processes can (and generally ought to) share
47parameter files.  The way a slave knows it's a slave is through the
48addition of the following *slave-only* parameter:
49
50eval.i-am-slave = true
51
52The master sets up distributed evaluation by loading special class,
53called a MasterProblem, which replaces the Problem during evaluation.
54The MasterProblem is defined by the following *master* parameter:
55
56eval.masterproblem = ec.eval.MasterProblem
57
58When the Evaluator is started up, it normally sets up the Problem class.
59If eval.masterproblem is set, the Evaluator also loads the master
60problem and then *replaces* the Problem class prototype with a prototype
61of the MasterProblem class.  The Problem prototype is then set to be the
62'problem' instance variable in the MasterProblem prototype.  This
63essentially allows the Problem to stick around even though it's never
64called by the Evaluator any more -- instead, the MasterProblem is
65called.
66
67The MasterProblem's job is to send stuff to remote slaves.  Thus a slave
68should not load it, but instead should load the regular Problem
69instance.  The slave does this by checking the eval.i-am-slave
70parameter.  If it's true, it simply ignores the eval.masterproblem
71parameter entirely.
72
73More information about the architecture may be found near the end of
74this file.
75
76The master listens on socket for new slaves to arrive and register
77themselves. When slaves are fired up, they attempt to attach to this
78socket and negotiate connection.  This means that both the master and
79the slave needs to know the master's port number, and furthermore the
80slave needs to know the master's IP address.  The socket port number is
81specified in the *master and slave* parameter (here set to 15000):
82
83eval.master.port = 15000
84
85The slave is told where the master is with the following *slave*
86parameter:
87
88eval.master.host = put.the.master.ip.address.here
89
90The master and slave can communicate over a compressed stream. 
91Communication is by default compressed.  Note that Java's
92compression routines are broken (they don't support PARTIAL_FLUSH), so we
93have to rely on a third-party library called JZLIB.  You can get the jar
94file for this library on the ECJ main web page or at
95http://www.jcraft.com/jzlib/
96This is specified in the *master and slave* parameter:
97
98eval.compression = true
99
100Last, the slave can be given a name.  This is solely for debugging
101purposes.  If you don't provide this parameter, the slave will give
102itself an arbitrary name, and that's fine.  The *slave* parameter is:
103
104eval.slave-name = put-my-name-here
105
106The master can print out various debug information if you turn on the
107following parameter:
108
109eval.masterproblem.debug-info = false
110
111
112
113
114JOBS
115
116A slave receives chunks of individuals to evaluate and return as a
117group. This chunk is called a "job".  If you're doing non-coevolutionary
118evolution, you can specify how many individuals the slave should receive
119at one time with the *master* parameter (here set to 100):
120
121eval.master-problem.job-size = 100
122
123If you have very small individuals, or fast evaluation, this makes
124better use of network bandwidth as it sends them as a group, evaluates
125them all, and then returns them all.  This is because more individuals
126can get packed into a TCP/IP packet before it's sent out, minimizing
127overhead.  However, the primary effect of changing the job-size is
128to modify the "slave evolution" population size (see the next section).
129
130Another way to improve network efficiency, particularly with very fast
131jobs, is to fill the network buffer with as many jobs as you can fit.
132Each slave mantains a queue of jobs that it's working on.  When the
133master needs to hand a job to a slave, it looks for one whose queue is
134not filled, and then puts the job in the queue.  Each of the slave
135connections keep their TCP/IP buffers stuffed with as many of these
136queued jobs as they can, so jobs are available at the Slaves before
137they even realize it.  To set the queue size, you use the *master*
138parameter (here it's being set to 100):
139
140eval.masterproblem.max-jobs-per-slave = 100
141
142If you only have 100 individuals in your population (say), this
143won't fill all the jobs on one slave connection.  The system goes
144round-robin through all the slaves and distributes jobs appropriately.
145Even so, if you have new slaves coming online all the time, they'll
146have to wait if jobs have already been measured out to the other
147slaves, so in that case it might be wise to keep the max-jobs-per-slave
148a bit low.
149
150If you are doing coevolutionary evolution, a job will consist of the
151individuals necessary to perform one joint coevolutionary evaluation.
152
153
154
155SLAVE EVOLUTION
156
157Slaves can operate in one of two modes: "regular" and "evolve".  In
158regular mode, when a slave receives a job, it evaluates them and returns
159them or their resulting fitnesses (see 'FITNESS VERSUS INDIVIDUAL'
160below).  In 'evolve' mode, the slave evaluates its individuals; but if
161it has some extra time on its hands, it then treats the individuals as a
162little population and does some evolution on it.  When time is up, it
163returns the most recent individuals in its little individuals in lieu of
164the original individuals.  This is particularly useful when your
165evaluation procedure is very fast compared to the amount of time spent
166reading and writing individuals over the network.  Note that this only
167works in NON-coevolutionary evolution.
168
169To turn on this feature in a slave, you set the following parameter in
170the *slave*:
171
172eval.run-evolve = true
173
174You will also need to specify how long the slave should do its
175evolution.  This is specified in wall clock time (in milliseconds) with
176the following *slave* parameter, here specifying 6 seconds:
177
178eval.runtime = 6000
179
180Last, you'll need to turn this *slave* parameter on to get
181"evolve" mode working right (see 'FITNESS VERSUS INDIVIDUAL' below for
182more information):
183
184eval.return-inds = true
185
186The size of this mini-"population" the slave is evolving is determined
187by the job-size parameter discussed earlier, so you'll want to set it to
188something appropriate.  Here again it's being set to 100:
189
190eval.master-problem.job-size = 100
191
192
193
194
195
196
197
198FITNESS VERSUS INDIVIDUAL
199
200By default the master/slave system sends individuals from master to
201slave, but only returns FITNESS values from slave to master, in order to
202save on network bandwidth.  However it is possible that some evaluations
203of individuals literally change them during evaluation.  If you have
204done such an evil thing, you'll need to have the modified individual
205shipped back to the master for reinsertion.  Be sure to change the
206appropriate *slave* parameter:
207
208eval.return-inds = true
209
210If you're running in "evolve" mode, you will *always* need to set this
211parameter to true.
212
213
214
215SLAVE CONFIGURATION
216
217Because individuals don't know that they're being evaluated remotely,
218it's best to make your Slave's EvolutionState structure, and particularly
219its Population structure, look and feel as much like the Master as
220possible.  Notably, Subpopulation classes, Species, Individual
221representations, and Fitnesses should be the same.  This is particularly
222important if when you want to do evolution on the Slave.  The easiest way
223to do this is simply to derive the Slave's parameter file from the Master's
224paramter file. 
225
226However, if you're doing evolution on the Slave, this doesn't mean you have
227to have the Slave set up the same way as the Master: just the Population
228and objects hanging off of it.  It might be preferable to do (for example)
229generational evolution on the Slave while asynchronous steady state
230evolution is happening on the Master.  You're free to change the high-level
231evolution procedures; but you should maintain the same representations and
232breeding mechanisms so that Individuals on the Slave are valid on the Master
233when they come back.
234
235If the Master and Slave share parameters, what prevents the Slave from
236trying to set up its own MasterProblem as well?  The answer: the i-am-slave
237parameter.  The Evaluator class will not set up a MasterProblem, even if one
238is specified, if i-am-slave is true.
239
240
241
242
243ARCHITECTURE
244
245The system's classes naturally break into two groups: the Slave class
246and the various master-side classes (all others).  The Slave class is
247essentially a replacement for Evolve.java which sets up a dummy ECJ
248process in which to evaluate individuals.  This means, of course, that
249the Slave must have the same basic evolutionary parameters --
250particularly representation parameters -- as the master evolutionary
251process.
252
253The master system is set up by MasterProblem, a special version of the
254Problem class which, instead of evaluating individuals, sends them to
255remote Slaves to be evaluated.  As mentioned before, MasterProblem is
256loaded in the master process and then replaces the original Problem
257instance, in essence "pretending" to be a Problem instance.
258
259Like any Problem, multiple MasterProblems are created during the course
260of evaluation, one per thread and per generation.  During setup, the
261MasterProblem prototype constructs one shared class which acts as the
262clearing house for remotely evaluating individuals handed it by the
263various MasterProblems.  This class is called the SlaveMonitor.
264
265The SlaveMonitor is responsible for keeping track of the remote
266slave connections.  For each Slave which has connected, the SlaveMonitor
267manages reading and writing to that slave via a SlaveConnection.  The
268SlaveConnection holds the job queue for that remote Slave, holds the
269socket connection and streams to the remote Slave, and runs, in its own
270thread, a worker thread which reads and writes jobs to/from the Slave.
271MasterProblems submit jobs to the SlaveMonitor, which in turn
272distributes them to an available slave.
273
274Most evaluation procedures can take advantage of this to provide a
275degree of semi-asynchrony.  For example, SimpleEvaluator performs
276per-thread evaluation in the following way:
277
278  problem.prepareToEvaluate(...)
279  for each individual
280    problem.evaluate(individual,...)
281  problem.finishEvaluating(...)
282
283This protocol informs the underlying Problem that it is free to delay
284actual evaluation of the requested individuals until
285finishEvaluating(...) time.  This in turn allows a MasterProblem to bulk
286up all the individuals as it likes.  The MasterProblem will then read in
287a full job's worth of individuals, then send them out to a slave, then
288read another job's worth of individuals, then send THEM out to another
289slave, and so on.  When SimpleEvaluator calls finishEvaluating, the
290remaining individuals are sent out as a (possibly short) job, and then
291the MasterProblem blocks waiting until all the individuals have come
292back.
293
294The SteadyStateEvaluator performs evaluation in a different way:
295
296  problem.prepareToEvaluate(...) // at the very beginning of evolution!
297  loop forever
298    if problem.canEvaluate(),
299      create/breed individual
300      problem.evaluate(individual, ...)
301    individual = problem.getNextEvaluatedIndividual()
302    if individual != null
303      introduce individual to population
304    sleep a tiny bit to avoid spin-waiting
305
306Note that problem.finishEvaluating(...) is NEVER CALLED.  Here, if a
307Slave's queue can accept another job, the MasterProblem returns true
308from canEvaluate().  Each time problem.evaluate() is called, the
309MasterProblem adds the individual to the job until the job is filled, at
310which time it sends the job off to the Slave.  Likewise, if
311problem.evaluatedIndividualAvailable() returns true from the Master
312Problem -- indicating that a job has come back with individuals for the
313SteadyStateEvaluator, then getNextEvaluatedIndividual() returns the next
314individual in that job.  This is sort of like the SteadyStateEvaluator
315loading individuals onto a bus, and sending it out, and then when busses
316come back to the station, the SteadyStateEvaluator gets the individuals
317one by one as they disembark from the bus.
318
319A note on how individuals come back from the slaves.  You'd think that
320the indivdiuals are sent out, and either revised fitnesses are read back
321and replace their old fitnesses; or new individuals replace the old
322individuals.  But that's not the case.  This is because the
323MasterProblem actually has no idea where the individuals are stored that
324it's receiving.  Instead we have a bit of an odd way of doing it.
325
326When a job is submitted to a Slave, we send the individuals off to the
327Slave, and then clone all the individuals to 'scratch individuals' a
328second array.  If fitnesses come back, for each individual i in the
329scratch array, we call
330scratchindividual[i].fitness.readFitness(inputStream).  If whole
331individuals come back, for each individual i in the scratch array, we
332call scratchindividual[i].readIndividual(inputStream).  We don't want to
333call these functions on the original individuals because if the Slave
334dies in the middle of transmission, the population's individuals are
335corrupted and our whole evolutionary process is messed up.  That's why
336we read into scratch individuals.
337
338So how do we then get the scratch individuals copied into the originals,
339if we don't know where the originals are stored?  ECJ does not have an
340Individual.copyIndividualIntoMe(individual) function -- it'd be nice if
341it did! Instead, we create a special buffered pipe, stored in
342ec/util/DataPipe.java, and write a scratch individual into it, using
343Individual.writeIndividual(pipeOutputStream). We then read from that
344same pipe into the original individual, using
345Individual.readIndividual(pipeInputStream). It's a hack but a pretty
346one.
347
348
349
350MASTER/SLAVE NETWORK PROTOCOL
351
352The master maintains a thread which continually waits for new slaves
353to come in.  When a slave does in, the master creates a SlaveConnection
354to handle the slave communication.  The SlaveConnection creates a read
355thread and a write thread to keep the pipeline to the slave filled with
356Individuals.
357
358You have the option of using compressed streams.  Unfortunately,
359Java has broken compression -- it doesn't support "partial flush", which
360is critical for doing compressed network streams.  To do compression you
361will need to download the JZLIB library from the ECJ website or from
362http://www.jcraft.com/jzlib/  and ECJ will detect and use it automatically.
363
364
365<On Connection>
366<- slave name             readUTF
367-> random number generator seed         writeInt
368  (note: the seed then increments by SEED_INCREMENT (7919)
369   and the Slave is free to use any integer values between
370   the seed and seed + SEED_INCREMENT - 1 to seed its
371   random number generators.  We hopve 7919 is enough :-)
372-> FLUSH
373
374
375<SlaveConnection's writer protocol loops the following:>
376-> job type             writeByte
377  (Slave.V_EVALUATESIMPLE or Slave.V_EVALUATEGROUPED or
378   Slave.V_SHUTDOWN)
379If Slave.V_SHUTDOWN,
380  break connection
381If Slave.V_EVALUATEGROUPED,
382  -> countVictoriesOnly         writeBoolean
383-> number of individuals in job         writeInt
384Loop for number of individuals in job,
385  -> individual           writeIndividual
386  -> update ind's fitness         writeBoolean
387-> FLUSH
388
389
390<SlaveConnection's reader protocol loops the following:>
391<- arbitrary byte (to block on)         readByte
392Loop for number of individuals in job,
393  <- returning individual type                            readByte
394          (Slave.V_INDIVIDUAL or Slave.V_FITNESS)
395  If Slave.V_INDIVIDUAL,
396    <- individual         readIndividual
397  Else if Slave.V_FITNESS,
398    <- evaluated?         readBoolean
399    <- fitness          readFitness
400  Else if Slave.V_NOTHING,
401    (nothing is sent)
402<- FLUSH
403
404Note that the reader protocol does not tell the SlaveConnection how many
405individuals there are in the job.  This is because the SlaveConnection
406already knows: jobs are received in the order they were sent.
407
408The slaves and slave connections shut down when the socket breaks
409or when the Slave.V_SHUTDOWN signal was received.
410
Note: See TracBrowser for help on using the repository browser.