1 | This directory contains the code for running the ECJ master/slave |
---|
2 | evaluator. The master/slave evaluator allows you to connect one ECJ |
---|
3 | evolution process with some N remote slave processes. These processes |
---|
4 | can come on-line at any time, and you can add new processes at any time, |
---|
5 | perhaps if new machines come available, or to replace slaves which have |
---|
6 | died for some reason. If a remote slave dies, ECJ gracefully handles |
---|
7 | it, rescheduling its jobs to the next available slave. |
---|
8 | |
---|
9 | Slaves run in single-threaded mode, and so have a single random number |
---|
10 | generator each. A slave receives a 32-bit random number seed from the |
---|
11 | master. Initially the master selects a random number seed based on |
---|
12 | current wall clock time. Each time a slave goes on-line, the master |
---|
13 | increments this current seed and gives the slave the revised seed. |
---|
14 | Slaves use this seed regardless of their seeds in their parameter files. |
---|
15 | |
---|
16 | You can use the master/slave evaluator freely in conjunction with island |
---|
17 | models to connect each island to its own private group of N slaves, |
---|
18 | though we have no examples for that in the distribution. |
---|
19 | |
---|
20 | Typical params files for the master and for the slaves are illustrated |
---|
21 | in the ec/app/star directory. |
---|
22 | |
---|
23 | You fire up the master something like this: |
---|
24 | |
---|
25 | java ec.Evolve -file foo.master.params |
---|
26 | |
---|
27 | (where foo.master.params might have parent.0 be normal ec |
---|
28 | parameters, and have parent.1 = master.params) |
---|
29 | |
---|
30 | You fire up each of the N slaves something like this: |
---|
31 | |
---|
32 | java ec.eval.Slave -file foo.slave.params |
---|
33 | |
---|
34 | (where foo.slave.params might have parent.0 be normal ec |
---|
35 | parameters, and have parent.1 = slave.params) |
---|
36 | |
---|
37 | ...and it should all nicely work! The system works fine under |
---|
38 | checkpointing as far as we know. |
---|
39 | |
---|
40 | |
---|
41 | |
---|
42 | |
---|
43 | |
---|
44 | MASTERS AND SLAVES |
---|
45 | |
---|
46 | The master and slave processes can (and generally ought to) share |
---|
47 | parameter files. The way a slave knows it's a slave is through the |
---|
48 | addition of the following *slave-only* parameter: |
---|
49 | |
---|
50 | eval.i-am-slave = true |
---|
51 | |
---|
52 | The master sets up distributed evaluation by loading special class, |
---|
53 | called a MasterProblem, which replaces the Problem during evaluation. |
---|
54 | The MasterProblem is defined by the following *master* parameter: |
---|
55 | |
---|
56 | eval.masterproblem = ec.eval.MasterProblem |
---|
57 | |
---|
58 | When the Evaluator is started up, it normally sets up the Problem class. |
---|
59 | If eval.masterproblem is set, the Evaluator also loads the master |
---|
60 | problem and then *replaces* the Problem class prototype with a prototype |
---|
61 | of the MasterProblem class. The Problem prototype is then set to be the |
---|
62 | 'problem' instance variable in the MasterProblem prototype. This |
---|
63 | essentially allows the Problem to stick around even though it's never |
---|
64 | called by the Evaluator any more -- instead, the MasterProblem is |
---|
65 | called. |
---|
66 | |
---|
67 | The MasterProblem's job is to send stuff to remote slaves. Thus a slave |
---|
68 | should not load it, but instead should load the regular Problem |
---|
69 | instance. The slave does this by checking the eval.i-am-slave |
---|
70 | parameter. If it's true, it simply ignores the eval.masterproblem |
---|
71 | parameter entirely. |
---|
72 | |
---|
73 | More information about the architecture may be found near the end of |
---|
74 | this file. |
---|
75 | |
---|
76 | The master listens on socket for new slaves to arrive and register |
---|
77 | themselves. When slaves are fired up, they attempt to attach to this |
---|
78 | socket and negotiate connection. This means that both the master and |
---|
79 | the slave needs to know the master's port number, and furthermore the |
---|
80 | slave needs to know the master's IP address. The socket port number is |
---|
81 | specified in the *master and slave* parameter (here set to 15000): |
---|
82 | |
---|
83 | eval.master.port = 15000 |
---|
84 | |
---|
85 | The slave is told where the master is with the following *slave* |
---|
86 | parameter: |
---|
87 | |
---|
88 | eval.master.host = put.the.master.ip.address.here |
---|
89 | |
---|
90 | The master and slave can communicate over a compressed stream. |
---|
91 | Communication is by default compressed. Note that Java's |
---|
92 | compression routines are broken (they don't support PARTIAL_FLUSH), so we |
---|
93 | have to rely on a third-party library called JZLIB. You can get the jar |
---|
94 | file for this library on the ECJ main web page or at |
---|
95 | http://www.jcraft.com/jzlib/ |
---|
96 | This is specified in the *master and slave* parameter: |
---|
97 | |
---|
98 | eval.compression = true |
---|
99 | |
---|
100 | Last, the slave can be given a name. This is solely for debugging |
---|
101 | purposes. If you don't provide this parameter, the slave will give |
---|
102 | itself an arbitrary name, and that's fine. The *slave* parameter is: |
---|
103 | |
---|
104 | eval.slave-name = put-my-name-here |
---|
105 | |
---|
106 | The master can print out various debug information if you turn on the |
---|
107 | following parameter: |
---|
108 | |
---|
109 | eval.masterproblem.debug-info = false |
---|
110 | |
---|
111 | |
---|
112 | |
---|
113 | |
---|
114 | JOBS |
---|
115 | |
---|
116 | A slave receives chunks of individuals to evaluate and return as a |
---|
117 | group. This chunk is called a "job". If you're doing non-coevolutionary |
---|
118 | evolution, you can specify how many individuals the slave should receive |
---|
119 | at one time with the *master* parameter (here set to 100): |
---|
120 | |
---|
121 | eval.master-problem.job-size = 100 |
---|
122 | |
---|
123 | If you have very small individuals, or fast evaluation, this makes |
---|
124 | better use of network bandwidth as it sends them as a group, evaluates |
---|
125 | them all, and then returns them all. This is because more individuals |
---|
126 | can get packed into a TCP/IP packet before it's sent out, minimizing |
---|
127 | overhead. However, the primary effect of changing the job-size is |
---|
128 | to modify the "slave evolution" population size (see the next section). |
---|
129 | |
---|
130 | Another way to improve network efficiency, particularly with very fast |
---|
131 | jobs, is to fill the network buffer with as many jobs as you can fit. |
---|
132 | Each slave mantains a queue of jobs that it's working on. When the |
---|
133 | master needs to hand a job to a slave, it looks for one whose queue is |
---|
134 | not filled, and then puts the job in the queue. Each of the slave |
---|
135 | connections keep their TCP/IP buffers stuffed with as many of these |
---|
136 | queued jobs as they can, so jobs are available at the Slaves before |
---|
137 | they even realize it. To set the queue size, you use the *master* |
---|
138 | parameter (here it's being set to 100): |
---|
139 | |
---|
140 | eval.masterproblem.max-jobs-per-slave = 100 |
---|
141 | |
---|
142 | If you only have 100 individuals in your population (say), this |
---|
143 | won't fill all the jobs on one slave connection. The system goes |
---|
144 | round-robin through all the slaves and distributes jobs appropriately. |
---|
145 | Even so, if you have new slaves coming online all the time, they'll |
---|
146 | have to wait if jobs have already been measured out to the other |
---|
147 | slaves, so in that case it might be wise to keep the max-jobs-per-slave |
---|
148 | a bit low. |
---|
149 | |
---|
150 | If you are doing coevolutionary evolution, a job will consist of the |
---|
151 | individuals necessary to perform one joint coevolutionary evaluation. |
---|
152 | |
---|
153 | |
---|
154 | |
---|
155 | SLAVE EVOLUTION |
---|
156 | |
---|
157 | Slaves can operate in one of two modes: "regular" and "evolve". In |
---|
158 | regular mode, when a slave receives a job, it evaluates them and returns |
---|
159 | them or their resulting fitnesses (see 'FITNESS VERSUS INDIVIDUAL' |
---|
160 | below). In 'evolve' mode, the slave evaluates its individuals; but if |
---|
161 | it has some extra time on its hands, it then treats the individuals as a |
---|
162 | little population and does some evolution on it. When time is up, it |
---|
163 | returns the most recent individuals in its little individuals in lieu of |
---|
164 | the original individuals. This is particularly useful when your |
---|
165 | evaluation procedure is very fast compared to the amount of time spent |
---|
166 | reading and writing individuals over the network. Note that this only |
---|
167 | works in NON-coevolutionary evolution. |
---|
168 | |
---|
169 | To turn on this feature in a slave, you set the following parameter in |
---|
170 | the *slave*: |
---|
171 | |
---|
172 | eval.run-evolve = true |
---|
173 | |
---|
174 | You will also need to specify how long the slave should do its |
---|
175 | evolution. This is specified in wall clock time (in milliseconds) with |
---|
176 | the following *slave* parameter, here specifying 6 seconds: |
---|
177 | |
---|
178 | eval.runtime = 6000 |
---|
179 | |
---|
180 | Last, you'll need to turn this *slave* parameter on to get |
---|
181 | "evolve" mode working right (see 'FITNESS VERSUS INDIVIDUAL' below for |
---|
182 | more information): |
---|
183 | |
---|
184 | eval.return-inds = true |
---|
185 | |
---|
186 | The size of this mini-"population" the slave is evolving is determined |
---|
187 | by the job-size parameter discussed earlier, so you'll want to set it to |
---|
188 | something appropriate. Here again it's being set to 100: |
---|
189 | |
---|
190 | eval.master-problem.job-size = 100 |
---|
191 | |
---|
192 | |
---|
193 | |
---|
194 | |
---|
195 | |
---|
196 | |
---|
197 | |
---|
198 | FITNESS VERSUS INDIVIDUAL |
---|
199 | |
---|
200 | By default the master/slave system sends individuals from master to |
---|
201 | slave, but only returns FITNESS values from slave to master, in order to |
---|
202 | save on network bandwidth. However it is possible that some evaluations |
---|
203 | of individuals literally change them during evaluation. If you have |
---|
204 | done such an evil thing, you'll need to have the modified individual |
---|
205 | shipped back to the master for reinsertion. Be sure to change the |
---|
206 | appropriate *slave* parameter: |
---|
207 | |
---|
208 | eval.return-inds = true |
---|
209 | |
---|
210 | If you're running in "evolve" mode, you will *always* need to set this |
---|
211 | parameter to true. |
---|
212 | |
---|
213 | |
---|
214 | |
---|
215 | SLAVE CONFIGURATION |
---|
216 | |
---|
217 | Because individuals don't know that they're being evaluated remotely, |
---|
218 | it's best to make your Slave's EvolutionState structure, and particularly |
---|
219 | its Population structure, look and feel as much like the Master as |
---|
220 | possible. Notably, Subpopulation classes, Species, Individual |
---|
221 | representations, and Fitnesses should be the same. This is particularly |
---|
222 | important if when you want to do evolution on the Slave. The easiest way |
---|
223 | to do this is simply to derive the Slave's parameter file from the Master's |
---|
224 | paramter file. |
---|
225 | |
---|
226 | However, if you're doing evolution on the Slave, this doesn't mean you have |
---|
227 | to have the Slave set up the same way as the Master: just the Population |
---|
228 | and objects hanging off of it. It might be preferable to do (for example) |
---|
229 | generational evolution on the Slave while asynchronous steady state |
---|
230 | evolution is happening on the Master. You're free to change the high-level |
---|
231 | evolution procedures; but you should maintain the same representations and |
---|
232 | breeding mechanisms so that Individuals on the Slave are valid on the Master |
---|
233 | when they come back. |
---|
234 | |
---|
235 | If the Master and Slave share parameters, what prevents the Slave from |
---|
236 | trying to set up its own MasterProblem as well? The answer: the i-am-slave |
---|
237 | parameter. The Evaluator class will not set up a MasterProblem, even if one |
---|
238 | is specified, if i-am-slave is true. |
---|
239 | |
---|
240 | |
---|
241 | |
---|
242 | |
---|
243 | ARCHITECTURE |
---|
244 | |
---|
245 | The system's classes naturally break into two groups: the Slave class |
---|
246 | and the various master-side classes (all others). The Slave class is |
---|
247 | essentially a replacement for Evolve.java which sets up a dummy ECJ |
---|
248 | process in which to evaluate individuals. This means, of course, that |
---|
249 | the Slave must have the same basic evolutionary parameters -- |
---|
250 | particularly representation parameters -- as the master evolutionary |
---|
251 | process. |
---|
252 | |
---|
253 | The master system is set up by MasterProblem, a special version of the |
---|
254 | Problem class which, instead of evaluating individuals, sends them to |
---|
255 | remote Slaves to be evaluated. As mentioned before, MasterProblem is |
---|
256 | loaded in the master process and then replaces the original Problem |
---|
257 | instance, in essence "pretending" to be a Problem instance. |
---|
258 | |
---|
259 | Like any Problem, multiple MasterProblems are created during the course |
---|
260 | of evaluation, one per thread and per generation. During setup, the |
---|
261 | MasterProblem prototype constructs one shared class which acts as the |
---|
262 | clearing house for remotely evaluating individuals handed it by the |
---|
263 | various MasterProblems. This class is called the SlaveMonitor. |
---|
264 | |
---|
265 | The SlaveMonitor is responsible for keeping track of the remote |
---|
266 | slave connections. For each Slave which has connected, the SlaveMonitor |
---|
267 | manages reading and writing to that slave via a SlaveConnection. The |
---|
268 | SlaveConnection holds the job queue for that remote Slave, holds the |
---|
269 | socket connection and streams to the remote Slave, and runs, in its own |
---|
270 | thread, a worker thread which reads and writes jobs to/from the Slave. |
---|
271 | MasterProblems submit jobs to the SlaveMonitor, which in turn |
---|
272 | distributes them to an available slave. |
---|
273 | |
---|
274 | Most evaluation procedures can take advantage of this to provide a |
---|
275 | degree of semi-asynchrony. For example, SimpleEvaluator performs |
---|
276 | per-thread evaluation in the following way: |
---|
277 | |
---|
278 | problem.prepareToEvaluate(...) |
---|
279 | for each individual |
---|
280 | problem.evaluate(individual,...) |
---|
281 | problem.finishEvaluating(...) |
---|
282 | |
---|
283 | This protocol informs the underlying Problem that it is free to delay |
---|
284 | actual evaluation of the requested individuals until |
---|
285 | finishEvaluating(...) time. This in turn allows a MasterProblem to bulk |
---|
286 | up all the individuals as it likes. The MasterProblem will then read in |
---|
287 | a full job's worth of individuals, then send them out to a slave, then |
---|
288 | read another job's worth of individuals, then send THEM out to another |
---|
289 | slave, and so on. When SimpleEvaluator calls finishEvaluating, the |
---|
290 | remaining individuals are sent out as a (possibly short) job, and then |
---|
291 | the MasterProblem blocks waiting until all the individuals have come |
---|
292 | back. |
---|
293 | |
---|
294 | The SteadyStateEvaluator performs evaluation in a different way: |
---|
295 | |
---|
296 | problem.prepareToEvaluate(...) // at the very beginning of evolution! |
---|
297 | loop forever |
---|
298 | if problem.canEvaluate(), |
---|
299 | create/breed individual |
---|
300 | problem.evaluate(individual, ...) |
---|
301 | individual = problem.getNextEvaluatedIndividual() |
---|
302 | if individual != null |
---|
303 | introduce individual to population |
---|
304 | sleep a tiny bit to avoid spin-waiting |
---|
305 | |
---|
306 | Note that problem.finishEvaluating(...) is NEVER CALLED. Here, if a |
---|
307 | Slave's queue can accept another job, the MasterProblem returns true |
---|
308 | from canEvaluate(). Each time problem.evaluate() is called, the |
---|
309 | MasterProblem adds the individual to the job until the job is filled, at |
---|
310 | which time it sends the job off to the Slave. Likewise, if |
---|
311 | problem.evaluatedIndividualAvailable() returns true from the Master |
---|
312 | Problem -- indicating that a job has come back with individuals for the |
---|
313 | SteadyStateEvaluator, then getNextEvaluatedIndividual() returns the next |
---|
314 | individual in that job. This is sort of like the SteadyStateEvaluator |
---|
315 | loading individuals onto a bus, and sending it out, and then when busses |
---|
316 | come back to the station, the SteadyStateEvaluator gets the individuals |
---|
317 | one by one as they disembark from the bus. |
---|
318 | |
---|
319 | A note on how individuals come back from the slaves. You'd think that |
---|
320 | the indivdiuals are sent out, and either revised fitnesses are read back |
---|
321 | and replace their old fitnesses; or new individuals replace the old |
---|
322 | individuals. But that's not the case. This is because the |
---|
323 | MasterProblem actually has no idea where the individuals are stored that |
---|
324 | it's receiving. Instead we have a bit of an odd way of doing it. |
---|
325 | |
---|
326 | When a job is submitted to a Slave, we send the individuals off to the |
---|
327 | Slave, and then clone all the individuals to 'scratch individuals' a |
---|
328 | second array. If fitnesses come back, for each individual i in the |
---|
329 | scratch array, we call |
---|
330 | scratchindividual[i].fitness.readFitness(inputStream). If whole |
---|
331 | individuals come back, for each individual i in the scratch array, we |
---|
332 | call scratchindividual[i].readIndividual(inputStream). We don't want to |
---|
333 | call these functions on the original individuals because if the Slave |
---|
334 | dies in the middle of transmission, the population's individuals are |
---|
335 | corrupted and our whole evolutionary process is messed up. That's why |
---|
336 | we read into scratch individuals. |
---|
337 | |
---|
338 | So how do we then get the scratch individuals copied into the originals, |
---|
339 | if we don't know where the originals are stored? ECJ does not have an |
---|
340 | Individual.copyIndividualIntoMe(individual) function -- it'd be nice if |
---|
341 | it did! Instead, we create a special buffered pipe, stored in |
---|
342 | ec/util/DataPipe.java, and write a scratch individual into it, using |
---|
343 | Individual.writeIndividual(pipeOutputStream). We then read from that |
---|
344 | same pipe into the original individual, using |
---|
345 | Individual.readIndividual(pipeInputStream). It's a hack but a pretty |
---|
346 | one. |
---|
347 | |
---|
348 | |
---|
349 | |
---|
350 | MASTER/SLAVE NETWORK PROTOCOL |
---|
351 | |
---|
352 | The master maintains a thread which continually waits for new slaves |
---|
353 | to come in. When a slave does in, the master creates a SlaveConnection |
---|
354 | to handle the slave communication. The SlaveConnection creates a read |
---|
355 | thread and a write thread to keep the pipeline to the slave filled with |
---|
356 | Individuals. |
---|
357 | |
---|
358 | You have the option of using compressed streams. Unfortunately, |
---|
359 | Java has broken compression -- it doesn't support "partial flush", which |
---|
360 | is critical for doing compressed network streams. To do compression you |
---|
361 | will need to download the JZLIB library from the ECJ website or from |
---|
362 | http://www.jcraft.com/jzlib/ and ECJ will detect and use it automatically. |
---|
363 | |
---|
364 | |
---|
365 | <On Connection> |
---|
366 | <- slave name readUTF |
---|
367 | -> random number generator seed writeInt |
---|
368 | (note: the seed then increments by SEED_INCREMENT (7919) |
---|
369 | and the Slave is free to use any integer values between |
---|
370 | the seed and seed + SEED_INCREMENT - 1 to seed its |
---|
371 | random number generators. We hopve 7919 is enough :-) |
---|
372 | -> FLUSH |
---|
373 | |
---|
374 | |
---|
375 | <SlaveConnection's writer protocol loops the following:> |
---|
376 | -> job type writeByte |
---|
377 | (Slave.V_EVALUATESIMPLE or Slave.V_EVALUATEGROUPED or |
---|
378 | Slave.V_SHUTDOWN) |
---|
379 | If Slave.V_SHUTDOWN, |
---|
380 | break connection |
---|
381 | If Slave.V_EVALUATEGROUPED, |
---|
382 | -> countVictoriesOnly writeBoolean |
---|
383 | -> number of individuals in job writeInt |
---|
384 | Loop for number of individuals in job, |
---|
385 | -> individual writeIndividual |
---|
386 | -> update ind's fitness writeBoolean |
---|
387 | -> FLUSH |
---|
388 | |
---|
389 | |
---|
390 | <SlaveConnection's reader protocol loops the following:> |
---|
391 | <- arbitrary byte (to block on) readByte |
---|
392 | Loop for number of individuals in job, |
---|
393 | <- returning individual type readByte |
---|
394 | (Slave.V_INDIVIDUAL or Slave.V_FITNESS) |
---|
395 | If Slave.V_INDIVIDUAL, |
---|
396 | <- individual readIndividual |
---|
397 | Else if Slave.V_FITNESS, |
---|
398 | <- evaluated? readBoolean |
---|
399 | <- fitness readFitness |
---|
400 | Else if Slave.V_NOTHING, |
---|
401 | (nothing is sent) |
---|
402 | <- FLUSH |
---|
403 | |
---|
404 | Note that the reader protocol does not tell the SlaveConnection how many |
---|
405 | individuals there are in the job. This is because the SlaveConnection |
---|
406 | already knows: jobs are received in the order they were sent. |
---|
407 | |
---|
408 | The slaves and slave connections shut down when the socket breaks |
---|
409 | or when the Slave.V_SHUTDOWN signal was received. |
---|
410 | |
---|