[6152] | 1 | This directory contains the code for running the ECJ master/slave |
---|
| 2 | evaluator. The master/slave evaluator allows you to connect one ECJ |
---|
| 3 | evolution process with some N remote slave processes. These processes |
---|
| 4 | can come on-line at any time, and you can add new processes at any time, |
---|
| 5 | perhaps if new machines come available, or to replace slaves which have |
---|
| 6 | died for some reason. If a remote slave dies, ECJ gracefully handles |
---|
| 7 | it, rescheduling its jobs to the next available slave. |
---|
| 8 | |
---|
| 9 | Slaves run in single-threaded mode, and so have a single random number |
---|
| 10 | generator each. A slave receives a 32-bit random number seed from the |
---|
| 11 | master. Initially the master selects a random number seed based on |
---|
| 12 | current wall clock time. Each time a slave goes on-line, the master |
---|
| 13 | increments this current seed and gives the slave the revised seed. |
---|
| 14 | Slaves use this seed regardless of their seeds in their parameter files. |
---|
| 15 | |
---|
| 16 | You can use the master/slave evaluator freely in conjunction with island |
---|
| 17 | models to connect each island to its own private group of N slaves, |
---|
| 18 | though we have no examples for that in the distribution. |
---|
| 19 | |
---|
| 20 | Typical params files for the master and for the slaves are illustrated |
---|
| 21 | in the ec/app/star directory. |
---|
| 22 | |
---|
| 23 | You fire up the master something like this: |
---|
| 24 | |
---|
| 25 | java ec.Evolve -file foo.master.params |
---|
| 26 | |
---|
| 27 | (where foo.master.params might have parent.0 be normal ec |
---|
| 28 | parameters, and have parent.1 = master.params) |
---|
| 29 | |
---|
| 30 | You fire up each of the N slaves something like this: |
---|
| 31 | |
---|
| 32 | java ec.eval.Slave -file foo.slave.params |
---|
| 33 | |
---|
| 34 | (where foo.slave.params might have parent.0 be normal ec |
---|
| 35 | parameters, and have parent.1 = slave.params) |
---|
| 36 | |
---|
| 37 | ...and it should all nicely work! The system works fine under |
---|
| 38 | checkpointing as far as we know. |
---|
| 39 | |
---|
| 40 | |
---|
| 41 | |
---|
| 42 | |
---|
| 43 | |
---|
| 44 | MASTERS AND SLAVES |
---|
| 45 | |
---|
| 46 | The master and slave processes can (and generally ought to) share |
---|
| 47 | parameter files. The way a slave knows it's a slave is through the |
---|
| 48 | addition of the following *slave-only* parameter: |
---|
| 49 | |
---|
| 50 | eval.i-am-slave = true |
---|
| 51 | |
---|
| 52 | The master sets up distributed evaluation by loading special class, |
---|
| 53 | called a MasterProblem, which replaces the Problem during evaluation. |
---|
| 54 | The MasterProblem is defined by the following *master* parameter: |
---|
| 55 | |
---|
| 56 | eval.masterproblem = ec.eval.MasterProblem |
---|
| 57 | |
---|
| 58 | When the Evaluator is started up, it normally sets up the Problem class. |
---|
| 59 | If eval.masterproblem is set, the Evaluator also loads the master |
---|
| 60 | problem and then *replaces* the Problem class prototype with a prototype |
---|
| 61 | of the MasterProblem class. The Problem prototype is then set to be the |
---|
| 62 | 'problem' instance variable in the MasterProblem prototype. This |
---|
| 63 | essentially allows the Problem to stick around even though it's never |
---|
| 64 | called by the Evaluator any more -- instead, the MasterProblem is |
---|
| 65 | called. |
---|
| 66 | |
---|
| 67 | The MasterProblem's job is to send stuff to remote slaves. Thus a slave |
---|
| 68 | should not load it, but instead should load the regular Problem |
---|
| 69 | instance. The slave does this by checking the eval.i-am-slave |
---|
| 70 | parameter. If it's true, it simply ignores the eval.masterproblem |
---|
| 71 | parameter entirely. |
---|
| 72 | |
---|
| 73 | More information about the architecture may be found near the end of |
---|
| 74 | this file. |
---|
| 75 | |
---|
| 76 | The master listens on socket for new slaves to arrive and register |
---|
| 77 | themselves. When slaves are fired up, they attempt to attach to this |
---|
| 78 | socket and negotiate connection. This means that both the master and |
---|
| 79 | the slave needs to know the master's port number, and furthermore the |
---|
| 80 | slave needs to know the master's IP address. The socket port number is |
---|
| 81 | specified in the *master and slave* parameter (here set to 15000): |
---|
| 82 | |
---|
| 83 | eval.master.port = 15000 |
---|
| 84 | |
---|
| 85 | The slave is told where the master is with the following *slave* |
---|
| 86 | parameter: |
---|
| 87 | |
---|
| 88 | eval.master.host = put.the.master.ip.address.here |
---|
| 89 | |
---|
| 90 | The master and slave can communicate over a compressed stream. |
---|
| 91 | Communication is by default compressed. Note that Java's |
---|
| 92 | compression routines are broken (they don't support PARTIAL_FLUSH), so we |
---|
| 93 | have to rely on a third-party library called JZLIB. You can get the jar |
---|
| 94 | file for this library on the ECJ main web page or at |
---|
| 95 | http://www.jcraft.com/jzlib/ |
---|
| 96 | This is specified in the *master and slave* parameter: |
---|
| 97 | |
---|
| 98 | eval.compression = true |
---|
| 99 | |
---|
| 100 | Last, the slave can be given a name. This is solely for debugging |
---|
| 101 | purposes. If you don't provide this parameter, the slave will give |
---|
| 102 | itself an arbitrary name, and that's fine. The *slave* parameter is: |
---|
| 103 | |
---|
| 104 | eval.slave-name = put-my-name-here |
---|
| 105 | |
---|
| 106 | The master can print out various debug information if you turn on the |
---|
| 107 | following parameter: |
---|
| 108 | |
---|
| 109 | eval.masterproblem.debug-info = false |
---|
| 110 | |
---|
| 111 | |
---|
| 112 | |
---|
| 113 | |
---|
| 114 | JOBS |
---|
| 115 | |
---|
| 116 | A slave receives chunks of individuals to evaluate and return as a |
---|
| 117 | group. This chunk is called a "job". If you're doing non-coevolutionary |
---|
| 118 | evolution, you can specify how many individuals the slave should receive |
---|
| 119 | at one time with the *master* parameter (here set to 100): |
---|
| 120 | |
---|
| 121 | eval.master-problem.job-size = 100 |
---|
| 122 | |
---|
| 123 | If you have very small individuals, or fast evaluation, this makes |
---|
| 124 | better use of network bandwidth as it sends them as a group, evaluates |
---|
| 125 | them all, and then returns them all. This is because more individuals |
---|
| 126 | can get packed into a TCP/IP packet before it's sent out, minimizing |
---|
| 127 | overhead. However, the primary effect of changing the job-size is |
---|
| 128 | to modify the "slave evolution" population size (see the next section). |
---|
| 129 | |
---|
| 130 | Another way to improve network efficiency, particularly with very fast |
---|
| 131 | jobs, is to fill the network buffer with as many jobs as you can fit. |
---|
| 132 | Each slave mantains a queue of jobs that it's working on. When the |
---|
| 133 | master needs to hand a job to a slave, it looks for one whose queue is |
---|
| 134 | not filled, and then puts the job in the queue. Each of the slave |
---|
| 135 | connections keep their TCP/IP buffers stuffed with as many of these |
---|
| 136 | queued jobs as they can, so jobs are available at the Slaves before |
---|
| 137 | they even realize it. To set the queue size, you use the *master* |
---|
| 138 | parameter (here it's being set to 100): |
---|
| 139 | |
---|
| 140 | eval.masterproblem.max-jobs-per-slave = 100 |
---|
| 141 | |
---|
| 142 | If you only have 100 individuals in your population (say), this |
---|
| 143 | won't fill all the jobs on one slave connection. The system goes |
---|
| 144 | round-robin through all the slaves and distributes jobs appropriately. |
---|
| 145 | Even so, if you have new slaves coming online all the time, they'll |
---|
| 146 | have to wait if jobs have already been measured out to the other |
---|
| 147 | slaves, so in that case it might be wise to keep the max-jobs-per-slave |
---|
| 148 | a bit low. |
---|
| 149 | |
---|
| 150 | If you are doing coevolutionary evolution, a job will consist of the |
---|
| 151 | individuals necessary to perform one joint coevolutionary evaluation. |
---|
| 152 | |
---|
| 153 | |
---|
| 154 | |
---|
| 155 | SLAVE EVOLUTION |
---|
| 156 | |
---|
| 157 | Slaves can operate in one of two modes: "regular" and "evolve". In |
---|
| 158 | regular mode, when a slave receives a job, it evaluates them and returns |
---|
| 159 | them or their resulting fitnesses (see 'FITNESS VERSUS INDIVIDUAL' |
---|
| 160 | below). In 'evolve' mode, the slave evaluates its individuals; but if |
---|
| 161 | it has some extra time on its hands, it then treats the individuals as a |
---|
| 162 | little population and does some evolution on it. When time is up, it |
---|
| 163 | returns the most recent individuals in its little individuals in lieu of |
---|
| 164 | the original individuals. This is particularly useful when your |
---|
| 165 | evaluation procedure is very fast compared to the amount of time spent |
---|
| 166 | reading and writing individuals over the network. Note that this only |
---|
| 167 | works in NON-coevolutionary evolution. |
---|
| 168 | |
---|
| 169 | To turn on this feature in a slave, you set the following parameter in |
---|
| 170 | the *slave*: |
---|
| 171 | |
---|
| 172 | eval.run-evolve = true |
---|
| 173 | |
---|
| 174 | You will also need to specify how long the slave should do its |
---|
| 175 | evolution. This is specified in wall clock time (in milliseconds) with |
---|
| 176 | the following *slave* parameter, here specifying 6 seconds: |
---|
| 177 | |
---|
| 178 | eval.runtime = 6000 |
---|
| 179 | |
---|
| 180 | Last, you'll need to turn this *slave* parameter on to get |
---|
| 181 | "evolve" mode working right (see 'FITNESS VERSUS INDIVIDUAL' below for |
---|
| 182 | more information): |
---|
| 183 | |
---|
| 184 | eval.return-inds = true |
---|
| 185 | |
---|
| 186 | The size of this mini-"population" the slave is evolving is determined |
---|
| 187 | by the job-size parameter discussed earlier, so you'll want to set it to |
---|
| 188 | something appropriate. Here again it's being set to 100: |
---|
| 189 | |
---|
| 190 | eval.master-problem.job-size = 100 |
---|
| 191 | |
---|
| 192 | |
---|
| 193 | |
---|
| 194 | |
---|
| 195 | |
---|
| 196 | |
---|
| 197 | |
---|
| 198 | FITNESS VERSUS INDIVIDUAL |
---|
| 199 | |
---|
| 200 | By default the master/slave system sends individuals from master to |
---|
| 201 | slave, but only returns FITNESS values from slave to master, in order to |
---|
| 202 | save on network bandwidth. However it is possible that some evaluations |
---|
| 203 | of individuals literally change them during evaluation. If you have |
---|
| 204 | done such an evil thing, you'll need to have the modified individual |
---|
| 205 | shipped back to the master for reinsertion. Be sure to change the |
---|
| 206 | appropriate *slave* parameter: |
---|
| 207 | |
---|
| 208 | eval.return-inds = true |
---|
| 209 | |
---|
| 210 | If you're running in "evolve" mode, you will *always* need to set this |
---|
| 211 | parameter to true. |
---|
| 212 | |
---|
| 213 | |
---|
| 214 | |
---|
| 215 | SLAVE CONFIGURATION |
---|
| 216 | |
---|
| 217 | Because individuals don't know that they're being evaluated remotely, |
---|
| 218 | it's best to make your Slave's EvolutionState structure, and particularly |
---|
| 219 | its Population structure, look and feel as much like the Master as |
---|
| 220 | possible. Notably, Subpopulation classes, Species, Individual |
---|
| 221 | representations, and Fitnesses should be the same. This is particularly |
---|
| 222 | important if when you want to do evolution on the Slave. The easiest way |
---|
| 223 | to do this is simply to derive the Slave's parameter file from the Master's |
---|
| 224 | paramter file. |
---|
| 225 | |
---|
| 226 | However, if you're doing evolution on the Slave, this doesn't mean you have |
---|
| 227 | to have the Slave set up the same way as the Master: just the Population |
---|
| 228 | and objects hanging off of it. It might be preferable to do (for example) |
---|
| 229 | generational evolution on the Slave while asynchronous steady state |
---|
| 230 | evolution is happening on the Master. You're free to change the high-level |
---|
| 231 | evolution procedures; but you should maintain the same representations and |
---|
| 232 | breeding mechanisms so that Individuals on the Slave are valid on the Master |
---|
| 233 | when they come back. |
---|
| 234 | |
---|
| 235 | If the Master and Slave share parameters, what prevents the Slave from |
---|
| 236 | trying to set up its own MasterProblem as well? The answer: the i-am-slave |
---|
| 237 | parameter. The Evaluator class will not set up a MasterProblem, even if one |
---|
| 238 | is specified, if i-am-slave is true. |
---|
| 239 | |
---|
| 240 | |
---|
| 241 | |
---|
| 242 | |
---|
| 243 | ARCHITECTURE |
---|
| 244 | |
---|
| 245 | The system's classes naturally break into two groups: the Slave class |
---|
| 246 | and the various master-side classes (all others). The Slave class is |
---|
| 247 | essentially a replacement for Evolve.java which sets up a dummy ECJ |
---|
| 248 | process in which to evaluate individuals. This means, of course, that |
---|
| 249 | the Slave must have the same basic evolutionary parameters -- |
---|
| 250 | particularly representation parameters -- as the master evolutionary |
---|
| 251 | process. |
---|
| 252 | |
---|
| 253 | The master system is set up by MasterProblem, a special version of the |
---|
| 254 | Problem class which, instead of evaluating individuals, sends them to |
---|
| 255 | remote Slaves to be evaluated. As mentioned before, MasterProblem is |
---|
| 256 | loaded in the master process and then replaces the original Problem |
---|
| 257 | instance, in essence "pretending" to be a Problem instance. |
---|
| 258 | |
---|
| 259 | Like any Problem, multiple MasterProblems are created during the course |
---|
| 260 | of evaluation, one per thread and per generation. During setup, the |
---|
| 261 | MasterProblem prototype constructs one shared class which acts as the |
---|
| 262 | clearing house for remotely evaluating individuals handed it by the |
---|
| 263 | various MasterProblems. This class is called the SlaveMonitor. |
---|
| 264 | |
---|
| 265 | The SlaveMonitor is responsible for keeping track of the remote |
---|
| 266 | slave connections. For each Slave which has connected, the SlaveMonitor |
---|
| 267 | manages reading and writing to that slave via a SlaveConnection. The |
---|
| 268 | SlaveConnection holds the job queue for that remote Slave, holds the |
---|
| 269 | socket connection and streams to the remote Slave, and runs, in its own |
---|
| 270 | thread, a worker thread which reads and writes jobs to/from the Slave. |
---|
| 271 | MasterProblems submit jobs to the SlaveMonitor, which in turn |
---|
| 272 | distributes them to an available slave. |
---|
| 273 | |
---|
| 274 | Most evaluation procedures can take advantage of this to provide a |
---|
| 275 | degree of semi-asynchrony. For example, SimpleEvaluator performs |
---|
| 276 | per-thread evaluation in the following way: |
---|
| 277 | |
---|
| 278 | problem.prepareToEvaluate(...) |
---|
| 279 | for each individual |
---|
| 280 | problem.evaluate(individual,...) |
---|
| 281 | problem.finishEvaluating(...) |
---|
| 282 | |
---|
| 283 | This protocol informs the underlying Problem that it is free to delay |
---|
| 284 | actual evaluation of the requested individuals until |
---|
| 285 | finishEvaluating(...) time. This in turn allows a MasterProblem to bulk |
---|
| 286 | up all the individuals as it likes. The MasterProblem will then read in |
---|
| 287 | a full job's worth of individuals, then send them out to a slave, then |
---|
| 288 | read another job's worth of individuals, then send THEM out to another |
---|
| 289 | slave, and so on. When SimpleEvaluator calls finishEvaluating, the |
---|
| 290 | remaining individuals are sent out as a (possibly short) job, and then |
---|
| 291 | the MasterProblem blocks waiting until all the individuals have come |
---|
| 292 | back. |
---|
| 293 | |
---|
| 294 | The SteadyStateEvaluator performs evaluation in a different way: |
---|
| 295 | |
---|
| 296 | problem.prepareToEvaluate(...) // at the very beginning of evolution! |
---|
| 297 | loop forever |
---|
| 298 | if problem.canEvaluate(), |
---|
| 299 | create/breed individual |
---|
| 300 | problem.evaluate(individual, ...) |
---|
| 301 | individual = problem.getNextEvaluatedIndividual() |
---|
| 302 | if individual != null |
---|
| 303 | introduce individual to population |
---|
| 304 | sleep a tiny bit to avoid spin-waiting |
---|
| 305 | |
---|
| 306 | Note that problem.finishEvaluating(...) is NEVER CALLED. Here, if a |
---|
| 307 | Slave's queue can accept another job, the MasterProblem returns true |
---|
| 308 | from canEvaluate(). Each time problem.evaluate() is called, the |
---|
| 309 | MasterProblem adds the individual to the job until the job is filled, at |
---|
| 310 | which time it sends the job off to the Slave. Likewise, if |
---|
| 311 | problem.evaluatedIndividualAvailable() returns true from the Master |
---|
| 312 | Problem -- indicating that a job has come back with individuals for the |
---|
| 313 | SteadyStateEvaluator, then getNextEvaluatedIndividual() returns the next |
---|
| 314 | individual in that job. This is sort of like the SteadyStateEvaluator |
---|
| 315 | loading individuals onto a bus, and sending it out, and then when busses |
---|
| 316 | come back to the station, the SteadyStateEvaluator gets the individuals |
---|
| 317 | one by one as they disembark from the bus. |
---|
| 318 | |
---|
| 319 | A note on how individuals come back from the slaves. You'd think that |
---|
| 320 | the indivdiuals are sent out, and either revised fitnesses are read back |
---|
| 321 | and replace their old fitnesses; or new individuals replace the old |
---|
| 322 | individuals. But that's not the case. This is because the |
---|
| 323 | MasterProblem actually has no idea where the individuals are stored that |
---|
| 324 | it's receiving. Instead we have a bit of an odd way of doing it. |
---|
| 325 | |
---|
| 326 | When a job is submitted to a Slave, we send the individuals off to the |
---|
| 327 | Slave, and then clone all the individuals to 'scratch individuals' a |
---|
| 328 | second array. If fitnesses come back, for each individual i in the |
---|
| 329 | scratch array, we call |
---|
| 330 | scratchindividual[i].fitness.readFitness(inputStream). If whole |
---|
| 331 | individuals come back, for each individual i in the scratch array, we |
---|
| 332 | call scratchindividual[i].readIndividual(inputStream). We don't want to |
---|
| 333 | call these functions on the original individuals because if the Slave |
---|
| 334 | dies in the middle of transmission, the population's individuals are |
---|
| 335 | corrupted and our whole evolutionary process is messed up. That's why |
---|
| 336 | we read into scratch individuals. |
---|
| 337 | |
---|
| 338 | So how do we then get the scratch individuals copied into the originals, |
---|
| 339 | if we don't know where the originals are stored? ECJ does not have an |
---|
| 340 | Individual.copyIndividualIntoMe(individual) function -- it'd be nice if |
---|
| 341 | it did! Instead, we create a special buffered pipe, stored in |
---|
| 342 | ec/util/DataPipe.java, and write a scratch individual into it, using |
---|
| 343 | Individual.writeIndividual(pipeOutputStream). We then read from that |
---|
| 344 | same pipe into the original individual, using |
---|
| 345 | Individual.readIndividual(pipeInputStream). It's a hack but a pretty |
---|
| 346 | one. |
---|
| 347 | |
---|
| 348 | |
---|
| 349 | |
---|
| 350 | MASTER/SLAVE NETWORK PROTOCOL |
---|
| 351 | |
---|
| 352 | The master maintains a thread which continually waits for new slaves |
---|
| 353 | to come in. When a slave does in, the master creates a SlaveConnection |
---|
| 354 | to handle the slave communication. The SlaveConnection creates a read |
---|
| 355 | thread and a write thread to keep the pipeline to the slave filled with |
---|
| 356 | Individuals. |
---|
| 357 | |
---|
| 358 | You have the option of using compressed streams. Unfortunately, |
---|
| 359 | Java has broken compression -- it doesn't support "partial flush", which |
---|
| 360 | is critical for doing compressed network streams. To do compression you |
---|
| 361 | will need to download the JZLIB library from the ECJ website or from |
---|
| 362 | http://www.jcraft.com/jzlib/ and ECJ will detect and use it automatically. |
---|
| 363 | |
---|
| 364 | |
---|
| 365 | <On Connection> |
---|
| 366 | <- slave name readUTF |
---|
| 367 | -> random number generator seed writeInt |
---|
| 368 | (note: the seed then increments by SEED_INCREMENT (7919) |
---|
| 369 | and the Slave is free to use any integer values between |
---|
| 370 | the seed and seed + SEED_INCREMENT - 1 to seed its |
---|
| 371 | random number generators. We hopve 7919 is enough :-) |
---|
| 372 | -> FLUSH |
---|
| 373 | |
---|
| 374 | |
---|
| 375 | <SlaveConnection's writer protocol loops the following:> |
---|
| 376 | -> job type writeByte |
---|
| 377 | (Slave.V_EVALUATESIMPLE or Slave.V_EVALUATEGROUPED or |
---|
| 378 | Slave.V_SHUTDOWN) |
---|
| 379 | If Slave.V_SHUTDOWN, |
---|
| 380 | break connection |
---|
| 381 | If Slave.V_EVALUATEGROUPED, |
---|
| 382 | -> countVictoriesOnly writeBoolean |
---|
| 383 | -> number of individuals in job writeInt |
---|
| 384 | Loop for number of individuals in job, |
---|
| 385 | -> individual writeIndividual |
---|
| 386 | -> update ind's fitness writeBoolean |
---|
| 387 | -> FLUSH |
---|
| 388 | |
---|
| 389 | |
---|
| 390 | <SlaveConnection's reader protocol loops the following:> |
---|
| 391 | <- arbitrary byte (to block on) readByte |
---|
| 392 | Loop for number of individuals in job, |
---|
| 393 | <- returning individual type readByte |
---|
| 394 | (Slave.V_INDIVIDUAL or Slave.V_FITNESS) |
---|
| 395 | If Slave.V_INDIVIDUAL, |
---|
| 396 | <- individual readIndividual |
---|
| 397 | Else if Slave.V_FITNESS, |
---|
| 398 | <- evaluated? readBoolean |
---|
| 399 | <- fitness readFitness |
---|
| 400 | Else if Slave.V_NOTHING, |
---|
| 401 | (nothing is sent) |
---|
| 402 | <- FLUSH |
---|
| 403 | |
---|
| 404 | Note that the reader protocol does not tell the SlaveConnection how many |
---|
| 405 | individuals there are in the job. This is because the SlaveConnection |
---|
| 406 | already knows: jobs are received in the order they were sent. |
---|
| 407 | |
---|
| 408 | The slaves and slave connections shut down when the socket breaks |
---|
| 409 | or when the Slave.V_SHUTDOWN signal was received. |
---|
| 410 | |
---|