This directory contains the code for running the ECJ master/slave evaluator. The master/slave evaluator allows you to connect one ECJ evolution process with some N remote slave processes. These processes can come on-line at any time, and you can add new processes at any time, perhaps if new machines come available, or to replace slaves which have died for some reason. If a remote slave dies, ECJ gracefully handles it, rescheduling its jobs to the next available slave. Slaves run in single-threaded mode, and so have a single random number generator each. A slave receives a 32-bit random number seed from the master. Initially the master selects a random number seed based on current wall clock time. Each time a slave goes on-line, the master increments this current seed and gives the slave the revised seed. Slaves use this seed regardless of their seeds in their parameter files. You can use the master/slave evaluator freely in conjunction with island models to connect each island to its own private group of N slaves, though we have no examples for that in the distribution. Typical params files for the master and for the slaves are illustrated in the ec/app/star directory. You fire up the master something like this: java ec.Evolve -file foo.master.params (where foo.master.params might have parent.0 be normal ec parameters, and have parent.1 = master.params) You fire up each of the N slaves something like this: java ec.eval.Slave -file foo.slave.params (where foo.slave.params might have parent.0 be normal ec parameters, and have parent.1 = slave.params) ...and it should all nicely work! The system works fine under checkpointing as far as we know. MASTERS AND SLAVES The master and slave processes can (and generally ought to) share parameter files. The way a slave knows it's a slave is through the addition of the following *slave-only* parameter: eval.i-am-slave = true The master sets up distributed evaluation by loading special class, called a MasterProblem, which replaces the Problem during evaluation. The MasterProblem is defined by the following *master* parameter: eval.masterproblem = ec.eval.MasterProblem When the Evaluator is started up, it normally sets up the Problem class. If eval.masterproblem is set, the Evaluator also loads the master problem and then *replaces* the Problem class prototype with a prototype of the MasterProblem class. The Problem prototype is then set to be the 'problem' instance variable in the MasterProblem prototype. This essentially allows the Problem to stick around even though it's never called by the Evaluator any more -- instead, the MasterProblem is called. The MasterProblem's job is to send stuff to remote slaves. Thus a slave should not load it, but instead should load the regular Problem instance. The slave does this by checking the eval.i-am-slave parameter. If it's true, it simply ignores the eval.masterproblem parameter entirely. More information about the architecture may be found near the end of this file. The master listens on socket for new slaves to arrive and register themselves. When slaves are fired up, they attempt to attach to this socket and negotiate connection. This means that both the master and the slave needs to know the master's port number, and furthermore the slave needs to know the master's IP address. The socket port number is specified in the *master and slave* parameter (here set to 15000): eval.master.port = 15000 The slave is told where the master is with the following *slave* parameter: eval.master.host = put.the.master.ip.address.here The master and slave can communicate over a compressed stream. Communication is by default compressed. Note that Java's compression routines are broken (they don't support PARTIAL_FLUSH), so we have to rely on a third-party library called JZLIB. You can get the jar file for this library on the ECJ main web page or at http://www.jcraft.com/jzlib/ This is specified in the *master and slave* parameter: eval.compression = true Last, the slave can be given a name. This is solely for debugging purposes. If you don't provide this parameter, the slave will give itself an arbitrary name, and that's fine. The *slave* parameter is: eval.slave-name = put-my-name-here The master can print out various debug information if you turn on the following parameter: eval.masterproblem.debug-info = false JOBS A slave receives chunks of individuals to evaluate and return as a group. This chunk is called a "job". If you're doing non-coevolutionary evolution, you can specify how many individuals the slave should receive at one time with the *master* parameter (here set to 100): eval.master-problem.job-size = 100 If you have very small individuals, or fast evaluation, this makes better use of network bandwidth as it sends them as a group, evaluates them all, and then returns them all. This is because more individuals can get packed into a TCP/IP packet before it's sent out, minimizing overhead. However, the primary effect of changing the job-size is to modify the "slave evolution" population size (see the next section). Another way to improve network efficiency, particularly with very fast jobs, is to fill the network buffer with as many jobs as you can fit. Each slave mantains a queue of jobs that it's working on. When the master needs to hand a job to a slave, it looks for one whose queue is not filled, and then puts the job in the queue. Each of the slave connections keep their TCP/IP buffers stuffed with as many of these queued jobs as they can, so jobs are available at the Slaves before they even realize it. To set the queue size, you use the *master* parameter (here it's being set to 100): eval.masterproblem.max-jobs-per-slave = 100 If you only have 100 individuals in your population (say), this won't fill all the jobs on one slave connection. The system goes round-robin through all the slaves and distributes jobs appropriately. Even so, if you have new slaves coming online all the time, they'll have to wait if jobs have already been measured out to the other slaves, so in that case it might be wise to keep the max-jobs-per-slave a bit low. If you are doing coevolutionary evolution, a job will consist of the individuals necessary to perform one joint coevolutionary evaluation. SLAVE EVOLUTION Slaves can operate in one of two modes: "regular" and "evolve". In regular mode, when a slave receives a job, it evaluates them and returns them or their resulting fitnesses (see 'FITNESS VERSUS INDIVIDUAL' below). In 'evolve' mode, the slave evaluates its individuals; but if it has some extra time on its hands, it then treats the individuals as a little population and does some evolution on it. When time is up, it returns the most recent individuals in its little individuals in lieu of the original individuals. This is particularly useful when your evaluation procedure is very fast compared to the amount of time spent reading and writing individuals over the network. Note that this only works in NON-coevolutionary evolution. To turn on this feature in a slave, you set the following parameter in the *slave*: eval.run-evolve = true You will also need to specify how long the slave should do its evolution. This is specified in wall clock time (in milliseconds) with the following *slave* parameter, here specifying 6 seconds: eval.runtime = 6000 Last, you'll need to turn this *slave* parameter on to get "evolve" mode working right (see 'FITNESS VERSUS INDIVIDUAL' below for more information): eval.return-inds = true The size of this mini-"population" the slave is evolving is determined by the job-size parameter discussed earlier, so you'll want to set it to something appropriate. Here again it's being set to 100: eval.master-problem.job-size = 100 FITNESS VERSUS INDIVIDUAL By default the master/slave system sends individuals from master to slave, but only returns FITNESS values from slave to master, in order to save on network bandwidth. However it is possible that some evaluations of individuals literally change them during evaluation. If you have done such an evil thing, you'll need to have the modified individual shipped back to the master for reinsertion. Be sure to change the appropriate *slave* parameter: eval.return-inds = true If you're running in "evolve" mode, you will *always* need to set this parameter to true. SLAVE CONFIGURATION Because individuals don't know that they're being evaluated remotely, it's best to make your Slave's EvolutionState structure, and particularly its Population structure, look and feel as much like the Master as possible. Notably, Subpopulation classes, Species, Individual representations, and Fitnesses should be the same. This is particularly important if when you want to do evolution on the Slave. The easiest way to do this is simply to derive the Slave's parameter file from the Master's paramter file. However, if you're doing evolution on the Slave, this doesn't mean you have to have the Slave set up the same way as the Master: just the Population and objects hanging off of it. It might be preferable to do (for example) generational evolution on the Slave while asynchronous steady state evolution is happening on the Master. You're free to change the high-level evolution procedures; but you should maintain the same representations and breeding mechanisms so that Individuals on the Slave are valid on the Master when they come back. If the Master and Slave share parameters, what prevents the Slave from trying to set up its own MasterProblem as well? The answer: the i-am-slave parameter. The Evaluator class will not set up a MasterProblem, even if one is specified, if i-am-slave is true. ARCHITECTURE The system's classes naturally break into two groups: the Slave class and the various master-side classes (all others). The Slave class is essentially a replacement for Evolve.java which sets up a dummy ECJ process in which to evaluate individuals. This means, of course, that the Slave must have the same basic evolutionary parameters -- particularly representation parameters -- as the master evolutionary process. The master system is set up by MasterProblem, a special version of the Problem class which, instead of evaluating individuals, sends them to remote Slaves to be evaluated. As mentioned before, MasterProblem is loaded in the master process and then replaces the original Problem instance, in essence "pretending" to be a Problem instance. Like any Problem, multiple MasterProblems are created during the course of evaluation, one per thread and per generation. During setup, the MasterProblem prototype constructs one shared class which acts as the clearing house for remotely evaluating individuals handed it by the various MasterProblems. This class is called the SlaveMonitor. The SlaveMonitor is responsible for keeping track of the remote slave connections. For each Slave which has connected, the SlaveMonitor manages reading and writing to that slave via a SlaveConnection. The SlaveConnection holds the job queue for that remote Slave, holds the socket connection and streams to the remote Slave, and runs, in its own thread, a worker thread which reads and writes jobs to/from the Slave. MasterProblems submit jobs to the SlaveMonitor, which in turn distributes them to an available slave. Most evaluation procedures can take advantage of this to provide a degree of semi-asynchrony. For example, SimpleEvaluator performs per-thread evaluation in the following way: problem.prepareToEvaluate(...) for each individual problem.evaluate(individual,...) problem.finishEvaluating(...) This protocol informs the underlying Problem that it is free to delay actual evaluation of the requested individuals until finishEvaluating(...) time. This in turn allows a MasterProblem to bulk up all the individuals as it likes. The MasterProblem will then read in a full job's worth of individuals, then send them out to a slave, then read another job's worth of individuals, then send THEM out to another slave, and so on. When SimpleEvaluator calls finishEvaluating, the remaining individuals are sent out as a (possibly short) job, and then the MasterProblem blocks waiting until all the individuals have come back. The SteadyStateEvaluator performs evaluation in a different way: problem.prepareToEvaluate(...) // at the very beginning of evolution! loop forever if problem.canEvaluate(), create/breed individual problem.evaluate(individual, ...) individual = problem.getNextEvaluatedIndividual() if individual != null introduce individual to population sleep a tiny bit to avoid spin-waiting Note that problem.finishEvaluating(...) is NEVER CALLED. Here, if a Slave's queue can accept another job, the MasterProblem returns true from canEvaluate(). Each time problem.evaluate() is called, the MasterProblem adds the individual to the job until the job is filled, at which time it sends the job off to the Slave. Likewise, if problem.evaluatedIndividualAvailable() returns true from the Master Problem -- indicating that a job has come back with individuals for the SteadyStateEvaluator, then getNextEvaluatedIndividual() returns the next individual in that job. This is sort of like the SteadyStateEvaluator loading individuals onto a bus, and sending it out, and then when busses come back to the station, the SteadyStateEvaluator gets the individuals one by one as they disembark from the bus. A note on how individuals come back from the slaves. You'd think that the indivdiuals are sent out, and either revised fitnesses are read back and replace their old fitnesses; or new individuals replace the old individuals. But that's not the case. This is because the MasterProblem actually has no idea where the individuals are stored that it's receiving. Instead we have a bit of an odd way of doing it. When a job is submitted to a Slave, we send the individuals off to the Slave, and then clone all the individuals to 'scratch individuals' a second array. If fitnesses come back, for each individual i in the scratch array, we call scratchindividual[i].fitness.readFitness(inputStream). If whole individuals come back, for each individual i in the scratch array, we call scratchindividual[i].readIndividual(inputStream). We don't want to call these functions on the original individuals because if the Slave dies in the middle of transmission, the population's individuals are corrupted and our whole evolutionary process is messed up. That's why we read into scratch individuals. So how do we then get the scratch individuals copied into the originals, if we don't know where the originals are stored? ECJ does not have an Individual.copyIndividualIntoMe(individual) function -- it'd be nice if it did! Instead, we create a special buffered pipe, stored in ec/util/DataPipe.java, and write a scratch individual into it, using Individual.writeIndividual(pipeOutputStream). We then read from that same pipe into the original individual, using Individual.readIndividual(pipeInputStream). It's a hack but a pretty one. MASTER/SLAVE NETWORK PROTOCOL The master maintains a thread which continually waits for new slaves to come in. When a slave does in, the master creates a SlaveConnection to handle the slave communication. The SlaveConnection creates a read thread and a write thread to keep the pipeline to the slave filled with Individuals. You have the option of using compressed streams. Unfortunately, Java has broken compression -- it doesn't support "partial flush", which is critical for doing compressed network streams. To do compression you will need to download the JZLIB library from the ECJ website or from http://www.jcraft.com/jzlib/ and ECJ will detect and use it automatically. <- slave name readUTF -> random number generator seed writeInt (note: the seed then increments by SEED_INCREMENT (7919) and the Slave is free to use any integer values between the seed and seed + SEED_INCREMENT - 1 to seed its random number generators. We hopve 7919 is enough :-) -> FLUSH -> job type writeByte (Slave.V_EVALUATESIMPLE or Slave.V_EVALUATEGROUPED or Slave.V_SHUTDOWN) If Slave.V_SHUTDOWN, break connection If Slave.V_EVALUATEGROUPED, -> countVictoriesOnly writeBoolean -> number of individuals in job writeInt Loop for number of individuals in job, -> individual writeIndividual -> update ind's fitness writeBoolean -> FLUSH <- arbitrary byte (to block on) readByte Loop for number of individuals in job, <- returning individual type readByte (Slave.V_INDIVIDUAL or Slave.V_FITNESS) If Slave.V_INDIVIDUAL, <- individual readIndividual Else if Slave.V_FITNESS, <- evaluated? readBoolean <- fitness readFitness Else if Slave.V_NOTHING, (nothing is sent) <- FLUSH Note that the reader protocol does not tell the SlaveConnection how many individuals there are in the job. This is because the SlaveConnection already knows: jobs are received in the order they were sent. The slaves and slave connections shut down when the socket breaks or when the Slave.V_SHUTDOWN signal was received.