Context Navigation

source: branches/OKBJavaConnector/ECJClient/src/ec/eval/README @ 12147

Visit:

Last change on this file since 12147 was 6152, checked in by bfarka, 14 years ago
added ecj and custom statistics to communicate with the okb services #1441
File size: 17.1 KB

Rev	Line
[6152]	1	This directory contains the code for running the ECJ master/slave
	2	evaluator. The master/slave evaluator allows you to connect one ECJ
	3	evolution process with some N remote slave processes. These processes
	4	can come on-line at any time, and you can add new processes at any time,
	5	perhaps if new machines come available, or to replace slaves which have
	6	died for some reason. If a remote slave dies, ECJ gracefully handles
	7	it, rescheduling its jobs to the next available slave.
	8
	9	Slaves run in single-threaded mode, and so have a single random number
	10	generator each. A slave receives a 32-bit random number seed from the
	11	master. Initially the master selects a random number seed based on
	12	current wall clock time. Each time a slave goes on-line, the master
	13	increments this current seed and gives the slave the revised seed.
	14	Slaves use this seed regardless of their seeds in their parameter files.
	15
	16	You can use the master/slave evaluator freely in conjunction with island
	17	models to connect each island to its own private group of N slaves,
	18	though we have no examples for that in the distribution.
	19
	20	Typical params files for the master and for the slaves are illustrated
	21	in the ec/app/star directory.
	22
	23	You fire up the master something like this:
	24
	25	java ec.Evolve -file foo.master.params
	26
	27	(where foo.master.params might have parent.0 be normal ec
	28	parameters, and have parent.1 = master.params)
	29
	30	You fire up each of the N slaves something like this:
	31
	32	java ec.eval.Slave -file foo.slave.params
	33
	34	(where foo.slave.params might have parent.0 be normal ec
	35	parameters, and have parent.1 = slave.params)
	36
	37	...and it should all nicely work! The system works fine under
	38	checkpointing as far as we know.
	39
	40
	41
	42
	43
	44	MASTERS AND SLAVES
	45
	46	The master and slave processes can (and generally ought to) share
	47	parameter files. The way a slave knows it's a slave is through the
	48	addition of the following slave-only parameter:
	49
	50	eval.i-am-slave = true
	51
	52	The master sets up distributed evaluation by loading special class,
	53	called a MasterProblem, which replaces the Problem during evaluation.
	54	The MasterProblem is defined by the following master parameter:
	55
	56	eval.masterproblem = ec.eval.MasterProblem
	57
	58	When the Evaluator is started up, it normally sets up the Problem class.
	59	If eval.masterproblem is set, the Evaluator also loads the master
	60	problem and then replaces the Problem class prototype with a prototype
	61	of the MasterProblem class. The Problem prototype is then set to be the
	62	'problem' instance variable in the MasterProblem prototype. This
	63	essentially allows the Problem to stick around even though it's never
	64	called by the Evaluator any more -- instead, the MasterProblem is
	65	called.
	66
	67	The MasterProblem's job is to send stuff to remote slaves. Thus a slave
	68	should not load it, but instead should load the regular Problem
	69	instance. The slave does this by checking the eval.i-am-slave
	70	parameter. If it's true, it simply ignores the eval.masterproblem
	71	parameter entirely.
	72
	73	More information about the architecture may be found near the end of
	74	this file.
	75
	76	The master listens on socket for new slaves to arrive and register
	77	themselves. When slaves are fired up, they attempt to attach to this
	78	socket and negotiate connection. This means that both the master and
	79	the slave needs to know the master's port number, and furthermore the
	80	slave needs to know the master's IP address. The socket port number is
	81	specified in the master and slave parameter (here set to 15000):
	82
	83	eval.master.port = 15000
	84
	85	The slave is told where the master is with the following slave
	86	parameter:
	87
	88	eval.master.host = put.the.master.ip.address.here
	89
	90	The master and slave can communicate over a compressed stream.
	91	Communication is by default compressed. Note that Java's
	92	compression routines are broken (they don't support PARTIAL_FLUSH), so we
	93	have to rely on a third-party library called JZLIB. You can get the jar
	94	file for this library on the ECJ main web page or at
	95	http://www.jcraft.com/jzlib/
	96	This is specified in the master and slave parameter:
	97
	98	eval.compression = true
	99
	100	Last, the slave can be given a name. This is solely for debugging
	101	purposes. If you don't provide this parameter, the slave will give
	102	itself an arbitrary name, and that's fine. The slave parameter is:
	103
	104	eval.slave-name = put-my-name-here
	105
	106	The master can print out various debug information if you turn on the
	107	following parameter:
	108
	109	eval.masterproblem.debug-info = false
	110
	111
	112
	113
	114	JOBS
	115
	116	A slave receives chunks of individuals to evaluate and return as a
	117	group. This chunk is called a "job". If you're doing non-coevolutionary
	118	evolution, you can specify how many individuals the slave should receive
	119	at one time with the master parameter (here set to 100):
	120
	121	eval.master-problem.job-size = 100
	122
	123	If you have very small individuals, or fast evaluation, this makes
	124	better use of network bandwidth as it sends them as a group, evaluates
	125	them all, and then returns them all. This is because more individuals
	126	can get packed into a TCP/IP packet before it's sent out, minimizing
	127	overhead. However, the primary effect of changing the job-size is
	128	to modify the "slave evolution" population size (see the next section).
	129
	130	Another way to improve network efficiency, particularly with very fast
	131	jobs, is to fill the network buffer with as many jobs as you can fit.
	132	Each slave mantains a queue of jobs that it's working on. When the
	133	master needs to hand a job to a slave, it looks for one whose queue is
	134	not filled, and then puts the job in the queue. Each of the slave
	135	connections keep their TCP/IP buffers stuffed with as many of these
	136	queued jobs as they can, so jobs are available at the Slaves before
	137	they even realize it. To set the queue size, you use the master
	138	parameter (here it's being set to 100):
	139
	140	eval.masterproblem.max-jobs-per-slave = 100
	141
	142	If you only have 100 individuals in your population (say), this
	143	won't fill all the jobs on one slave connection. The system goes
	144	round-robin through all the slaves and distributes jobs appropriately.
	145	Even so, if you have new slaves coming online all the time, they'll
	146	have to wait if jobs have already been measured out to the other
	147	slaves, so in that case it might be wise to keep the max-jobs-per-slave
	148	a bit low.
	149
	150	If you are doing coevolutionary evolution, a job will consist of the
	151	individuals necessary to perform one joint coevolutionary evaluation.
	152
	153
	154
	155	SLAVE EVOLUTION
	156
	157	Slaves can operate in one of two modes: "regular" and "evolve". In
	158	regular mode, when a slave receives a job, it evaluates them and returns
	159	them or their resulting fitnesses (see 'FITNESS VERSUS INDIVIDUAL'
	160	below). In 'evolve' mode, the slave evaluates its individuals; but if
	161	it has some extra time on its hands, it then treats the individuals as a
	162	little population and does some evolution on it. When time is up, it
	163	returns the most recent individuals in its little individuals in lieu of
	164	the original individuals. This is particularly useful when your
	165	evaluation procedure is very fast compared to the amount of time spent
	166	reading and writing individuals over the network. Note that this only
	167	works in NON-coevolutionary evolution.
	168
	169	To turn on this feature in a slave, you set the following parameter in
	170	the slave:
	171
	172	eval.run-evolve = true
	173
	174	You will also need to specify how long the slave should do its
	175	evolution. This is specified in wall clock time (in milliseconds) with
	176	the following slave parameter, here specifying 6 seconds:
	177
	178	eval.runtime = 6000
	179
	180	Last, you'll need to turn this slave parameter on to get
	181	"evolve" mode working right (see 'FITNESS VERSUS INDIVIDUAL' below for
	182	more information):
	183
	184	eval.return-inds = true
	185
	186	The size of this mini-"population" the slave is evolving is determined
	187	by the job-size parameter discussed earlier, so you'll want to set it to
	188	something appropriate. Here again it's being set to 100:
	189
	190	eval.master-problem.job-size = 100
	191
	192
	193
	194
	195
	196
	197
	198	FITNESS VERSUS INDIVIDUAL
	199
	200	By default the master/slave system sends individuals from master to
	201	slave, but only returns FITNESS values from slave to master, in order to
	202	save on network bandwidth. However it is possible that some evaluations
	203	of individuals literally change them during evaluation. If you have
	204	done such an evil thing, you'll need to have the modified individual
	205	shipped back to the master for reinsertion. Be sure to change the
	206	appropriate slave parameter:
	207
	208	eval.return-inds = true
	209
	210	If you're running in "evolve" mode, you will always need to set this
	211	parameter to true.
	212
	213
	214
	215	SLAVE CONFIGURATION
	216
	217	Because individuals don't know that they're being evaluated remotely,
	218	it's best to make your Slave's EvolutionState structure, and particularly
	219	its Population structure, look and feel as much like the Master as
	220	possible. Notably, Subpopulation classes, Species, Individual
	221	representations, and Fitnesses should be the same. This is particularly
	222	important if when you want to do evolution on the Slave. The easiest way
	223	to do this is simply to derive the Slave's parameter file from the Master's
	224	paramter file.
	225
	226	However, if you're doing evolution on the Slave, this doesn't mean you have
	227	to have the Slave set up the same way as the Master: just the Population
	228	and objects hanging off of it. It might be preferable to do (for example)
	229	generational evolution on the Slave while asynchronous steady state
	230	evolution is happening on the Master. You're free to change the high-level
	231	evolution procedures; but you should maintain the same representations and
	232	breeding mechanisms so that Individuals on the Slave are valid on the Master
	233	when they come back.
	234
	235	If the Master and Slave share parameters, what prevents the Slave from
	236	trying to set up its own MasterProblem as well? The answer: the i-am-slave
	237	parameter. The Evaluator class will not set up a MasterProblem, even if one
	238	is specified, if i-am-slave is true.
	239
	240
	241
	242
	243	ARCHITECTURE
	244
	245	The system's classes naturally break into two groups: the Slave class
	246	and the various master-side classes (all others). The Slave class is
	247	essentially a replacement for Evolve.java which sets up a dummy ECJ
	248	process in which to evaluate individuals. This means, of course, that
	249	the Slave must have the same basic evolutionary parameters --
	250	particularly representation parameters -- as the master evolutionary
	251	process.
	252
	253	The master system is set up by MasterProblem, a special version of the
	254	Problem class which, instead of evaluating individuals, sends them to
	255	remote Slaves to be evaluated. As mentioned before, MasterProblem is
	256	loaded in the master process and then replaces the original Problem
	257	instance, in essence "pretending" to be a Problem instance.
	258
	259	Like any Problem, multiple MasterProblems are created during the course
	260	of evaluation, one per thread and per generation. During setup, the
	261	MasterProblem prototype constructs one shared class which acts as the
	262	clearing house for remotely evaluating individuals handed it by the
	263	various MasterProblems. This class is called the SlaveMonitor.
	264
	265	The SlaveMonitor is responsible for keeping track of the remote
	266	slave connections. For each Slave which has connected, the SlaveMonitor
	267	manages reading and writing to that slave via a SlaveConnection. The
	268	SlaveConnection holds the job queue for that remote Slave, holds the
	269	socket connection and streams to the remote Slave, and runs, in its own
	270	thread, a worker thread which reads and writes jobs to/from the Slave.
	271	MasterProblems submit jobs to the SlaveMonitor, which in turn
	272	distributes them to an available slave.
	273
	274	Most evaluation procedures can take advantage of this to provide a
	275	degree of semi-asynchrony. For example, SimpleEvaluator performs
	276	per-thread evaluation in the following way:
	277
	278	problem.prepareToEvaluate(...)
	279	for each individual
	280	problem.evaluate(individual,...)
	281	problem.finishEvaluating(...)
	282
	283	This protocol informs the underlying Problem that it is free to delay
	284	actual evaluation of the requested individuals until
	285	finishEvaluating(...) time. This in turn allows a MasterProblem to bulk
	286	up all the individuals as it likes. The MasterProblem will then read in
	287	a full job's worth of individuals, then send them out to a slave, then
	288	read another job's worth of individuals, then send THEM out to another
	289	slave, and so on. When SimpleEvaluator calls finishEvaluating, the
	290	remaining individuals are sent out as a (possibly short) job, and then
	291	the MasterProblem blocks waiting until all the individuals have come
	292	back.
	293
	294	The SteadyStateEvaluator performs evaluation in a different way:
	295
	296	problem.prepareToEvaluate(...) // at the very beginning of evolution!
	297	loop forever
	298	if problem.canEvaluate(),
	299	create/breed individual
	300	problem.evaluate(individual, ...)
	301	individual = problem.getNextEvaluatedIndividual()
	302	if individual != null
	303	introduce individual to population
	304	sleep a tiny bit to avoid spin-waiting
	305
	306	Note that problem.finishEvaluating(...) is NEVER CALLED. Here, if a
	307	Slave's queue can accept another job, the MasterProblem returns true
	308	from canEvaluate(). Each time problem.evaluate() is called, the
	309	MasterProblem adds the individual to the job until the job is filled, at
	310	which time it sends the job off to the Slave. Likewise, if
	311	problem.evaluatedIndividualAvailable() returns true from the Master
	312	Problem -- indicating that a job has come back with individuals for the
	313	SteadyStateEvaluator, then getNextEvaluatedIndividual() returns the next
	314	individual in that job. This is sort of like the SteadyStateEvaluator
	315	loading individuals onto a bus, and sending it out, and then when busses
	316	come back to the station, the SteadyStateEvaluator gets the individuals
	317	one by one as they disembark from the bus.
	318
	319	A note on how individuals come back from the slaves. You'd think that
	320	the indivdiuals are sent out, and either revised fitnesses are read back
	321	and replace their old fitnesses; or new individuals replace the old
	322	individuals. But that's not the case. This is because the
	323	MasterProblem actually has no idea where the individuals are stored that
	324	it's receiving. Instead we have a bit of an odd way of doing it.
	325
	326	When a job is submitted to a Slave, we send the individuals off to the
	327	Slave, and then clone all the individuals to 'scratch individuals' a
	328	second array. If fitnesses come back, for each individual i in the
	329	scratch array, we call
	330	scratchindividual[i].fitness.readFitness(inputStream). If whole
	331	individuals come back, for each individual i in the scratch array, we
	332	call scratchindividual[i].readIndividual(inputStream). We don't want to
	333	call these functions on the original individuals because if the Slave
	334	dies in the middle of transmission, the population's individuals are
	335	corrupted and our whole evolutionary process is messed up. That's why
	336	we read into scratch individuals.
	337
	338	So how do we then get the scratch individuals copied into the originals,
	339	if we don't know where the originals are stored? ECJ does not have an
	340	Individual.copyIndividualIntoMe(individual) function -- it'd be nice if
	341	it did! Instead, we create a special buffered pipe, stored in
	342	ec/util/DataPipe.java, and write a scratch individual into it, using
	343	Individual.writeIndividual(pipeOutputStream). We then read from that
	344	same pipe into the original individual, using
	345	Individual.readIndividual(pipeInputStream). It's a hack but a pretty
	346	one.
	347
	348
	349
	350	MASTER/SLAVE NETWORK PROTOCOL
	351
	352	The master maintains a thread which continually waits for new slaves
	353	to come in. When a slave does in, the master creates a SlaveConnection
	354	to handle the slave communication. The SlaveConnection creates a read
	355	thread and a write thread to keep the pipeline to the slave filled with
	356	Individuals.
	357
	358	You have the option of using compressed streams. Unfortunately,
	359	Java has broken compression -- it doesn't support "partial flush", which
	360	is critical for doing compressed network streams. To do compression you
	361	will need to download the JZLIB library from the ECJ website or from
	362	http://www.jcraft.com/jzlib/ and ECJ will detect and use it automatically.
	363
	364
	365	<On Connection>
	366	<- slave name readUTF
	367	-> random number generator seed writeInt
	368	(note: the seed then increments by SEED_INCREMENT (7919)
	369	and the Slave is free to use any integer values between
	370	the seed and seed + SEED_INCREMENT - 1 to seed its
	371	random number generators. We hopve 7919 is enough :-)
	372	-> FLUSH
	373
	374
	375	<SlaveConnection's writer protocol loops the following:>
	376	-> job type writeByte
	377	(Slave.V_EVALUATESIMPLE or Slave.V_EVALUATEGROUPED or
	378	Slave.V_SHUTDOWN)
	379	If Slave.V_SHUTDOWN,
	380	break connection
	381	If Slave.V_EVALUATEGROUPED,
	382	-> countVictoriesOnly writeBoolean
	383	-> number of individuals in job writeInt
	384	Loop for number of individuals in job,
	385	-> individual writeIndividual
	386	-> update ind's fitness writeBoolean
	387	-> FLUSH
	388
	389
	390	<SlaveConnection's reader protocol loops the following:>
	391	<- arbitrary byte (to block on) readByte
	392	Loop for number of individuals in job,
	393	<- returning individual type readByte
	394	(Slave.V_INDIVIDUAL or Slave.V_FITNESS)
	395	If Slave.V_INDIVIDUAL,
	396	<- individual readIndividual
	397	Else if Slave.V_FITNESS,
	398	<- evaluated? readBoolean
	399	<- fitness readFitness
	400	Else if Slave.V_NOTHING,
	401	(nothing is sent)
	402	<- FLUSH
	403
	404	Note that the reader protocol does not tell the SlaveConnection how many
	405	individuals there are in the job. This is because the SlaveConnection
	406	already knows: jobs are received in the order they were sent.
	407
	408	The slaves and slave connections shut down when the socket breaks
	409	or when the Slave.V_SHUTDOWN signal was received.
	410

Note: See TracBrowser for help on using the repository browser.

Download in other formats:

Update cookies preferences