Context Navigation

source: branches/OKBJavaConnector/ECJClient/src/ec/eval/README @ 6934

Visit:

Last change on this file since 6934 was 6152, checked in by bfarka, 14 years ago
added ecj and custom statistics to communicate with the okb services #1441
File size: 17.1 KB

Line
1	This directory contains the code for running the ECJ master/slave
2	evaluator. The master/slave evaluator allows you to connect one ECJ
3	evolution process with some N remote slave processes. These processes
4	can come on-line at any time, and you can add new processes at any time,
5	perhaps if new machines come available, or to replace slaves which have
6	died for some reason. If a remote slave dies, ECJ gracefully handles
7	it, rescheduling its jobs to the next available slave.
8
9	Slaves run in single-threaded mode, and so have a single random number
10	generator each. A slave receives a 32-bit random number seed from the
11	master. Initially the master selects a random number seed based on
12	current wall clock time. Each time a slave goes on-line, the master
13	increments this current seed and gives the slave the revised seed.
14	Slaves use this seed regardless of their seeds in their parameter files.
15
16	You can use the master/slave evaluator freely in conjunction with island
17	models to connect each island to its own private group of N slaves,
18	though we have no examples for that in the distribution.
19
20	Typical params files for the master and for the slaves are illustrated
21	in the ec/app/star directory.
22
23	You fire up the master something like this:
24
25	java ec.Evolve -file foo.master.params
26
27	(where foo.master.params might have parent.0 be normal ec
28	parameters, and have parent.1 = master.params)
29
30	You fire up each of the N slaves something like this:
31
32	java ec.eval.Slave -file foo.slave.params
33
34	(where foo.slave.params might have parent.0 be normal ec
35	parameters, and have parent.1 = slave.params)
36
37	...and it should all nicely work! The system works fine under
38	checkpointing as far as we know.
39
40
41
42
43
44	MASTERS AND SLAVES
45
46	The master and slave processes can (and generally ought to) share
47	parameter files. The way a slave knows it's a slave is through the
48	addition of the following slave-only parameter:
49
50	eval.i-am-slave = true
51
52	The master sets up distributed evaluation by loading special class,
53	called a MasterProblem, which replaces the Problem during evaluation.
54	The MasterProblem is defined by the following master parameter:
55
56	eval.masterproblem = ec.eval.MasterProblem
57
58	When the Evaluator is started up, it normally sets up the Problem class.
59	If eval.masterproblem is set, the Evaluator also loads the master
60	problem and then replaces the Problem class prototype with a prototype
61	of the MasterProblem class. The Problem prototype is then set to be the
62	'problem' instance variable in the MasterProblem prototype. This
63	essentially allows the Problem to stick around even though it's never
64	called by the Evaluator any more -- instead, the MasterProblem is
65	called.
66
67	The MasterProblem's job is to send stuff to remote slaves. Thus a slave
68	should not load it, but instead should load the regular Problem
69	instance. The slave does this by checking the eval.i-am-slave
70	parameter. If it's true, it simply ignores the eval.masterproblem
71	parameter entirely.
72
73	More information about the architecture may be found near the end of
74	this file.
75
76	The master listens on socket for new slaves to arrive and register
77	themselves. When slaves are fired up, they attempt to attach to this
78	socket and negotiate connection. This means that both the master and
79	the slave needs to know the master's port number, and furthermore the
80	slave needs to know the master's IP address. The socket port number is
81	specified in the master and slave parameter (here set to 15000):
82
83	eval.master.port = 15000
84
85	The slave is told where the master is with the following slave
86	parameter:
87
88	eval.master.host = put.the.master.ip.address.here
89
90	The master and slave can communicate over a compressed stream.
91	Communication is by default compressed. Note that Java's
92	compression routines are broken (they don't support PARTIAL_FLUSH), so we
93	have to rely on a third-party library called JZLIB. You can get the jar
94	file for this library on the ECJ main web page or at
95	http://www.jcraft.com/jzlib/
96	This is specified in the master and slave parameter:
97
98	eval.compression = true
99
100	Last, the slave can be given a name. This is solely for debugging
101	purposes. If you don't provide this parameter, the slave will give
102	itself an arbitrary name, and that's fine. The slave parameter is:
103
104	eval.slave-name = put-my-name-here
105
106	The master can print out various debug information if you turn on the
107	following parameter:
108
109	eval.masterproblem.debug-info = false
110
111
112
113
114	JOBS
115
116	A slave receives chunks of individuals to evaluate and return as a
117	group. This chunk is called a "job". If you're doing non-coevolutionary
118	evolution, you can specify how many individuals the slave should receive
119	at one time with the master parameter (here set to 100):
120
121	eval.master-problem.job-size = 100
122
123	If you have very small individuals, or fast evaluation, this makes
124	better use of network bandwidth as it sends them as a group, evaluates
125	them all, and then returns them all. This is because more individuals
126	can get packed into a TCP/IP packet before it's sent out, minimizing
127	overhead. However, the primary effect of changing the job-size is
128	to modify the "slave evolution" population size (see the next section).
129
130	Another way to improve network efficiency, particularly with very fast
131	jobs, is to fill the network buffer with as many jobs as you can fit.
132	Each slave mantains a queue of jobs that it's working on. When the
133	master needs to hand a job to a slave, it looks for one whose queue is
134	not filled, and then puts the job in the queue. Each of the slave
135	connections keep their TCP/IP buffers stuffed with as many of these
136	queued jobs as they can, so jobs are available at the Slaves before
137	they even realize it. To set the queue size, you use the master
138	parameter (here it's being set to 100):
139
140	eval.masterproblem.max-jobs-per-slave = 100
141
142	If you only have 100 individuals in your population (say), this
143	won't fill all the jobs on one slave connection. The system goes
144	round-robin through all the slaves and distributes jobs appropriately.
145	Even so, if you have new slaves coming online all the time, they'll
146	have to wait if jobs have already been measured out to the other
147	slaves, so in that case it might be wise to keep the max-jobs-per-slave
148	a bit low.
149
150	If you are doing coevolutionary evolution, a job will consist of the
151	individuals necessary to perform one joint coevolutionary evaluation.
152
153
154
155	SLAVE EVOLUTION
156
157	Slaves can operate in one of two modes: "regular" and "evolve". In
158	regular mode, when a slave receives a job, it evaluates them and returns
159	them or their resulting fitnesses (see 'FITNESS VERSUS INDIVIDUAL'
160	below). In 'evolve' mode, the slave evaluates its individuals; but if
161	it has some extra time on its hands, it then treats the individuals as a
162	little population and does some evolution on it. When time is up, it
163	returns the most recent individuals in its little individuals in lieu of
164	the original individuals. This is particularly useful when your
165	evaluation procedure is very fast compared to the amount of time spent
166	reading and writing individuals over the network. Note that this only
167	works in NON-coevolutionary evolution.
168
169	To turn on this feature in a slave, you set the following parameter in
170	the slave:
171
172	eval.run-evolve = true
173
174	You will also need to specify how long the slave should do its
175	evolution. This is specified in wall clock time (in milliseconds) with
176	the following slave parameter, here specifying 6 seconds:
177
178	eval.runtime = 6000
179
180	Last, you'll need to turn this slave parameter on to get
181	"evolve" mode working right (see 'FITNESS VERSUS INDIVIDUAL' below for
182	more information):
183
184	eval.return-inds = true
185
186	The size of this mini-"population" the slave is evolving is determined
187	by the job-size parameter discussed earlier, so you'll want to set it to
188	something appropriate. Here again it's being set to 100:
189
190	eval.master-problem.job-size = 100
191
192
193
194
195
196
197
198	FITNESS VERSUS INDIVIDUAL
199
200	By default the master/slave system sends individuals from master to
201	slave, but only returns FITNESS values from slave to master, in order to
202	save on network bandwidth. However it is possible that some evaluations
203	of individuals literally change them during evaluation. If you have
204	done such an evil thing, you'll need to have the modified individual
205	shipped back to the master for reinsertion. Be sure to change the
206	appropriate slave parameter:
207
208	eval.return-inds = true
209
210	If you're running in "evolve" mode, you will always need to set this
211	parameter to true.
212
213
214
215	SLAVE CONFIGURATION
216
217	Because individuals don't know that they're being evaluated remotely,
218	it's best to make your Slave's EvolutionState structure, and particularly
219	its Population structure, look and feel as much like the Master as
220	possible. Notably, Subpopulation classes, Species, Individual
221	representations, and Fitnesses should be the same. This is particularly
222	important if when you want to do evolution on the Slave. The easiest way
223	to do this is simply to derive the Slave's parameter file from the Master's
224	paramter file.
225
226	However, if you're doing evolution on the Slave, this doesn't mean you have
227	to have the Slave set up the same way as the Master: just the Population
228	and objects hanging off of it. It might be preferable to do (for example)
229	generational evolution on the Slave while asynchronous steady state
230	evolution is happening on the Master. You're free to change the high-level
231	evolution procedures; but you should maintain the same representations and
232	breeding mechanisms so that Individuals on the Slave are valid on the Master
233	when they come back.
234
235	If the Master and Slave share parameters, what prevents the Slave from
236	trying to set up its own MasterProblem as well? The answer: the i-am-slave
237	parameter. The Evaluator class will not set up a MasterProblem, even if one
238	is specified, if i-am-slave is true.
239
240
241
242
243	ARCHITECTURE
244
245	The system's classes naturally break into two groups: the Slave class
246	and the various master-side classes (all others). The Slave class is
247	essentially a replacement for Evolve.java which sets up a dummy ECJ
248	process in which to evaluate individuals. This means, of course, that
249	the Slave must have the same basic evolutionary parameters --
250	particularly representation parameters -- as the master evolutionary
251	process.
252
253	The master system is set up by MasterProblem, a special version of the
254	Problem class which, instead of evaluating individuals, sends them to
255	remote Slaves to be evaluated. As mentioned before, MasterProblem is
256	loaded in the master process and then replaces the original Problem
257	instance, in essence "pretending" to be a Problem instance.
258
259	Like any Problem, multiple MasterProblems are created during the course
260	of evaluation, one per thread and per generation. During setup, the
261	MasterProblem prototype constructs one shared class which acts as the
262	clearing house for remotely evaluating individuals handed it by the
263	various MasterProblems. This class is called the SlaveMonitor.
264
265	The SlaveMonitor is responsible for keeping track of the remote
266	slave connections. For each Slave which has connected, the SlaveMonitor
267	manages reading and writing to that slave via a SlaveConnection. The
268	SlaveConnection holds the job queue for that remote Slave, holds the
269	socket connection and streams to the remote Slave, and runs, in its own
270	thread, a worker thread which reads and writes jobs to/from the Slave.
271	MasterProblems submit jobs to the SlaveMonitor, which in turn
272	distributes them to an available slave.
273
274	Most evaluation procedures can take advantage of this to provide a
275	degree of semi-asynchrony. For example, SimpleEvaluator performs
276	per-thread evaluation in the following way:
277
278	problem.prepareToEvaluate(...)
279	for each individual
280	problem.evaluate(individual,...)
281	problem.finishEvaluating(...)
282
283	This protocol informs the underlying Problem that it is free to delay
284	actual evaluation of the requested individuals until
285	finishEvaluating(...) time. This in turn allows a MasterProblem to bulk
286	up all the individuals as it likes. The MasterProblem will then read in
287	a full job's worth of individuals, then send them out to a slave, then
288	read another job's worth of individuals, then send THEM out to another
289	slave, and so on. When SimpleEvaluator calls finishEvaluating, the
290	remaining individuals are sent out as a (possibly short) job, and then
291	the MasterProblem blocks waiting until all the individuals have come
292	back.
293
294	The SteadyStateEvaluator performs evaluation in a different way:
295
296	problem.prepareToEvaluate(...) // at the very beginning of evolution!
297	loop forever
298	if problem.canEvaluate(),
299	create/breed individual
300	problem.evaluate(individual, ...)
301	individual = problem.getNextEvaluatedIndividual()
302	if individual != null
303	introduce individual to population
304	sleep a tiny bit to avoid spin-waiting
305
306	Note that problem.finishEvaluating(...) is NEVER CALLED. Here, if a
307	Slave's queue can accept another job, the MasterProblem returns true
308	from canEvaluate(). Each time problem.evaluate() is called, the
309	MasterProblem adds the individual to the job until the job is filled, at
310	which time it sends the job off to the Slave. Likewise, if
311	problem.evaluatedIndividualAvailable() returns true from the Master
312	Problem -- indicating that a job has come back with individuals for the
313	SteadyStateEvaluator, then getNextEvaluatedIndividual() returns the next
314	individual in that job. This is sort of like the SteadyStateEvaluator
315	loading individuals onto a bus, and sending it out, and then when busses
316	come back to the station, the SteadyStateEvaluator gets the individuals
317	one by one as they disembark from the bus.
318
319	A note on how individuals come back from the slaves. You'd think that
320	the indivdiuals are sent out, and either revised fitnesses are read back
321	and replace their old fitnesses; or new individuals replace the old
322	individuals. But that's not the case. This is because the
323	MasterProblem actually has no idea where the individuals are stored that
324	it's receiving. Instead we have a bit of an odd way of doing it.
325
326	When a job is submitted to a Slave, we send the individuals off to the
327	Slave, and then clone all the individuals to 'scratch individuals' a
328	second array. If fitnesses come back, for each individual i in the
329	scratch array, we call
330	scratchindividual[i].fitness.readFitness(inputStream). If whole
331	individuals come back, for each individual i in the scratch array, we
332	call scratchindividual[i].readIndividual(inputStream). We don't want to
333	call these functions on the original individuals because if the Slave
334	dies in the middle of transmission, the population's individuals are
335	corrupted and our whole evolutionary process is messed up. That's why
336	we read into scratch individuals.
337
338	So how do we then get the scratch individuals copied into the originals,
339	if we don't know where the originals are stored? ECJ does not have an
340	Individual.copyIndividualIntoMe(individual) function -- it'd be nice if
341	it did! Instead, we create a special buffered pipe, stored in
342	ec/util/DataPipe.java, and write a scratch individual into it, using
343	Individual.writeIndividual(pipeOutputStream). We then read from that
344	same pipe into the original individual, using
345	Individual.readIndividual(pipeInputStream). It's a hack but a pretty
346	one.
347
348
349
350	MASTER/SLAVE NETWORK PROTOCOL
351
352	The master maintains a thread which continually waits for new slaves
353	to come in. When a slave does in, the master creates a SlaveConnection
354	to handle the slave communication. The SlaveConnection creates a read
355	thread and a write thread to keep the pipeline to the slave filled with
356	Individuals.
357
358	You have the option of using compressed streams. Unfortunately,
359	Java has broken compression -- it doesn't support "partial flush", which
360	is critical for doing compressed network streams. To do compression you
361	will need to download the JZLIB library from the ECJ website or from
362	http://www.jcraft.com/jzlib/ and ECJ will detect and use it automatically.
363
364
365	<On Connection>
366	<- slave name readUTF
367	-> random number generator seed writeInt
368	(note: the seed then increments by SEED_INCREMENT (7919)
369	and the Slave is free to use any integer values between
370	the seed and seed + SEED_INCREMENT - 1 to seed its
371	random number generators. We hopve 7919 is enough :-)
372	-> FLUSH
373
374
375	<SlaveConnection's writer protocol loops the following:>
376	-> job type writeByte
377	(Slave.V_EVALUATESIMPLE or Slave.V_EVALUATEGROUPED or
378	Slave.V_SHUTDOWN)
379	If Slave.V_SHUTDOWN,
380	break connection
381	If Slave.V_EVALUATEGROUPED,
382	-> countVictoriesOnly writeBoolean
383	-> number of individuals in job writeInt
384	Loop for number of individuals in job,
385	-> individual writeIndividual
386	-> update ind's fitness writeBoolean
387	-> FLUSH
388
389
390	<SlaveConnection's reader protocol loops the following:>
391	<- arbitrary byte (to block on) readByte
392	Loop for number of individuals in job,
393	<- returning individual type readByte
394	(Slave.V_INDIVIDUAL or Slave.V_FITNESS)
395	If Slave.V_INDIVIDUAL,
396	<- individual readIndividual
397	Else if Slave.V_FITNESS,
398	<- evaluated? readBoolean
399	<- fitness readFitness
400	Else if Slave.V_NOTHING,
401	(nothing is sent)
402	<- FLUSH
403
404	Note that the reader protocol does not tell the SlaveConnection how many
405	individuals there are in the job. This is because the SlaveConnection
406	already knows: jobs are received in the order they were sent.
407
408	The slaves and slave connections shut down when the socket breaks
409	or when the Slave.V_SHUTDOWN signal was received.
410

Note: See TracBrowser for help on using the repository browser.

Download in other formats:

Update cookies preferences