Ticket #1233 (closed enhancement: done)
|Reported by:||cneumuel||Owned by:||ascheibe|
Description (last modified by ascheibe) (diff)
Refactor domain objects and db-schema Split info-objects and data-objects (like Job and JobData) Data Access Layer (more consistent method names, more compact code, inspired by OKB) Split transaction and db-context handling Allow uploading of plugins for a job (or hiveexperiment) Make WCF service completely stateless. Put all remaining state-information into the database (latestHeartbeats, latestConsistencyCheck, newlyAssignedJobs (remove completely and solve by adding a heartbeat)) StateLog: Log state transitions of jobs. Statistics Measure core capacity and utilization every minute Measure CPU and memory capacity and utilization every minute Reliably measure the execution time spent on hive per user / in total. Also measure speedup values (maybe also per minute). Keep jobs deleted jobs in database (flag them) - only delete JobData, plugins ect. Number of experiments / jobs (per user). Job per slave Calculate overall productivity per job (waiting time vs. computation time)
- Consider waiting time to avoid starvation
- Users should have priorities
- A user should be able to manage priorities only in the scope of his own experiments
- Childjobs should automatically have the priorities of their parent jobs
- Precomputed job-queue
Fix wrong timestamps in statelog on services.heuristiclab.com
Adapt Slave for new Server Refactor Slave (easier communication between core and executor) Tests Console Client Windows Service Client Installer for Slave Windows Tray Icon for Slave HL App Client Sort out problem with uploaded, modified assemblies which aren't downloaded to the slave; Add GUIDs to PluginCache Heartbeat interval should be controllable by the server Creation of a unique Id for a machine which does not change if the config is deleted Correct total physical memory available for a slave (ConfigManager) Test sandboxing and security of appdomains. If any assemblies can be uploaded by users, becomes very important. React on SayHello action (call Hello service method) Send cpu utilisation with every heartbeat Log exceptions to Windows Event Log FreeCores needs to be decremented right after a CalculateJob message has been received. Otherwise a slave reports free cores which are already reserved for new jobs. PluginTemp directory should be cleaned up from time to time (or on startup) SlaveCommListener in Slave.Tests should not be used in ConsoleClient Heartbeats are massively delayed, because the heartbeat-method locks on engines (in GetExecutionTimeOfAllJobs) and the same lock is made at StartJobInAppDomain. This causes the a slave-heartbeat-timeout (1 minute), thus a reset and reassignment of all jobs.
Show jobs in treeview. Would greatly save screen space and navigation-clicks to be enhanced (event wiring) Sort HiveExperiments alphabetically Plugin-Upload (optional) Experiment Sharing Appropriate numbering of Runs Use Service-Call pattern from OKB (or PPOV-Cockpit) Show StateLog - use Gantt Chart like view Pause and stop single jobs Paused jobs should not be integrated into experiment, so results are not lost. Parameters of paused jobs should be changable (and used when resumed).
- Deleting jobs after adding them (neither the remove button, nor the del key, nor the context menu entry succeeds in deleting a job (experiment) that has just been dragged in)
HiveEngine jobs should have a HiveExperiment, which is marked, so a user cannot see it in HiveExperimentManager. However it should be visible in Administration GUI. If a Hive Engine crashes and cannot delete the experiment, this should be detected by the server and it should be automatically deleted. Improve HiveEngine View (list of jobs, with status ect.) Stabilize
Missing WebService Methods:
GetJobsBySlave -> GetJobsByResourceId
- GetGlobalStatistics (for Statistics TabPage)
GetScheduleForResource (+ Add/Update/Delete)
convert HeuristicLab.Calender to a plugin use svcutil write partial classes for dtos and implement IContent build Observable Collections for Users/Slaves/Groups add ContentViews for Users and SlaveGroups show some fancy statistics add Save Button integrate HeuristicLab.Services.Hive.Common-3.4 in Server get rid of HiveItem etc. on Server
Architects meeting (16.06.2011)
TransactionManager with interface again remove AssignedResourcesId in AssignedResources, use JobId+ResourceId as primary keys remove CreateHiveDatabaseApplication. the db schema should not be developed dbml first, since dbml does not support most sql-server features. instead the sql-server schema should be designed first and the dbml should be generated. UptimeCalendar should be named DowntimeCalendar DataAccess layer and Dao classes should be removed, access to linq to sql should happen directly in server-implementation.
Lifecycle should be named differently. maybe EventHandler, EventManager. put magic numbers into config timeout in Lifecycle ApplicationConstants look for magic numbers in hive client GetWaitingJobs should be implemented as a stored procedure and should also assign a job to a slave. it should make sure no race conditions occur if it is called concurrently.
- ascheibe: moved back to next HL release
rename: HiveExperiment -> Job, Job -> Task HiveExperimentPermissions the GrantedUserId could be removed
- ascheibe: GrantedUserId is part of the PK and can't be removed. GrantedByUserId is not necessary and could be removed, but it still could be interesting information?
- ascheibe: When talking to swagner it was decided that we leave it because it could be interesting in the future.
only Full and Read permissions are necessary (Read: just read!, Full: control, delete, grant permissions) remove LastAccessed and IsHiveEngine. there should be a category field instead.
Remarks for the future (cneumuel) (28.06.2011)
- GetPlugins currently returns all plugins from the server. This exposes all uploaded assemblies. When confidentiality for plugins is relevant this method should be removed and only GetPlugin(s)ById and GetPlugin(s)ByHash should be available.
- Slave-user: Each hive slave uses the same username and password. A slave is allowed to download jobs. When a slave downloads a job it should be checked if the job is assigned to this slave or a parent-slave-group (not implemented yet). However it is still possible for an attacker to fake the ID of another slave (if it is known) and get access to jobs.
- Further measures to include (as total sums, also keep deleted jobs in DeletedJobStatistics):
- Globally: FinishedJobs, WaitingJobs, FailedJobs, AbortedJobs, TransferringJobs, PausedJobs
- Per user: total jobdata-size (MB)
- Increasing number of slaves puts pressure on the server with increasing response times and some deadlock-situations. Ideas to resolve:
- Increase heartbeat-interval (maybe dynamic when the number of slaves gets higher). Remember to increase the SlaveHeartbeatTimeout in the web.config too.
- Make GetWaitingJobs faster by using stored procedure or use a job-queue instead of querying the whole job-table.
- Large jobs (>15MB) are sometimes result in database-timeouts, especially if multiple of them are uploaded concurrently. Ideas to resolve:
- Use Filestream as db-type instead of Varbinary as it is supposed to be faster for large data-blobs.
- As streaming is not an option (no security, encryption), using a chunking channel could work ( http://msdn.microsoft.com/en-us/library/aa717050.aspx).
Some ideas for a scheduler:
- 3 levels of priorities:
- Job priority (fixed at upload)
- User priority (fixed)
- Time (dynamic: f(Now-Uploaded))
- Those 3 priority values are aggregated (average, (weighted-)sum) represent the final priority by which the jobs are ordered.
- Fast-slaves-first: Faster slaves get the jobs first, slow slaves later. This would require:
- Performance-index: Let each slave calculate a benchmark-job before it is used.
- Job-queues per slaves: Right now every slave who sends a heartbeat gets a job (if one is available). One queue per slave would allow the server to actively assign jobs to slaves. Such a queue could also ease performance issues and race conditions.
- Re-scheduling: Sometimes fast slaves finish their jobs and slow slaves are still calculating. In those cases it might be reasonable to pause the jobs and have them calculated on the faster slaves.
- Summary changed from Refactore Hive Project Structure to Hive-3.4 development
comment:37 Changed 2 years ago by cneumuel
- Description modified (diff)
comment:131 Changed 23 months ago by cneumuel
comment:164 Changed 21 months ago by cneumuel
- Owner changed from cneumuel to ascheibe
- Status changed from accepted to assigned
comment:201 Changed 19 months ago by ascheibe
- Status changed from assigned to reviewing
- Milestone changed from HeuristicLab 3.3.x Backlog to HeuristicLab 3.3.6