id summary reporter owner description type status priority milestone component version resolution keywords cc 1233 Hive-3.4 development cneumuel ascheibe "= General notes = === Server === * ~~Refactor domain objects and db-schema~~ * ~~Split info-objects and data-objects (like `Job` and `JobData`) * ~~Data Access Layer (more consistent method names, more compact code, inspired by OKB)~~ * ~~Split transaction and db-context handling~~ * ~~Allow uploading of plugins for a job (or hiveexperiment)~~ * ~~Make WCF service completely stateless. Put all remaining state-information into the database (latestHeartbeats, latestConsistencyCheck, newlyAssignedJobs (remove completely and solve by adding a heartbeat))~~ * ~~`StateLog`: Log state transitions of jobs.~~ * ~~Statistics~~ * ~~Measure core capacity and utilization every minute~~ * ~~Measure CPU and memory capacity and utilization every minute~~ * ~~Reliably measure the execution time spent on hive per user / in total. Also measure speedup values (maybe also per minute). Keep jobs deleted jobs in database (flag them) - only delete `JobData`, plugins ect.~~ * ~~Number of experiments / jobs (per user). Job per slave~~ * ~~Calculate overall productivity per job (waiting time vs. computation time)~~ * Scheduler * Consider waiting time to avoid starvation * Users should have priorities * A user should be able to manage priorities only in the scope of his own experiments * Childjobs should automatically have the priorities of their parent jobs * Precomputed job-queue * ~~Fix wrong timestamps in statelog on services.heuristiclab.com~~ === Slave === * ~~Adapt Slave for new Server~~ * ~~Refactor Slave (easier communication between core and executor)~~ * ~~Tests~~ * ~~Console Client~~ * ~~Windows Service Client~~ * ~~Installer for Slave~~ * ~~Windows Tray Icon for Slave~~ * ~~HL App Client~~ * ~~Sort out problem with uploaded, modified assemblies which aren't downloaded to the slave; Add GUIDs to `PluginCache` ~~ * ~~Heartbeat interval should be controllable by the server~~ * ~~Creation of a unique Id for a machine which does not change if the config is deleted~~ * ~~Correct total physical memory available for a slave (`ConfigManager`)~~ * ~~Test sandboxing and security of appdomains. If any assemblies can be uploaded by users, becomes very important.~~ * ~~React on `SayHello` action (call `Hello` service method)~~ * ~~Send cpu utilisation with every heartbeat~~ * ~~Log exceptions to Windows Event Log~~ * ~~`FreeCores` needs to be decremented right after a `CalculateJob` message has been received. Otherwise a slave reports free cores which are already reserved for new jobs.~~ * ~~`PluginTemp` directory should be cleaned up from time to time (or on startup)~~ * ~~`SlaveCommListener` in Slave.Tests should not be used in `ConsoleClient`~~ * ~~Heartbeats are massively delayed, because the heartbeat-method locks on `engines` (in `GetExecutionTimeOfAllJobs`) and the same lock is made at `StartJobInAppDomain`. This causes the a slave-heartbeat-timeout (1 minute), thus a reset and reassignment of all jobs.~~ === Experiment Manager === * ~~Show jobs in treeview. Would greatly save screen space and navigation-clicks~~ * ~~to be enhanced (event wiring)~~ * ~~Sort `HiveExperiments` alphabetically~~ * ~~Plugin-Upload (optional)~~ * ~~Experiment Sharing~~ * ~~Appropriate numbering of Runs~~ * ~~Use Service-Call pattern from OKB (or PPOV-Cockpit)~~ * ~~Show `StateLog` - use Gantt Chart like view~~ * ~~Pause and stop single jobs~~ * ~~Paused jobs should not be integrated into experiment, so results are not lost. Parameters of paused jobs should be changable (and used when resumed).~~ * Deleting jobs after adding them (neither the remove button, nor the del key, nor the context menu entry succeeds in deleting a job (experiment) that has just been dragged in) === Hive Engine === * ~~`HiveEngine` jobs should have a `HiveExperiment`, which is marked, so a user cannot see it in `HiveExperimentManager`. However it should be visible in Administration GUI. If a Hive Engine crashes and cannot delete the experiment, this should be detected by the server and it should be automatically deleted.~~ * ~~Improve `HiveEngine` View (list of jobs, with status ect.)~~ * ~~Stabilize~~ === Administration === Missing `WebService` Methods: * ~~`GetAllHiveExperiments`~~ * `GetUsers` * `GetUserStatistics` * ~~`GetJobsBySlave` -> `GetJobsByResourceId` ~~ * `GetGlobalStatistics` (for Statistics `TabPage`) * ~~`GetScheduleForResource` (+ `Add/Update/Delete`)~~ TODOS: * ~~convert `HeuristicLab.Calender` to a plugin~~ * ~~use svcutil~~ * ~~write partial classes for dtos and implement `IContent`~~ * ~~build Observable Collections for `Users/Slaves/Groups`~~ * ~~add `ContentViews` for Users and `SlaveGroups`~~ * ~~show some fancy statistics~~ * ~~add Save Button~~ * ~~integrate `HeuristicLab.Services.Hive.Common-3.4` in Server~~ * ~~get rid of `HiveItem` etc. on Server~~ = Meeting protocols = === Architects meeting ^(16.06.2011)^ === '''DataAccess:''' * ~~`TransactionManager` with interface again~~ * ~~remove `AssignedResourcesId` in `AssignedResources`, use JobId+ResourceId as primary keys~~ * ~~remove `CreateHiveDatabaseApplication`. the db schema should not be developed `dbml first`, since dbml does not support most sql-server features. instead the sql-server schema should be designed first and the dbml should be generated.~~ * ~~`UptimeCalendar` should be named `DowntimeCalendar`~~ * ~~`DataAccess` layer and `Dao` classes should be removed, access to linq to sql should happen directly in server-implementation.~~ '''Server''' * ~~`Lifecycle` should be named differently. maybe `EventHandler`, `EventManager`.~~ * ~~put magic numbers into config~~ * ~~timeout in `Lifecycle`~~ * ~~`ApplicationConstants`~~ * ~~look for magic numbers in hive client~~ * ~~`GetWaitingJobs` should be implemented as a stored procedure and should also assign a job to a slave. it should make sure no race conditions occur if it is called concurrently.~~ * ascheibe: moved back to next HL release '''!HiveExperiment''' * ~~ rename: `HiveExperiment` -> `Job`, `Job` -> `Task` ~~ * ~~`HiveExperimentPermissions` ~~ * ~~the `GrantedUserId` could be removed ~~ * ascheibe: `GrantedUserId` is part of the PK and can't be removed. `GrantedByUserId` is not necessary and could be removed, but it still could be interesting information? * ascheibe: When talking to swagner it was decided that we leave it because it could be interesting in the future. * ~~only `Full` and `Read` permissions are necessary (`Read`: just read!, `Full`: control, delete, grant permissions)~~ * ~~ remove `LastAccessed` and `IsHiveEngine`. there should be a category field instead. ~~ === Remarks for the future (cneumuel) ^(28.06.2011)^ === '''Security''' * `GetPlugins` currently returns all plugins from the server. This exposes all uploaded assemblies. When confidentiality for plugins is relevant this method should be removed and only `GetPlugin(s)ById` and `GetPlugin(s)ByHash` should be available. * Slave-user: Each hive slave uses the same username and password. A slave is allowed to download jobs. When a slave downloads a job it should be checked if the job is assigned to this slave or a parent-slave-group (not implemented yet). However it is still possible for an attacker to fake the ID of another slave (if it is known) and get access to jobs. '''Statistics''' * Further measures to include (as total sums, also keep deleted jobs in `DeletedJobStatistics`): * Globally: !FinishedJobs, !WaitingJobs, !FailedJobs, !AbortedJobs, !TransferringJobs, !PausedJobs * Per user: total jobdata-size (MB) '''Server performance''' * Increasing number of slaves puts pressure on the server with increasing response times and some deadlock-situations. Ideas to resolve: * Increase heartbeat-interval (maybe dynamic when the number of slaves gets higher). Remember to increase the `SlaveHeartbeatTimeout` in the web.config too. * Make `GetWaitingJobs` faster by using stored procedure or use a job-queue instead of querying the whole job-table. * Large jobs (>15MB) are sometimes result in database-timeouts, especially if multiple of them are uploaded concurrently. Ideas to resolve: * Use `Filestream` as db-type instead of `Varbinary` as it is supposed to be faster for large data-blobs. * As streaming is not an option (no security, encryption), using a `chunking channel` could work (http://msdn.microsoft.com/en-us/library/aa717050.aspx). '''Scheduling'''\\ Some ideas for a scheduler: * 3 levels of priorities: * Job priority (fixed at upload) * User priority (fixed) * Time (dynamic: `f(Now-Uploaded)`) * Those 3 priority values are aggregated (average, (weighted-)sum) represent the final priority by which the jobs are ordered. * Fast-slaves-first: Faster slaves get the jobs first, slow slaves later. This would require: * Performance-index: Let each slave calculate a benchmark-job before it is used. * Job-queues per slaves: Right now every slave who sends a heartbeat gets a job (if one is available). One queue per slave would allow the server to actively assign jobs to slaves. Such a queue could also ease performance issues and race conditions. * Re-scheduling: Sometimes fast slaves finish their jobs and slow slaves are still calculating. In those cases it might be reasonable to pause the jobs and have them calculated on the faster slaves. " enhancement closed medium HeuristicLab 3.3.6 Hive.General 3.3.6 done ascheibe