WorkflowSim is a open source simulator that extends CloudSim by providing a workflow level support of simulation. It models workflows with a DAG model with support an elaborate model of node failures, a model of delays occurring in the various levels of the WMS stack, and the implementations of several most popular dynamic and static workflow schedulers (e.g., HEFT, Min-Min) and task clustering algorithms (e.g., runtime-based algorithms, data-oriented algorithms and fault tolerant clustering algorithms). Parameters are directly learned from traces of real executions.
Workflow Mapper is used to import DAG files formatted in XML (called DAX in WorkflowSim) and other metadata information such as file size from Workflow Generator. Workflow Mapper creates a list of tasks and assigns these tasks to an execution site. A task is a program/activity that a user would like to execute.
Workflow Engine manages tasks based on their dependencies between tasks to assure that a task may only be released when all of its parent tasks have completed successfully. The Workflow Engine will only release free tasks to Clustering Engine.
Clustering engine merges tasks into jobs such that the scheduling overhead is reduced. A job is an atomic unit seen by the execution system, which contains multiple tasks to be executed in sequence or in parallel if applicable. Different to the clustering engine in Pegasus WMS, Clustering Engine in WorkflowSim also perform task reclustering in a faulty environment with transient failures. If there are failed tasks returned from Workflow Scheduler, they are merged again into a new job.
Workflow Scheduler is used to match jobs to a worker node based on the criteria selected by users. WorkflowSim has introduced different layers of overheads and failures based on our prior work, which improves the accuracy of simulation.
Failure Generator is introduced to inject task/job failures at each execution site during the simulation. After the execution of each job, Failure Generator randomly generates task/job failures based on the distribution and average failure rate that a user has specified.
Failure Monitor collects failure records (e.g., resource id, job id, task id) to return these records to Clustering Engine to adjust the scheduling strategies dynamically.
Overhead Modeling Based on our prior studies on workflow overheads, we add layered overhead to the workflow simulation. We have classified workflow overheads into five categories as follows. Workflow Engine Delay measures the time between when the last parent job of a job completes and the time when the job gets submitted to the local queue. Queue Delay is defined as the time between the submission of a job by the workflow engine to the local queue and the time the local scheduler sees the job running. Data Transfer Delay happens when data is transferred between nodes. Clustering Delay measures the difference between the sum of the actual task runtime and the job runtime seen by the Workflow Scheduler.
Failure Modeling WorkflowSim supports two types failures. A task failure which means a task fails while other tasks in the same job may not fail. A job failure means a job fails while all of its tasks fail. The reason why we have this classification of transient failures is that they usually have different causes. Task failure is usually generated by an error during the execution while a job failure happens during the preparation of a job. Users can specify the distribution (Weibull, Uniform, Normal and Gamma) of the failure rates.
Shared and Distributed File Systems Modern file systems are usually classified into two types. The first one is a central file system in which the data storage is shared among all the worker nodes in a data-center. The communication cost is thereby included in the execution time of a task and scheduling algorithms do not necessary consider the data transfer delay between tasks. The second is a distributed file system in which the data storage is constructed with multiple separate local data storages. The communication cost should not be ignored and data-aware scheduling algorithms can apply to this environment to optimize the data locality and the overall makespan. In either way, WorkflowSim uses a Replica Catalog to keep track of the files including their replications.
Dynamic Scheduling Algorithm While CloudSim has already supported static scheduling algorithms, we added the support of dynamic workflow algorithms in WorkflowSim. For static algorithms, jobs are assigned to a worker node at the workflow planning stage. When a job reaches the remote scheduler, it will just wait until the assigned worker node is free. For dynamic algorithms, jobs are matched to a worker node in the remote scheduler whenever a worker node becomes idle. This classification proposes new options for researchers to evaluate the dynamic performance of their algorithms.
Task Clustering Compared to CloudSim and other workflow simulators, WorkflowSim provides support of task clustering that merges tasks into a clustered job. Users can specify different criteria to optimize the overall performance. For example, in a fault environment, a long job would end up with running forever even though the overheads are reduced.
WorkflowSim has attracted a wide attention in the Grid and Cloud communities and have been widely used in multiple workflow studies in literature. Below we introduce several typical use cases of WorkflowSim.
Cloud Broker Jrad et.al. have developed a broker-based framework for running workflows in a multi-Cloud environment. They extended WorkflowSim to use the Cloud Service Broker as the scheduler instead of using an external one. In addition, a Replica Catalog keeps a list of data replicas by mapping input/output filenames to their current site locations. The data transfer is initiated by workflow tasks during their execution on the respective data-centers, whereas the Replica Catalog is managed by the Data Manager.
Balanced Task Clustering Recently we have used WorkflowSim to evaluate the dependency imbalance and runtime imbalance while performing task clustering in scientific workflows. With the support of task clustering and the modeling of data dependencies in WorkflowSim, we are able to evaluate the cause of dependency imbalance and runtime imbalance respectively and propose balanced task clustering algorithms to improve the overall performance of scientific workflows. We used traces of five widely used scientific workflows that were run on FutureGrid and Amazon EC2.
Fault Tolerant Clustering Many existing clustering strategies ignore or underestimate the impact of the occurrence of failures on system behavior, despite the increasing impact of failures in large-scale distributed systems. We have proposed fault tolerant clustering algorithms that dynamically adjusts the clustering strategy based on the current trend of failures. During the runtime, this approach estimates the failure distribution among all the resources and dynamically merges tasks into jobs of moderate size and recluster failed jobs to avoid failures. To complete this work, we relied on the generation of transient failures and reclustering techniques in WorkflowSim.
WorkflowSim commits to the open source community and welcomes your contribution
Scientific Applications widely use workflows to organize their works
Community Involvement and Research Opportunities