The Workflow Trace Archive: Open-Access Data from Public and Private Computing Infrastructures -- Technical Report

Realistic, relevant, and reproducible experiments often need input traces collected from real-world environments. We focus in this work on traces of workflows---common in datacenters, clouds, and HPC infrastructures. We show that the state-of-the-art in using workflow-traces raises important issues: (1) the use of realistic traces is infrequent, and (2) the use of realistic, {\it open-access} traces even more so. Alleviating these issues, we introduce the Workflow Trace Archive (WTA), an open-access archive of workflow traces from diverse computing infrastructures and tooling to parse, validate, and analyze traces. The WTA includes ${>}48$ million workflows captured from ${>}10$ computing infrastructures, representing a broad diversity of trace domains and characteristics. To emphasize the importance of trace diversity, we characterize the WTA contents and analyze in simulation the impact of trace diversity on experiment results. Our results indicate significant differences in characteristics, properties, and workflow structures between workload sources, domains, and fields.


Introduction
Workflows, that is, applications that organize tasks with data and computational inter-dependencies, are already a significant part of private datacenter and public cloud infrastructures [21,50]. This trend is  Open-source (≈ 15%) workflow-traces in representative articles, which can affect the relevance and reproducibility of experiments for the entire community. (right) Toward an answer: the Workflow Trace Archive (WTA) stakeholders, process, and tools provide the community with open-source traces of relevant workflows running in public and private computing infrastructures.
likely to intensify [25,11], as organizations and companies transition from basic to increasingly more sophisticated cloud-based services. For example, 96% of companies responding to RightScale's 2018 survey are using the cloud [48], up from 86% in 2012 [28]; the average organization combines services across five public and private clouds.
To maintain, tune, and develop the computing infrastructures for running workflows at the mas-sive scale and with the diversity suggested by these trends, the systems community requires adequate capabilities for testing and experimentation. Although the community is aware that workload traces enable a broad class of realistic, relevant, and reproducible experiments, currently such traces are infrequently used, as we summarize in Figure 1 (left) and quantify in Section 2. Toward addressing this problem, we focus on improving trace availability and understanding by proposing a new, free and open-access WTA, as detailed in Figure 1 (right) and in the remainder of this work.
The need for workflow traces is stringent [24,11]. Not only the sheer volume of workloads has increased significantly over time [50], but also the users of datacenters and cloud operations are expecting increasingly better Quality of Service (QoS) from the workflow-management systems, including elasticity, reliability, and low-cost, under strong assumptions of validation [24,11] and reproducibility [25,18]. Developing workflow management systems to meet these requirements requires considerable scientific and technical advances and, correspondingly, comprehensive trace-based experimentation and testing, both in vivo and in silico [12]. Testing such systems, especially at cluster and datacenter scale, often cannot be done in vivo, due to downtime or the operational costs required. Instead, workflow traces can be replayed in silico, allowing multiple setups to run in parallel, testing individual components, etc. without the downtime nor costs. Although realistic workflow traces are key for testing, tuning, validating, and inspiring system designs, they are currently still scarce [10]. Prior work, such as WorkflowHub [42], has introduced numerous workflow traces, yet only from the science domain. As Figure 1 (left) indicates, and Section 2 quantifies and explains, less than 40% of relevant articles focusing on workflow systems conduct experiments with realistic traces, and less than 15% conduct experiments with realistic and opensource traces.
The current scarcity of traces forces researchers to either use synthetically generated workloads or to use one of the few available traces. Synthetic traces may reduce the representatives and quality of experiments, if they do not match relevant real-world settings. Using realistic traces that correspond to a narrow application-domain may result in overfitting; Amvrosiadis et al. [4] demonstrate this for clusterbased infrastructures. Additionally, a lack of realistic traces may lead to limited or even wrong understanding of workflow characteristics, their performance, and their usage, which hampers the reuse of the systems tested with such (workloads of) workflows [35]. This gives rise to the research question RQ-1: How diverse are the workflow traces currently used by the systems community?
We identify the need to share workflow traces collected from relevant environments running relevant workloads under relevant constraints. Effective sharing requires unified trace formats, and also support for emerging and new features. For example, since the introduction of commercial clouds, clients have increasingly started to ask for better QoS, and in particular have started to increasingly express nonfunctional requirements (NFRs) such as availability, privacy, and security demands in traces [11,8]. This leads us to research question RQ-2: How to support sharing workflow traces in a common, unified format? How to support in it arbitrary NFRs?
Persuading both academia and industry to release data is vital to address the problems stated prior. We tackle this issue with two main approaches. First, by offering tools to obscure sensitive information, while still retaining significant detail in shared traces. Second, by encouraging the same organization to share the data across its possibly multiple workflow management systems (sources), and by explicitly aiming to collect data across diverse application domains and fields. The availability of diverse data and tools stimulate the benefits of making available such traces, while simultaneously reducing the concerns of competitive disadvantage or of an (accidental) disclosure of sensitive information. The community is already helping with both approaches, by increasingly focusing on the problem of reproducibility. For example, ACM introduced artifact review and badges to stimulate open-access software and data artifacts for reproducibility and verification purposes [38]. We add to this community-effort ours, which is scientific in nature: RQ-3: What is the impact of the source and domain of a trace on the characteristics of workflows?
Addressing research questions 1-3, our contribution is four-fold: 1. To answer RQ-1, we conduct the first comprehensive survey of how the systems community uses workflow traces (Section 2). We collect, select, and label articles from top conferences and journals covering workflow management. We analyze the types of traces used in the community, and the domains and fields covered in published studies.
2. To answer RQ-2, we design the WTA for openaccess to traces of workloads of workflows (Section 3). We identify a comprehensive set of requirements for a workflow trace archive. A key conceptual contribution of the WTA is the design of a unified trace format for sharing workflows, the first to generalize NFRs support at both workflowand task-levels. The WTA currently archives a diverse set of (1) real workflow traces collected from real-world environments, (2) realistic workflow traces used in peer-reviewed publications, and (3) workflow traces collected from simulated and emulated environments commonly used by the systems community. WTA also introduces tools to detail and compare its traces.
3. To address RQ-3, we compare key workload characteristics across traces, domains, and sources (Section 4). Our effort is the first to characterize the new trace from Alibaba, and the first to investigate the critical path task length, level of parallelism, and burstiness using the Hurst exponent on workloads of workflows. Overall, the archive comprises 95 traces, featuring more than 48 million workflows containing over 2 billion CPU core hours.
4. To validate our answers to RQs 1-3, we analyze various threats (Section 5). We conduct a tracebased simulation study and qualitative analysis.
Our results for the former indicate systems should be tested with different traces to validate claims about the generality of the proposed approach.

A Survey of Workflow Trace Usage
To assess the current usage of workflow traces in the systems community and the need for a workflow archive, we systematically survey a large body of work published in top conferences and journals, and investigate articles that perform experiments using workflow traces, either through simulation or using a real-world setup. The process and outcome of this survey answer RQ-1.

Article Selection and Labeling
Selection: Figure 2 displays our systematic approach to select articles relevant to this survey, based on [29]. First, we collect data from DBLP [33] and Semantic Scholar [2]. We filter them by venue, retaining only articles from the 10 key conferences and journals in distributed systems listed in the caption of  104 papers have been cited 3,965 times. Labeling: We label for each of the 104 articles the type of trace usage. For articles explicitly describing their use, we use the labels realistic for traces derived from real-world workflow executions. We label the others as synthetic. We further label traces as open-access (or open-source) if they are available online and to a broad audience, and closed-access (or closed-sources) otherwise. In our analysis, we include among the open-access traces only those that are also realistic.
We also label articles by domain and field. We identify in articles explicit use of traces from the domains "scientific", "engineering", "multimedia", "governmental", and "industry", and from fields such as "bioinformatics", "astronomy", "physics", etc. We further label a trace with uncategorized when its origin remains unexplained.

Types of Traces Used in the Community
We analyze here the types of traces used by the community, with the following Observations (Os): O-1: Less than 40% of articles use realistic traces.

O-2:
Only one-seventh of all articles use open-access traces. Table 1 presents the types of traces used in the community, focusing on realistic (R) and open-access (R+O) traces. The community uses traces for experiments across both conference and journal articles, across various levels of (high) quality. In contrast to this positive finding, the results indicate that,  These findings match the perceived difficulty in reproducing studies in the field [18,12], and may hint why so few of these seemingly successful designs are adopted for use in practice [40].

Workflow Domains and Fields
We analyze the domains and fields from which the community sources workflows, with as main observations: O-3: The community sources workflows from 5+ domains and 25+ fields.
O-4: Traces containing scientific workflows are used significantly more (20x) than workflows from other domains, e.g., industry and engineering, in the surveyed articles.
O-5: Bioinformatics workflows are the most commonly used, but three other fields exhibit usage within a factor of 3.
O-6: Many traces have uncategorized domain and/or field.
Overall, we find that the community uses diverse workflows, sourced from 6 domains and 28 fields. We further investigate the distribution of use, per domain and per field. Figure 3 (top) shows that the scientific domain is over-represented in the literature in the top-five trace domains encountered, due to the large number of available open-access traces and from their conventional use in prior work. In particular, a large portion of the articles use workflow traces from the Pegasus project, which covers the scientific domain. The number of traces in this domain exceeds 200, which is larger than the number of articles in the study as each article uses multiple traces. In contrast, the next-largest domains are industry and engineering, each with less than 10 traces representing less than one-twentieth of the scientific domain.
We remark the positive diversity of the workflow domains, considering that the entire community is tempered by the extreme focus on scientific workflows. This confirms the bias demonstrated by Amvrosiadis et al. [3] with the popular Google-cluster traces. A similar situation appears for fields, but more tempered, as Figure 3 (bottom) indicates.
Overall, the results reveal that the community has a strong bias for one domain (scientific) and favors scientific fields (especially bioinformatics). We conjecture the large amount of open-access data from

Workflow Size in Open-and Closed-access Traces
We focus on open-and closed-access traces-is there structural or size-related inequality? We find that:   Figure 4a, etc. Figure 4a shows that most articles use up to 10 workflows for experiments, with an average of 11 workflows per article and some extreme outliers. The distributions for open-and closed-access traces are similarly shaped, yet closed-access traces can include over an order of magnitude more workflows.
For the smallest workflow size, Figures 4b indicates that about half of articles use workflows with 30 tasks or fewer. The range of workflow-sizes in open-access traces is much smaller than for closed-access traces; in particular, articles using closed-access traces also report many workflows with 10 or less tasks.
For the largest workflow size, half of experiments do not exceed 1,000 tasks, matching the prior characterization of Juve et al. [27]. Although closer than for the smallest workflow size, the open-and close-access traces are again exhibiting different ranges for their largest workflow sizes.

The Workflow Trace Archive
In this section we outline the design of the WTA, the unified trace format used, tools to support consumers with the trace selection according to their use-case, and give a summarized overview of the current contents of the archive. Furthermore, that facilitate the continuous growth of the archive, we provide tools for trace anonymization and a collection of trace parse scripts for different trace sources.

Requirements
We identify five key requirements for the structure, content, and operation of a useful archive for workflow traces. R-1: Diverse Traces for Academia, Industry, and Education. The main observations in Section 2.3 give strong evidence that the archive must include a diverse set of traces, corresponding to the many existing workflow domains and fields. Moreover, the archive must include traces that cover a broad spectrum of workflow sizes, structures, and other characteristics, including both general characteristics to many domains and fields, and idiosyncratic characteristics corresponding to only one domain or field.
Addressing this requirement is important. For academia, experimenting with representative traces demonstrates the applicability of an approach, increases credibility. For industry, representative traces are essential to test production-ready systems, and can act as a validation for techniques proposed by academia [30]. Furthermore, as systems become more complex, education and training on these topics becomes ever more important. R-2: A Unified Format for Workload of Workflows Traces.
To simplify trace exchange, reduce trace integration effort among different systems, prevent double work for other users, and provide dataset independent tools (expressed as R-3), the archive must define a unified trace format. The format must cover a broad set of data about the workloads and about the workflow management systemsm including: workload metadata; workflow-level data including NFRs and common metadata; task-level data including per-task NFRs and operational metadata such as task-state changes; inter-dependencies between tasks and other operational elements such as data transfers; systemlevel information including resource provisioning, allocation, and consumption; etc.
Addressing this requirement is the basis of any data archive. A unified trace format for workflow workloads helps providing long-term provenance information [6] to improve the reproducibility of experimental results, which is an ongoing grand challenge in fields such as bioinformatics.

R-3: Detailed Insights into Trace Properties.
Just archiving data is not enough; every archive must provide insight into its traces at the level of detail required by the broad audience implied by R-1, from beginner to expert. Broad insights, easily provided in standardized reports, include extrinsic properties such the number of workflows and tasks and the number of users in the trace, while intrinsic properties include the workflow arrival patterns and the resource consumption per-task. Detailed insights include expert-level analysis of single traces at workload-, workflow-, and systemlevel; and collective analysis across all traces and across traces with a shared feature (e.g., all traces of a domain or field). These properties must be accessible through readily available tools (see R-4) and, possibly, through interactive online reports. Addition-ally, practitioners can correlate information across multiple traces, resulting in better quantitative evidence, intuition about otherwise black-box applications, and understanding that helps avoiding common pitfalls [17]. R-4: Tools for Trace Access, Parsing, Analysis, Validation.
The most important tool is the online presence of the archive itself. The archive must further provide tools to parse traces from different sources to the unified format (see also R-2), to provide insight into traces (see also R-3), and to validate common properties (e.g., the presence of and correctness of properties). An absence of such tools would lead to users unable to select appropriate traces, validate their properties, and compare them.
The archive should further aid users in building more sophisticated tools. Newly built tools can then be added to the selection of tools so more parties can make use of them (contributing to R-5) R-5: Methods for Contribution.
The archive must reflect the continuous evolution of workflow use in practice, by increasing the coverage of different scenarios. We make a distinction between two types of contribution: (1) additional traces, possibly originating from a new domain or application-field, and (2) additional traces, introducing new properties. To facilitate the former contribution, the archive must provide a method for the upload and (basic) automated verification of traces. To facilitate the latter, the format must integrate specific provisions that enable upgrades and long-term maintainability, such as adding a version to each component of the format.
Having methods to contribute new workloads is key to encourage new and existing contributors to submit traces. In particular, tools to add new domains are of particular importance, to support emerging paradigms with realistic data.

Overview of the WTA
We design the WTA as a process and set of tools helping a diverse set of stakeholders. We consider three roles for the WTA community members, outlined in Figure 1. The contributor supplies, as the legal owner or representative, one or more traces to the WTA. A workflow trace contains historical task execution data, resource usage, NFRs, resource description, inputs and outputs, etc. To fulfill R-5, the WTA team assists the contributor in parsing, anonymizing, and converting the traces into the unified format (Section 3.4), minimizing the risk of competitive disadvantage, and verifying their integrity. WTA fulfills R-1 as it incrementally expands with contributors of traces from different domains with different properties.
The user represents non-expert or expert trace consumers. Non-expert users often need to rely on generic domain or trace properties, whereas the expert users have detailed knowledge of their system and require fine-grained detailed for selecting the correct trace. In addition, expert users may comment on (missing) properties and may develop new tools, models or other techniques to further compare and rank the traces. Both user types require assistance in selecting the most suitable trace given a set of criteria (Section 3.5) as well as analysis and validation (Section 3.6) from the available set of traces (Section 3.7). To support both user types, the WTA discloses both high-level and low-level details.

Workflow Model
There are numerous types of workflow models used across different communities. A 2018 study by Versluis et al. find that directed acyclic graphs (DAGs) are the most commonly used formalism in computer system conferences [47]. Therefore, for the first design of this archive, we adopt DAGs as the workflow model.
A workflow constructed as a DAG in which nodes are computational tasks and directed edges depict the computational or data constraints between tasks. Entry tasks are tasks with no incoming dependencies and, once submitted to the system, immediately are eligible for execution. Similarly, end tasks are nodes that have no outgoing edges. A collection of workflows submitted to the same infrastructure over a certain period of time is considered a workload.
Although popular, we specifically do not focus in this work on BPMN and BPEL, Petri nets, hyper graphs, general undirected, or cyclic graphs. These formalisms either include business and human-in-theloop elements [1] or add additional complexity due to having a large set of control structures such as loops, conditions, etc. [44] which we consider out of scope for this work.

Unified Trace Format
Creating a unified format (R-2) requires from the designer a careful balance between limiting the number of recorded fields while supporting a diverse set usage scenarios for all stakeholders in Section 3.2. Modern logging and tracing infrastructure can capture thousands of metrics for each machine and workflowtask involved [51], from which the designer must select. We specifically envision support for common system and workflow properties found in the typical scenarios considered in the top venues surveyed in Section 2, such as engineering a workflow engine [52], characterizing the properties of workloads of workflows [27], and designing and tuning new workflow schedulers [45].
Our unified format attempts to cover different trace domains, while preserving valuable information, such as resource consumption and NFRs, contributing to fulfilling R-1 and 3. By analyzing the raw data formats, we carefully selected useful properties to include in our unified format, omitting low-level details, such as cycles per instruction, page cache sizes, etc.
Answering RQ-2 and fulfilling R-2, our trace format is the first to support arbitrary NFRs both at task and workflow levels. For example, one of the LANL traces (introduced in Table 3) contains deadlines per workflow and the Google cluster data features task priorities, both are supported by the WTA unified format. Capturing these properties is important to test QoS-aware schedulers.
As depicted in Figure 5, the WTA format includes seven objects: Workload, Workflow, Task, TaskState Resource, ResourceState, and DataTransfer. Each of these objects contains a version field, updated whenever the set of properties is altered (R-5).
Each trace is a single workload, consisting of multiple workflows and their arrival process.  properties include the number of workflows, tasks, users, resources used (e.g., CPUs or cores), domain and field when available, authors list, start and end date, and overall resource consumption statistics. The workload description described the origin of the workflow and other relevant information, including but not limited to its source, execution environment, non-functional requirements etc.
Each workflow in the workload has a unique identifier, an arrival time, and contains a set of tasks and several properties, including scheduler used, number of tasks, critical path length, NFRs, and resource consumption. Each workflow also has the name of its domain of study, when possible.
The workflow structure is a DAG of tasks with control and data dependencies between tasks. Each Task has a unique identifier and lists its submission and waiting time, runtime and resource requirements, including required (compute) resources, memory, disk, network, and energy usage. Additionally, each task provides optional dictionaries for task-specific execution parameters and NFRs. To model dependencies between tasks, the WTA format maintains for each task a list of its predecessor (parent) and successor (child) tasks. Similarly, data dependencies are recorded as a list of data transfers.
Resource objects cover various resource types, such as cloud instances, cluster nodes, and IoT devices. A resource has a unique identifier and contains several properties, such as resource type (e.g., CPU, GPU, threads), number, processor model, memory, disk space, and operating system. An optional dictionary provides further details, such as instance type or Cloud provider. The ResourceState event snapshots periodically the resource state, including availability and utilization. Analogous to the ResourceState, the TaskState records periodically the resource consumption of the task (the Task object records the resource demand). Different from the ResourceState, the TaskState snapshot contains averages over a period of time. This is due to the resources generally allocating resources to tasks, whereas the actual consumption of a task may differ over time.
Each DataTransfer describes a file transfer from a source to a destination task, which can be a local copy on the same resource or a network transfer from a remote source, etc. To support bandwidth analysis, a data transfer introduces submission time, transfer time, and data size. Each data transfer also provides an optional dictionary with detailed event timestamps (e.g., pause, retry).

Mechanisms for Trace Selection
We address R-3 by assisting archive users in retrieving appropriate traces for their scenarios, using filter and selection mechanisms. The website is the most important such filter and mechanism, containing an overview of all traces in a general table, visible in Figure 6, with the number of workflows, tasks, users, etc. This table is sortable and searchable, allowing website users to interact with the more than 90 traces currently in the WTA (column "#WL", row "Total" in Table 3).
We provide, online and as separate tools, a detailed report for each trace. Each report includes automatically generated statistics, such as the number of workflows and tasks, then resource properties such as compute, memory, and IO, and job and task arrival times and runtime distributions (see Section 4). The metrics featured in the report are reported as important by prior studies [43,49] and enable developers to select traces appropriate for their intended use-case.

Tools for Analysis and Validation
We implement the unified trace format using the Parquet file format and the Snappy compression algorithm. Parquet is a binary file format that is supported by many big data tools such as Apache Spark, Flink, Druid, and Hadoop [46]. Many programming languages also have libraries to parse this format, such as PyArrow for Python and parquet-avro for Java. Snappy 1 compression reduces the size of the dataset significantly and has low CPU usage during extraction.
Beside trace selection support and to address R-4, the WTA offers several tools to facilitate and incentivize the continuous growth of the archive. Most of these tools required significant engineering effort to develop, due to the typical challenges of big data processing (high volume, noisy data, diverse inputformats, etc.). The WTA simplifies the upload of new traces by providing a set of parsing scripts for different trace sources, such as Google, Pegasus, and Alibaba. Parsing traces can become non-trivial, once they grow both in complexity and size. Such traces require big data tools, such as Apache Spark, and enough resources, a cluster, to compute. Noisy data raise another non-trivial issue: both Google's and Alibaba's cluster data contained either anomalous fields, undocumented attributes, and non-DAG workflows. Some of these issues were never discovered by their respective communities and were corrected in our parsing tools. Debugging, filtering, and correcting noisy big data requires significant compute power and detailed engineering.
Because traces may contain sensitive information, the WTA offers a trace anonymization tool, which 1 https://github.com/google/snappy supports users to automatically replace privacy and security-related information, to avoid an accidental reveal of proprietary information. Specifically, to remove sensitive information from trace files, we use two common techniques [37], culling and transforming. Culling is done during trace conversion, by omitting parts of the raw trace data which do not match our workflow trace format. For the transformation, as presented in Table 2, our anonymization tool automatically scans the workflow trace file for sensitive data, such as IP addresses, file paths, names, etc., by string pattern matching. Beside these standard sensitive-data checks, the WTA offers the option to search for custom privacy-critical strings.
Finally, all matched strings are replaced by a salted SHA-256 hash key. This approach using cryptographic hash functions offers protection of sensitive data, while preserving the relationships between the matched values in the same trace file [37]. Additionally, our tool hides potential relations to other trace files by adding a salt of length 16 to the hash key generation, which is randomly generated on each tool run.
To validate traces, the WTA provides a validation script that checks the integrity and summarizes important characteristics of a trace. During trace conversion, using the validation script, we successfully identified several parse bugs and inconsistencies in the data that we subsequently corrected.
Specifically, because tasks build the base of each trace, our tool checks if all contained tasks are well defined. This, for example, means that all parsed control dependencies, such as children and parents, link only to existing tasks with valid properties. A task property is valid, if the parsed property type matches the property type definition, and the property value is allowed e.g. task runtime > 0. Based on and similar to this fundamental validation, our tool provides options to check the workflow and data transfer properties to identify inconsistencies, as well.
These tools help combating perceived barriers to share data described by Sayogo et al. [39]. Several technological barriers are addressed by using a unified format and validation (data architecture, quality, and standardization), Legal and policy barriers are more difficult to address. Our anonymization tool aids in overcoming the data protection barrier, yet legal and other enforced policies may require tailored solutions.
Besides offering these tools, the WTA also hosts the trace data, addressing logistic and economical barriers. The increasing focus on sharing data artifacts by the community, is lowering the barrier regarding competition for merit and reputation for quality and bolsters the culture of open sharing. Finally, each trace has its own DOI by also uploading it to Zenodo 2 which can be cited and thus provides authors with the appropriate credits (incentive barrier).

Current Content
Having a diverse set of traces available is necessary to use in experimentation. When using traces in experimentation, different traces should be used to prove generality of the proposed approach (see Section 5). Gathering and parsing raw logs and other traces requires significant computing effort. Using a commodity cluster of 16 nodes (32 eight-core Xeon E5-2630 v3 and 1TB RAM), several traces require up to a day to compute using big data tools such as Apache Spark. In total, the WTA team spent more than two person months on converting traces to the unified trace format.
The WTA features currently 96 workloads from 10 different sources, with over 48 million workflows and 2 billion CPU core hours. Each workload is uniquely identified by a combination of the following properties if available: source, runtime environment, application, and application parameters [31]. Table 3 and 4 summarize these traces. From these tables we observe that WTA contains a vast amount of different traces, from different sources and domains, with various number of workflows, properties, number of tasks, timespans, and core hour counts. Although supported by our format, no trace currently has information on energy consumption, highlighting the need of such traces [42]. These traces are collected by combining open-access data (logs, traces, etc.) and closed-access data throughout the years in collaboration with both industry and academia. This contributes to R-1.
2 https://zenodo.org/ This diversity enables new workflow management techniques and systems to be thoroughly tested for their feasibility, strengths, and, equally important, weaknesses.
An example of the usability of different traces is in emerging fields. Serverless is a new and emerging field, hence no traces are available yet. One of our collaborators used the Shell (IoT) workload, which features small and short-running workflows, much alike serverless functions, to experiment with a serverless workflow engine. Additionally, from these traces additional insights can be gathered regarding task and job runtimes, sizes, arrival patterns, resource consumption, etc.
We encourage the community to continuously contribute traces to the archive to improve coverage of domains, maintain its representativeness of the jobs being executed in real-world by both academia and industry, offer more artifacts to prove robust performance of e.g., policies, and offer more diversity in trace statistics.

A Characterization of Workloads of Workflows
To answer RQ-3, we perform in this section a characterization of the workloads in the WTA. We characterize workloads using a variety of metrics and properties, including workflow size, resource usage, and structural patterns. Our characterization reveals significant differences between workloads from different domains and sources. Such differences further support our claim that the community needs to look beyond just scientific workloads, and consider a wider range of domains and sources for experimental studies.

Structural Patterns
O-9: Scientific, industrial, and engineering workflows exhibit various structural patterns, but at least 60% of tasks in a domain match the dominant pattern of that domain.
O-10: Industry workflows stand out by exhibiting primarily scatter patterns, as opposed to pipeline operations.
This characterization quantifies five structural patterns in workflows often used by researchers [7]: scatter (data distribution), shuffle (data redistribution), gather (data aggregation), pipeline, and standalone (process). Investigating these structural patterns is important to understand the types of applications being executed and tune a system's performance. We exclude from this analysis the LANL, Two Sigma, and Google traces, which lack structural information, that is, task parent-child relationship information. Figure 7 depicts the structural patterns found per Day of week domain. From this figure, we observe that in each domain a dominant pattern emerges that accounts for 61-85% of tasks. In the scientific and engineering domains, the majority of tasks are simple pipelines. Interestingly, the industrial workflows include primarily scatter operations. This observation matches known properties of the Alibaba trace, which accounts for over 99% of tasks with structural information we analyzed in this domain. In particular, the Alibaba trace includes MapReduce jobs, each consisting of many "map" tasks (scatter operations) and a smaller number of "reduce" tasks (gather operations). To investigate the weekly trends that may appear in workload traces, we depicts in Figure 8 for several traces the average number of tasks that arrive per day of the week. We omit the Askalon new source from the hourly task-arrival plot as they contain 4 or 5 data points, which is too few to plot a trend. We observe that traces have significantly different arrival rates and patterns. The Alibaba trace features the highest task arrival rates, peaking at over 10,000,000 tasks per hour. Google and the Two Sigma workloads follow with 10-10,000 tasks per hour. This shows that industrial workloads included in this work have significantly more tasks per hour than the other compute environments, which agrees with companies such as Alibaba and Google operating at a global scale. Also interesting to notice is that besides Askalon Old 1, the non-industrial traces show significant fluctuations throughout the week, whereas both Alibaba and Google do not. This might be due to the global, around-the-clock operation of Alibaba's and Google's services, which can lead to a more stable task arrival rate.

Arrival Patterns
To observe differences in daily trends, we depict the average task rate per hour of day in Figure 9. This figure reaffirms our observation that the two largest traces-Alibaba and Google-and the Askalon Old 1 trace have a relatively stable arrival pattern throughout the day. In contrast, the Two Sigma 1 trace exhibits a typical office hours pattern; task arrival rates increase around hour 7 and start dropping around 17. The same pattern occurs to a lesser extent in the Two Sigma 2 trace. The highly variable arrival rates of tasks in the LANL traces, as observed in Figure 8, are also evident in our analysis of daily trends. We study this in more depth in Section 4.6. With the structural patterns observed, we investigate if the large occurrence of the pass-through patterns expresses in a high level of parallelism. The level of parallelism indicates how many tasks can maximally run in parallel for a given workflow, provided sufficient resources. Figure 10  of 99th percentile parallelism, up to hundreds of thousands of tasks. This is likely a consequence of the many MapReduce workflows, which are highlyparallel by nature, that are present in the Alibaba trace. Alibaba also contains bag of tasks workflows, which by nature have a high parallelism.Scientific workflows exhibit low median parallelism but high 99th percentile parallelism, featuring levels of parallelism up to thousands of tasks. Engineering traces exhibit a moderate amount of median parallelism, between industry and scientific, with at most 1000 concurrent running tasks. Figure 11 shows the level-of-parallelism per source. From this figure, we observe that Alibaba exhibits the highest levels of parallelism, as discussed previously. Second are the Pegasus and WorkflowHub workflows. These sources contain a variety of scientific applications, commonly known for their parallel structures, as observed in Section 4.1. Other traces demonstrate less parallelism, with up to 100 concurrent running tasks. As Shell exist entirely of sequential pipelines, the source does not exhibit any variation. O-19: Although highly parallel, industrial workflows exhibit longer critical paths than scientific workflows.  The critical path (CP) refers to the longest sequence of dependent tasks in a workflow, from any entry task to any exit task. By quantifying the CP length, we investigate if workflow runtimes are primarily dominated by a few heavy tasks, or by many small tasks. Figure 12 presents the results of this characterization per workload domain. From this figure we observe the CP length for engineering workflows is the highest. This matches with the parallelism observations in Sections 4.1 and 4.3. Interestingly, even though industrial workflows are often highly parallel, their critical paths are often longer than those of scientific workflows. This indicates that industrial workflows are bigger in size than scientific workflows, which our data supports. Figure 13 presents the results of CP characterization per workload source. From this figure we observe the CP length differs significantly per trace. Based on the prior findings, the engineering traces are expected to show longer critical paths. As we can observe, the Askalon old traces contains workflows with the longest critical path. Alibaba workflows also exhibit long critical paths, indicating their workflows next to  being highly parallel, also contain a lot of tasks with stages. More concentrated, the other traces exhibit lower critical path lengths, yet the traces are still clearly distinct. As the Shell trace contains solely sequential workflows, the critical path length is one.

Critical Path Runtime
O-20: Engineering workflows have the highest critical path runtime.
O-21: Although highly parallel, industrial workflows exhibit longer runtimes than scientific workflows.
The critical path runtime is the sum of runtimes of all tasks on the critical path. It is the minimum amount of time required to run a workflow on a given infrastructure. Figure 14 represents presents the results of this characterization per workload domain. We observe that the results are very similar to those obtained by characterizing the critical path length, per domain, in Section 4.4. Engineering workflows have the highest critical path runtime, followed by industry and scientific workflows.  To investigate if workloads expose bursty behavior, the Hurst exponent H is used. H quantifies the effect previous values have on the present value of the time series. A value of H < 0.5 indicates a tendency of a series moving in the opposite direction based on the previous values, and thus exhibit jittery behavior (sporadic burst). A value of H > 0.5 indicates a tendency to move in the same direction, and thus towards well defined peaks (sustained burst). When H = 0.5, the series behaves like a random Brownian motion.
In this experiment, we inspect busty behavior by computing the Hurst exponent for task arrivals. The results of this experiment are visible in Figure 16. From this figure, we observe most traces depict bursty behavior at least for one of small, medium, and large window size. They are also not bursty for at least one window size. This is expected, as in most systems task arrivals vary at (sub-)second interval. Interestingly, LANL traces exhibit most bursty behavior at medium window sizes. This might be due to national laboratories workflows being submitted in waves. A wave is submitted all at once leading to a burst. But, a wave itself is processed smoothly. The workload is also stable over longer time periods as evidenced by H ≈ 0.5 for larger windows. Finally, the two largest traces in this work, Alibaba and Google, exhibit increasingly burst behavior for larger windows. This indicates that for larger arrival times, the workloads (in absolute numbers) vary more than for the other sources. This matches the observations in Section 4.2. In this experiment, we inspect the task interarrival times. Traces not containing task-level information, such as one of the LANL traces, are omitted. By quantifying the task arrival times, we gain insight into the speed scheduler in different environments must make decisions. Low task interarrival times translate to systems with a continuous stream of incoming tasks, possibly at a high rate. In such systems, schedulers must make often sub-second decisions to avoid tasks delaying, which could have important implications on QoS metrics such as job makespan and task turnaround time. High task interrival times translate to systems with longer periods between the arrival of two consecutive tasks. However, this does not mean the amount of tasks arriving might be low. In situations where bags-of-tasks are submitted simultaneously, the task-interarrival times between these tasks will be zero, yet the time between the arrival of two bags-of-tasks may be long. Figure 17 depicts the CDF of task interarrival times per domain. From this figure we observe that almost all tasks in the industrial domain have a task interarrival time of less than 10 milliseconds. This means industrial task schedulers must make decisions at the millisecond level. Interestingly, the industrial domain also exhibits the highest task interarrival times.

Task Interarrival times
The scientific domain shows roughly 30% of all tasks arrive within 10 milliseconds of one another. Roughly 93% of all tasks have an interarrival time of less than 100 milliseconds.
For the engineering domain, roughly 21% of all tasks have an interarrival time of less than 10 milliseconds. Around 80% of all tasks have an interarrival time less than 100 milliseconds, with 95% less than a second.
Overall, this demonstrates all domains require scheduling operations to happen at the sub-second level, with industry having the highest need for wellperforming schedulers. To observe differences per use-case, Figure 18 depicts the task interarrival times per source. From this figure we observe the Askalon (old and new) traces and LANL 1 exhibit mid-range interarrival times, as the majority of tasks have interarrival times between 10 milliseconds and 1 second. All other sources show the majority of tasks have task interarrival times lower than 10 milliseconds, stressing the need for high-performance schedulers. While most tasks arrive quickly after one another, there are significant outliers in the Two Sigma traces. This could indicate downtime of production systems, as the arrival pattern of Two Sigma is diurnal, yet stable (see Section 4.2).

Addressing Challenges of Validity
In this section, we discuss challenges to the validity of this work. We address the challenges through either trace-based simulation (the first) or argumentation (the others). Challenge C-1. Trace diversity does not impact the performance of workflow schedulers. As outlined in Sections 3.7 and 4, the WTA traces are diverse. However, what is the impact of trace diversity?
To demonstrate the impact of trace diversity on scheduler performance, we conduct a trace-based simulation study. We simulate workloads from five sources using two scheduler configurations. We equip the simulated scheduler with either the first-come first-serve (FCFS) or the shortest job first (SJF) queue sorting policy. For both scheduler configurations, we further use a best-fit task placement policy. We do not use a fixed resource environment to prevent bias when sampling or scaling traces [17]. Instead, we tailor the amount of available resources for each trace to reach roughly a 70% resource utilization on average, based on the amount of CPU (core) seconds of trace and its length. Although ambitious, 70% resource utilization is achievable in parallel HPC environments [26] and can be seen as a target for cloud environments. To evaluate the performance of each scheduler, we use three metrics commonly used to assess schedulers' performance [32,15]: task response time (ReT), bounded task slowdown (BSD, using a lower bound of 1 second), and normalized workflow schedule length (NSL, the ratio between a workflow's response time and its critical path).
We report the performance of each simulated scheduler in Table 6 per source. From this table we observe significant differences between schedulers and trace sources. In particular, we find that the relative performance of schedulers differs between trace sources. For example, SJF outperforms FCFS on the normalized schedule length metric by up to two orders of magnitude on traces from Askalon Old and Pegasus. In contrast, on traces from Askalon New and Shell, the scheduling policies perform similarly. For other metrics, these differences are present, but less pronounced. SJF performs better than FCFS on response time and slowdown for each trace source, but the differences in performance between the schedulers vary greatly across traces.
Overall, we kept the working environment fixed per trace, yet obtained significantly different results depending on the scheduler and input trace. Thus, our trace-based simulations give practical evidence that researchers require experimenting with different traces to claim generality and feasibility of their proposed approaches.
C-2. Limited venue selection in the survey. Besides omitting venues that yielded no results on our initial query, we made sure that journals, workshops, and conferences were covered at various levels in term of quality. We believe this covers the field of systems community to a degree where conclusions can be drawn from. We specifically focus on articles published in the systems communities as specialized communities, e.g., bioinformatics, focus on systems that solve domain-specific problems, but rarely conduct in-depth experiments, including trace-based, to test the system-level capabilities and behavior.
C-3. Level of data anonymization. The Google team published interesting work data [37], but their anonymization approach, of normalizing values of both resource consumption and available resources, reduces significantly the usability of traces and the characterization details they provide. We argue this type of anonymization is not preferred. When available resources per machine, e.g. available disk space, memory, etc., and resource consumption numbers are normalized, reusing traces for different environments becomes difficult. Researchers then need to make assumption on what kind of hardware the workflows were executed as done in the work of Amvrosiadis et al. [5] or need to assume a homogeneous environment.
Instead, obfuscation techniques, such as multiplying both consumption and resources by a certain factor, allow for relative comparisons and the possibility to replay scheduling the workload on the resources while still concealing the original data.
C-4. The Workflow Trace Format. A fourth challenge is the properties included in the workload trace format. For each encountered property in other formats, we carefully decided whether to include it or not. Low-level details such as page caches are omitted to not unnecessarily complicate the traces. If future work demands change, the versioning schema per object will allow for these additions.

Related Work
We survey in this section the relevant body of work focusing on trace archives and on characterizing workloads. Differently from other archives, the WTA focuses on workloads of workflows, preserving workflow-level arrival patterns and task interdependencies not found in other archives. Differently from other characterization work, ours is the first to reveal and compare workflow characteristics across different domains and fields of application.
Open-access trace archives: Closest to this work is WorkflowHub [42], which archives traces of workflows executed with the Pegasus workflow engine and offers them in a unifying format containing structural information. WorkflowHub also provides a tool to convert Pegasus execution logs to traces, similar to our parsing tools. Different from this work, WorkflowHub's traces include a single workflow and thus not a workload with a job-arrival pattern. Work-FlowHub also does not provide statistical insights per trace and thus, they do not meet requirements R-1 and R-3, and only partially meet R-4.
Also relatively close to this work, the ATLAS repository maintained by the Carnegie Mellon University [4] contains two traces (the S3 traces in this work), with other two traces announced but not yet released (as announced, the S7 traces in this work). None of their published traces contains taskinterdependency data, so, although overlapping with our S3 and S7, the ATLAS work is different in scope and in particular does not address workflows. Further, they do not consider different domains nor fields, and their archive lacks a unified format, statistical insights, selection mechanisms, and toolingthus, they do not meet our requirements R1-4.
Other trace-archives with similarities to this work include the MyExperiment archive (ME) [19], the Parallel Workloads Archive (PWA) [13], and the Grid Workloads Archive (GWA) [23]. ME stores workflow executables, and semantic and provenance-data, but not pro-vide execution traces as WTA does and thus has different scope. The PWA includes traces collected from parallel production environments, which are largely dominated by tightly-coupled parallel jobs and, more recently, by bag-of-tasks applications. The GWA includes traces collected from grid environments; differently from this work, these traces are dominated by bag-of-tasks applications and by virtual-machine lease-release data.
Workload characterization, definition, and modeling: There is much related and relevant work in this area, from which we compare only with the closely related; other characterization work does not focus on comparing traces by domain and does not cover a set of characteristics as diverse as this work, leading to so many findings. Closest to this work, the Google cluster-traces have been analyzed from various points of view, e.g., [36,9,34]. Amvrosiadis et al. [3,4] compare the Google cluster traces with three other cluster traces, of 0.3-3 times the size and 3-60 times the duration, and find key differences; our work adds new views and quantitative data on diversity, through both survey and characterization techniques. Bharathi et al. [7] provide a characterization on workflow structures and the effect of workflow input sizes on said structures. Five scientific workflows are used to explain in detail the compositions of their data and computational dependencies. Using the characterization, a workflow generator generator for parameterized workflows is developed. Juve et al. [27] provide a characterization of six scientific workflows using workflow profiling tools that investigate resource consumption and computational characteristics of tasks. The teams of Feitelson and Iosup have provided many characterization and modeling studies for parallel [16], grid [22], and hostedbusiness [41] workloads; and Feitelson has written a seminal book on workload modeling [14]. In contrast, this work addresses in-depth the topic of workloads of workflows.

Conclusion and Ongoing Work
Responding to the stringent need for diverse workflow traces, in this work we propose the Workflow Trace Archive (WTA), which is an open-access archive containing workflow traces. We conduct a survey of how the systems community uses workflow traces, by systematically inspecting articles accepted in the last decade in peerreviewed conferences and journals. We find that, from all articles that use traces, less than 40% use realistic traces, and less than 15% use any open-access trace. Additionally, the community focuses primarily on scientific workloads, possibly due to the scarcity of traces from other domains. These findings suggest existing limits to the relevance and reproducibility of workflow-based studies and designs.
We design and implement the WTA around five key requirements. At the core of the WTA is an unified trace format that, uniquely, supports both workflowand task-level NFRs. The archive contains a large and diverse set of traces, collected from 10 sources and encompassing over 48 million workflows and 2 billion CPU core hours.
Finally, we provide deep insight into the WTA traces, through a statistical characterization revealing that: (1) there are large differences in workflow structures between scientific, industrial, and engineering workflows, (2) our two biggest traces-from Alibaba and Google-have the most stable arrival patterns in terms of tasks per hour, (3) industrial workflows tend to have the highest level of parallelism, (4) the level of parallelism per domain is clearly divided, (5) engineering workloads tend to have the most tasks on the critical path, (6) the three domains inspected in this work show distinct critical path curves, (7) in order to claim generality of an approach, one should test a system with a variety of traces with different properties, possibly from different domains.
In ongoing work, we aim to attract more organizations to contribute real-world traces to the WTA, and to encourage the use of the WTA content and tools in educational and production settings. One of our goals is to develop a library system administrators can integrate into their systems to generate traces in our format. Our preliminary experience with this learns that developing such a library, even for a single system, requires significant engineering effort and is thus left for future work. We aim to support other formalisms in the future, including directed graphs, BPMN workflows, etc. based on the community's needs. Furthermore, we aim to improve the trace format and statistics we report for each trace, based on community feedback.

A.1.1 Survey Workflow Trace Usage
The data used in the survey and the tools used to visualize the data are available as open-source software at https: // github. com/ lfdversluis/ wta-analysis .

A.1.2 Parsing traces
To parse traces into parquet file format with the WTA workload format, the archive offers a parse script for each distinctive workload structure at https: // github. com/ lfdversluis/ wta-tools . All parse scripts are available as openaccess data and can be inspected for correctness and used as a foundation for other parse scripts. All parse scripts are located in the parse_ scripts/ parquet_ parsers/ folder and the filenames are indicative of which raw data source they parse. For example, the workflowhub_ to_ parquet. py Python script parses a trace from the WorkflowHub archive. All output can be found in parse_ scripts/ output_ parquet/ , with a sub-folder for the source. Thus, the workflowhub_ to_ parquet. py Python script will output to parse_ scripts/ output_ parquet/ workflowhub/ .

A.1.3 Validating Traces
To validate the traces and their properties, the archive offers a trace validation tool. This tools can be found in the GitHub repository containing WTA tools. The tool expects the traces in parquet file format using the unified trace format introduced in this work. The run the tool, run the validate_ parquet_ files. py Python script with as argument the path to the trace directory to be validated. The tool can be found in the parse_ scripts directory. The tool will print the results to standard output (or stdout) and will exit with a non-zero exit code.

A.2 Trace Characterization
All scripts used to characterize the contents of the WTA are available online at https: // github. com/ lfdversluis/ wta-analysis . These scripts are offered as IPython notebook files, which includes both the code and the visual outputs, i.e. graphs. In order to run all IPython notebook files, JuPyter (4.1) is required and the python libraries PySpark (2.4.3), plotnine (0.5.1), NumPy (1.16.2), Pandas (v0.24.2), more-itertools (7.0.0), and hurst (0.0.5) should be installed on the machine running JuPyter.

A.2.1 Simulation Experiment
The simulator used in Section 5 is available online as open-source software at https: // github. com/ lfdversluis/ wta-sim . The datasets used are available in the open-access archive introduced in this work.
The simulator is written in Kotlin 1.3 and is executed on a system running CentOS Linux release 7.4.1708 and an Intel(R) Xeon(R) CPU E5-2630 v3 @ 2.40GHz. The parquet data was read using the Avro parquet reader (1.10.1) from the Hadoop (3.2.0) project.