With the continued rise of scientific computing and the enormous increases in the size of data being processed, scientists must consider whether the processes for transmitting and storing data sufficiently assure the integrity of the scientific data. When integrity is not preserved, computations can fail and result in increased computational cost due to reruns, or worse, results can be corrupted in a manner not apparent to the scientist and produce invalid science results. Technologies such as TCP checksums, encrypted transfers, checksum validation, RAID and erasure coding provide integrity assurances at different levels, but they may not scale to large data sizes and may not cover a workflow from end-to-end, leaving gaps in which data corruption can occur undetected.
In this talk, we will present our findings from the “Scientific Workflow Integrity with Pegasus” (SWIP) project by describing an approach of assuring data integrity - considering either malicious or accidental corruption - for workflow executions orchestrated by the Pegasus Workflow Management System (WMS). A key goal of SWIP is to provide assurance that any changes to input data, executables, and output data associated with a given workflow can be efficiently and automatically detected. Towards this goal, SWIP has integrated data integrity protection into a newly released version of Pegasus WMS by automatically generating and tracking checksums for both when inputs files are introduced and for the files generated during execution. We will describe how we validate our integrity protection approach by leveraging Chaos Jungle - a toolkit providing an environment for validating integrity verification mechanisms by allowing researchers to introduce a variety of integrity errors during data transfers and storage. We will also provide an analysis of integrity errors and associated overheads that we encountered when running production workflows using Pegasus.