Research project receives DOE funding to reduce size of scientific datasets


The amount of data produced each year by scientific user facilities such as those in national laboratories or government organizations can be up to several billion gigabytes per year. This massive amount of data generated has now started to overtake researchers’ ability to effectively analyze this data in order to achieve their scientific goals -; an oversized problem when it comes to making new scientific breakthroughs.

To develop new mathematical and computational techniques to reduce the size of these datasets, the U.S. Department of Energy (DOE) awarded $ 13.7 million to nine projects under the Advanced Scientific Computing Research program ( ASCR) in September 2021. A team led by Dr. Byung-Jun Yoon, associate professor in the Department of Electrical and Computer Engineering at Texas A&M University, received $ 2.4 million to address challenges related to moving, storing and processing the huge datasets produced and processed by scientific workflows.

The general principle of this project is to focus on the scientific objectives of each data set and to maintain the retention of the quantities of interest (QoI) that relate to the objectives. By optimizing data representation while keeping the focus on scientific goals close at hand, the Yoon team is able to preserve important information that can lead to scientific advancements despite the significant reduction in data size. .

Our idea is not only to drastically reduce the amount of data, but also ultimately to preserve the purposes for which the data is meant to serve. That’s why we call it goal-based data reduction for science workflows. We want to reduce the amount of data but not sacrifice the amounts or qualities of interest. “

Dr Byung-Jun Yoon, Associate Professor, Department of Electrical and Computer Engineering, Texas A&M University

One of the first steps Yoon’s team will take to achieve this goal is to use a theoretical approach to information to find a compact representation of data by exploiting semantics and invariances. They will also examine the impact of data reduction on the achievement of end goals, based on which they will jointly optimize the models that make up general science workflows.

An example of how an overwhelming amount of data can become unmanageable is cryogenic electron microscopy (cryo-EM), which is a widely used method for analyzing molecular structure. Typical data sets during cryo-EM consist of thousands of micrographs containing projection images of molecules in various orientations of several terabytes in size. Another example is that of X-ray scattering experiments, which are regularly carried out to analyze the structure of the material. When performed in a mapping mode where x-ray exposures are taken over the cross section of a sample, a single scatter map is a 4D dataset that can contain approximately 10 billion values.

“What excites me the most is probably for the first time that we are looking at this problem of data reduction from an objective point of view, which I think may not have been done by d ‘others,’ Yoon said. “We provide a metric that can be used to objectively quantify the impact of data reduction and then optimize the data reduction pipeline using that metric so that we can preserve the usability of the data to support the end goal. . The ultimate performance that we can bring by applying this idea to our data reduction is also very exciting. “

The mission of the ASCR program is to discover, develop and deploy computing and networking capabilities to analyze, model, simulate and predict complex phenomena important to DOE and the advancement of science.


About Author

Comments are closed.