Tan BuiBig data analytics has ushered in a new age of research in everything from astronomy to online shopping habits. But big data can become a big problem when the data is too big to efficiently store, process or analyze.

Tan Bui-Thanh, an ASE/EM professor and associate researcher at the Institute for Computational Engineering and Sciences (ICES), is part of a research team that has been awarded a $1.05 million grant from the U.S. Department of Energy (DOE) to help make data more manageable by developing mathematical methods that reduce large amounts of data down to the most essential and important parts.

Bui-Thanh is collaborating with three other researchers: Paul Constantine, professor at the Colorado School of Mines and principal investigator, and Youssef Marzouk and Qiqi Wang, both professors at the Massachusetts Institute of Technology.

"Big data comes from many DOE facilities that generate tons of data per day and it is expected to increase by a few orders of magnitude in the near future. How to extract the important information in the data is important to understanding the science." said Bui-Thanh, who calls big data analytics the fourth scientific paradigm after theory, experiment and computation.

The grant, lasting three years and awarded through the DOE's Advanced Scientific Computing Research program, will focus specifically on reducing data related to inverse problems, a class of problems which uses observed data to infer unknown parameters or initial conditions of physical phenomena, such as weather patterns or contaminated groundwater flow. The data is often applied to construct complex computational models.

Tan Bui data removed research image
A graphic depicting pressure contour data from fluid flow over an airfoil with 30 percent of collected data removed. Researchers compared the reduced data with the complete data to identify data redundancy.

"We are looking for the input that induces output that is consistent with data that we observe," Bui-Thanh said.

In the near future, inverse problems will use and produce data in the exabyte range (1018 bytes) or, about 500 to 3,000 times the amount of content stored in the Library of Congress, said Bui-Thanh. However, only a select amount of the total data is important in influencing answers or model outcomes, with the rest of the information taking up precious hardware, time and energy resources. To help reduce input and output data, Bui-Thanh and his collaborators are investigating mathematical and statistical methods that automatically identify, approximate, and regroup information from the most active subspace of the input, the most influential portion of the data.

Using these methods, Bui-Thanh says he hopes to reduce the data required to compute inverse problems by three to six orders of magnitude.

Bui-Thanh and collaborators are working with national labs, and focusing their topics of chemical kinetics and turbulent flame simulation. But by the end of the grant, they hope to have refined methods to a point where they can be applied to a wide swath of inverse problems.