Cutting big data down to a usable size

July 6th, 2015 • Claudia Lutz

Next generation DNA sequencing technologies have turned the vision of precision medicine into a plausible reality, but also threaten to overwhelm computing infrastructures with unprecedented volumes of data. A recent $1.3M award from the National Institutes of Health will allow researchers at the University of Illinois and Stanford to help address this challenge by developing novel data compression strategies.

Anyone who has struggled with the logistics of working with, saving, or sharing large computer files can empathize with the challenge faced by today's biomedical researchers and medical practitioners. However, the genomic data files that these groups are beginning to produce on a routine basis are orders of magnitude larger than the average movie or digital photo; a single human genome sequence takes up around 140 GB.

Olgica Milenkovic, an Associate Professor of Electrical and Computer Engineering at Illinois, and Stanford Professor of Electrical Engineering Tsachy Weissman are coPIs on the project. Milenkovic is an affiliate of the Biosystems Design and the Gene Networks of Neural and Developmental Plasticity research themes at the Carl R. Woese Institute for Genomic Biology (IGB). Electrical and computer engineers Deming Chen and Wen-Mei Hwu, and bioengineer and fellow IGB member Jian Ma are co-investigators; Ma is a member of the Cellular Decision Making in Cancer and the Gene Networks of Neural and Developmental Plasticity research themes, and an affiliate of the Biosystems Design theme.

"Precision medicine requires that genomic, proteomic and other types of health-care related data corresponding to many individuals be acquired, stored and archived for many years," said Milenkovic. "[Our goal is] to develop a suite of software solutions for the next generation of biological data repositories and labs, which are currently facing enormous challenges with data storage, transfer, visualization and wrangling."

The grant was one of several new software development awards within the NIH Big Data to Knowledge (BD2K) Initiative, which supports efforts to improve the production, analysis, management, and accessibility of biomedical Big Data of all kinds. The award is the second BD2K grant involving Illinois faculty to be awarded this academic year; in September 2014, a collaboration between the Mayo Clinic and members of Illinois' CompGen Initiative received BD2K Center of Excellence funding.

The main goal of Milenkovic and Weissman's project will be the development of a suite of data compression software that will handle several types of genomic data, including DNA sequence data, metagenomic data, quality scores for sequences, and data from gene functional analyses (e.g., RNA-Seq). While compression of each data type requires a unique approach, the group hopes to identify aspects of compression strategies that are transferable across many types of genomic data.

The basic principle of data compression is shared across the myriad methods that exist—to represent the information stored in a dataset in a more efficient way. A classic example is that a string of 50 As can be much more compactly represented as "A times 50."

Genomic data sets pose a unique set of challenges and opportunities for data compression, because they have a large amount of repetition and a very small alphabet; for example, just four nucleotide bases, or "letters," in raw DNA sequence. As the example given above demonstrates, repetitions within data provide opportunities for shortcuts in representation.

In addition, some genomic data sets are linked to a reference—a genome sequence from a similar species, the same species or even the same individual, to which DNA or RNA sequences can be compared. A reference-based algorithm can then encode only the differences between these sequences and the reference, rather than every nucleotide of the sequence, to greatly reduce the size of the data set.

Milenkovic, Weissman and colleagues will explore strategies that combine existing compression algorithms, focusing on those that handle these data characteristics well, with algorithms that will be newly developed as part of the project.

"We will cover the development of the algorithms, their analysis, prototyping of the software solutions and benchmarking on real data," said Milenkovic, describing her research group's role in the project. "We plan on collaborating with Mayo Clinic, and potentially other institutions, to promote the use of our methods" among biomedical researchers, she added.

Another challenge addressed by the project will be the trade-off faced by data compression algorithms between how greatly the size of a data set can be reduced, and the computing power necessary to achieve compression. A major component of the project will be to develop parallel processing strategies to decrease the wait time for users of the resulting software.

Provided by University of Illinois at Urbana-Champaign