By Marlene Cimons, National Science Foundation
In 1948, American mathematician and electronic engineer Claude Shannon published a paper that became the foundation of “information theory,’’ a concept that examines the limits of how society represents and transmits Anformation. His groundbreaking ideas opened the way for today’s trillion-dollar communications industry.
In “A Mathematical Theory of Communication,” Shannon quantified the limits of compressing, storing and transmitting data, presaging the current Information Age, with its proliferation of high speed Internet, CDs, DVDs and wireless technology. Shannon described the fundamental problem with communication as “reproducing at one point either exactly or approximately a message selected at another point.”
Information theory was big hit with communications engineers and other researchers, who have been building upon his work ever since. Recently, scientists created the Science of Information Center at Purdue University to advance science and technology through a new quantitative understanding of information processing in a wide-ranging array of systems, including biological, physical, social and engineering.
“Shannon started this communication revolution with a paper and a mathematical model, from which DVDs, the Internet and CDs came,” says Wojciech Szpankowski, director of the new center and a computer sciences professor at Purdue. “Engineers put into existence what he predicted more than 60 years ago.”
The center “is working in the same spirit in trying to build something fundamental and apply it to real life problems,’’ he adds. “To keep pace with rapid advances in networking, biology and quantum information processing, we need to rethink how we understand and integrate information. By assimilating elements of space, time, structure, semantics and context, we will deepen our understanding of information and apply these results to critical problems in society.”
The National Science Foundation supports the center, one of NSF’s Science and Technology Centers, with $25 million in funding over five years. Purdue University is the lead institution, with research partners at the Massachusetts Institute of Technology, Stanford University, the University of California at Berkeley, Princeton University, Howard University, Bryn Mawr University, the University of California at San Diego and the University of Illinois at Urbana-Champaign.
Researchers also are developing an education program, including a new course called “Science of Information,” which will introduce undergraduates to information and communication theories, and problems.
Center researchers hope to define and develop the core principles that govern information transfer, and apply this knowledge to problems in the physical and social sciences, and in engineering, including, for example, financial transactions, patterns of consumer behavior, and communication among cells within molecular biology. The results could have wide-ranging applications in numerous fields, from disease detection to developing the next generation of wireless networks.
“Information provides the essential substrate and unifying theme for virtually all complex interacting systems,” Szpankowski says. “Understanding information flow, therefore, holds the key to comprehending and building more efficient systems.”
In view of this, the center focuses its research around three major areas: life sciences, communication and the extraction of knowledge from massive datasets. Predictably, the three overlap upon occasion.
For example, scientific researchers around the world are collecting huge amounts of genomic sequences at a rate that outstrips conventional storage capacities. “We are looking at the different types of data being generated, how they are processed, and what questions are later asked about them to come up with intelligent compression schemes that will allow long term storage of this valuable but copious information stream,’’ says Tsachy Weissman, associate professor of electrical engineering at Stanford University, and a center researcher.
“More concretely, we have been studying the fundamental limits, and computationally viable schemes for approaching those limits, in compression of genomic data given highly correlated genomic data already available on a database,’’ Weissman adds. “For example, compression of one individual’s genome, given that the genome of another individual from the same species is already on the database.’’
Biology, especially, “is at a cross-road, and a better understanding of the field might come from understanding the information flow among cells,” Szpankowski says. “We haven’t designed biological systems--it’s already been built by natural selection--but we should be able to understand how information is passed from one cell to another cell. In this sense, information flow involves a much larger world than many people realize.”
Molecular biologists collect vast amounts of data, but aren’t always able to extract the information they need to confirm a hypothesis about how cells behave.
“These databases contain valuable information about how cells work and how diseases develop, but it is hard to find these needles in the haystack with current computational and analysis methods,” Szpankowski says. “We need the next level of computation and data analysis, which will come from a better understanding of information.”
Conventional research in computational biology often largely focuses on identifying single markers associated with disease and phenotype, that is, the observable characteristics of an organism. “However, it is widely believed that such phenotypes result from emergent behavior, one that is better observed as an interacting sub-unit, rather than individually differentiated markers,’’ says Ananth Grama, professor of computer science at Purdue.
“The problem of identifying network signatures, while highly promising in theory, poses profound computational and data-related challenges,’’ Grama adds. “As part of the broader theme within the center on network modeling and analysis, we propose to develop novel models, methods and software that will fundamentally enhance our understanding of disease and phenotype, while potentially uncovering new targets for intervention.’’
Doraiswami Ramkrishna, professor of chemical engineering at Purdue, hopes center researchers will be able to help him distill information that will enable him to predict specific gene expression within cellular metabolism, his field of study.
“We have lots of data, and I want to decipher from that data the information that will help me prove a theory,” he adds. “We want to know that we are correctly predicting which genes are expressing, and which are not. We need to be able to see whether what we predict is reflected in the data.”
In the larger communications arena, center researchers also are studying ways to ensure the timely delivery of messages, which is not always possible within today’s wireless networks. “The news challenge is to build protocols that will deliver the message within strict deadlines in a dynamically changing environment,” Szpankowski says. “We would like to understand how information is being transported in space, and can be delivered by a certain deadline.
“In order to send useful information, for example, when I am using my cell phone, we have to build a virtual connection between users,” he adds. “If we send too much information overhead, there is not enough bandwidth to send useful signals. The goal is to try to cut out the clutter, and to send enough information to construct a path between you and the other person within enough time to get the real information across.”
Finally, center researchers hope to design new ways to sift through and extract relevant knowledge from the flood of information that society encounters daily in countless information exchanges.
“When you query Google, for example, you get an answer, but are you getting the knowledge you need? What is knowledge? How is it created from information?” Szpankowski says. “We are dealing right now with huge datasets. Every company has a huge amount of data. The problem is: how do you extract relevant information from your quest? It’s not a problem to get AN answer. The problem is whether you get THE answer.”