This Science News Wire page contains a press release issued by an organization and is provided to you "as is" with little or no review from Science X staff.

Lehigh research team to investigate a 'Google for research data'

August 20th, 2018

There was a time—not that long ago—when the phrases "Google it" or "check Yahoo" would have been interpreted as sneezes, or a perhaps symptoms of an oncoming seizure, rather than as coherent thoughts.

Today, these are key to answering all of life's questions.

It's one thing to use the Web to keep up with a Kardashian, shop for ironic T-shirts, argue with our in-laws about politics, or any of the other myriad ways we use the Web in today's world. But if you are a serious researcher looking for real data that can help you advance your ideas, how useful are the underlying technologies that support the search engines we've all come to take for granted?

"Not very," says Brian Davison, associate professor of computer science at Lehigh University. "They understand web pages, not datasets. And existing dataset search services are cumbersome, focusing on searching descriptions instead of data, and they cater to researchers looking within their own discipline."

Brian and his Lehigh research team envision a "dataset search engine" that can ultimately assist many kinds of scientists in locating data that they can use to perform exploratory analysis and test hypotheses. The team has won more than $500,000 in support from the National Science Foundation (NSF) in this endeavor, which formally launched on August 1, 2018, with an estimated completion date of July 31, 2021.

The interdisciplinary Domain-Agnostic Dataset Search team at Lehigh includes Davison as principal investigator (PI) and co-PIs Jeff Heflin, associate professor of computer science, and Haiyan Jia, assistant professor of journalism and communication in Lehigh's College of Arts and Sciences. Together, they are developing techniques that enable the discovery of relevant datasets, regardless of the searcher's area of expertise.

According to the group, the sheer quantity of collections of public datasets now available has become so large that it is difficult for researchers to track them within their own discipline, and simply impossible to do so across disciplines. To help researchers find data in a discipline-agnostic manner, this NSF-backed project will investigate new, promising approaches to full-content dataset search, utilizing what the team calls "user-centric methods to develop dataset search tools and novel methods of indexing a dataset's contents."

While some disciplines have carefully curated dataset collections with search capabilities, they are limited in scope and require researchers to know which collection to search. This makes it more difficult for researchers in other disciplines to find these datasets.

"By investigating domain-agnostic search techniques," says Davison, "we hope to enable the creation of a worldwide dataset search service, much like today's web search engines."

Through this project, the team hopes to provide technology and develop a prototype of a tool that can ultimately assist many kinds of scientists to locate data that they can use to perform exploratory analysis and test hypotheses.

"Our goal," says Davison, "is that this work will one day help enable public dataset discovery and reuse, regardless of who produced the data or where it is stored—a way for researchers from all fields to organize, distribute, and access hard-won knowledge effectively, avoiding duplication of effort and enabling overall progress."

According to Heflin, data and data analytics is now an integral part of academic discovery across all areas of research and learning.

"We hope to help research communities be more efficient in their use of data to solve problems and create new knowledge," he says. "We envision a system as easy and powerful to use as Google, but used to explore datasets instead of Web pages, photos, and videos. This will be especially beneficial to research endeavors undertaken by social, physical, and data scientists."

Jia says that the design and development of the prototype will also involve professionals and practitioners in observational, interview and experimental studies to inform and guide this process, including a set of instruments for evaluating the dataset search technology and interface from the user's perspective.

"A dataset search engine using these methods benefits society by helping researchers accelerate their work and reduce duplication of effort," she says. "We intend for the end result of this project to help any analyst locate and utilize relevant datasets. It will benefit others in 'research-adjacent' pursuits as well, such as journalists seeking ways to improve their reporting, and financial managers forecasting trends in the marketplace."

All three of the primary researchers on the team are affiliated with Lehigh's new Interdisciplinary Research Institute for Data, Intelligent Systems, and Computation (I-DISC), one of three new Institutes launched by the University to create communities of scholars and catalyze crucial research in areas in which Lehigh can take a leading position on the national and international stage and make lasting societal contributions.

"I-DISC was formed to support teams of researchers that combine fundamental data and computational approaches with those focused on critical applications," says Davison. "With its potential for broad impact across the research world, this project is a perfect fit for that vision."

The team's project formally kicked off on August 1, 2018, and extends through July of 2021. The researchers intend to incorporate results of this effort into Lehigh courses that delve into data science, search engines, data journalism, and semantic Web technologies.

Provided by Lehigh University

Citation: Lehigh research team to investigate a 'Google for research data' (2018, August 20) retrieved 30 June 2025 from https://sciencex.com/wire-news/296219178/lehigh-research-team-to-investigate-a-google-for-research-data.html
This document is subject to copyright. Apart from any fair dealing for the purpose of private study or research, no part may be reproduced without the written permission. The content is provided for information purposes only.