refget v2.0 links the hidden dictionaries of DNA
A widely-used tool that finds the exact references needed to pinpoint differences in our DNA just got a refresh.
On 17 July, the Standards Steering Committee of the Global Alliance for Genomics and Health (GA4GH) voted to release refget v2.0. With better compatibility for a range of reference genome names, formats, and systems, the new version of refget makes it easier than ever to retrieve verified genomic reference sequences.
A vital infrastructure
You may not even realise that you're using refget already.
"Almost anyone who uses a CRAM file uses refget," said Timothe Cezard, co-leader of the team that produced the new refget version and a project lead at EMBL's European Bioinformatics Institute (EMBL-EBI). "Compression, decompression, all CRAM tools—say, for conversion to other formats or direct analysis—sit on top of refget."
CRAM is a popular and efficient file format for storing DNA sequences, able to reduce storage costs by up to 50%. It achieves that staggering compression by relying on a reference sequence—a bit of DNA considered typical.
Compare your own genes against the reference, and you will begin to see variation: genetic differences that could lead to everything from freckles to a high risk of breast cancer.
Instead of storing all three billion base pairs of the reference sequence alongside the DNA being studied, CRAM files simply hold onto the reference sequence's name.
When it's time to decompress the data, refget steps in—helping you "get" the "reference" you need.
Solving the dictionary dilemma
CRAM is just one example of how refget removes dangerous uncertainty from genomic data.
"refget identifies each reference sequence using its inherent unique qualities, so you can always trust that a sequence contains what it says on the label," said Andrew Yates, a founding developer of refget and a team leader at EMBL-EBI.
"The consequences of comparing genomic data to incorrect or misaligned reference sequences are serious. Genetic variants may be classified as pathogenic or harmless incorrectly, and patients could receive improper care. Being exact matters," he said.
By assigning a unique identifier to reference sequences, refget solves a tricky naming problem in genomics.
Central authorities like the International Nucleotide Sequence Database Collaboration (INSDC), Ensembl, and the University of California, Santa Cruz (UCSC) Genome Browser use different naming conventions for the same reference sequence.
Think of how the Oxford English Dictionary and Merriam-Webster sometimes spell and define the same English word differently. Then try to convince a British English speaker to use "color" instead of "colour," and you will see the challenge of standardising nomenclature.
Non-unique names create even more uncertainty when analysing data. For instance, another common naming convention counts by chromosome number, starting with chromosome 1 as the largest. But the reference genomes of many organisms have a chromosome called "1." How do you know you're getting a human chromosome and not a mouse, for instance? Which "1" is the right one to use?
refget clears up any confusion.
"refget is very straightforward. You've got a name, you grab a sequence. You have a sequence, you construct the name," said Cezard. "You don't need to rely on any naming authority."
Why you need refget for any genomic analysis
For the initial refget release in 2018, the GA4GH Large Scale Genomics Work Stream tailored the API to support CRAM.
But Yates and team quickly realised that refget could smooth over issues in other genomic data formats and models. VCF and SAM also support refget identifiers, with growing community interest in using them.
"refget is a fundamental building block for GA4GH standards," said Yates. "It can solve problems beyond CRAM, for any file format or data model that requires a reference sequence. With refget, you know exactly what sequence you're talking about."
For instance, refget is already solving problems for the GA4GH Variation Representation Specification (VRS), which provides a framework for describing genetic variants that computers can easily compare and analyse.
Yates and Cezard worked closely with the VRS team to develop refget v2.0, which supports VRS sequence identifiers. Now hospitals, laboratories, and databases like NIH-funded ClinGen, which use VRS to represent and share genetic variants, link to reference sequences via refget.
"In large part due to refget, GA4GH VRS makes sharing and comparing variant data across institutions much more reliable. refget lets us pinpoint the exact reference sequence, which then helps us represent the variation unambiguously," said Larry Babb, a leader of the VRS team, principal software engineer at the Broad Institute of MIT and Harvard, and a GA4GH Driver Project Champion for ClinGen.
"By using refget identifiers in VRS, we can tackle important interoperability challenges that arise when comparing evidence from new reference sequences. The strategy has already worked in the real world in a project with the Atlas of Variant Effects Alliance," said Alex Wagner, the other VRS team leader, who is a principal investigator at Nationwide Children's Hospital and GA4GH Driver Project Champion for the Variant Interpretation for Cancer Consortium.
Another major resource for the genomics community, the European Nucleotide Archive (ENA), has already implemented refget v2.0.
ENA contains all sequenced DNA and RNA in the public domain—nearly three billion sequences. To decompress files from the database, researchers use the CRAM reference registry, which runs on refget.
The new version of refget will also debut in the Ensembl genome browser. This collection of more than 50,000 genomes (representing great diversity within and among species, from humans to maize to zebrafish) offers tools for analysis and comparison.
"refget is powering our new Ensembl infrastructure. These refget endpoints will be made available in the near future and will provide access to Ensembl hosted protein and transcript sequences," said Yates.
New features in v2.0
The latest version of refget expands the capabilities of the API, making it more accessible and compatible with other systems.
Work with the VRS team led to a new preferred algorithm for defining identifiers. Other new features detailed in the specification include recommended best practices (such as lowercase naming authority strings), and options when searching for a specific identifier (with or without a namespace).
One key change—aimed at expanding the groups who can benefit from refget—allows you to search not just by unique refget identifier, but by another naming convention. In dictionary terms, you can look up either "colour" or "color" and still retrieve the right definition.
"refget servers can now retrieve the same sequence using a different naming convention. The new version is interoperable with other systems that rely on naming authorities, so you can search even if you don't have access to the reference sequence itself," said Cezard.
"You can enter a name that isn't a refget identifier and still get the same verified, reliable sequence—which you can then recompute into a refget identifier," he added.
The new version includes technical solutions for handling non-unique names.
These major v2.0 upgrades don't entail major work for implementers: all existing refget clients can continue using the API. The only breaking change is a minimal one and makes refget servers compatible with the GA4GH Service Info API, which helps find web services for analysing genomic data.
refget for an entire genome
Building on the same principles as refget, the team is currently developing a new specification that verifies the identity of collections of sequences.
"refget defines a name for a single sequence, like a chromosome. Sequence Collections defines a name for a group of sequences, which we would often use for assemblies or a whole genome," said Cezard.
Sequence Collections will offer many new features beyond defining names, including searching within and comparing collections.
In the meantime, Cezard, Yates, and collaborators aim to strengthen support for refget in a wide range of genomic file formats, from BED to SAM to VCF, and demonstrate how useful refget can be.
"refget already safeguards a crucial step in genomic analysis for researchers around the world," said Yates. "This second version reinforces how vital the concept of unique identifiers really is, whether you are identifying a single reference sequence, an entire genome, or even a pangenome."
Provided by Global Alliance for Genomics and Health