Building a data repository that's responsive to researcher needs
NeSI is working with Genomics Aotearoa to develop a pilot data repository for the storage of genomic data generated from taonga species. As the repository takes shape, contributions of test datasets are helping identify challenges and opportunities that exist around how researchers store, share, describe, publish, interact, and archive data.
Dr Kim Handley, a Royal Society Te Apārangi Rutherford Discovery Fellow and Senior Lecturer in Environmental Microbial Genomics at the University of Auckland, is among those inaugural contributors to the repository. Her dataset comes from a Genomics Aotearoa funded project to better understand the links between microbial life in stream, estuary, and sea ecosystems.
Through a collaboration between the University of Auckland School of Biological Sciences and the Cawthron Institute, Kim and her PhD student Hwee Sze Tee used genomic and mass spectrometry techniques in a study to identify the genes and proteins present in less well-studied benthic (river or lake bed dwelling) cyanobacteria, such as Microcoleus.
By studying these cyanobacteria in their natural habitat, they've been able to gain new insights into cyanobacteria coexistence with other bacteria and microbial eukaryotes, and the cyanobacteria's various mechanisms for acquiring nutrients from the benthic mat environment. A wealth of useful genomic information was produced from the study of Microcoleus during a 19-day bloom event over a New Zealand summer. They learned that Microcoleus are equipped with diverse mechanisms for acquiring nitrogen and phosphorus, enabling them to proliferate and out-compete others in low-phosphorus waters, while taking advantage of nitrogen compounds likely introduced by agricultural runoff.
Figure 1. Underwater photograph of benthic Microcoleus mats coating a riverbed.
The results of this study were published in The ISME Journal and were also shared with GenBank, which is part of the International Nucleotide Sequence Database Collaboration, comprising the DNA DataBank of Japan (DDBJ), the European Nucleotide Archive (ENA), and GenBank at the National Center for Biotechnology Information (NCBI).
The availability of Kim's microbial dataset as early test data has provided the data repository project team with an opportunity to explore a number of functionality, process and policy issues in the management of genomic data.
For example, the Gen3 software behind the Genomics Aotearoa data repository was selected in early 2020, following a review of existing repository applications. Gen3 is an open source platform developed initially by The Center for Translational Data Science (CTDS) at the University of Chicago to help accelerate scientific discovery through collaborative infrastructure. Early development partners included the US National Cancer Institute where it powers the NCI Genomic Data Commons portal.
The metagenomic nature of Kim's data has required a refinement of the Gen3 data model through a hierarchical metadata dictionary. Physical storage, security and access issues are being addressed within the context of NeSI HPC infrastructure. Genomics Aotearoa’s kāhui-based process for decisions on which datasets should be defined as taonga species and the access request process are being incorporated into the Gen3 framework.
Kim’s challenges in submitting her project's data and metadata for NCBI also have relevance for the Genomics Aotearoa repository.
"When we're writing our manuscripts and it's time to submit our data, it can be painful because you're doing so many other things trying to tie up all the loose ends and you've got to set aside a really decent chunk of time to grapple with all the metadata, all the issues with tidying up your data to submit. It can mean setting aside days if not weeks (albeit on and off) just to submit. Regardless, it's not an inconsequential amount of time."
And so, with that in mind, while it would be useful to have a national repository where all New Zealand genomic data could be discovered in one place, she notes that it would ideally be connected to other existing online databases. Researchers neither have the time or desire to submit their data to multiple places, particularly if there are different upload requirements or processes for each repository. Related to that, an opportunity for improving the effectiveness of online repositories would be to allow researchers to include their own annotations.
A New Zealand repository's effectiveness will also depend on its public interface, Kim says. When logging into GenBank and NCBI, it's a fairly quick and simple process to grab the data you need. So, consideration needs to be given around how projects, case studies, and datasets are stored, as well as how visitors can search, access, and use the data.
Insights such as these are essential to the Genomics Aotearoa's repository development in progress. While the initial focus will be on ensuring that taonga species genomes are located within New Zealand and are well managed under the kaitiakitanga of the kāhui and robust data management practices, it is expected that the repository will grow to store more than just data from taonga species, to become a national repository for genomic data.
As for Kim and Sze, they are looking to build on their work with the Cawthorn Institute to continue studying Microcoleus. One of their follow-up projects is using isolates from the Cawthorn collection, together with environmental genomes from New Zealand and California, to determine what is genomically different between toxic and non-toxic strains of Microcoleus. The goal is to better understand when proliferations of these cyanobacteria in our waterways are toxic.
For more information on the Genomics Aotearoa data repository, visit the Genomics Aotearoa website.