Genome data could outgrow YouTube, Twitter content by 2025 – report
Scientists have warned that the computing resources designed to handle genome data may soon exceed those of giants like Twitter and YouTube. It has been estimated that between 100 million and 2 billion human genomes could be sequenced by 2025.
According to the report, published in the journal PLoS Biology, this means that as much as 2–40
exabytes of storage capacity will be needed by 2025 just for the
human genomes. And although the computer scientists believe that
these needs can be diminished with effective data compression,
“decompression times and fidelity are a major concern in
compressive genomics,” they say.
The team estimates that YouTube currently has 300 hours of video
being uploaded every minute, and this could “grow to
1,000–1,700 hours per minute (1–2 exabytes of video data per
year) by 2025 if we extrapolate from current trends.”
Twitter, meanwhile, currently generates 500 million tweets/day,
each about 3 kilobytes including metadata, the report states.
“While this figure is beginning to plateau, a projected
logarithmic growth rate would suggest a 2.4-fold growth by 2025,
to 1.2 billion tweets per day, 1.36 petabytes/year.”
READ MORE: Quoi?! British DNA is 40% French, Oxford study finds
In other words, data acquisition in these domains is expected to
grow by up to two orders of magnitude in the next decade, the
researchers say.
“Although total genomic data could far exceed the demands for
the others, with the right new innovations the net requirements
could be similar to the domains of astronomy and YouTube,”
according to the report.
The most practical, and perhaps only, solution for distributing
genome sequences at a population scale, the researchers say, is
to use “cloud-computing systems that minimize data movement
and maximize code federation.”
The report adds that new developments from companies like Google,
Amazon, and Facebook that include applications designed to
“fit the frameworks of distributed computing efficient data
centers and distributed storage and cloud computing
paradigms” are also expected to be part of the solution.
Last but not least, authentication, encryption, and other
security safeguards “must be developed” to ensure that
genomic data remain private, the researchers wrap up.