Nature’s Databank – Engineering Education ASEE Prism

Researchers look to DNA, the essential blueprints of life, to capture a digital-age deluge of information.

By Charles Q. Choi

We’re drowning in data. Every digital photo and video taken, every e-mail and text message sent, every record on government and corporate servers, every file on computers and the Internet, every bit of data collected by sensors worldwide through the Internet of Things—all of it threatens to overwhelm current technologies used to save data. Whether stored magnetically on hard disks; optically on CDs and DVDs; or in a solid-state format such as flash memory, the amount of data in the world is exploding to the point of requiring an expanded set of measurement terms. This year, it is expected to reach 44 zettabytes—the equivalent of 44 trillion gigabytes—and grow to 175 zettabytes by 2025, according to market analysts at the International Data Corporation. If we were to rely solely on flash memory, by 2040, the materials needed to store the amount of data humanity is predicted to generate would exceed the world’s expected supply of microchip-grade silicon by 10 to 100 times.

And if you think the “cloud”—powerful remote networks of interconnected machines—has solved the problem, think again: The rows and rows of computers—Internet-connected servers—filling warehouse-size spaces, require large tracts of land and vast amounts of power and water. “Demand for data storage is growing exponentially, but the capacity of existing storage media is not keeping up,” says Karin Strauss, a principal research manager at Microsoft. Moreover, all mainstream techniques for storing data break down over time, requiring regular replacement. For example, most digital archives are currently saved on magnetic tapes, which can store data for one-sixth the cost of hard disks but typically lose viability after about 30 years. Conventional data storage techniques also run the risk of becoming obsolete, think floppy disks and VCR tapes, making today’s files unreadable by future generations.

But help is on the horizon. Drawing on the way nature—life itself—has recorded and stored information for billions of years, scientists are using strands of DNA to develop a radically different way of storing seemingly limitless quantities of data.

Immense Capacity

DNA can store an extraordinary amount of data—up to one exabyte, or 1 billion gigabytes, per cubic millimeter. “That’s something like 10 million to 1 billion times greater data density than any electronic media projected to come online in the future would be,” says James Tuck, a professor of electrical and computer engineering at North Carolina State University. In theory, just a few kilograms of DNA could store all of humanity’s data.

Each strand of DNA is made of strings containing four different kinds of molecules known as nucleotides—adenine, thymine, cytosine, and guanine, abbreviated as A, T, C and G. In electronics, data is typically encoded in series of binary digits, or bits, of zeroes and ones; in DNA, the pairs 00, 01, 10, and 11 can, for instance, be encoded as A, T, C and G. Researchers can synthesize DNA with whatever patterns they want and use DNA sequencing technology to read out the data. The DNA can then be stored someplace dark, cool, and dry, possibly in the form of a dry powder or encapsulated in glass beads.

All in all, using DNA, the current contents of the entire accessible Internet—that is, everything not behind a password or some other barrier—could fit “in a shoebox or two,” Strauss says. Moreover, DNA is extremely durable. “Some DNA has managed to persist in less than ideal storage conditions for tens of thousands of years in mammoth tusks and bones of early humans,” she added. As a bonus, it has built-in protection against obsolescence—unlike a floppy disk—since people will always have an incentive to have technology that can read DNA due to its medical applications.

Signals from Space?

The notion that DNA could be used to store data may first have emerged from, of all things, a search for messages from aliens in outer space. In the late 197Os, two Japanese researchers, Hiromitsu Yokoo and Tairo Oshima, became intrigued by the possibility that DNA might serve as a channel used by another advanced civilization to send a message to Earth. Until then, the search for extraterrestrial communication had been limited to radio signals. But the pair argued in a 1979 paper that “biological media should not be neglected” as an information exchange system that aliens might favor, since these media “are free from such difficult problems as duration time, frequency and bandwidth, direction and directivity of antennas, and distance.”

Evidently straight-faced, the scientists examined the biology of the virus φX174. The first genome to be synthesized in a test tube, φX174 can infect and eventually destroy E. coli bacteria. They reasoned aliens might choose it for extraterrestrial communication because earthlings would be familiar with it. Ultimately, Yokoo and Oshima found “no significant pattern” in the φX174 virus pointing to a possible alien message, but their exploration inspired others to look at DNA in a new way: as a data repository. Among them was George Church, professor of genetics at Harvard Medical School. Looking back, Church suggests that a better place to look for DNA-implanted alien messages would be amoebas, since “their genomes are 500 times bigger than the human genome.”

As the cost of synthesizing DNA dropped over the decades, more scientists began tinkering with encoding data into DNA. In 1986, artist Joe Davis, along with Harvard geneticist Dan Boyd, crafted DNA encoding a simple image of a rune—an ancient Germanic symbol that predates use of the Latin alphabet. They chose the rune “algiz,” meaning life. In 2001, a team at Mount Sinai School of Medicine encoded the opening lines from Charles Dickens’s A Tale of Two Cities. In 2009, scientists at the University of Toronto encoded the text, music, and an image from “Mary Had a Little Lamb.”

Interest in the field exploded in 2012 when Church and his colleagues translated into DNA a 53,000-word-plus book on synthetic biology that Church had coauthored, as well as 11 images and a JavaScript program. This 650-kilobyte trove, the largest amount of data artificially encoded into DNA, demonstrated that such a technique could vastly outperform mainstream data storage methods. In 2016, Microsoft and the University of Washington set a new record by storing 200 megabytes of data on DNA, including a music video and the top 100 books of Project Gutenberg. These successes don’t mean DNA is ready as a mainstream databank. “It’s a very unusual data storage system,” says Olgica Milenkovic, a professor of electrical and computer engineering at the University of Illinois-Urbana-Champaign. “Some of the problems that implementing it brings are really unknown in classical storage.” But, she adds: “That’s what makes it very exciting.”

Cost Cutting

The main challenge facing DNA data storage is reducing the cost of synthesizing DNA. Currently, the lowest cost for DNA synthesis is roughly $100 for a million base pairs, Church says. (A base pair refers to the pairs of nucleotides making up a double-helix of DNA). For the most promising initial commercial applications of DNA data storage, the cost of synthesis may have to drop at least two orders of magnitude, says Yaniv Erlich, chief science officer at the Israeli DNA ancestry firm MyHeritage. Still, “it’s feasible in the foreseeable future for the costs to go down by three or four orders of magnitude without any super new inventions—just clever engineering,” Erlich says. Church agrees, saying that DNA synthesis costs “could easily go down by a billion-fold.”

One way researchers are trying to reduce DNA synthesis costs is by copying nature. The conventional method for synthesizing DNA sequences, known as phosphoramidite chemistry, adds nucleotides onto strands of DNA one at a time. This slow, expensive process takes about six to 12 minutes per cycle of nucleotide addition. “It’s a very stop-and-go reaction,” says Henry Hung-Yi Lee, a postdoctoral fellow at Harvard Medical School. In contrast, nature synthesizes DNA all the time using enzymes known as polymerases. Adapting this to the lab could make DNA synthesis faster and cheaper. Lee is encouraged: “A lot of proteins have evolved to manipulate DNA,” he says. So is Sylvain Gariel, co-founder and chief operating officer of French biotechnology startup DNA Script in Paris, who says “the potential of this approach is going to be very great.” Right now his company can incorporate one nucleotide into a DNA strand every one or two minutes, but in nature, “the enzymes we use can incorporate tens or hundreds of nucleotides per second, so our aim is to go far beyond what we have now.”

A second challenge is that all DNA synthesis techniques are vulnerable to errors in writing data. However, conventional electronic data storage methods have long dealt with comparable problems, and researchers are adapting error-correction techniques from current industries—not only to fix errors but to improve DNA synthesis efficiency.

“I came from electrical engineering but have been working on genomics and synthetic biology for the past 15 years,” Lee says. He’s fascinated by the intersection between classical computing and molecular biology, and says “we can bring a lot of ideas from one to the other.” Lee and his colleagues have developed a DNA synthesis method that is much faster but less precise than standard DNA synthesis techniques; it tends to add a random amount of nucleotide to each DNA strand. Although this would prove a problem if they relied on the precise sequence of nucleotides to encode data, the team instead relied on transitions from one nucleotide to another to represent data. For example, if a DNA strand switched from A to T, that might represent 0, while a switch from A to G might represent 1. With error-correction codes to help read this data and despite DNA synthesis error rates of more than 30 percent, “we could get to writing one base per second,” Lee says.

File Retrieval

Another hurdle that DNA data storage faces is finding a way to retrieve a specific file from a collection of records, a feature known as “random access,” instead of every file in that collection. One way researchers have tried to overcome this problem is to attach short labels of about 20 nucleotides to data sequences. They could also bind corresponding DNA sequences onto these labels. However, there are only an estimated 30,000 unique labels of this kind available—too few for the vast archives for which DNA data storage is intended. To get around this problem, Tuck and others added two chemical handles to each DNA file. One binder sequence could grab onto one of these handles to select a fraction of files to examine, and another binder sequence can then lock onto the other handle to find the specific file in question from this initial selection. This increases the estimated number of potential file labels to roughly 900 million, Tuck says.

“It’s like having a postal address. You might use a city, state, and ZIP code to get an idea of where to go first, and then use the street address to find your final destination,” Tuck explains. “The downside of this is that adding these handles gives up space you could use for data, but the added benefits are worth it.”

If DNA data storage becomes commercially viable, the most immediate applications will likely be to archive “very valuable data that needs to last for a long time,” such as government, corporate, scientific, and financial data, and medical and historical records, according to Strauss. It could also be used to store classic movies; in 2016, Technicolor collaborated with Church and his colleagues to encode the 1902 A Trip to the Moon, widely considered to be the first science fiction film. “Ultimately, any piece of data that needs persistence could be stored in DNA,” Strauss says.

If costs fall far enough, DNA storage could be incorporated into everyday objects. In what could be a scene from a spy thriller, Erlich and his colleagues encoded text and a short video into DNA and embedded it within transparent plexiglass, which they fashioned into lenses for a pair of reading glasses. The concealed information could be recovered from any tiny fragment of the plastic.

“You can’t put memories into materials with tapes or CDs,” notes Erlich. “DNA is virtually the only storage technology that doesn’t have a defined shape and that can be embedded in other types of material.” In another demonstration, he and his team showed they could 3-D-print a white plastic bunny loaded with silica beads, each of which encased DNA encoded with the blueprints for making the bunny. “Think of creating a medical implant that you could put all its information on—its size and shape, how it was manufactured—and if you move to a new country, you can sequence the DNA in it and retrieve that information,” Erlich says.

Moreover, Church and his colleagues are investigating ways to create “biological cameras” that can record visual, chemical, or other data directly as DNA. In 2017, they showed they could encode a digitized image of a human hand and even one of the first motion pictures ever made, that of a galloping horse, in the genomes of live bacteria by feeding them sequences of DNA. Afterward, they successfully read this data by extracting and analyzing the DNA from these cells.

In principle, one could envision incorporating such bacteria into paints or wallpaper that one could apply onto surfaces to record images, the presence of toxins, or other data. “We’re not just thinking about how to record digital data, but rethinking where data might come from,” Church contends. “We’re not limited by current recording devices.” A future application for DNA recorders might place them within the human body. Church and his colleagues already have tried this on mice, producing genetically modified rodents with cells that each maintained a record of its own development. “You could imagine the equivalent of a flight recorder inside every cell of the body, recording one’s personal history from egg to adult,” Church says.

If one could record all the activity from each neuron in a person’s brain, one might imagine duplicating a human brain, Church added. “A lot of people talk about uploading brains to silicon computers, but it might be much easier to duplicate biology than to remap a brain to a complete new modality, just as it might be easier to take a photo of the Mona Lisa than to capture its nuances in words,” Church says. Just as inventors of the automobile wanted more than a mechanical horse, he adds, DNA data storage should aim not to copy what electronics already do, but to launch its own revolution.

Charles Q. Choi is a New York-based freelance writer specializing in science.
Design by Nicola Nittoli