Abstract
Storage needs represent a significant burden on the economy and the environment. Some of this can potentially be offset by improved density molecular storage. The potential of using DNA for storing data is immense. DNA can be harnessed as a high density, durable archiving medium for compressing and storing the exponentially growing quantities of digital data that mankind generates. Several studies have demonstrated the potential of DNA-based data storage systems. These include exploration of different encoding and error correction schemes and the use of different technologies for DNA synthesis and sequencing. Recently, the use of composite DNA letters has been demonstrated to leverage the inherent redundancy in DNA based storage systems to achieve higher logical density, offering a more cost-effective approach. However, the suggested composite DNA approach is still limited due to its sensitivity to the stochastic nature of the process. Combinatorial assembly methods were also suggested to encode information on DNA in high density, while avoding the challenges of the stochastic system. These are based on enzynatic assembly processes for producing the synthetic DNA.
In this paper, we propose a novel method to encode information into DNA molecules using combinatorial encoding and shortmer DNA synthesis, in compatibility with current chemical DNA synthesis technologies. Our approach is based on a set of easily distinguishable DNA shortmers serving as building blocks and allowing for near-zero error rates. We construct an extended combinatorial alphabet in which every letter is a subset of the set of building blocks. We suggest different combinatorial encoding schemes and explore their theoretical properties and practical implications in terms of error probabilities and required sequencing depth. To demonstrate the feasibility of our approach, we implemented an end-to-end computer simulation of a DNA-based storage system, using our suggested combinatorial encodings. We use simulations to assess the performance of the system and the effect of different parameters.
Our simulations suggest that our combinatorial approach can potentially achieve up to 6.5-fold increase in the logical density over standard DNA based storage systems, with near zero reconstruction error.
Implementing our approach at scale to perform actual synthesis, requires minimal alterations to current technologies. Our work thus suggests that the combination of combinatorial encoding with standard DNA chemical synthesis technologies can potentially improve current solutions, achieving scalable, efficient and cost-effective DNA-based storage.
1 Introduction
DNA is a promising candidate to serve as storage media for long-term data archiving due to its high information density, long-term stability, and robustness. In recent years, several studies have demonstrated the use of synthetic DNA for storing digital information on a megabyte scale, exceeding the physical density of current magnetic-tape based systems by roughly six orders of magnitude [1] [2].
Efforts in the field of DNA-based storage are mainly focused on using standard DNA synthesis and sequencing technologies, applying various encoding schemes to reduce error rate and ensure reliability [1] [2] [3] [4] [5] [6] [7]. Yet, despite the enormous benefits potentially associated with capacity, robustness, and size, existing DNA-based storage technologies create information redundancy. This is due to the nature of DNA synthesis and sequencing methodologies, which process multiple molecules that represent the same DNA sequence in parallel. Recent studies suggested exploiting this redundancy to increase the system’s logical density, by extending the standard DNA alphabet using composite letters and thereby encoding more than 2 bits per letter [8] [9] [10].
A composite DNA letter uses all four DNA bases (A, C, G, and T), combined or mixed in a specified predetermined ratio σ = (σA, σC, σG, σT), A resolution parameter k = σA + σC + σG + σT is defined, to control the alphabet size. The full composite alphabet of resolution k, denoted Φk, is the set of all σ = (σA, σC, σG, σT), so that ∑i∈(A,C,G,T}σi = k. In order to “write” a composite letter onto a specific predefined position in a DNA sequence, the multiple molecules representing the custom encoded sequence must be synthesized, while preserving the desired ratio between the different letters across all the molecules. For this to be done, current synthesis technologies are utilized, which produce multiple copies, and thereby allow for the implementation of the mixtures. Using the extended composite alphabet increases the logical density of the DNA based storage, breaking the theoretical limit of two bits per synthesis cycle as has been previsouly demonstrated [9] [10].
A recent study shows the power of using combinatorial assembly to encode information in DNA. Using propriety high throughput machinery, based on enzymatic assembly, the authors encoded information in a megabit per second write speed. As a proof of principle, they wrote a 25KB message into DNA, reading it using commercially available nanopore sequencing technology [11] [12].
In this paper, we present a novel approach to encode information in DNA using combinatorial encoding and shortmer DNA synthesis. The method described herein leverages the advantages of combinatorial encoding schemes, while relying on existing DNA chemical synthesis methods with some modifications. Using shortmer DNA synthesis also minimizes the effect of synthesis and sequencing errors. We formally define shortmer-based combinatorial encoding schemes, explore different designs, and analyze their performance. One such design is a variation of combinatorial encoding, wherein the size of the subset is fixed as part of the design, allowing for higher confidence in the reading process. We use computer-based simulations to demonstrate an end-to-end DNA data storage system based on combinatorial shortmer encodings, and study its performance. Finally, we discuss the potential of combinatorial encoding schemes and the future work required to enable these schemes in DNA-based data storage systems.
2 Results
2.1 Shortmer combinatorial encoding for DNA storage
To get an improved capacity in DNA-based storage systems, we suggest a novel method that not only effectively extends the DNA alphabet, it also ensures a near-zero error rate. This encoding scheme is based on the following principles:
A set Ω of DNA k-mers that will serve as building blocks for the encoding scheme. Elements in Ω are designed to be sufficiently different from each other to minimize mix-up error probability. Hence, the set is designed to satisfy d(Xi,Xj) ≥ d, so that P(read Xi|write Xj) ≤ a, ∀Xi, Xj ∈ Ω, i ≠ j, where d(x, y) is the Hamming distance between x and y. Note that N = |Ω| ≤ 4k. Table 4 demonstrates several examples of such sparse sets.
DNA synthesis using the k-mers in Ω, which can be non-standard reagents. This is compatible with current DNA synthesis methods. This synthesis mixtures of k-mers from Ω in each position, i.e., similar to the synthesis of composite DNA letters [9].
A large combinatorial alphabet Σ in which every letter is defined by a subset S of the k-mers in Ω. A letter σ ∈ Σ (representing a subset S) is an N-dimensional binary vector, and the indices σi = 1, 1 ≤ i ≤ N represent the k-mers included in the subset S. For example, σ = (0, 1, 0, 1, 1, 1) means that S = {X2, X4, X5, X6} (and |Ω| = N = 6). To write a combinatorial letter σ in a specific position, we synthesize a mixture of the four k-mers included in the subset S. To infer a combinatorial letter σ, a set of reads needs to be analyzed to determine which k-mers are observed (See Sections 2.2 and 2.3, for more details).
DNA barcodes that allow the grouping of reads representing the same sequence, for the inference of the combinatorial sequence. Our barcode length will be predefined as bc, representing optional barcodes in {A, C, G, T}bc.
The extended combinatorial alphabets allow for higher logical density of the DNA-based storage system, while minimizing error rates. Figure 1 depicts a complete workflow of DNA-based storage with combinatorial shortmer encoding. This includes the following steps:
Combinatorial message encoding. A binary message is encoded using a large k-mer combinatorial alphabet (e.g., trimer-based alphabet of size |Σ| = 4096 letters, with N = |Ω| = 16). The complete message is broken into sequences of set length, each sequence is then marked with a standard DNA barcode and translated using the table presented in the Encode Legend (See Section 2.2, for details about the Binomial Encoding).
Error correction. 2D Systematic Reed-Solomon (RS) encoding is used for error correction. First, the barcode is encoded using RS(6,8) over GF(24) and the payload is encoded using RS(120,134) over GF(212). Next, the columns of each block of 42 sequences are encoded using RS(42,48) over GF(212) [13] [14].
DNA synthesis. DNA molecules pertaining to the designed sequences are synthesized using combinatorial k-mer DNA synthesis (See Figure 2). In each position of the payload, each DNA molecule will contain one of the trimers included in the designed combinatorial letter σ.
DNA sequencing. DNA molecules are sequenced using any DNA sequencing technology.
Combinatorial sequence reconstruction. The sequenced reads are grouped by the barcode sequence, and the combinatorial sequences are recovered (See Sections 2.3 and 3.1).
Error correction decoding. The recovered sequences are decoded to correct errors. The barcode, payload, and each block of sequences (48, in our example) are decoded separately.
Binary message decoding. The resulting combinatorial sequence is decoded back to a binary message.
The combinatorial shortmer encoding scheme is potentially based on using the standard phosphoramidite chemistry synthesis technology, with some alterations [15] [16] [17] [18] [19] [20]. Figure 2 demonstrates the required alterations:
The use of DNA shortmers as building blocks. The DNA print head will be fed from more than four cartridges, each containing a different building block (from Ω).
Synthesis of mixed building blocks in a single cycle. Each cycle in the synthesis process will be divided into two steps. First, all the desired building blocks will be added to a designated mixing chamber, and only then will the mixture of all desired building blocks be introduced (e.g, by injection) to the elongating molecules.
Specifically, Figure 2 exemplifies the synthesis of a message encoded using binomial encoding, with N = 16, K = 3, bc = 2. Note the N + 4 = 20 (4 for the standard nucleotides) cartridges on the print head and the designated mixing chamber. The synthesized (combinatorial) DNA sequence is ATσ39σ270, with AT being the barcode and σ39σ270 the payload σ39 = {AAT, AGC, CGT}, σ270 = {CAC, ATG, CTA}. As Figure 2 shows, first, the standard DNA letter barcode with length = 2 (namely AT here) is synthesized (steps 1-5). Next, a single combinatorial letter is synthesized (steps 6-10). This includes K steps of adding trimers to the mixing chamber (steps 6-8), mixing (step 9) and, finally, phosphoramidite elongation (step 10). Then, a second combinatorial letter is synthesized (steps 11-15). At the end of the process, we can cleave the synthesized molecules from the surface (step 16).
2.2 Binary and binomial combinatorial alphabets
The main parameter that defines a combinatorial encoding scheme is the alphabet Σ. More specifically, the set of valid subsets of Ω that can be used as letters. Here we define two general approaches for the construction of Σ. Namely, the binomial encoding and the binary encoding. We start with the former.
In the binomial encoding scheme, only subsets of Ω of size exactly K represent valid letters in Σ, so that every letter σ ∈ Σ consists of exactly K k-mers. This yields an effective output alphabet of size letters. Therefore, all the letters in the alphabet have the same Hamming weight - w(σ) = K, ∀σ ∈ Σ. This way, an r-bit binary message will require synthesis cycles (and a DNA molecule with length ). Conversely, every single letter in the output alphabet encodes bits. Note that this calculation ignores the overhead caused by the error correction redundancy.
For example, consider a binary message encoded using a binomial encoding alphabet Σ, derived from using trimer building blocks, a set Ω of size N = 10 and K = 5 distinct trimers for each letter in Σ. In this case the alphabet size is . To fit blocks of bits, a subset of 128 letters is used. The basic block capacity encodes 7 bits in one single letter. A message of length r = 10KB can be synthsis with synthsis cycles. If we use trimers and 120 synthesis cycles to yield oligos of length 120 *3 = 360nt, we will need 100 such oligos.
Table 1 demonstrates the binomial alphabet (also used in Figure 1) with N = |Ω| = 16 and K = |S| = 5. The effective alphabet has letters. Since 4096 ≤ 4368, each synthesized letter encodes 12 bits.
In the binary encoding scheme, all possible nonempty subsets of Ω represent valid letters in the alphabet. This yields an effective alphabet of size letters. This way, an r-bit binary message will require synthesis cycles (yielding a DNA molecule of length ). Every single letter in the output alphabet encodes N – 1 = ⌊log2(|Σ|)⌋ bits. Note that this calculation ignores the overhead caused by the error correction redundancy.
For example, consider a binary message encoded using a binary encoding alphabet Σ, derived from using a trimer building blocks, a set Ω of size N = 10. In this case, the alphabet size is |Σ| = 210 – 1 = 1023. The basic block capacity encodes 9 bits in every letter.
2.3 Reconstruction probabilities for binomial encoding
With binomial encoding, it is possible to collect reads and stop after observing all the K distinct k-mers at every position. This simplifies the analysis of the reconstruction probability and reduces error rates, making it the preferred combinatorial encoding system.
Since every letter σ ∈ Σ consists exactly of the K participating k-mers, the required number of reads for observing at least one read of each k-mer follows the coupon collector distribution [21]. The number of reads required to achieve this goal can be described as a random variable where R1 = 1 and , i = 2,…, K. The expected number of required reads is then:
The expected number of reads required for observing all the participating k-mers remains reasonable for the relevant values of K.
Using the independence of Ri we can derive that . By Chebyshev inequality, we get an upper bound (a loose bound) on the probability of requiring more than E[R] + cK reads to observe at least one read of each k-mer:
When we examine an entire read of length l, assuming independence and not taking error correction into account, we get the following relationship between c and any desired confidence level 1 – δ:
Table 2 presents several examples for the loose upper bound, derived as above, on the number of reads required to 1 – δ ensure reconstruction of a binomial message with K = 5, sequence of length l. As demonstrated in the simulations, these numbers are definitely not tight (Section 2.4).
Clearly, for reasonable values of K (i.e., K ≤ 10), a standard coverage of 100 reads per oligos yields low probabilities for missing one of the included k-mers. Note that with an online sequencing technology (i.e., nanopore sequencing) we can simply keep sequencing until K distinct k-mers have been confidently observed. The above bounds will then provide an estimate on the sequencing cost.
To take into account the probability of observing a k-mer that is not included in the designed set of K k-mers, we can require at least m > 1 reads of each of the K k-mers to be observed. In this case, the derivation of the number of required reads is not as trivial, but is expected to be approximated by E[R] ≅ K(log(K) + m loglog(k)) [21]. Again, we obtain reasonable numbers for relevant values of K and m, and can use the selective nanopore approach to guide the process, avoiding reconstruction issues.
This analysis is based on neglecting mix-up errors (i.e., there are no incorrect k-mer readings). This assumption is based on the near-zero mix-up probability that is attained by the construction of Ω with a minimal Hamming distance, see Table 4.
2.4 Simulation of an end-to-end combinatorial shortmer storage system
To demonstrate our suggested encoding approach, we created an in-silico end-to-end system based on combinatorial shortmer encoding, simulated combinatorial DNA synthesis, simulated DNA sequencing, and message decoding (See Section 3). We simulated systems with different binomial alphabets and error probabilities, and measured the resulting reconstruction and decoding rates. Figure 3 depicts a schematic representation of our simulation workflow and indicates how the error rates are calculated.
The results of the simulation runs are summarized in Figure 4 and Figure 5. Each run included 30 repeats with random input texts of 10KB each. The results presented are for an alphabet of size |Σ| = 4,096 (N = |Ω| = 16, K = 5), barcode length of 12nt with 4nt of RS, payload of length 120 trimers with payload RS 14 trimers. The crosswise RS consists of 6 trimers for every block of 42 payloads, in any given position. From each barcode, 1000 copies were synthesized (simulated). Errors are then simulated into the resulting sequences, to represent synthesis and sequencing errors, as expected in actual usage [22].
The following is a summary of the conclusions from the simulation study:
As expected, higher synthesis and sequencing error probabilities produce a lower reconstruction rate.
Smaller samples of 10 and 20 reads per barcode did not allow for full reconstruction, even with zero error rate. Also, increasing the sampling rate results in better reconstruction. This demonstrates the crucial effect of random sampling on the overall performance of the system.
Substitution errors are easier to detect and correct than deletion and insertion errors. This is because substitution errors affect the nucleotide level rather than the trimer level. The minimal Hamming distance d = 2 of the trimer set Ω allows for the correction of single-base substitutions.
2D RS error correction significantly improved reconstruction rates.
3 Methods
3.1 Implementation of a combinatorial k-mer storage system
In our first run of simulations encoding/storing/decoding, we will address some of the process parameters. Table 3 presents a message encoding example.
The data will be encoded using molecules of length 418nt. Each molecule will contain 16nt barcode bases (out of which 4nt are dedicated for RS error correction), 402nt payload bases representing 402nt/3=134 trimers (out of which 14 trimers are dedicated for RS error correction). To protect from sequence dropouts, every 48 molecules are treated as a single block for RS error correction, to protect from sequence dropout).
1. Encoding
1.1. Data padding
To fit the binary data onto the molecule, it must be divided by the molecule size and the block size. If the division results in a gap, the data is padded with zeros to close this gap.
1.2. 2D-error correction using Reed-Solomon decoding
Reed Solomon (RS) is used a total of three times. It is applied lengthwise on each sequence twice, error correcting each barcode sequence and then each payload sequence. It is also used crosswise on all the sequences in one block size. (See Section 3.2).
2. Synthesis and sequencing
2.1. Simulating the synthesis process
The synthesis of each combinatorial sequence was simulated separately. For a fixed sequence we first draw, from X~N(μ =predtermined, σ2 = 100), the number of molecules that will represent it. Let this number be x. All k-mers that occur within a single position (cycle) are then generated. To do this, x numbers of the subset are selected, representing the relevant σ. The size of this subset is K, and its members will most likely be represented many times. This random composition is achieved by drawing a total of x independent times, according to Y~U(1, K). 1s in σ are indexed at 1,…, K, and the appropriate k-mers are “synthesized” in accordance with the drawn index.
2.2. Mixing
Once all of the molecules are synthesized, they are mixed to mimic real molecules in a container.
2.3. Error simulation
To replicate a real synthesis and sequencing process, several error scenarios were simulated. These include the three error types Deletion, Insertion, and Substitution of a letter in the sequence, each predefined by an error percentage. A Bernoulli trial is then performed on every sequence and letter position, where P is the predefined error, inserting the errors in each position. To replicate the Substitution error, we implemented the error per nucleotide, and for the Insertion and Deletions errors we implemented the error on each full k-mer. This method is closest to the expected error scenarios in combinatorial DNA synthesis and sequencing.
2.4. Reading and sampling
Several different samples were drawn, to analyze their impact on the accuracy of the data retrieved.
3. Decoding
3.1. Sequence retrieval
To retrieve the original sequence, first each sequence barcode undergoes RS error correction. Next, each sequence payload is reviewed individually, and undergoes RS, too. For sequences in the same block, RS is also done, crosswise on the block.
3.2. Grouping by barcode and determining σ in each position
Once barcode retrieval is complete, sequences are grouped by the same barcode. In each of the groups, all the sequences are reviewed at the same exact position, where we extract the K most common k-mers to determine the σ in that position. In the process of determining the K most common k-mers, we may encounter invalid k-mer (not in Ω). Should an invalid k-mer be encountered in the payload sequence, the following steps are taken:
If the length of the sequence is equal to the predetermined length. The sequence is reviewed, and if an invalid k-mer is encountered, which is not part of our alphabet, an Xdummy is inserted instead, followed by skipping 3nts.
If the length of the sequence is smaller than the predetermined length. When Δ<SL*, it indicates that there is a deletion in the sequence. We pad it with a dummy nucleotide R that restores it to the predetermined length, and then review the sequence.
If the length of the sequence is greater than the predetermined length. When Δ>SL*, it indicates that there is an insertion in the sequence. We cut the sequence to restore it to the predetermined length, and then review the sequence.
* Δ – Current sequence length, SL – Predetermined sequence length
3.3. Missing barcode
After the reading process is complete, if a barcode is found missing, the missing barcode and a dummy sequence is added to enable Reed Solomon to retrieve the data correctly.
4. Validation
3.2 Data Padding and Error Correction Using 2D Reed Solomon codes
The binary message is first padded, so that its total length is divisible by r (r is the number of bits per letter). Next it is encoded into a combinatorial message. The combinatorial message is broken down into sequences of length l, and another padding is done to complete a block of B – 1 full sequences. The padding information is included in the final single combinatorial sequence to complete a block of B sequences.
For the barcode sequence, a systematic (6, 8) RS code over GF(24) was used to transform the unique 12nt barcode to a 16nt sequence. The 120 combinatorial letter payload sequence was encoded using a (120, 134) RS code over GF(29), GF(212), GF(213) for the binomial alphabets , respectively. This resulted in sequences of length 134 combinatorial letters + 16nt barcodes. Using trimers, the (theoretical) resulting oligos have overall length of 418nt.
To protect against sequence dropouts, we used RS error correction on the columns of the matrix (See Figure 6). In each block of 42 sequences, we apply a (42,48) RS code over GF(29), GF(212), GF(213) for the binomial alphabets , respectively. This is applied in each column separately.
Figure 6 demonstrates the encoding of ~0.1 KB using the following parameters:
A (3,5) RS code over GF(42) for the barcodes.
A (12, 18) RS code over GF(29) for the binomial alphabet payload sequence.
A 10-sequence block encoded, column wise, using a (10,15) RS code over GF(29).
The 824 bits are first padded to be 828 = 92 * 9. The 92 combinatorial letter message is split into 7 sequences of 12 letters and an additional sequence of 8 letters. Finally, a complete block of 12 sequences (total of 12 *12 = 144 letters) is created by padding with one additional sequence of 12 letters and including the padding information as the last sequence.
4 Discussion
4.1 Information capacities for selected encodings
Table 3 demonstrates the encoding of a 1GB input file, with standard encoding and two combinatorial encoding schemes, binomial and binary, using six different alphabets. In the binomial encoding scheme, three different trimer alphabets of sizes were used. In the binary encoding scheme, three different trimer sets of sizes 10, 16, 20 were used. All calculations are based on error correction parameters similar to those previously described [9] [10] [2] [14].
With these alphabets, up to 9.5-fold and 6.5-fold increase in information capacity is achieved per synthesis cycle and per DNA base respectively, compared with standard DNA based storage. We note that the 6.5 increase obtained using the binary coding also guarantees near zero reconstruction error.
We briefly touched on analytically evaluating reconstruction errors when using the schemes proposed herein. Further research is required to obtain better estimates of the theoretical reconstruction rates or Levenshtein distances.
Finally, it is important to note that none of the methods described here has been tested in the lab for actual synthesis. This is mostly due to hardware limitations. Modifications to the existing phosphoramidite DNA synthesis machinery needs to include a mixing chamber [25], such as depicted in Figure 2. They should also include access to multiple feeding containers, holding the N = |Ω| reagents necessary for the desired alphabet design.
6 Supplement
Table 4 is an example of k-mer sets. To the left, two sets of trimers that have a minimal Hamming distance of 2, with |Ω1| = 16 and |Ω2| = 12. To the right, a set of 54 6-mers that have a minimal Hamming distance of 4.