Abstract |
Compared to conventional data storage media, DNA has several advantages, including high data density, energy efficiency, longevity, and ease of generating copies. However, challenges arising from the prevalence and variety of errors in the DNA data storage pipeline, which include substitutions, duplications, insertions, and deletions, must be addressed. This paper focuses on simultaneously correcting an arbitrary number of short tandem duplications and at most $p$ substitutions, where a short tandem duplication error consists of inserting a copy of a substring of length at most 3 immediately after it. Interacting with tandem duplications, the substitutions may affect segments of unbounded lengths in the stored sequence. However, if the codewords are irreducible, i.e., they do not have any short tandem repeats, the problem can be cast as correcting edits in at most $p$ substrings of bounded lengths. We construct irreducible codes with a structure that allows identifying where edit errors have occurred, which are then corrected using an MDS code. The rate of the proposed code correcting duplications and at most $p$ substitutions, when $\log p=o(\log n)$, is shown to be at least $\log(q-2) (1-o(1))$, where $q$ is the alphabet size and $n$ is the length of the code.
|