| Paper ID | D7-S6-T2.3 | 
    | Paper Title | Sketching and Sequence Alignment: A Rate-Distortion Perspective | 
	| Authors | Ilan Shomorony, University of Illinois at Urbana-Champaign, United States; Govinda Kamath, Microsoft Research New England, United States | 
  
    | Session | D7-S6-T2: Applications of Source Coding | 
  
    | Chaired Session: | Tuesday, 20 July, 23:40 - 00:00 | 
  
    | Engagement Session: | Wednesday, 21 July, 00:00 - 00:20 | 
  
    | Abstract | Pairwise alignment of DNA sequencing data is a ubiquitous task in bioinformatics and typically represents a heavy computational burden. A standard approach to speed up this task is to compute "sketches" of the DNA reads (typically via hashing-based techniques) that allow the efficient computation of pairwise alignment scores. We propose a rate-distortion framework to study the problem of computing sketches that achieve the optimal tradeoff between sketch size and alignment estimation distortion. We consider the simple setting of i.i.d. error-free sources of length n and introduce a new sketching algorithm called "locational hashing." While standard approaches in the literature based on min-hashes require B = (1/D) O(log n) bits to achieve a distortion D, our proposed approach only requires B = log^2(1/D) O(1) bits. This can lead to significant computational savings in pairwise alignment estimation. |