Reoptimization of the Consensus Pattern Problem under Pattern Length Modification

Jhoirene B. Clemente1*, Proceso L. Fernandez Jr.2, Richelle Ann B. Juayong1,
Jasmine A. Malinao1, Ivy D. Ordanel1, and Henry N. Adorna1

1Department of Computer Science, University of the Philippines Diliman,
Quezon City, NCR 1101 Philippines
2Department of Information Systems and Computer Science, Ateneo de Manila University,
Quezon City, NCR 1108 Philippines

*Corresponding author: jbclemete@upd.edu.ph

 

ABSTRACT

In Bioinformatics, finding conserved regions in genomic sequences remains to be a challenge not just because of the increasing size of genomic data collected but because of the hardness of the combinatorial model of the problem. One problem formulation is called the Consensus Pattern Problem (CPP). Given a set of t n-length strings S = {S1,…, St} defined over some constant size alphabet Σ and an integer l, where l ≤ n, the objective of CPP is to find an l-length string v and a set of l-length substrings si of each Si in S such that the total sum of d(si, v) is minimized for all 1 ≤ i ≤ t. Here d(x, y) denotes the Hamming distance between the two strings x and y. It is known that CPP is NP-hard i.e., unless P = NP, there is no polynomial-time algorithm that produces an optimal solution for CPP. In this study, we investigate a combinatorial setting called reoptimization in finding an approximate solution for this problem. We seek to identify whether a specific additional information can help in solving CPP. Specifically, we deal with the following reoptimization scenario. Suppose we have an optimal l-length consensus substring of a given set of sequences S. How can this information be beneficial in obtaining an (l + k)-length and (l – k)-length consensus for S? In this paper, we show that the reoptimization variant of the problem is still computationally hard even with k = 1. In response, we present four algorithms that make use of the given optimal solution – we prove that the first three algorithms produce solutions with quality that is bounded from above by an additive error that grows as the parameter k increases, while the fourth algorithm achieves a guaranteed approximation ratio. It has been shown that there is no efficient polynomial-time approximation scheme for CPP (Boucher 2015). In this paper, we show that we can save 𝒕(𝒏−(𝒍+𝒌)+𝟏)(𝒕(𝒍+𝟐𝒌)𝒓) steps in computation from the original running time of the known polynomial-time approximation scheme for CPP.