Question
All DNA is composed of a series of nucleotides abbreviated as A, C, G, and T, for example: "ACGAATTCCG". When studying DNA, it is sometimes useful to identify repeated sequences within the DNA.
Write a function to find all the 10-letter-long sequences (substrings) that occur more than once in a DNA molecule.
For example,
Given s = "AAAAACCCCCAAAAACCCCCCAAAAAGGGTTT", Return: ["AAAAACCCCC", "CCCCCAAAAA"].Hash Table Bit Manipulation
Analysis
This question is straight-forward: just iterate thru all the 10-letter long subsequences, and then add into hashmap.
However, if you simply use String as hash key, it will exceed OJ run time.
Solution
Convert string to int is the right approach. To get the best performance, we use 2 bits to represent each DNA nucleotide. Integer is 32-bit long so, no problem with that.
Node that Integer.Max = 2,147,483,647. There’s total of 10 digitals, so using 1 digit to represent 1 letter is not feasible. Ref: BUPT-哲人善思
Code
The code is pretty difficult to write.
public class Solution {
public List<String> findRepeatedDnaSequences(String s) {
List<String> res = new ArrayList<String>();
if (s == null || s.length() < 10) {
return res;
}
Map<Integer, Integer> map = new HashMap<Integer, Integer>();
// map<> stores int-to-int: first int is "ACGT" -> 1234
// second int is the init position of the pattern in s
int num = 0;
int mask = 0xFFFFFFFF;
mask >>>= (32 - 18);
for (int i = 0; i < s.length(); i++) {
num = mask & num;
// get the last 18 binary digits of num
num = (num << 2) | convertChatToInt(s.charAt(i));
// append num with the number representing charAt(i)
if (i >= 9) {
// insert or search num in the map
if (!map.containsKey(num)) {
map.put(num, (i - 9));
// (i - 9) is the start position of
// the 10-letter-long sequence
} else if (map.get(num) != -1) {
res.add(s.substring(map.get(num), map.get(num) + 10));
map.put(num, -1);
// after the sequence has been added, mark it in the map
} else {
// do nothing
}
}
}
return res;
}
int convertChatToInt(char ch) {
if (ch == 'A') {
return 1;
} else if (ch == 'C') {
return 2;
} else if (ch == 'G') {
return 3;
}
return 0;
}
}