In this practical work, we will program and study the Burrows Wheeler transform
In this work, we compare two strings iteratively, by comparing their characters by their so-called lexicographic order, (i.e., their alphabetic order A<B<C<....<Z). Notice that here we consider only the alphabet characters, and an "end of string" character, denoted \$ s.t. \\$ < A. Therefore, a string $S$ that is a prefix of a string $T$ is considered smaller: $S<T$.
Let $R(T,k)=T'$ be a function that takes a string $T$ and a integer $k$ as inputs and returns a string $T'$ that is a circular shift of string $T$, such that:
$T'[1,|T|-k] = T[k+1,|T|]$ and $T'[|T|-k + 1,|T|] = T[1,k]$
example:
for T = "dans l'herbe noire Les Kobolds vont$"
R(T,3) = "s l'herbe noire Les Kobolds vont$dan"
R(T,9) = "rbe noire Les Kobolds vont$dans l'he"
Example:
Original String: $S=$ ACATACAGATG\$
Sorted circular shifts:
\$ACATACAGATG
ACAGATG\$ACAT
ACATACAGATG\$
AGATG\$ACATAC
ATACAGATG\$AC
ATG\$ACATACAG
CAGATG\$ACATA
CATACAGATG\$A
G\$ACATACAGAT
GATG\$ACATACA
TACAGATG\$ACA
TG\$ACATACAGA
Burrow Wheeler Transform:
$BWT(S) = $ GT\$CCGAATAAA
Questions :
def circular_shift(text,k):
"""
Make a circular shift
Args:
text (str): string to be shifted
Returns:
str: shifted string
"""
return text[-k:]+text[:-k]
def BWT(text,verbose=False):
"""
Compute the BWT
Args:
text (str): string to be shifted
Return:
str: BWT
"""
if text[-1] != "$":
text += "$"
rotations = []
for k in range(1,len(text)+1):
circular_shifted_text = circular_shift(text,k)
rotations.append(circular_shifted_text)
rotations = sorted(rotations)
if verbose:
for circular_shifted_text in rotations:
print(circular_shifted_text)
bwt = "".join([r[-1] for r in rotations])
return(bwt)
text = "dans l'herbe noire les kobolds vont. le vent profond pleure, on veut croire."
shifted_text = circular_shift(text,10)
shifted_text
bwt = BWT(text)
bwt
BWT("ACATACAGATG",verbose=True)
from Bio import SeqIO
genome = list(SeqIO.parse("data/sarscov2_genome.fasta", "fasta"))[0]
BWT(str(genome.seq))
str(genome.seq)[:100]
The BWT is a algorithm that aims at preparing data to be compressed. In practice it rearranges a string into pieces of repeated characters, that can then be compressed using for example the run-length encoding. This technique aims at storing a sequence of repeated characters as a single value and the number of repeats.
Example
original string : AAAAAAATTTTTGGGGTGTTTTTT
compressed string : A7T5G4TGT6
Questions :
def run_length_encoding(S):
"""
Encode sequence using the Run Length mathod
Args:
text (str): string to be shifted
Return:
str: run length
"""
S+="$"
i = 1
result = ""
current_char = S[0]
count = 1
while i < len(S):
if S[i] == current_char:
count += 1
else:
result += current_char
if count > 1:
result += str(count)
current_char = S[i]
count = 1
i += 1
return(result)
run_length_encoding("AAAATTCCCGCGG")
genome_bwt = BWT(str(genome.seq))
len(str(genome.seq)),len(run_length_encoding(str(genome.seq))), len(run_length_encoding(genome_bwt))
Let $S$ be a string, and let $S'$ be the transformed version of $S$ after undergoing a given number of circular shift operations, and let us consider that the end of string character is at location $i$.
Questions :
Answer :
$S[1:i]$ is the suffix of $S[i+1:|S|]$ and $S[|S|]$ is the character just before the suffix $S[1:i]$. In other words, $S[1:i]$ is the (circular) right context of character $S[|S|]$. Therefore the BWT simply returns the previous character to each suffix in the suffix table (i.e., returns the character preceding a suffix, sorted by their suffix lexicographic order, i.e. sorted by their right context).
If we assume that some patterns exist in the string, i.e., some characters tend to co-occur with higher frequency, sorting characters according to their suffixes (i.e., their write context) will bring together similar characters. For example, let us imagine that we have a very long document about dogs, then just before a "og ..." there are chances to find a "d". and thus sorting letters according to their suffix will tend to bring together many "d", alternated with other characters (for example for words like "cog").
def suffix_list(T):
"""
Compute the suffix list
Args:
T (str): string
Return:
list of strings: suffix list
"""
suffix_list = [T[i:] for i in range(len(T))]
sorted(suffix_list,reverse=True)
return suffix_list
def suffix_table(T):
"""
Compute the suffix table
Args:
T (str): string
Return:
list of tuples (suffix,location): suffix table
"""
suffix_list = [T[i:] for i in range(len(T))]
suffix_table = sorted((e,i) for i,e in enumerate(suffix_list))
return suffix_table
def BWT_suffix_table(T,end_of_string="$"):
"""
Compute the BWT from the suffix table
Args:
T (str): string
end_of_string (char): end of string character to append
Return:
bwt (str): BWT
"""
T += end_of_string
ST = suffix_table(T)
bwt = ""
for s,i in ST:
bwt += T[i-1]
return(bwt)
T = "ACATACAGATG"
BWT(T)
BWT_suffix_table(T)
The BWT transformation is reversible, and the inverse operation does note require the storage of any additional data: only the end of string character is needed. The inverse BWT procedure is described hereafter:
Given the circular shift operation, we know that $\forall i$ if $S[i] = S_{bwt}[i]$, then $S[i+1] = S_{1}[i]$
Now, $\{\forall i \quad (S_{bwt}[i], S_{1}[i])\}$ denote the set of all pairs of successive characters in $S$.
Example :
$S_{bwt}$ = GT$CCGAATAAA
Sort to get $S_1$: $S_1$ = $AAAAACGGTT
Paste $S_{bwt}$ to get the pairs: $\{\forall i \quad (S_{bwt}[i], S_{1}[i])\}$ = {G\$, TA, \\$A, CA, CA, GA, AC, AC, TG, AG, AT, AT}
Sort the pairs to get the (first and) second column: \$A, AC, AC, AG, AT, AT, CA, CA, G\\$, GA, TA, TG
$S_{2}$ = ACCGTTAA\$AAG
Paste $S_{bwt}$ to get the triplets: $\{\forall i \quad (S_{bwt}[i], S_{1}[i], S_{2}[i])\}$ = {G\$A, TAC, \\$A, CAG, CAT, GAT, ACA, ACA, TG\$, AGA, ATA, ATG}
...
Questions :
def print_table(table):
for row in table:
print("".join(row))
def inverse_BWT(bwt, verbose=False, last_character="$"):
"""
Inverse the BWT
Args:
bwt (str): bwt of a string T
verbose (bool): if True print the process of the algorithm
last_character (char): which is the end of string character?
Return:
T (str): BWT^{-1} of bwt
"""
def paste(suffix_table,bwt):
new_columns = []
for i,s in enumerate(suffix_table):
new_columns.append(bwt[i]+s[:])
return(new_columns)
bwt = list(bwt)
table = sorted(bwt)
while len(table[0])<len(bwt):
table = paste(table,bwt)
table = sorted(table)
if verbose:
print("suffix table in progress....")
print_table(table)
if verbose:
print("suffix table")
for row in table:
print(row)
for row in table:
if row[-1] == last_character:
return(row)
inverse_BWT("GT$CCGAATAAA")
text
inverse_BWT(BWT(text))
There is a more efficient way to perform the Inverse Burrows Wheeler Transform, without needing to reconstruct the entire table (pasting and sorting)
Example:
\$ACATACAGATG
ACAGATG\$ACAT
ACATACAGATG\$
AGATG\$ACATAC
ATACAGATG\$AC
ATG\$ACATACAG
CAGATG\$ACATA
CATACAGATG\$A
G\$ACATACAGAT
GATG\$ACATACA
TACAGATG\$ACA
TG\$ACATACAGA
Answer:
\$ACATACAGATG$_1$
ACAGATG\$ACAT$_1$
ACATACAGATG\$$_1$
AGATG\$ACATAC$_1$
ATACAGATG\$AC$_2$
ATG\$ACATACAG$_2$
CAGATG\$ACATA$_1$
CATACAGATG\$A$_2$
G\$ACATACAGAT$_2$
GATG\$ACATACA$_3$
TACAGATG\$ACA$_4$
TG\$ACATACAGA$_5$
row 1 $\to$ \$$_1$ACATACAGATG$_1$
row 9 $\to$ G$_1$\$ACATACAGAT$_2$
row 12 $\to$ T$_2$G\$ACATACAGA$_5$
\$$_1$ACATACAGATG$_1$
A$_1$CAGATG\$ACAT$_1$
A$_2$CATACAGATG\$$_1$
A$_3$GATG\$ACATAC$_1$
A$_4$TACAGATG\$AC$_2$
A$_5$TG\$ACATACAG$_2$
C$_1$AGATG\$ACATA$_1$
C$_2$ATACAGATG\$A$_2$
G$_1$\$ACATACAGAT$_2$
G$_2$ATG\$ACATACA$_3$
T$_1$ACAGATG\$ACA$_4$
T$_2$G\$ACATACAGA$_5$
Let us consider the following example:
We set the indexes in the last column such that:
xxxxG$_1$
.....
yyyyG$_2$
.....
zzzzG$_3$
Given the BWT construction (sorting the strings by lexicographic order) we know that:
xxxx < yyyy < zzzz
Consequently, in the BWT the rows starting by G (circular shifts of the previous ones) will preserve the same order
.....
G$_1$xxxx
G$_2$yyyy
G$_3$zzzz
.....
Let $S_{bwt}$ the BWT of a string $S$, and let $A$ be the alphabet of $S$.
Let $S*T = (S[1], S[2], \dots, S[|S|], T[1], \dots, T[|T|])$ be the concatenation of strings $S$ and $T$
initialization
By definition, the first character of the First column is \$, its left context is stored in the first character of the Last column ($S_{bwt}[i]$), therefore this is the penultimate character in $S$ (just before \\$).
Let $X\leftarrow S_{bwt}[0]$ denote the current character
Repeat While $X \neq \$$
The main idea is to find the location of $X$ in the First column of the table (not computed explicitly), in order to find its left context in the Last column, i.e., $S_{bwt}$. Notice that the characters in the first column are: 1) sorted into a lexicographic order 2) the instances of the same letter share the same appearance index than those in $S_{bwt}$.
Questions:
from collections import Counter
def occurrence_indexer(S):
K = []
last_index = {}
for s in S:
if s not in last_index:
last_index[s] = 0
K.append(last_index[s])
last_index[s] += 1
return(K)
def last2first(counts,k,X):
return k + sum([counts[char] for char in counts if char < X])
def efficient_inverse_BWT(bwt,end_of_string="$"):
"""
Inverse the BWT
Args:
bwt (str): bwt of a string T
last_character (char): which is the end of string character?
Return:
T (str): BWT^{-1} of bwt
"""
K = occurrence_indexer(bwt)
counts = Counter(bwt)
X = bwt[0]
k = K[0]
S = end_of_string
while X != end_of_string:
S = X+S
j = last2first(counts,k,X)
X = bwt[j]
k = K[j]
return(S)
inverse_BWT("GT$CCGAATAAA")
efficient_inverse_BWT("GT$CCGAATAAA")
Using a similar technique as in the efficient inverse BWT, it is possible to detect the presence of a substring $T$ in a string $S$ using the following principle
initialization
while $e < f$ and $i>0$:
Question:
def last2first(counts,k,X):
return k + sum([counts[char] for char in counts if char < X])
def get_first_occurrence(L,X):
for i,l in enumerate(L):
if l == X:
return(i)
def pattern_matching_BWT(S,pattern):
"""
Search a patter in a String using the BWT
Args:
S (str): string
pattern (str): pattern
Return:
bool: true if the pattern is in the string
"""
L = BWT_suffix_table(T)
K = occurrence_indexer(L)
counts = Counter(L)
e = 0
f = len(L)
i = len(pattern) - 1
while e < f and i >= 0:
X = pattern[i]
first_occurence_in_L_ef = get_first_occurrence(L[e:f],X)
if first_occurence_in_L_ef is None:
return False
else:
r = first_occurence_in_L_ef+e
#print(first_occurence_in_L_ef,L[e:f],r,X,L[r])
last_occurence_in_L_ef = get_first_occurrence(L[e:f][::-1],X)
if last_occurence_in_L_ef is None:
return False
else:
s = f-last_occurence_in_L_ef-1
#print(last_occurence_in_L_ef,L[e:f],s,X,L[s])
e = last2first(counts,K[r],X)
f = last2first(counts,K[s],X)+1
i -= 1
#print(r,s,e,f)
return(i<0)
pattern_matching_BWT('ACATACAGATG','CATAC')