%matplotlib inline
import matplotlib.pyplot as plt
import numpy as np
import seaborn as sns
Let $S$ and $T$ be two strings of same length. The Hamming distance between these two strings is simply $d(T,S) = |\{i \quad \forall i \in \{1,..., |S|\} \; | \; T[i]\neq S[i])$. This distance represent the number of positions where both strings differ, or in other words, the minimum number of substitutions (e.g., errors) that changed one string into the other.
Questions:
def hamming_distance(S,T):
"""
Hamming distance
Args:
S (str): first string
T (str): second string
Returns:
d (int): distance
"""
if len(S) != len(T):
return None
d = 0
for i,s in enumerate(S):
d += s != T[i]
return d
hamming_distance("dans l'herbe noire les kobolds vont",
"dans l'herbr noird lzs kobolds vont")
hamming_distance("dans l'herbe noire les kobolds vont ",
" dans l'herbe noire les kobolds vont")
In order to compare strings that have different lengths, or strings that undergo other kinds of modification (e.g., insertions, deletionss), the Levenshtein distance can be used.
Let $S$ and $T$ be two strings with possibly $|S| \neq |T|$, and let $\mathbb{1}_{S[i] \neq T[j]}$ be the indicator function that is equal to one when $S[i] \neq T[j]$ and 0 otherwise.
Let $\mathcal{L}_{S,T}(i,j)$ be the Levenshtein distance between the first $i$ characters of $S$ and the first $j$ characters of $T$:
The Wagner–Fischer algorithm, can be used to compute the Levenshtein distance between $S$ and $T$. This dynamic programming algorithm, stores a matrix with the Levenshtein distances between all prefixes of both strings. The last value of the matrix (at row $|S|$ and column $|T|$) stores the Levenshtein distance between both strings.
Questions:
def Levenstein_distance(S,T,return_matrix=False):
"""
Levenshtein distance
Args:
S (str): first string
T (str): second string
return_matrix (bool): If true return matrix, else return distance
Returns:
F (np.array, np.array): If return_matrix is true F is the score matrix
S_aligned, T_aligned (str,str): If return_matrix is false return only the distance
"""
L = np.zeros((len(S)+1,len(T)+1))
L[0,:] = range(len(T)+1)
L[:,0] = range(len(S)+1)
for i in range(1,len(S)+1):
for j in range(1,len(T)+1):
comparison = int(S[i-1]!=T[j-1])
L[i,j] = min([L[i-1,j] + 1, L[i,j-1] + 1, L[i-1,j-1] + comparison])
if return_matrix:
return L
return(L[-1,-1])
Levenstein_distance("dans l'herbe noire les kobolds vont ",
" dans l'herbe noire les kobolds vont le vent profond")
L = Levenstein_distance("dans l'herbe noire les kobolds vont ",
" dans l'herbe noire les kobolds vont le vent profond",
return_matrix=True)
sns.heatmap(L)
This algorithm performs a global sequence alignment, i.e., it finds the best alignment over the entire length of two sequences $S$ and $T$. Intuitively the algorithm seeks an alignment that maximizes the number of element-to-element matches. This dynamic programming algorithm is similar to the Wagner–Fischer algorithm. The underlying scoring system defined by Needleman and Wunsch is equal to 1 for matches, -1 for mismatches or indels. Let $F_{S,T}(i,j)$ be the total number of matches scoring between the first $i$ elements of $S$ and the first $j$ elements of $T$:
If $i=0$ or $j=0$ $F_{S,T}(i,j) = -max(i,j)$
Otherwise:
Record the values of $F_{S,T}(i,j)$ in a matrix
For each element in the matrix, record the decision that was made to reach this value:
In order to find the best alignment,
To find the best alignment we start from the last element of the matrix (at row $|S|$ and column $|T|$) and we move backwards until we reach the first one (at row $0$ and column $0$). At each position we check which decision leaded to the actual solution (many solutions are sometimes possible, in this case you can keep all of them).
Questions:
def NWmatrix2strings(S,T,D):
"""
Needleman-Wunch matrix to aligned strings
Args:
S (str): first string
T (str): second string
D (np.array or list of lists): Decision matrix
Returns:
S_aligned (str): first string aligned
T_aligned (str): second string aligned
"""
S_WITH_GAP = -1
T_WITH_GAP = 1
S_T_ALIGN = 0
i = D.shape[0]-1
j = D.shape[1]-1
S_aligned = ""
T_aligned = ""
while i > 0 or j > 0:
#print(i,j,T[j-1],S[i-1],D[i,j])
if D[i,j] == S_WITH_GAP:
S_aligned = S[i-1]+S_aligned
T_aligned = "_"+T_aligned
#print(S_aligned,T_aligned)
i -= 1
elif D[i,j] == T_WITH_GAP:
T_aligned = T[j-1]+ T_aligned
S_aligned = "_"+S_aligned
#print(S_aligned,T_aligned)
j -= 1
elif D[i,j] == S_T_ALIGN:
S_aligned = S[i-1] + S_aligned
T_aligned = T[j-1] + T_aligned
#print(S_aligned,T_aligned)
j -= 1
i -= 1
return(S_aligned,T_aligned)
def Needleman_Wunch(S,T,return_matrix=False):
T_WITH_GAP = 1
S_WITH_GAP = -1
S_T_ALIGN = 0
decision = [T_WITH_GAP,S_WITH_GAP,S_T_ALIGN]
F = np.zeros((len(S)+1,len(T)+1))
F[0,:] = range(len(T)+1)
F[:,0] = range(len(S)+1)
F *= -1
D = np.zeros((len(S)+1,len(T)+1))
D[0,1:] = T_WITH_GAP
D[1:,0] = S_WITH_GAP
for i in range(1,len(S)+1):
for j in range(1,len(T)+1):
comparison = int(S[i-1]==T[j-1])*2-1 # * 10
options = [F[i,j-1] - 1, F[i-1,j] - 1, F[i-1,j-1] + comparison]
F[i,j] = options[0]
D[i,j] = T_WITH_GAP
for o,val in enumerate(options):
if val > F[i,j]:
F[i,j] = val
D[i,j] = decision[o]
if return_matrix:
return F,D
else:
return NWmatrix2strings(S,T,D)
S = "dans l'herbe noire les gros kobolds vont "
T = " dans l'herbe noire les kobolds vont le vent profond"
F,D = Needleman_Wunch(S,T,True)
NWmatrix2strings(S,T,D)
sns.heatmap(F)
sns.heatmap(D)
Smith-Waterman algorithm performs local sequence alignment, i.e., it finds alignments shorter than the entire sequences. This kind of alignments is particularly useful when we compare two sequences that are very different, but that may share some local regions with high similarity. The Smith-Waterman algorithm is very similar to the Needleman-Wunsch method.
If $i=0$ or $j=0$ $F_{S,T}(i,j) = 0$
Otherwise:
Moreover, instead of starting at position $|S|$ and $|T|$ to traceback the alignment, one should start at the maximal position in the matrix. And the traceback stops as soon as we reach a 0.
Questions:
def SWmatrix2strings(S,T,D,F):
"""
Smith-Waterman matrix to aligned strings
Args:
S (str): first string
T (str): second string
D (np.array or list of lists): Decision matrix
Returns:
S_aligned (str): first string aligned
T_aligned (str): second string aligned
"""
S_WITH_GAP = -1
T_WITH_GAP = 1
S_T_ALIGN = 0
i,j = np.unravel_index(F.argmax(),F.shape)
print(i,j)
S_aligned = ""
T_aligned = ""
while i > 0 and j > 0:
#print(i,j,T[j-1],S[i-1],D[i,j])
if D[i,j] == S_WITH_GAP:
S_aligned = S[i-1]+S_aligned
T_aligned = "_"+T_aligned
#print(S_aligned,T_aligned)
i -= 1
elif D[i,j] == T_WITH_GAP:
T_aligned = T[j-1]+ T_aligned
S_aligned = "_"+S_aligned
#print(S_aligned,T_aligned)
j -= 1
elif D[i,j] == S_T_ALIGN:
S_aligned = S[i-1] + S_aligned
T_aligned = T[j-1] + T_aligned
#print(S_aligned,T_aligned)
j -= 1
i -= 1
elif D[i,j] == NO_ALIGN:
break
return(S_aligned,T_aligned)
def Smith_Waterman(S,T,return_matrix=False):
"""
Needleman-Wunch
Args:
S (str): first string
T (str): second string
return_matrix (bool): If true return matrix, else return the aligned strings
Returns:
F,D (np.array, np.array): If return_matrix is true F is the score matrix and D the decision matrix
S_aligned, T_aligned (str,str): If return_matrix is false first and second strings aligned
"""
T_WITH_GAP = 1
S_WITH_GAP = -1
S_T_ALIGN = 0
NO_ALIGN = 2
decision = [T_WITH_GAP,S_WITH_GAP,S_T_ALIGN,NO_ALIGN]
F = np.zeros((len(S)+1,len(T)+1))
D = np.zeros((len(S)+1,len(T)+1))
D[0,1:] = T_WITH_GAP
D[1:,0] = S_WITH_GAP
for i in range(1,len(S)+1):
for j in range(1,len(T)+1):
comparison = int(S[i-1]==T[j-1])*2-1
options = [F[i,j-1] - 1, F[i-1,j] - 1, F[i-1,j-1] + comparison, 0]
F[i,j] = options[0]
D[i,j] = T_WITH_GAP
for o,val in enumerate(options):
if val > F[i,j]:
F[i,j] = val
D[i,j] = decision[o]
if return_matrix:
return F,D
else:
return SWmatrix2strings(S,T,D,F)
S = "un buisson gifle l'oeil au passant. dans l'herbe noir(e les gros kobolds vont "
T = " dans l'herbe noire les kobolds vont le vent profond pleure on veut croire"
F,D = Smith_Waterman(S,T,True)
SWmatrix2strings(S,T,D,F)
sns.heatmap(F)
sns.heatmap(D)
In this section we use the Levenshtein distance to study the similarity between sequences from a database.
pip install python-Levenshtein
(in Colab)clustermap
function from seaborn
to represent the results, take a look at the documentation and explain the function.
Illustration for Needleman-Wunsch: https://www.wikiwand.com/en/Needleman%E2%80%93Wunsch_algorithm
Illustration for Smith-Waterman: https://www.wikiwand.com/en/Smith%E2%80%93Waterman_algorithm