In [10]:
%matplotlib inline
import matplotlib.pyplot as plt
import numpy as np
import seaborn as sns

Comparing distances

1) String distances

1.1) Hamming distance

Let $S$ and $T$ be two strings of same length. The Hamming distance between these two strings is simply $d(T,S) = |\{i \quad \forall i \in \{1,..., |S|\} \; | \; T[i]\neq S[i])$. This distance represent the number of positions where both strings differ, or in other words, the minimum number of substitutions (e.g., errors) that changed one string into the other.

Questions:

  • Write a program that computes the Hamming distance between two strings
  • Compute its complexity
  • Describe the limits of this method? Propose an example to explain your point
In [2]:
def hamming_distance(S,T):
    """
    Hamming distance
    
    Args:
        S (str): first string
        T (str): second string

    Returns:
        d (int): distance
    """
    if len(S) != len(T):
        return None
    d = 0
    for i,s in enumerate(S):
        d += s != T[i]
    return d
In [3]:
hamming_distance("dans l'herbe noire les kobolds vont",
                 "dans l'herbr noird lzs kobolds vont")
Out[3]:
3
In [4]:
hamming_distance("dans l'herbe noire les kobolds vont ",
                 " dans l'herbe noire les kobolds vont")
Out[4]:
36

1.2) Levenshtein distance

In order to compare strings that have different lengths, or strings that undergo other kinds of modification (e.g., insertions, deletionss), the Levenshtein distance can be used.

Let $S$ and $T$ be two strings with possibly $|S| \neq |T|$, and let $\mathbb{1}_{S[i] \neq T[j]}$ be the indicator function that is equal to one when $S[i] \neq T[j]$ and 0 otherwise.

Let $\mathcal{L}_{S,T}(i,j)$ be the Levenshtein distance between the first $i$ characters of $S$ and the first $j$ characters of $T$:

  • if $min(i,j)=0$: $\quad \mathcal{L}_{S,T}(i,j)= max(i,j)$
  • Otherwise:
$$ \mathcal{L}_{S,T}(i,j)= min \begin{cases} \mathcal{L}_{S,T}(i-1,j) + 1\\ \mathcal{L}_{S,T}(i,j-1) + 1\\ \mathcal{L}_{S,T}(i-1,j-1) + \mathbb{1}_{S[i] \neq T[j]} \end{cases} $$

The Wagner–Fischer algorithm, can be used to compute the Levenshtein distance between $S$ and $T$. This dynamic programming algorithm, stores a matrix with the Levenshtein distances between all prefixes of both strings. The last value of the matrix (at row $|S|$ and column $|T|$) stores the Levenshtein distance between both strings.

Questions:

  • Explain each line of the function
  • Implicitly, this algorithm assigns a scoring function to each kind of modification, which are these scores?
  • Compute the complexity of the algorithm that aims at computing this function
  • Program this function in python
  • Compare two strings, and represent the matrix using imshow
In [41]:
def Levenstein_distance(S,T,return_matrix=False):
    """
    Levenshtein distance

    Args:
        S (str): first string
        T (str): second string
        return_matrix (bool): If true return matrix, else return distance

    Returns:
        F (np.array, np.array): If return_matrix is true F is the score matrix 
        S_aligned, T_aligned (str,str): If return_matrix is false return only the distance
        
    """
    L = np.zeros((len(S)+1,len(T)+1))
    L[0,:] = range(len(T)+1)
    L[:,0] = range(len(S)+1)
    for i in range(1,len(S)+1):
        for j in range(1,len(T)+1):
            comparison = int(S[i-1]!=T[j-1])
            L[i,j] = min([L[i-1,j] + 1, L[i,j-1] + 1, L[i-1,j-1] + comparison])
    if return_matrix:
        return L
    return(L[-1,-1])
In [29]:
Levenstein_distance("dans l'herbe noire les kobolds vont ",
                   " dans l'herbe noire les kobolds vont le vent profond")
Out[29]:
16.0
In [30]:
L = Levenstein_distance("dans l'herbe noire les kobolds vont ",
                        " dans l'herbe noire les kobolds vont le vent profond",
                        return_matrix=True)
In [31]:
sns.heatmap(L)
Out[31]:
<matplotlib.axes._subplots.AxesSubplot at 0x7fafa8bfb850>

3) Alignment

3.1) Needleman-Wunsch - Global Alignment

This algorithm performs a global sequence alignment, i.e., it finds the best alignment over the entire length of two sequences $S$ and $T$. Intuitively the algorithm seeks an alignment that maximizes the number of element-to-element matches. This dynamic programming algorithm is similar to the Wagner–Fischer algorithm. The underlying scoring system defined by Needleman and Wunsch is equal to 1 for matches, -1 for mismatches or indels. Let $F_{S,T}(i,j)$ be the total number of matches scoring between the first $i$ elements of $S$ and the first $j$ elements of $T$:

  • If $i=0$ or $j=0$ $F_{S,T}(i,j) = -max(i,j)$

  • Otherwise:

$$ \mathcal{F}_{S,T}(i,j)= max \begin{cases} \mathcal{F}_{S,T}(i-1,j) - 1\\ \mathcal{F}_{S,T}(i,j-1) - 1\\ \mathcal{F}_{S,T}(i-1,j-1) - \mathbb{1}_{S[i] \neq T[j]} + \mathbb{1}_{S[i] = T[j]} \end{cases} $$
  • Record the values of $F_{S,T}(i,j)$ in a matrix

  • For each element in the matrix, record the decision that was made to reach this value:

    • $\mathcal{F}_{S,T}(i-1,j)$
    • $\mathcal{F}_{S,T}(i,j-1)$
    • or $\mathcal{F}_{S,T}(i-1,j-1)$).
  • In order to find the best alignment,

  • To find the best alignment we start from the last element of the matrix (at row $|S|$ and column $|T|$) and we move backwards until we reach the first one (at row $0$ and column $0$). At each position we check which decision leaded to the actual solution (many solutions are sometimes possible, in this case you can keep all of them).

    • If the decision was $F_{S,T}(i-1,j)$ then it means that $S[i]$ is aligned with a gap
    • If the decision was $F_{S,T}(i,j-1)$ then $T[j]$ is aligned with a gap
    • If it was $F_{S,T}(i-1,j-1)$ then $S[i]$ and $T[j]$ are aligned. You can use the records you made in the previous step or recompute the possibilities ... both are possible. \end{itemize}

Questions:

  • Program the Needleman-Wunsch algorithm, the function should take two strings as input and should return the decisions matrix, as well as the scores matrix
  • Program a function that receives the strings as well as the decision function and returns an aligned version of both strings (deleted characters being replaced by "_")
  • Compute the complexity of this algorithm
  • Change the underlying scoring parameters, and run some tests to assess the impact
In [33]:
def NWmatrix2strings(S,T,D):
    """
    Needleman-Wunch matrix to aligned strings

    Args:
        S (str): first string
        T (str): second string
        D (np.array or list of lists): Decision matrix

    Returns:
        S_aligned (str): first string aligned
        T_aligned (str): second string aligned
    """
    S_WITH_GAP = -1
    T_WITH_GAP = 1
    S_T_ALIGN = 0
    i = D.shape[0]-1
    j = D.shape[1]-1
    S_aligned = ""
    T_aligned = ""
    while i > 0 or j > 0:
        #print(i,j,T[j-1],S[i-1],D[i,j])
        if D[i,j] == S_WITH_GAP:
            S_aligned = S[i-1]+S_aligned
            T_aligned = "_"+T_aligned
            #print(S_aligned,T_aligned)
            i -= 1
        elif D[i,j] == T_WITH_GAP:
            T_aligned = T[j-1]+ T_aligned
            S_aligned = "_"+S_aligned
            #print(S_aligned,T_aligned)
            j -= 1
        elif D[i,j] == S_T_ALIGN:
            S_aligned = S[i-1] + S_aligned
            T_aligned = T[j-1] + T_aligned
            #print(S_aligned,T_aligned)
            j -= 1
            i -= 1
    return(S_aligned,T_aligned)
    
def Needleman_Wunch(S,T,return_matrix=False):

    T_WITH_GAP = 1
    S_WITH_GAP = -1
    S_T_ALIGN = 0
    decision = [T_WITH_GAP,S_WITH_GAP,S_T_ALIGN]
    F = np.zeros((len(S)+1,len(T)+1))
    F[0,:] = range(len(T)+1)
    F[:,0] = range(len(S)+1)
    F *= -1
    D = np.zeros((len(S)+1,len(T)+1))
    D[0,1:] = T_WITH_GAP 
    D[1:,0] = S_WITH_GAP 
    for i in range(1,len(S)+1):
        for j in range(1,len(T)+1):
            comparison = int(S[i-1]==T[j-1])*2-1 # * 10
            options = [F[i,j-1] - 1, F[i-1,j] - 1, F[i-1,j-1] + comparison]
            F[i,j] = options[0]
            D[i,j] = T_WITH_GAP
            for o,val in enumerate(options):
                if val > F[i,j]:
                    F[i,j] = val
                    D[i,j] = decision[o]
    if return_matrix:
        return F,D
    else:
        return NWmatrix2strings(S,T,D)
In [34]:
S = "dans l'herbe noire les gros kobolds vont "
T = " dans l'herbe noire les kobolds vont le vent profond"
F,D = Needleman_Wunch(S,T,True)
NWmatrix2strings(S,T,D)
Out[34]:
("_dans l'herbe noire les gros kobolds vont _______________",
 " dans l'herbe noire les _____kobolds vont le vent profond")
In [35]:
sns.heatmap(F)
Out[35]:
<matplotlib.axes._subplots.AxesSubplot at 0x7faf6a560a90>
In [36]:
sns.heatmap(D)
Out[36]:
<matplotlib.axes._subplots.AxesSubplot at 0x7faf3921ae50>

3.2) Smith-Waterman - local Alignment

Smith-Waterman algorithm performs local sequence alignment, i.e., it finds alignments shorter than the entire sequences. This kind of alignments is particularly useful when we compare two sequences that are very different, but that may share some local regions with high similarity. The Smith-Waterman algorithm is very similar to the Needleman-Wunsch method.

  • If $i=0$ or $j=0$ $F_{S,T}(i,j) = 0$

  • Otherwise:

$$ \mathcal{F}_{S,T}(i,j)= max \begin{cases} \mathcal{F}_{S,T}(i-1,j) - 1\\ \mathcal{F}_{S,T}(i,j-1) - 1\\ \mathcal{F}_{S,T}(i-1,j-1) - \mathbb{1}_{S[i] \neq T[j]} + \mathbb{1}_{S[i] = T[j]}\\ 0 \end{cases} $$

Moreover, instead of starting at position $|S|$ and $|T|$ to traceback the alignment, one should start at the maximal position in the matrix. And the traceback stops as soon as we reach a 0.

Questions:

  • Explain the reasons behind the difference between both methods
  • Program the Smith-Waterman algorithm, the function should take two strings as input and should return the decisions matrix, as well as the scores matrix
  • Program a function that receives the strings as well as the decision and the score matrices and returns an aligned version of both strings (deleted characters being replaced by "_")
  • Compute the complexity of this algorithm
  • Change the underlying scoring parameters, and run some tests to assess the impact
In [42]:
def SWmatrix2strings(S,T,D,F):
    """
    Smith-Waterman matrix to aligned strings

    Args:
        S (str): first string
        T (str): second string
        D (np.array or list of lists): Decision matrix

    Returns:
        S_aligned (str): first string aligned
        T_aligned (str): second string aligned
        
    """    
    S_WITH_GAP = -1
    T_WITH_GAP = 1
    S_T_ALIGN = 0
    i,j = np.unravel_index(F.argmax(),F.shape)
    print(i,j)
    S_aligned = ""
    T_aligned = ""
    while i > 0 and j > 0:
        #print(i,j,T[j-1],S[i-1],D[i,j])
        if D[i,j] == S_WITH_GAP:
            S_aligned = S[i-1]+S_aligned
            T_aligned = "_"+T_aligned
            #print(S_aligned,T_aligned)
            i -= 1
        elif D[i,j] == T_WITH_GAP:
            T_aligned = T[j-1]+ T_aligned
            S_aligned = "_"+S_aligned
            #print(S_aligned,T_aligned)
            j -= 1
        elif D[i,j] == S_T_ALIGN:
            S_aligned = S[i-1] + S_aligned
            T_aligned = T[j-1] + T_aligned
            #print(S_aligned,T_aligned)
            j -= 1
            i -= 1
        elif D[i,j] == NO_ALIGN:
            break
    return(S_aligned,T_aligned)

def Smith_Waterman(S,T,return_matrix=False):
    """
    Needleman-Wunch 

    Args:
        S (str): first string
        T (str): second string
        return_matrix (bool): If true return matrix, else return the aligned strings

    Returns:
        F,D (np.array, np.array): If return_matrix is true F is the score matrix and D the decision matrix
        S_aligned, T_aligned (str,str): If return_matrix is false first and second strings aligned
        
    """
    T_WITH_GAP = 1
    S_WITH_GAP = -1
    S_T_ALIGN = 0
    NO_ALIGN = 2
    decision = [T_WITH_GAP,S_WITH_GAP,S_T_ALIGN,NO_ALIGN]
    F = np.zeros((len(S)+1,len(T)+1))
    D = np.zeros((len(S)+1,len(T)+1))
    D[0,1:] = T_WITH_GAP 
    D[1:,0] = S_WITH_GAP 
    for i in range(1,len(S)+1):
        for j in range(1,len(T)+1):
            comparison = int(S[i-1]==T[j-1])*2-1
            options = [F[i,j-1] - 1, F[i-1,j] - 1, F[i-1,j-1] + comparison, 0]
            F[i,j] = options[0]
            D[i,j] = T_WITH_GAP
            for o,val in enumerate(options):
                if val > F[i,j]:
                    F[i,j] = val
                    D[i,j] = decision[o]
    if return_matrix:
        return F,D
    else:
        return SWmatrix2strings(S,T,D,F)
In [43]:
S = "un buisson gifle l'oeil au passant. dans l'herbe noir(e les gros kobolds vont "
T = " dans l'herbe noire les kobolds vont le vent profond pleure on veut croire"
F,D = Smith_Waterman(S,T,True)
SWmatrix2strings(S,T,D,F)
78 37
Out[43]:
(" dans l'herbe noir(e les gros kobolds vont ",
 " dans l'herbe noir_e les _____kobolds vont ")
In [45]:
sns.heatmap(F)
Out[45]:
<matplotlib.axes._subplots.AxesSubplot at 0x7faf50c3a590>
In [46]:
sns.heatmap(D)
Out[46]:
<matplotlib.axes._subplots.AxesSubplot at 0x7fafa8dae7d0>

Real World Application

Clustering of microRNA sequences

In this section we use the Levenshtein distance to study the similarity between sequences from a database.

  • Install the Levenshtein Python library: pip install python-Levenshtein (in Colab)
  • Download some miRNA sequences from the miRBase database: http://www.mirbase.org/ftp.shtml
  • Compute the pairwise distances between the different sequences.
  • Use the clustermap function from seaborn to represent the results, take a look at the documentation and explain the function.
In [ ]:
 

More material

In [ ]: