Burrows Wheeler Transform

Sergio Peignier (sergio.peignier@insa-lyon.fr)

In this practical work, we will program and study the Burrows Wheeler transform

Lexicographic order

In this work, we compare two strings iteratively, by comparing their characters by their so-called lexicographic order, (i.e., their alphabetic order A<B<C<....<Z). Notice that here we consider only the alphabet characters, and an "end of string" character, denoted \$ s.t. \\$ < A. Therefore, a string $S$ that is a prefix of a string $T$ is considered smaller: $S<T$.

Circular Shift

Let $R(T,k)=T'$ be a function that takes a string $T$ and a integer $k$ as inputs and returns a string $T'$ that is a circular shift of string $T$, such that:

$T'[1,|T|-k] = T[k+1,|T|]$ and $T'[|T|-k + 1,|T|] = T[1,k]$

example:

for T = "dans l'herbe noire Les Kobolds vont$"

R(T,3) = "s l'herbe noire Les Kobolds vont$dan"

R(T,9) = "rbe noire Les Kobolds vont$dans l'he"

  • Program function $R(T,k)$ in python

Burrows Wheeler Transform (BWT)

  • To compute the Burrows Wheeler Transform (BWT) we must:
    • Assume that the string contains a special character "End of string" that only occurs at the end of the end of the string and nowhere else (here: "\$")
    • Apply the function $R(T,k), \quad \forall k\in [1,...,|T|]$ and store the outputs (i.e., compute and store all the circular shifts)
    • Sort the strings into lexical order
    • Extract the last letter of each string

Example:

Original String: $S=$ ACATACAGATG\$

Sorted circular shifts:

\$ACATACAGATG

ACAGATG\$ACAT

ACATACAGATG\$

AGATG\$ACATAC

ATACAGATG\$AC

ATG\$ACATACAG

CAGATG\$ACATA

CATACAGATG\$A

G\$ACATACAGAT

GATG\$ACATACA

TACAGATG\$ACA

TG\$ACATACAGA

Burrow Wheeler Transform:

$BWT(S) = $ GT\$CCGAATAAA

Questions :

  • Program the BWT in python
  • Apply the BWT to encode the virus genome downloaded during the previous lecture
In [25]:
def circular_shift(text,k):
    """
    Make a circular shift
    
    Args:
        text (str): string to be shifted

    Returns:
        str: shifted string
    """
    return text[-k:]+text[:-k]

def BWT(text,verbose=False):
    """
    Compute the BWT
    
    Args:
        text (str): string to be shifted
    
    Return:
        str: BWT
    """
    if text[-1] != "$":
        text += "$"
    rotations = []
    for k in range(1,len(text)+1):
        circular_shifted_text = circular_shift(text,k)
        rotations.append(circular_shifted_text)
    rotations = sorted(rotations)
    if verbose:
        for circular_shifted_text in rotations:
            print(circular_shifted_text)
    bwt = "".join([r[-1] for r in rotations])
    return(bwt)
In [26]:
text = "dans l'herbe noire les kobolds vont. le vent profond pleure, on veut croire."
In [27]:
shifted_text = circular_shift(text,10)
shifted_text
Out[27]:
"ut croire.dans l'herbe noire les kobolds vont. le vent profond pleure, on ve"
In [28]:
bwt = BWT(text)
bwt
Out[28]:
"tss.ee,dtens.letedro n$lrblrrvhllvo'oo  o  poo aeokrnrb fv  eiuipcendunnee   "
In [29]:
BWT("ACATACAGATG",verbose=True)
$ACATACAGATG
ACAGATG$ACAT
ACATACAGATG$
AGATG$ACATAC
ATACAGATG$AC
ATG$ACATACAG
CAGATG$ACATA
CATACAGATG$A
G$ACATACAGAT
GATG$ACATACA
TACAGATG$ACA
TG$ACATACAGA
Out[29]:
'GT$CCGAATAAA'
In [32]:
from Bio import SeqIO
genome = list(SeqIO.parse("data/sarscov2_genome.fasta", "fasta"))[0]
BWT(str(genome.seq))
Out[32]:
'AAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAACTGTCTTCTTTCCTTGGTTCCTCCCTGGTTGTTGACTTTTCTTGTTTCTGGATTCCTGAAGTCTGGCTGAACATTCTAGAAGGCTTCTTCGGAGGTTCAAGGAAATCGGATCGTGAGCCTCAGTGATATTTTCACAGCACCGGGTGCAGACAAGAAACATAACGCCAAGTTAATGCGACCCAGTTAAGTGTCTCTTAGGCCCTGAAATATTGCCAGCGATACAGTTTCCGACTTTCACGACGCACCTTTATGGAACCAAATAATATCTACCGAACCGCATAATAATAGTACTATGAATGCGTATCAAACCTTTGTATTCTTACTGGATACCGATTCCCTTTATAAGACATACCAATTGTTTACGTACTCTTTTTTTGCTGTCAACTTCGGGGCACAGAAAACGTGAATGCCACGACCTGCTCCATTTAGAAGCGCAACCCATGACATATATTAACAAATAATACCCCATGCGCGATCGGAAACATAAGATTAATTCCAGCAGGTTTATCGTTTTCTCAATTTTCTTTACTAGCATATTATCTACCGACACATAACGTCTTGTTTCACAATCCGAGATTCCAAGCGGTATTGGTATCTAGAGATCATTGTCCCGTCACAAATACAAGTTTAACACTAGTGACAGTTACCAATTTTTGCATAGCGCCCAAGCGAGTCCACCTAAAAACTTTTAATTGAGGGCTTTGGAAACCAGCGCTACATGCGAAATGAATAGTCCAATTTGAAGATGGAGGGCTAGCATCAACCATTTGTTAGTTAATCGAACAAATCACCCAATAATTCGGTCAAACCCTCTGTCGCAGAATTGACCAGTCTTCGCCAGGGCAGCAGTCTTCTCTTCGTACGATACGATCGTATACGGTGGCCCTACGCTTTGCCTTTTGGCTTCTCTCCATTAGCGAAATATCCCAACGCAGCATGCCCGATCATTGCTCAGAAACACTATAGCTTATTCTGTGCCTGACACATAGAACTCTTACACAACTTCACTATTATAATTAAAATGTAACTTATTAATGGTCCCGCTCTACAAACCGATGCTTCTTACACAACAACCGCATCTCATGACGATATTAGACCCTCATCTGGCAGAGTACTATACGGCCGTCCCCCCTTCAATGCCACATAAAGGCACCAAAACACTCCCCGGCAAGAAATCGTAGCATACAAAACCTCCACCGCAGACAAGAAATTTATTCATGCCCAACGATCCCACGTCCACAAGTGATCCCGACTCACATGATAAAAAAAAACCTAAAGTTAGTTTAGCGTGCAATACCTCGGGCGAGAATACAAGTATACGCATAACTTACCCCCTCCCCACATCCACAACTAACGCCCCCTCCAGCCGGGTGTCCCAACTCGACTCACGGTGCTACTCATAAGTACCCCAGTGTATGCCCCAACAAATGCCCAAAAGTCAGACGATCAATCCCAGGCACGACTAGTGCCTCGGTCGCAGACATAAGGTAGCGCATTGCTCAAACTCTAACCCTAGGGCTTGGATAATTTTAATGAACCAAAGGATCACACGCAGGGGTGGGGCGAGAGGGCATAACTTGCCAGCAAGCCCAAATTATATCGTCGAACGGAGCAATCACGAAACAGACGGTTGACTGGAACGGCAGGAGGACTCTCCACGTTCAGTTTCCAGTAAAAAACAAGGCAGGCGCACATAAACCTTAGAAAGAAGCAGGAAGCAGCGATTAATCGATATTAAGGAATATCAGGCGCACAGTTTCTAGTCAAACGCCATGACGCCAGGGGGCAGCAGGGTTAAGACAATACCACCCCGAACCCCTTGTATAATACAAATAACAACAATACAGGTCAGACATTGCGAGAAATGAATTACCTCTTTTTCTCTCGACGAATGAGCGAAGCGCAAGGAGGATCAAAAGGGCTATTACAGAGAGCAGTGGAATCCAAGAGCTATTTACGCGCAAGGTGGTCACCGCAGTGAAACGCAAGATTAAGAGGATCTAAAGAGGCAATAGAAAAGCTTAGTTCGCCTAGAAAGGAGGTGGGGACACTAAAACGGATTAAGGGCCAGGAGGATAAAATTACGGCCAATCAATTGCTTGATTTCTTTTGTTGTAATAGTCAGTTGTCGGAAGTCTGCAAGAAATACAATCAACTTGAATACGATGAATACCGCTGTTAAATCTGTCTCTTTGACAAATGGACGATCTCCTCGTTAAAAGCAGCCCGAAACAAATTACAAAAAAGCTACCTAGATAAAATTTATGACGGGCCAAATCATCCATTACTAAGCGTAGCCATCCCTTCCGATGTTAGATAACGAGGACGTAGCGCTCTTCTCTATATAATAAATGGTTCCTACTTTTTTCTTCCAGCTACTATTCTTTCCGAATGCCTATATAGTACCCTAATTATTGCTACCTGCCGCTAAAGAATACTCCATACGTTTGGACTTGCTACACAATCTGACAAAAACACCCAATACTTATTTTTTATTCTGTCTATCAGGGCCTAGGACATAATGCGGTATATAGCGTATAACGGATACGCAACATCTTGCTCAATTTCACTATCTTGACCGTAAACCAATTTACCAAAAAGAGATACCTCCTGGTGGGAGTCTTCGAATTTGCTAGGATCTATTCGAACCTGAATAAGTACAATCAGTCCCCACTGAATAATAGGGAGAACATGCCACGATTTAATCACTAACCGATGTAGGCATCCTAAATAAAAATTGAACATTGAACATGAAATTAAACAATCGCTCGGCCCACACGCTACTTGCTCAACATACCAGGCAACGTAAGGCAACCGACACTGTTGTTTAATTTCCGACCAAGCGTTCAGCGGATTCTTTAATCACTTAATATATACTACATGTAAAGTTCAGGCTATAATAAAGAGTTAAAATCCGGCTGGTCATTTCGGTTGAAACTAACACTTACTTAGAGGTTGGTTGCGAAACTGTATTGAAAGGATACTATGGTCCTAAACCCCAACTAAATATACTACATACGGTATATACTAATTCTCCAGCAGTTCATCATCACGGTGTTAATAGGATGTACCGTAGACCATAAATAGATGAGACATTTGGAAAACATAAGTGGGCGTGAACCTATAGAACTACTTTAATTGCAACCGTTTTTTCTTTGTAGCCGAGGTAAGCTTTATAATAAAGGGACATAATTCCATTATCGTAGCCTTCCGTAGCAAAAGCCCTGTCGACCATTTCAACCACAAGGTAGCGTTTATCCAATATGGGTTCGCTAGGATCCTGTGATAGTAGGGGTCCCGAACCTTACTTATGTTTTCACTTAGCCTATTCCCCCCGATTGCTAATGATCTCGATACGCAGTATAGCAAGATACGTTTTAGTTAAAAGAACTTGACCCAATTTCACTAAATGCTGTCTTATTAATTAGCAAGACTTTGTTCTTTGGATTTAACACTTACGTTAGCTCCCTAAATAACTCGGGATGAGTCACTAGGTTAACACCGAGCGCTAAGACAGTTAGAGCTGATTTGTCGGCAAGTTAAAACAGAAAGTATTCAAAAGACATACCTGGCCGCGAGTATAGGTAACGCATCAACATGCTGTAAAGAAGAATGCATGGGCTAATGGACGGACATGGGATTATAATTCATGATACTTACATACAGCTACATATTATATACGTAATAGGCATAGTTCAATATCATTGTTTCCAGTAACGACAATCGCACCGCATTAAAGTTTTTTCTCACTACCGATGCAACAAAGTTTAAAGCCTCATTAAACATTCCTACAAGCGAACGGAATCTTTTGCTCCATTTATATAGAACCTGTTCCCCTAATGATGGATGAAGTGTCTGGAAGATAGATTACTTTTCGCACAGCACCTGCAAGTCGCTTCAATCCCATCCATCCGCTCTAAAATCTAAACAGCAATACCAACAGCCGAGAACATTGTCAAAGGACTTGCCATCTTCTCATCCTGAATGCCGCTATTCGTTAAGTAAGCTCTAACTCGTTCCCAGTTCCACTCCGCTAAAACAATAGCGTGCTTGGTGAGTCCATCTTAGTACACTTCACTCAGTCTTACTCCCGCTTAATCTGCACCTATCGGCATGGATAGAAATAGGCGAGGTCAAAAGATCCTCGCCCCATTCACTACTGATTCGCCTTGACAAGTTTTAATTCAGAGTCTTTATAATCGTTACTAGACATCCTATCTTTTAACCCTAAATATTTTTACATTTTATTCGCCTATGTTACATTCAAGCTGTTATACCAAGGGAAAGCTCTCTCCTTTTCCTTTACTGATAGCCCCCTCCTTCTCAGTAGATTTAACTTTCGCGCCCTCGTAGTTTGTCGGGCCTGAATTAACCAACCACGTCATTTGTGAGTCATAACTTCCGGTAAAATAGAAGATATGTTTGTCACCTCAGTGTACCCAAACCGTCAGATACTTTTGATAGCGCTACGATCATAATTATATATGTTGGCCTCTCACCGTATTACACCTTTATTTAAAAACTTTTATATTATACAAAAAAAGGAAAAATTCACTACACAGTACTATTCCTATCATCTTCCAACTTGAAAATTACCAACCTAGTAAGGAGTTCCAATACACACAGAATCCGTACGGTCCCATTCATTATTAACTCACCTAAAGAATTTGAATGTTGACCGGCCTCCACAGTGTAGACAAGTGCCGTGCAGTCGAGGCATTAAGACGACCCAAATCTCGCTGACTCCATATTCACCATGCATTTCTCCTTATAGACTAATACATTCATGCTATAGGAAGAAATTAAATATATGTTGGTGATCTACTCCCTATCATGATGCTATACGGCTGACCTTCCATTGCGCTCATAGGGATGTGGGATTACAGACGTATATCCGAAGATAAACTCATAAAACAAAATGTCATATCACATAGAAACACATCACTATACAGTTAGACCCAAAAACCGAAAATTATCATATGGAAACAAGCCTGAGCTGACTCATAAGTCTCACGAGTCGGCATTTGAGCAACCATAACACGATAAACGCGCACTCTATAACTGATTTTAGTTAATGAATCCCTCTCCCCTTCTCTACGAATCCCACGGAGGAAATTCCATAATACGAACGCGACTACTACGCAAGCGAAACTTGTTCGAACCCACAACATGGGTTAACCTTTATCACAGCTCACTCTACCTCGAAACAACACTTTAGGCAAACGAATAATCGTACACAATACGCGAGATTGAGCAAACCTTCCCATTAAAAATTCACGTGAAGTCCCTCCTTCATACAAATATCCAAGCAGTCGTTTACCCCTAATTTTCATCATTCACCCTCTTGGCTAGAATCATACCCGAACAAAGTTCCCTAACCTCATTACATCGTCGATCTCCTAAGTAGAAACTCATCTGCCTCCAAGCAACACTATTCCCCTGTCTGTTCGAACAATACTCAATATTCTCTGAAAACCTGGACTCAGGGACCACGAGACATCGTGCATTAAACTCCCCCTAGTTTTCTTCTGCCCAGTCGTTGTATATGATCAGATACCACATACTACCGATATAGAGTCGTAAAACGCCTAGTTATAAGGAAGACGTACACTCTCAGTCGATCTCTCTACCACAAATGTGTCGTTATCAGTAGCGAGCTATTATATGATCAGCCCATAAGCTTTTAATACAGTGATTGAAATAATCACGCACGTTACTTCCATATGATACACCGAGGAATGAACCTAAGGTCAACTGTATTACTTACGAAGAACCGAAAAAACACCATACAATGGAGGCATATATAAGGGCCCAATTTAAGTGCAAGCGTTTGTTACCAGTAAGTGGCCACGACCTGGACCCAACACACGCCCTGCCACATTACAAAACGGACCTGATAAAACAGAATAGACGCGCAGGCCCGGATGGACTACCGTAGTAAGCCGAAACAACCTGTTAGGCAACCCAAAGAAGCAAGCAACGAAAGCCAAAAAGGCTCACGCATCGGTAGAGTTAAAAAGGTATATACTGGCTAACACCTCTCACTACCACGTCATGACAACAGCAAGCCCAAATATGCCGAATCACTAAGAATTTTGATTATCTTTCGTATCAGTCCGACTCTTCCTTTTAATTAACGGGGGCCTCCACTGGCCGTGTGAATACTGTGTCAACCACTAAGACCTGCTAAGACCGCCAAGTGGAACGCAAGAAAAGGTACTTAAGGAGAAACCCTTACTACCAGAGCTCACCAAAGTGTCGCGAAAATTGTCAGCATCTTCTAGTGTTTACGAGTCACTAAATACAATATCATATTGTTTTATAAATAAATGATAAATTTTTTTTGTTCACTACGTGGAATATGTCCTCACTTAAGTTCTCCTGGATGGATCTATTGCGGTGGAAAGGCCAGCAACGTATTTCCAGCCTGTATAGTCCTGCTTATAACCAAATCCGCTTAATCAACCGGGCGGTATGAACACGGTAGGGAAACCAACACCATGAGACAATGAAGTAGTATAACTCCAAAATACCTTATCATTTATGAAGCAATCGAAATACCTGAGTGCAGTTTCCACTTCACGGACCCACTGATCGTTGATGTGCAAGTTAACCATTGACAGGTATACTTTATCCCCGTTACCTTTTAGAGGGTGAGATAAGTTTCGATATGGGCCTGTTATAGAGTGCATTCGTAAATATCATGTTCTATAACTAAATTAGGTACGTAACATAGTCGCTTTGTCCAACGTGTCGCTTATTTTCTATATCATGCTCAATTCGAGGCCCAAGCCAACTTCAGTCTTCCACCTTACTATGTTGCCGCAAATCACGCATATCATATGTCTCAATATCGCACGGTAAGTCAATAATGTGGAAGATCTATGCCTAACGGAAGAATATAGCCTATCTTTCCGATACACCATATCCTAATAAACAAATGCTAAGCTGCTCTGGATACAGAAACCGTACATTCTTTCAAATTCAGTTGGCCCGGACACTAGCTCAGGGGCTAATAAATACAGCTTTTGTCCAACAAGCGCCACGTATCATAGTGTTAACCTATTCAGAACAACAAGAACAACACGTCCTACCTAATCCTTTTCCGGCCCGAAGGGCAAGTAACCAGGCAGCACCGCATGTTACTCTATCAAACTGATCAGTATCACAATCATAATACCCAATGTCCATGCAGAAACCAAGAGCCATTAACTATCCGTCCTCGAAGAAACATGGACAGATCTCAAATACCAAAAGTGCATGACGAAGTTGAAGCCATGAGATTTCATCTCTAGCGTGCTTCGCCACATTATCTGTTCCTAATCAACCACAATCGACTGGGGGTCCTACGAAGCATGCTAAAGACCCGAAAGCCCTTGGACGCCAAGATTTACTAACTGTCAATAAATAGAAGCGTCTCCTACTTTACTGAATAGTGCGCTTGTGTCGGTAGATTCGTTGCGAAAATGTAATGATGCATGAATGGCGGGTATAGTTTTGACGCCTCTATTTACATATAATTTTTCCGTATTGTTGAATCAGTGGTTAGAAAAGAAGTTGGAGAATGACTAACAGCTGTATCTTGATTTCAAGGTGAGTTATAGTGTGTTGTTATCAACAACGACAGGACCATGGCAGCTACGCGTAAGTCTTTCCGACTTGCGTAAAGATATCACTATCTATTGTCCACGATAGACAATTCTCGAAAACGATAGAGGATCAGAGTACGAAGGAGGGAGAGCGGTTCCGGTCAGATATATTTACACGTGAAAATGGTCATGGTATTAATTAAAGATGCAATAATTCCCTGCAGACGACCTAATAGAATCATGGGAATCAGAGCCACAACTAAGAAAGTTTTGGGGACGTGTGCCCATTGAATAGGACTAGAATTAATTTGCGGGTCGAGAAGAGGGAAATAGGGAAATTCAGCAACCATAGGTATAGTTAATATATTCACTGTGCGGGTACAAGTCAAAATTCTCCCGATAAGCCTATCTCCATTTGTAAAGCAAATTCTAATAAAAGTTGTAAATGCCTCTTACCCTAAGTTAAGACCTCGCGGTATTTAAAATCTGTTCCTACGAAACGAATCGACGATGCAGGCGCCTGAATAAGATATAGCCGGAGGTACGGGATGATAAAGAACCAACCTTCGACAGCAAT$TACATAACAATCCTTTCAATTCCTTACAAATGTCCTACACTAGTCTATCCTATATTCCTGAGAAGAGTTTATAAAATTACTAAACTTTGGCTCTTAAAAGCTACTCCTAAAAAAATCATGACCTTAATTCCTCTGGATTAGTAACAATTACATGAGTAATTAATGGTTAAAAATAGGTATGGCCAACATTAGAATCTGACGCACTAATCACAGGACGACCGTAAAAAAAATTCTAGGCCTAAAAGTATGGCCACAACACGAAAGTCATTTGACACATATGTTACAATCATCTAATCCGCTGCGATTAACTTCGTCGACACGGGTACACCGTATTTTGCGGTTTATCTAATTTAATGAACTTGGTAACCTAAATAAAATCGTATGTAACCCGTCTTATGCGGCTTACACTAAATGATAAAATTAGCCCTCAGAGGTATATTCCGTAGCTTGTGTACTTGGGTGTGATAAATGACGAACGACCACAAGCAAAACTTACTAAGATTATAAAATTATTAAGTATCTAATTAGACAGACTACGCAAAGAAAGGAGAACTCTGAAAAGAACATAAGACGCCAAATAATATATACTATGTGCGCGATCATAGGTTCCGCTACGAATGTTAGTAATCCGAATGACGGATAATCAAACGTAGCAAAACCTCCCCCTGCGTCCCAGTGTATATAGTAGCAGACAGTGCTCCCGCGCCATATTGACTCACTTTTGAACGAGAATGTAGCGGACAAGCACAACGTTGAAACGTACTTGATTCTTATATAATTATCTAGCAAACTGGGAAAGACGCACCGGTGCTGTAGTTCATCTATGGTGAAACAGTATTAGTCATACGTAGTTCGTATAAGTAGGGGGTGAAATAATAAATAATGAATTGATAAATATATCTAAAATAGAACAAAAAAATAAAACATCATTAGCACACAAGTGTAGACAATAGGGTCAACAACGATACAATAACTCCATTATGTAATGAAGTTCTCCATAATACGCCCAAAACAAACAACGATTTTGTCTTGAGTAGATAAAATGCCTAGGCCATGACCTAGACATTGAAGATTTTTGATGATGCGATTAAGAAAAGAAGAAGTAACCTTATTTCATTTAACTTGAGAAACTTGAATATGAGATACTCTAATTGTAGTCTTATACCTGGCTGTCCTTCACACCAATTAATGAAAATGAACGTATACTCACATATTTAAATAATATGAAACTTCTAGTTTAAAAGTTCCGCTTGGGCATAATAACTTAATGAATTAGGAGAAATACGCTTAAATCTCTCAGGTTCGAAGTTAACAAAGGTTACGATGAAAATAGAAGGTATTATATTGAAAATAAAAAGATACACGATGGTATCTATAACAGACTACCGACAAACCGAAATATTTAAAATCGGAATCACAATCAGAAATAAAGTACAGGCCGACTGCTAAATATACTACCGAGCACACTCGCCAAGGTGAACATCTGACGGGAAGGCCAGAAAATTAGTCCTCAGCCAAAAATAATGTTCGATATGTTAAAAGAGTACGCTACAGTAATATTTCGAAGTTTAGCTAAAATAGTTTCTACGTGTGGTTGCAGAAGATAATCAGTACATGCTACCCAGGAGAACTCGAAGAAAAGAATAGTCACCTGAGTATACTATGAATAAAGTTCGGACTATAAGGGATCCCATAGTTAAACGAATCATAATGTGTACGAACTTTCTACCGAATGTAACTGCGTAATACGGAAAAACACATTACCATATTGGGACTACAATTAGCCAATTTAGGGTGTGATAGGTAGGCAACTAATTAAGGTCCAGACGTGGGCCCTAAACACTTACATAAGGTAGAACCAAGTCCGTGAAGACATCAAATACTAGATGAACTGCTAGGATAATATAAGTCATGGTGAAGTGCAAAGAAACACGAGAAGCTAGAATAAAAATTCAGAACATGAGTTTTGAAGATATGCTGGACGCGGTCTTTTAGAAAAACAAAGGTGCAAATTATTGTCCCTTAGGAGGTCATTTGTAGTGCGCCCTATAACCTCAAGAGGAATTAACACTGGGCACAACGTAACTCAAGACTTTTTGAAATGTCTTGAGGGAGAATTTCAATCGTACTATTTAGTTGTATCTCATTGCATACGCCCAAAACATTGTCAACTTTAAATAAAACTATTAAGAGAGGCACAACAAAAGTTTCACAGCGGTAGCAATATAATAAAATCTACTCAGTCAACAATGCCTAAATCTGATGGCTAAGTTAGAATGACAACAACGAAGGCGATGCGGAATCCTTTTTATAGCAGACATCATAACTCTATCCTGTTCAACATAACATAGCTTGTCAAATAGAATCTTTAAAACATCGGTTTGTAAGAGGCCACAGGTATTCAGAAGGAAAGAAGAGAAAAAAAATACTGGACCAAGCAAGTACATTCATTCAATTAGCCGATCCTTACACTTCAAATTAAAATAATGTGGATAAACGTTAGCGGTTCTTAATCATTGATTTTAGACAGAGCGCTTTATATTCCCAGATAACAATGAGAAATTAATATACAACAATGGCCTAATATTCTAGATCTGACTTTCATAAGCATGATGAACTCGAATTACGCATGTAAATATCATAAAGACTTATAAGCACATCTCTAAGAATACACGTATAATGAGGAAATCATGATAGAACAAAGTTCCAAAACGCAAGTCTCAGGAATTTCGGACAGTTTAATAAAGATCCAAAATATCGATTGAAATAAGAAACCCTGAGTCGTCACCTAACTATGCCCGAGGTGGCCATTCTAACTAAAAAAAATAATGTTATGACCTTACAGCTACCCTATGGTATACACTACTTATGATTCAGAGTAATAGAAATTGATAAGTATGTCCAGATATCAGCTAACAACAAGGAGAAAAAGTGATGTGAAAGACTTAATGAGCAGGTCAATGGTACTTGGAGAACAAGAGACAACTATAAGTAAAGTGTGACAGGGAAGGATGTAAGGAAACAAGGGTTTAGAATAAATAGATTCGACTTAAGTAAAATTCGGATAAAGAACATTTACAGAATAGTAGTACAAGAAAATACATAAAAAAGTAAAAGGGGTAAAAATAGGTTTTGAAACAATTAGGTGGTCTGTGGATCTTAAGACAGCAGCAAACACTTTTAGAAGATTGATCCCATAGAAAATACAGAATTTTAAAACATATTACAAAGAGAAGCTGAGAACTGCATCAAAGAGACAAAATAGATTAATGGAAATTTATAAAAGTGTCAAATAATCTATATGATTCGAATATAGAGAAGAGGTCGACTCAATCGAATCAGAAGAATGCTACGGAGCATCACGCCTAAATTCTAAGAGTTAGTGAGGAGATCTAATATATGCTCCAACAATTGAACCAAAAAATAAGAGTTTAGATACGGTGACTAAAAGAATGAATTACTGCAAGTGGAGTCCGGTTTTTGAATCGCGGATGATCGGTGCACCATAAATAAAACAACCTGTATGATACAATAAGGCTACGAGAAACCTCGAAAATACCAACGTAGTTTGCCAATTTACGGGAATAAGCGAACGAACTGAATCGTATGAGCCAACTAATTGTGATGTATTATGAGAATTACATAGATAACGAATATCGAAGGATCGCATCGAAAAGAATGATGATAATAGTGCGAAGAATAAGATGGAAGTCGGAAAAAACAAGATAGCGTGGCAGCATAAGAATAAAGTCGATCACGCTAGAAAAAGCAGTGACAAAGGATCATCAATTCGTCATGATAGACACATTTATATGTTCTCAGTGTTACTGACAACTCTAATGATACACTGGATAACTTACATCTGAAACATAGAACCTACCTGATTAGAGAGTAGTCGCTATCTAAACAGCGGTGCAATATTACGTAGTAGCTTATTGGATGAAATTAATATTCGTTGAAGACCCAATATCTCACATGAAAGTGACTGAGAAAAACTGGGATTTTAATCAAAGTAGGACATAGATAAAGGTGCGTTTAAGGGGATTCAAAATTCGCAAGCCTATCGACGCAAAAATGCATAAAACATCGGCAGTAGTAAATATGAACTAATGACTATTGCGAAAGTGACCGGTAATAGAACTTTAGCTGAGCATCTTCTAACAAGTGACGCAGCAATTGTAATAGTAAGGTACTTGCTCGTCACCTGAAACGACCCATACTACATGTTCAAGCTGTGGATAGATCGTGAACCAGTTGGTTATGATCCAACTGGGCATTTCTTACAGGTCGGCACACTATGCCAGTGGACAGAGTAGAAAAACATAACGAAGAAAAACTACAGTCAGCTTCTTGTATAACTCCCCTAAGTGTCTATAGGAGGTCTAAAACGATCTCTGAAAACAATCAGTCGCGGAGTGTTGTAGACTCAAATACTACGAGTAACGATTGGCGCGTCAGTTCGAGAATTGTAAAGTCACTGAGATTATCAAGACGGTTAGAGGAAATCGTTCATCGGCAAGCATGGACATATGTAGTAAGTATATGAGGGACGTGGACAGCACGAAGCTCAATACGCTCTCCGTAAGAATGCTACACTTGTTGTTTGCGGAGATTTTGTGTTAGTTTGTTGTGGGTCTGGTATCGGACTGGGTAAGGATACATTAAAGACCAGAAGCTTGATTGAACGCAGTACTCAAAGGATATGGCCATAATCTACGATAGCAGCGCAAGAATGACGGAATAAATAAAAATTAAACAGACACTTGAGAATTCTGTAGAGCTAAAACGTGCATCGATTCATCGAAACTCTGTCAGAACAGAAGCTCACTCCCACCTATCATAATGGCCACTAAAAACTATATTGTCCGAAGGGAGATTAAATATGTAAGGGACTGTGGAACCGCAAGTAGACCAGTGTTGACTACTATGTAATGCTTCATTTTTGATGGGTTATAGTAAAGCTTTATCTATATGTCATCTTATACCTGTCGCGTAACTGGAATAAAAGCCACCTATGAACAAAAGTTGGCCCTTAATTAAAGTGAGTCTTCTACGTTTTAGATAGGATCGGACAAAAAAAACAACGCTCTAGATAACTAGATTGAACGGAGTGGTACCAGAGGTTAGGATGTAGAAGGGTCACATCTGGAAGGTGCGGGGTAGCGATCGGACTGTCGATAGGGCGGTCTGCGCGCTGATACCATGGCGGCGGGCACATTGGTAGGAGGAGCATGGACGGTGTAAGATTATACAAGTGAGACTATAGGGCACGTAGATAGTGGTTAATAGTTTATATGGAGACAGGAAGGGATGTGCAGGGGCGATCGAAAGGAGTCACCATGGAGAAAAGTGAGACTAAAGTGAGGAAAATATAGGGTCGGATAGGAGGAAGGCTATGGCAAGGAATGAAAGATACGAATTCTTCCTGAAGATTGAAACAGTCTGGATTAGAATGGGTTACATGGGTGACGGACTCGGACCGGGGGTGAGCTAATTAATAATGCGATGGAGGTAACATTGAAGCACCCTGATAGAAGACTTTAAAACGTATTTATCGAATTCACAATCTATCCAGCAAGCTTTATGGACTAAGTGTAGTATCTCTATGATGACTTTTAAATTAATTATGGAGTGAAGAGATACGAGGATTAAGAGTCACGAAATAGTATCTAAGCATGGGTTTTGGGTTATGATCAAGCCATTGAACATATGCGTTACAAGAAACGATAGACGTGTTGATGAAGGGGTGCACACTATCACGGGGAATCTTTCTAAAAGTAGGTCAAAATATCATGCATGGTAAATCTGTATCTTTAACATAACGATTCAGTTATTCACTAGCCCCCTTGTACAGATTAGAGGTTGCTAGACGATAGTTTGTCTAATCTTGCTGAATAATTAAGTTCGTAGTATTGAATCGTACACATTAGTGCTTGTCAGTTCCTCCAGGAAACCGGATGAGATAAATATCAGAAAAGGGGTTGAATAAGAACTTAGTAGGATAGGTTTGGATCTTAATACATACCGTTGAAATGAATTTGTAAGAGTGAGTCCGTTAGTCAATTGAGTAAATTCTCTATATACGAAACGGTACTCTAAAGATCACCATCTGTGTAGTAAAAACTATCTGGTGAAGCGGTCTATGGCCGGACAAAAGAGGGCTTAAATATACGACACAAAACAGAATGGTGGAACCCTCTAAACGCGGCAGGGCCGGGCGATTTAGGCCTTCTAGATTGAGCTCTTTTGTAATTAATCAAGAGATTAATTTATGATTATGTGGAAAAAGTTTAAGATTTATATAAAAAATTTAATAATAAGAAATTTAGTTTAAATATTTTAGACATGCAGATTGAAGAAATTCGTAACTATAATGTTAGTCAGTACATTATGTACATAAAAGTTGTATATAGTACGCGATCTGATTCTGGTTTGATGCCCTTTATAAAGTGATGACTGGGTGAACATTTGTGGCGATATTTTGTCAGGGAGTGCCATGTTAGCAGATTAAGATTTTTTAAGCTGAATTTTTATTCGTATTGGGTTAGCTAATACTACTTATATTATTTGTAATAATAATAAAAATAATGACAAAAATATAAACAAGTATATTAATTAATTGTTTAAGATATAAATATAGAATAGCGGAGTGCGTTATTAAAATAACAGTTCAAAAGGTTTATAGAGAAGAAATAGATTGTTAGAATATTTAGAATAGAACTAGAGTGTTGTAATAAGAATGTTAATTAATTTAACGTTGTGGGTGAAATTTGATAGGGTCGTAGTTTTTTAAAGTAGATGGGGAGAATAAAGAGTAGTGGGTTAATAATGGAATGGATTAATTGTGTAATAGTTTTTTTATTGAGAATAGAAGTTTCTTCGGTTAAGATAAAAATTTTTTGTATTTAATTTGGTATTAAGATATTTAAGATATAGCAATTTATGAAGGGTGTTTCAAATAGTCTTGTTCTTTATTGTTGAAGGGGATTAGTAGGGAACCACTTGGAATATAACAAGTACAATAATAGGTCCTGGTATTTGAAAGTCTTTAAATAAAATTTCTTGTAAGTTGTATAAATTTATAAAATAATTATAGGTATTGTAGTTTAAGTTTTAGTGAATTAACGTAGACTGATGAATAGTATAATATGAATATAAGGAGAATAATGTATCTATGTAGAAAGGGTAAGATGAAAAATTGGAAAAAAAGGAGTTTTTATTAAATGGATAGGAAGGAATAGAATGGAATTGACTTAAAAGTCTATGTTAAGGGTTGACTATTAATGATTTATACGACAATAGCAGGAAAGAAAGATGTATAGTCAATTTTGTATAAAAAAGGTATGGAGGAATAATTTATATATAATCTAAATCAGTATTAAAATATGTGTAATTGAGTATAAGGATAGAAAAATTTGATTTATTTTTAATTACGTTATTTCTTAGGTCCGCGTTTTGCAATGTAATATTTTCATATTTTCAACAATATCTTTGTACATACTAGGATAACTAGAGAAACATATAATTAGTTTCAAACCAAATAGCTATCATTTCAATTATTTATTTGTATTTTGAGTATTGATTTAGTTATAATTATATTTATTTGATTTTTTAATTATATAATGTAAGATAAAATTCTATCAATGTAATAGATAATGAGTAATTTTGTATATATTCTATTTGTAGTAGAGTTATTATTCTATATTAATTGTTTTTTAATTATTAATTTGGTATTTTTTATATGTGACATATTTTTTAATCTTAAAAATTTATATAAAATTGTATTTTATTGGGTATAAAAGAAGTTCTATGATCCATTATCTAAATTTGATGAAATGCGGTTAATAATAATGCATAGTGATATGCTTAAGTGCCGGTGATGATTTGTCATCTTATCTTATTCTTCTTAATAGTTGTTTTGGTTTAATGAAATATCGAGATCAATCGGCTTTTCTGAATTAGCTATTGTTCTTTATCGGCTAATATTTGAAGAGTGTGGAAATATGTCTATTTTTTTGCTGTTATCATAGTGGGTATCGGTAAGCCTATACTGGGTTTGATTATTTGGTTTTTATATTCTAATTTTTGGGGCGAATTTTGTATTGGTTGTTTGATTTTCAATTTTATCTCTATGTACAGTATAGACCATATGGAGTCATGTCGTGGTAACCGACTTTGGTCTAAATTGACTAGTGTAACATGTGTAGTCGTACGTGTGTCTATTATTATGATAGGTTTTCATCTAAGATGTGGTTGTTTTACTGATTCTTTTATTATGGTTATTCGAATTTGATAAATATAGGTTCAATGGTAGTGATAAATCGTTTTTACGAGTGTGATTTTACGTTTTGTTGAGTTTTTTTCTTATCAGAATGAGATTTTTCTTTTCTTAAAGTTTTGAATTAATTTAATTATAATAGATCCCTTTGGCTTAGCAGTAACTCGGTTACTTAATTACTTGTGATGGTGGTACTATTACTTACTATTTATAGAAAGTTTGCAGGTGATAATTTAAATGTATTTACTGTTTTGAATTTTTATTTATATTACAGTAAGTTGGACGTTTACTTAACCATGTGGTGTTATTAGAAATATTCTATGCATAAGTGTTATGAATTTTTTGATAGGCTATTTATTGGTTGTAGTTCTTTTAGTTTAGGTTTTAATCTTATATTTATATTTTTTAGTGCTATGTTTAAATGAGTTTTTTTTCTAGTGATAATATTTTTGTAATTTTGTAAATTAAGTTATTTTTAGTTGTTTTACTGGCTCACTTTAGGTTTTAATAGAATTAAGTAGAAAATGTAATATATTTTATGTGTAGTATTTAACACTTGTTTTGATTAACAGGATTGTTGTAAGTATAAGTAAATGTTTATTTGATGATGGTTTATTGAGTTAGATTGTCTTGATGCCTTAGCAGGCATTATGGTAGAGGTTTAATTATATAATGTATTGTAGAACATATAGTTATTTGTTGTTAGTTATCAGTTGTTTGGATTAGATGTTTTCGGAGTATTTATTTAATTATTTATTTTTTTTTTGTTATGAAATTTTTTTTATTGGTCAAATTTTGCTTTTAACCTCAATAAATACGAATATAAAGTTGTGCATATTTTATTTTGAATTTTTTGAATTATTTCCATATGGATTTTTATTTGGTTGTTTAAAAATTTAAATGGGTATATATAAATAATGTTTATAATTGAAATATTTTGAGTAAATACGTGTTAATCTGAAATTCATAATATTATTACAATTTTATGTTATGTGATTTTTACTTTCATTATACATGAATTGAGATTGTATCGTATTGTATGTTTTCACTTGTTTGGTCAGTATAGATATTTTTCTTTCATGTATATCCTTAGTTATCTTATTATAATGAAACGATTAGTTTTTCACTTTAGAATTTTTTCTGTATGCGACTATCATATCATTGTTTATAAGTTTGGTATAATAAGGTTAGGATATCAATATATAGAATTTATAAGACTTTTTTTGTTTATAATTTATCTGGCTTTTTATTGAATTTCGTTAGAGTATAGTTAAAAATAATTTTATTAAAAATTCTGTATGTCGTTTATATGAACGAAATGTAATATTTAAATGACACGTTTTTATCAATTAATAAGGCTAAACTTATTAATATTAATTACATTGTAAAACTTTATTATTCTTATTTATTATTATGACTATTTTTTTTCAGTCAATTTGGTTTCATATTTTGTATCTTCTGTTGTGATTTACTTGATGGAATTTTCACATTATTTACGAGTTTATTATTTATTTTTGTTTATTAAAATATCTATAACTAATATAATGTTTTACTAATTTATATTTTATGGCTATAGTTTTTTATTCTCTTTTATAATTTTTATGTTTCTACCTTTTTTTTTGTTGTGTTATTATTAATTCTGATTATGTTTGTATAGGGTATATGGATATTTTAGGTATTATCATTGTTTTTTGATATATTGTTATATTTTTTTGATTAGTTAATTCCATGAGTTAATACTATTTAAAAATGTTTATATGATTAATTTACGTAAACTAGATTATTTTGTTAGATAATTTGATTAATTTAACTGAATGATAAGGTTATGGTATTAGTGAATAGCCACGGTTATGTGAGTTCGAGTTGGGAATGTTTCGTATTTAGGGGTGTTTTTTGATATGCGTCAAGGTAATGCTGGGTGGCAGCATCGTTGCTAGACGTGGAGGATACTTCTGGATGGTTAAGGGGTTATTAGAGGGATGAGTAATTTGTGAGAAGTTGAAGGTTTATTATTGTGAGTATCGGTAACATTTTTAATTATCTGGAATTCTGTGTGAGTGAGTTATTTGTTACTCGGGTTTAAGTATCTTTTGTAAATATTTTTTATAATAGTATTTCTTGTATTGTTTTTTTGATATGGGGTATATGTTTTATACATTCATATATAGTAGTCTTGTATACATGTTCTTTACGAGCCAGTGTTTTGTTTTTGTCGTTTCACGTATCATCTTCTTGTATGGTGTATTTATAGTAGTTGGGTATATTAATAGTAAATGCGGTTTATCGAATGTATAGCGGGTTGGATTGCAAGGATTTGATAATCTGACGTCAATGTTATTTATTATTTAACATATTTATTGTTTCTAATTTTATAAGGATAATATATCTCTATTTTAATGGTTGTTGAAATTTACGTATGGAATTTAAGATTTATTTCTCGTTGTGAAATTATTCTGATTTTTAGTATTCTTTGTATTATGAATAATCAATATTGCGCTCTTTCTTATATTTATTTTTTTTTCATTTATATTATTCTGACACAATGTTTTTTAATAACTACATAATTATTTTATTTGGTTTTAAGTGGTTGAAATAGCGCATTTAGGACGACGGGATTTATAAAGATATCTGGATTCGGATTGTTTTGGATGCTAAATCTGAGGTTTTTGGTAGTTCTCTTGGAGTGTCTGGGAGTTATAGGATGGGGGGGGGGGTTATAATTAGTAGTATATTAGTACGGAAGCTAAACGCGCAGTTTAGCTCAAAACTAATTAAGCTTCGCGTAATATAATTCCATGATGGTACTGGAGAATGTGTCGAGAACTAATATTTTAGTGGTTGCTTGTAGCAATCAGCTGTTGGTTTAACTGTGTATCTTATTTAAAATTTTAGTACGTTATGGCGTTTCGTGATGGTTCCTGGCAGCTTTGACTAGAGAGGAATATGTCAGATATTTCAGTTTGTCGTGCAGCGCAAGGATTGTTTTGAGGTACTGATGAAAGTATCGAAGATACATTATTTAGTTCAGGAAAAATAAATTTTCATAGTTTTTAAGAGGTAAGATGAGTGATTTTGGTGTGATCCTGCGGTTGGAGCCTAGCTGGGCGTGGAGGGCGGTGACAGGAGTGCATAGTATTGTATTATCTTTTTTTTTTTATTTATTAATTATTTGTCATCTTTTTTTAATTAATTATATAAGTTAATAAAAGATGTTTGTAGTTGTTTTTAATTTTAGTTAGAATTATTTTGTGTTTTTTAGTTCTAATTATTGCAGAGCTATATGGGTATTTTTGTTTGTACGGTTGGTTAGCTGGTTAATATTGGGTTTATTTTGGATGGGCCGGGCTTATGGGACTAATTTAGTCGCTGGAGCTAGTTTCGGGTAGTTTTACGCTTTGTGTATTTTAAGATCATATAATCGTTAATATTTTTTCTTTAAGTACGATATTTTGGCGATTATAATGTTTATGATACATCATTAGCGGAATAATTTTTTTTTCAAGATTAAGGTTAGAGTGTGGTATAAGTTGTTTTTGTTTTATTTGATAGTTGTGTTAGTGTGTTCTTTATAAACATAGGATTTGATGAGAGTATATATCATTATTTTTTAATGATGAGTATAGGTGTTTAGTGTTTCATTAACAAGAAGGAGTACTGGTGTTTGATCTCGTTATATATGTTGAATTAGATTGAGATCGGAACAGAAATTTGTGTAGTCTTATGATGTTGTACGTTTGTAAATATTATGATGTTGTGTTTGCATTGGGTGTATAGCTTAATGATTACAGCTAAATTATTCGTGTATTGTCTAGATTTTTGGGATAAGTAAGTGCTTGGCTCTTACTCACCCCTTAAATTTCTTAGTCGTGGCAAGTTGTTTTTCGCTGAACTATTTTATGTACCCTTGGGATTTGATTTTACGATCCACCATGGAACTTTTTTTGAAGGCCTCCACATCACCTCCTGACTCAACATCCCACTTGCAATTTTCTAATCTCTACGCAATCCGGGCCTTGAACCTCGTTTTGCATGTGTGCCGACATAGTGCTACCTGTTGGAATTACTTACTACTAAACATTCAATTCTCCTGTGTGACGGCGCTTTATAGTAAGAAACCTTCCCAGCGAGTATCCAAATAAAAATAGTGTTAGCTCCTGCAGCCTGCTCAGCGGTAGATGTTACTTTTATTTTTTGCCAGCCAAGGAACTTTTTAGTGTATATGTGTCGGGCATAAATTACGCGTTCTATACGTTACCGCTTTGTTGCCAATGTCCTTTAATTTTTTCTTTAGACTTATGTATCGTTCCAACTGACTGGTTGTTCTTTCTGCTTAACGCTTCTTTGCGAAACATTGAATACTGATTGTGTTCTAAACTCCAAGCTCATCTTTAGCGATGCTATACTCAATACCTCTTCTAACCCGACATCGTGTTGTTTTTAGTGGCCACTTGCTGCCTGGCTTATCTGTCGTCAGCCATGTTTCTTCATAAAGGACATGAGGCGCCTCTCAAGCACGCTTTATGTTAAGTCCTCTACCTGAACAGGATTTTCTAATCTCTTTATTCTCGATTGGAGGTAGTATGCTTGGTGTTTAGGAGTCCTTATCATTCAGGTCCTTGTTCTATAGTGGCACGTCTCTAGCGGTCGCTCCCTTATTTTGTTTAGTGTTATTCCCAGGTTCGAATAGGAATTTGTTCTGCTTACTTGAAATATGACTTGTATTCACTCGACTAAGAGTCGTACACTAGGCTCTTTACTTGTTGTCCTTGGAAATTCTTAAGGGGTGTGATGCCTTCTCTGATGTTCTGCAGAACATGTTCTTACTGACGGCCTATTATAATTATTTTTGCCTCCCTATTGTTTATAGATCGATTTAAATTTTAAAGTTCAATGTTTTCGCTGTATGGTTTCTAGTTGGGGTTCGTAAGTGGTTCGCAGCTGGATTACTCTTTCCCTCGTTATTCCTTCGTACTGTTGCCCAAGGTCCTTACACTCACCTACGGTTGTCTCCCGTCGAAGTTTCCCTGTACTCTTTTATGCAGCGCGTGCTACATCCGCTCGGTACGAGTTGGGTGGCCTCGCTTAGAAATAGGTGTGTTTACTTTATCCTTCATTCGGTCGTGGCGTTACTCTGTAATCCTTATTTCGTGATTGTGCCGTGTGGTCTGGTGCTATGTTCTAATTTGTTTCCAAAGTGGCTCTTGTGGACGTCTTGTAGGTCGAAGATTGCGGGCTTCCATTAAAATCGCTTATATTCTCTGCAGGCCTATGATATTCACGTTGTGAGGACGCAGAAACTGTTGATTGGCGAACAGTCTCAAGTAGAGACAATGAGTCTATGATCGCTAGCCTTTACAGCGTGTTACCTCGGTTAGGGTCTATGAATGTTTCTTGACGACCTGACGTAGGTTATGTCTGCAGCAAGTTTGTTTCCTCAAACACTGGCCCAATGGCCATTGCTGTGCGTTGAGTTACCCATTTGATCCCGGAGTGTCGTCCTTAGCTATTCCGAGTGTGCAGGGTGCCACGTTAGAGATGATGGAGACGGAGACAGCTGGCTGCCACGAGCGAGGGCAGGTCTTATCTAAATATGCGAGCCGTTTCTGTCCCTGTTATCTATTTCTATGTACCTTAGGAGTTTTACTTACTCCCGGCTCGCTTTTGTCGGTATGCTTCCGTCCCAGCCCCTTCTCATAATTGCTGCAAATAAGCTGCCCCGAGTAATTTCGATTATTCTTATAATTACTTCCCGCCAACCTTAACACAACTCAGCTGATTTTACTAGTTGCTTTTTCTTTCTTTTGTTTTTTGCTTCCGTACCTTTCCGCTACTACCTAATGGTTCTTATATTTTGCTCATGCTTTTGTAGGCCCCCATGTTCTGCCGATACATTCAGGTCTTCTTTGAAAAGCTTTTACTATTGTTTCATTGGGTATCACTCCTAAGCTGCTTATCTATTACCTTTGTCTATCTTGACGCTTGTTCTTATCTACTTTTTAGTGCTTCGAGCCCCCAGCCATGATTTCTCACTCCAACACCTCTATGTTTCATTAATTCGTGTTTTTCTGCCCTAGCACTTCCCCAGTTTATCCTGTCGCATTTTCCTAACGGCACTCGTCGCTGTAGATGCCCGAACACTCCTTGCGTATCCAGTTAGTCGTACTTGATATTTCGGGTCTCTTTTAATCGCATTATGCCCCCTCATGTGCTCATAAACATACGAATCTGATTGCGCCATCTCATATCCACCCATGCTGCGTGGAAAGTGCGGGATCATCCGAGTCCTGTTACCTGTTTCTGCTACTCTTATTTTTTCATTTCACTATTTCAAATTTAATAGCGACGCTCATACATCGCAGTTATCCATACCAGTGCAGCCTACGCTTCCTCCCCCTTCTAGTCATTGTAGTCTGGTCTATTCTTTCTGATATAAGACTGTTGGAGTATTAATTATTGCATGTCAAGATGCAGTGAATGGAGATATGTTTGTCCGGCTATTCATGTTACCGCACTTACATCCGAAAATATCTTATTTAGAACCGTCTTGTTCGTCTAAGGAATAGACAAATCTTAGCAAACGACAACAAGAACCGCTCCTCTCTTTATACCTCTTATGTGCAGGTTTTTACCCGAGTATACCATACGAGGCAAAGGTTTTTATCTACAGATGTTCGTATTGTCAATATCCAAGATCTGCCCAGTCCTTCCCCGGTTGCTGGTGTTATAGCACGGTCCACTCATTGATTTGGTTACAAGGCTCTCATAGAGCGTAGTTGGTCGGCTTACCTTTCATTAGTTTTTAGGAGTGCACGGCCCCATGATCAATATAATTCGGTTCTTTTTAGATTCTTTTATCTAGTCTGTTGGGTTTGCGTAGTCAGAGCTGTATTAGATATAAAAGCACAATAGTGGTGTGTTGATTCTATGGTCAACGATCAGTTTTTTAGAATACGAACGTGTTTAAAGTCTCTCGACGAGTGAATTCTGTAGCCGATCGTTCTTTATTGTGCCAAGTCTTATTCCTGGCGCTTGACCTGACTTTCCAGGTGATTCCCCGATTTACTCCATACTGCACCATTATACGGAGCTTCAATATAGTCGTACTTTTGTTAACGGCATGTCGAATATTATTCTATTTTCCGCTAAAGAAGGTAATTCGATTTACGAAAACTTTGAGATCTAGACAGTCGCGCTACGTTCTCTGAGGTTGTTATACTTCTTATTACCATGATCTATACTACACAACCCCGACTACCTAGTTAATCTATTATCGGGTATAACGACCTAGCCGAAGATTTATAGTGTCCGATAAATGCCATCCATTGGACTGAAAATTTTCTGTGTTTTATTCGCTCTATCCACATGAGTTTGATGGGGAAAATAGTTTTTTGAGACTTTTATTCGATTCTTTAGTATTAGCAGCAGGTTTTCGAGGTTATTTTACTTTATGCGGTTTGGACTTTTCTTACTATTTTATGTGTTTTATGTTCACAGACTCTTGATGCTGGGGTCTAATTTACGAATTATTCTTTCATTTCGGCAACGGGCGTTCCTTGCTTTTATGATAGTATTGTACTCATGTTTAGTCATGCTGGTTTTGAGGGACTTTTTAGACTATCTGTTATTCTTTGATTGCGTGGAACAACTGTCAATCGAACTGAAGCATTGTGTCACTGTGTAGCAATCTGTAGTTTCGCTATAATTTGGTGTATATTCTAGCCGACTCTTGGCGAATAATATCCAACTTCTCTGGTAAGTCTAGGATACTCGAGACTTGCTGGGTTTATAAGACGACATATTCTCAAGAAACATATGCAAAAGATCCGAAGTACGACGTTGATTTCTTATTAGCCATATTCGTTCGCAGCTTTGTGATTTCAGTAAACTCAAATCTCAGTTTTACGGCTCGAATAAATGAGGCGGTACGCAGTTAGCCTGATATTACGACTCAACGGTAAGCACACGCCCTTTCAACTAAACACTACATGAGTCGGACAAAATACATCTTAAAGTAATCATCCGAGTTGATAAATAAGAGATGACATTGTCTACTCCATATAGCTTGATATGGCGTAGGTGAAATGGGCTGGACCGAAGAGGACTCCTTAGTTGGACAGTTTCTAAAAGACGACTGCATACGAGGTGCCGTAGTGCTATCAGTTAGTTTTATTGCTGACCGTGACGTAGCCGTGAAATCATGATATCTAACAGGATTTGTTGAAATTGATCTTGGTAGTTTGAATACGAACAATTACCATGGATGAATGCTTATTGATTTCAAGTTGTACGATCACCTTTCTAATTGGCATATCAGTAGCGATATATGCCGACTAAGTTCTCGTGGAAATGCACGATTAATCTTAATAACATAAGTCATCGAGGTCGTGTCCTTTAATGATACGTGGCCAAACCCATATGATATTCAAGACTGTCGTGCTTGAGAGTTTTGAAAACTTATGGTATACCACTCAATTTTCTCATTGATATGTAGTTGAATAAAAATTGCTGGAACGTTTAATGACCGTGACTTGTGATACAGGGACGCTTGCTTGCATCGACACGAGATCGTTCCATATTCCGAGGAGATCAACGTGGATCGTTTTGATCGTCCCGCGTCTATACGAAACTATCCGGCTCTTGCTCCTCTGGCGGCATATTACCACGTTGGCGATTTATCAACTTTCTATATCAGGAATGATGATATTCAAGGCTCAACCTAATACTGAACAATAAAATGTCCAACTCAAGCTACTCACGGACTTGTCGCAGATCGATCAGTTTTATGAGGTAGACGTCGGGAATTAAGGGCGGTAGTCATCCTCACAAGAATTATCGTAAACCTTTATTCGTATTCGGCAAATCTACGGTATGGTGTAAGGATAATCTCCTATTATACAAGTGATGATATCGAATCGCATACTATGATTGTTTATGTTTATAACTTGTGATGAACATACTCGAGTTTAAATGACGACCTTTTTCAATTAGAAAGTACTAATCTCAGTTGGAATTGTGGTGGAATTAAAGGCAGGTCATATCTCTTCACAATGTGGTTAAACCTTTTATAGAAACCGTAAGGATGTGAAAACTTATGGTTTCGTAAGGAGGGGTCCCAAAGTTACGCCCCCAAACCGGAGATTATAATACATCACCAAGTCTCGGGTAACCTGAAGCACCTGAGCAAGTGTGCAAGTCATTAGTAGCATTTTAACTGACATCTTTTGTAAATCTGGTTATTTCACGATGCGTATTATTTGACAGGTCGGAAACAAAAGACTGCAACTTGACTAGATCGGCAAAGAATAATCCCTCTTGAAGTGCAGTTACAAGCAAGGAATTCAAGCCTCGGCCTGTGGGATAGAATTTCTCTTAATATGTGGCTTGTCTTTTTGTATAATTATATGATTTCGTCAAGATACCCGGATGTCATGTAACACCAGCCCTGTTTGATGCTAAATGAAACCTCGCGGGTTTCGTTGACCGATCCTTCTCTTCCAGGTCATCTATGTAACCATCAGGGATGGCTGTAGTTCGTAGGATTTTCTCATTCATTTCTCTAGTTGATGATATTTTTCCTACGAAGAAAGCTAAATTAGACTAAAATCCTTGTATGTACCCGTTGAGAAATAATTTGTATCAGAATGCGAATTGGGATAGTACAAGAAGTTCCTATATCCTTTATAAAGCACTTGACATTCTTTAGTACTGCGGAACATAATACATCAAGTGCACTTCGTTCATCATTTCTCTACTGAGTCTGTAAGATGTCGTCCTGTTATTGTGTTATTCGATAGTATATAATTCGCTAAGATTAACATCGGAGTAGATGTGTTCTACAAACCTATTTACACGCGATGTTGTAGCATTGGTTAGCACCTGTTTAGTTTATTGTGTATATGGTAGTCGTATTCTCGATTTCTGTAGTTCTCGTTGGTACTTAGTCTACATGTTGATAAGCTCGTTCGAGCCTTCGCACGCTCCTAATTAAGGTGTGCCGGGAGCTTTTCAGGTTGGTTAACTGATTCGGCGTTCAGATTGAGGTGTATCTGTCAATCGGTCCCGACAGTCAATCTCTCGCTAAGTGAGAAAAGGACGGGTGTACGTGACCAAAGCAGCCTGGGCGCGCACGTACCTTGCACTGCGGGGGTTTGAAGTCCATATAGCCCTTTGTAGCGGACGGCGGTGCATAAATATGCCTCGCTGATACTTATGCGGGCCCCGATATAGTATTATCAATGATTGTGAGATTTGCCGATTATATTTCATCTGACAGCGAGAACTACTGCTAGTTATACTACCCATCAGCTCGCGCGATGGGTATAATCCGAATCTCTACTGGTTTTGGGTCGATTTTCTGCCGTAGCGTGTCTTTTTACTTACCAACCTGAAATAATTTTTAGAGGGTGTTTGGTTTTGGTTGACCCCCTTGGGAAAAACATGTGCGGTATTGGTGATTTATAATGTGATAGTTGGTTAGACCACAGTCTCTTGTTGCTTCCTAGGTTGGGGAATCTCAGCTGCGAGTGATATAGGAAAGTTTCTATTGAGTACAAGCTAGCCATTCAGTGAATATGAAAATAAATTCCCACGGATCCAAATCAAACAATGGTCCTTGGTGCATTAAGCAGACTAGCCCCCGCCTGACATTAGGGCTTTTGTTTTCCCTTACGCACGATCTACTTAGTTAGTGAGCATGTTAGCGATGGTGCTGTATTCTAATTAAACGTGGCCCGCTGATAGCAACGAATTGGATTTATGTTTTGTTATTTCTGCTATACCTTACGTTGTTCCCTCCCATCTGTGTACTAGTTCTATTCTACGAAACTTGTTCAATCCCTAACAATGACGTCCTGAATTTGCACACTATAGCGATGGATCTGCTGTACAATAAACATGGTGCATGAGTTCAATATTTGCGAGAAATCCCCCTAACCTCAAATCAGTTTTGTTTGTACTACCTCCACGATTATCTTGCGAGCTCACAAGCGACCGAGCCCTTGAGGACACTTGGCCCTTGCCCATTTTATATAATACATTTCCAATTTATATATATTGTTAGAACTCCCTGGCACTTGCATGCCGGCTAAATACAATCTTTACTTGGTAAACAGCTTAGATTATAAGTAATGGACGCTACTTTTAATCCGTCATTCCAGTCCAATAATCGATGATGTACTCCTTTGTTAAACTCAGCACGCGTCTCGGTAACTCTTCTCCAATCACATCGACATCTTCGATAGCGTTGCTTTCAGCTGTGCCCTGAGCACTATGCTTTACTGTCACAAGCTTTAACGTGTGTATTGCAGCAATACCATTGGGAGTAATTGATCCGGGTTGCGTAAACACATTGCACTACCGCCTCGGGGTTCGTTTCTCCACTTCTAGGGAACTCGAACCCCAATGTCTATACAATCTGCACTCATAGTTGGCGAATTCTCAATTCCTTAGTTACACGTATGATAGCGTAAAAGGCCAATCTTCCATCATTCCCCAAAATCTGAGAGGTTTGTTTCCTAGGCATTCTCTCTATATGATACTGCTCTCGCACAAAGCCCAGACCAGCTAATTATGGTTAATCGGTCAGGGAACGCTGCCTATATGCCTGTGTGCTCGTGACCGCAGTTCGGCGCCGAAACGTCTTATGGCGATGCGCCGTAGGCATGGGTGTACCTGTTGATCGGCGGGGATGTGACCCACATTGATATGATTGAGTTTCCTGTGGGGATGACCTTTTTATTGGGGTAGTGACTTTGCGTCGATTCTTGAGGGATGCATACTAGGGCTAACACTTAGAAGCTCGTATATATGGTGTTATGGTACGCCGTCCGAATCTGCTGGTGTCATTTCTAAGTGTACCAGAACTACAAAGCTTTATACTCTAGACTGTGTTGGCGTTAGAGCTACATGTAGGTATTGCTTCCTATAGATACAGTGCTATCGTGTTGCCGAGTTAATTCCTCACAGTCGCTACACTTTCCCTACTCGGTATGAGCAGTTTTCAGGAATTGTAATCAACAAGACCTTTGTATGTAATACGTATCATTCAAATTTGATACAGTTGAAGTGTAGGAGCCGTAGCGGTAATTGGCATGATCATTTTCATAGGTTATTGCCATGACCTAAGACTTGTCGCGAGGTATGTTGGTGCCTGCAATATCGCGGCTTTCTTTCGCTGGTAGGCCACCCTTTATCATCGTTTGCTTTTGATAGCCAATCCGGTAGAACCTCGCGGCATTGGTATGCTTGTGCGGTCCTATATCAGGGGGAATTTCCACCTCTTTTACGTCTGCCAACTCATCGAATTCTAATGAAAGCCCTGCTTGCTGTACAATTAAGGAAGTCCTATTAATGCACGAGCATGGAGGGTTGTGCCGATTAGATAGATTTATGAAAACAAGGTATTCCAATAGGACGCCCCAATTGAATCGTCTCAACAAGTACTCTTTATTGGTGGTTCTGTCTACGTAGTTCTTCCGTTCCCTTCTAACTGATTCGCCACGCTATCTACATTCTACGAGGAATCACTGGTATTGCATTTGCGTTCGACTGGTGAGTGTGTCGTTTCGGCAGCGCGACACACAACCTTGCACGGCCTCGTCACAGTTCAGTTTCGGTCCGAAAACCCCTTTCTATTGTGGACTCAGGATGGCGCCATATATGTCGTTCGGACGGCGCATGCTTTGTCGGACTACCGTTGGTGTACTCAGCTCGATTGAGTGGGCAGATGTAAGAAATCAAATGACGTTATCTTAATCTTAAGTAAGGACACACCCGTATTTCATTGATATAAGCTGCGCTTGCCTAGGCTGTGTCATAGGTTTATGAATCTAAGCCTCTCTGTCCTAAAGTGCTCTGTCTGTATAGACTCCGCCGGATATTTCATCTCTAATAACAAATCCTGATACCTCTGACCATAGCTTGTATCCTTCCCGTGCTAGTTTCACATATAGAATGGAGGATTTATTCCTTTATCCCTCCCTCTTATGAATCCTCCTCGAAGCATTGTCAATTCCTTCTATTGTTTTCATTGCTTCATTCAGGCCTATGGTAGTTTGCAATTGGGCATGAAACTTCGGAATCCTAATTTATTGTAAGGACACATTTCTGTTCTGTGCTTAGTTTACCGGCAGTTACAGCAGTTCGCAGGGGGCCAAAAGCGTGGCCCGGAGGGCAGTGTCTAAGTGGTATAGAAGTAGCGGCAATGACCTTAAAAGAGCTCCACTACCTTGTCTGCCATGGTCACGTTGTGTCTAACTGGTCGCTATCTACCCCCCACACCCGATTCCCGCTGTATGGGGCCACAAGCACGTAGATATGCTAATTCCCATACACTATCTAGTTAGAAACAGCTAATACTTCAACCTCGAGGACCGCTCGAGGGATGGCAAGAGGACGACGGTGCCCGGAAAAGGAGTACCGCCGCTTC'
In [33]:
str(genome.seq)[:100]
Out[33]:
'ATTAAAGGTTTATACCTTCCCAGGTAACAAACCAACCAACTTTCGATCTCTTGTAGATCTGTTCTCTAAACGAACTTTAAAATCTGTGTGGCTGTCACTC'

Data Compression

The BWT is a algorithm that aims at preparing data to be compressed. In practice it rearranges a string into pieces of repeated characters, that can then be compressed using for example the run-length encoding. This technique aims at storing a sequence of repeated characters as a single value and the number of repeats.

Example

original string : AAAAAAATTTTTGGGGTGTTTTTT

compressed string : A7T5G4TGT6

Questions :

  • Write a simple function that performs the Run length encoding
  • Estimate the complexity of your function
  • Compress the genome before and after applying the BWT
In [34]:
def run_length_encoding(S):
    """
    Encode sequence using the Run Length mathod
    
    Args:
        text (str): string to be shifted
    
    Return:
        str: run length
    """
    S+="$"
    i = 1
    result = ""
    current_char = S[0]
    count = 1
    while i < len(S):
        if S[i] == current_char:
            count += 1
        else:
            result += current_char
            if count > 1:
                result += str(count)
            current_char = S[i]
            count = 1
        i += 1
    return(result)
In [35]:
run_length_encoding("AAAATTCCCGCGG")
Out[35]:
'A4T2C3GCG2'
In [36]:
genome_bwt = BWT(str(genome.seq))
len(str(genome.seq)),len(run_length_encoding(str(genome.seq))), len(run_length_encoding(genome_bwt))
Out[36]:
(29903, 27727, 27321)

BWT and suffix tables: Understanding the BWT

Let $S$ be a string, and let $S'$ be the transformed version of $S$ after undergoing a given number of circular shift operations, and let us consider that the end of string character is at location $i$.

Questions :

  • What does the substring $S[1:i]$ represent for the substring $S[i+1:|S|]$, and particularly for the last character $S[|S|]$?
  • Describe the BWT in terms of suffixes
  • Assuming that there are some regularities in the string (some elements tend to be followed by some particular elements with higher frequency), explain why the BWT tends to bring together the same characters.
  • Conceive an algorithm that takes a string and its suffix table as an input, and returns its BWT.
  • Compute the complexity of this method

Answer :

$S[1:i]$ is the suffix of $S[i+1:|S|]$ and $S[|S|]$ is the character just before the suffix $S[1:i]$. In other words, $S[1:i]$ is the (circular) right context of character $S[|S|]$. Therefore the BWT simply returns the previous character to each suffix in the suffix table (i.e., returns the character preceding a suffix, sorted by their suffix lexicographic order, i.e. sorted by their right context).

If we assume that some patterns exist in the string, i.e., some characters tend to co-occur with higher frequency, sorting characters according to their suffixes (i.e., their write context) will bring together similar characters. For example, let us imagine that we have a very long document about dogs, then just before a "og ..." there are chances to find a "d". and thus sorting letters according to their suffix will tend to bring together many "d", alternated with other characters (for example for words like "cog").

In [37]:
def suffix_list(T):
    """
    Compute the suffix list
    
    Args:
        T (str): string
    
    Return:
        list of strings: suffix list
    """
    suffix_list = [T[i:] for i in range(len(T))]
    sorted(suffix_list,reverse=True)
    return suffix_list

def suffix_table(T):
    """
    Compute the suffix table
    
    Args:
        T (str): string
    
    Return:
        list of tuples (suffix,location): suffix table
    """
    suffix_list = [T[i:] for i in range(len(T))]
    suffix_table = sorted((e,i) for i,e in enumerate(suffix_list))
    return suffix_table

def BWT_suffix_table(T,end_of_string="$"):
    """
    Compute the BWT from the suffix table
    
    Args:
        T (str): string
        end_of_string (char): end of string character to append
    
    Return:
        bwt (str): BWT
    """
    T += end_of_string
    ST = suffix_table(T)
    bwt = ""
    for s,i in ST:
        bwt += T[i-1]
    return(bwt)
In [38]:
T = "ACATACAGATG"
BWT(T)
Out[38]:
'GT$CCGAATAAA'
In [39]:
BWT_suffix_table(T)
Out[39]:
'GT$CCGAATAAA'

Naive Inverse Burrows Wheeler Transform

The BWT transformation is reversible, and the inverse operation does note require the storage of any additional data: only the end of string character is needed. The inverse BWT procedure is described hereafter:

  • Let $S$ be the original string and let $BWT(S) = S_{bwt}$ (we only know $S_{bwt}$)
  • Given only $S_{bwt}$, we can easily reconstruct the first column of the circular shifts table: indeed, $S_{bwt}$ contains all the characters of $S$, and we know that the first column also contains all the characters of $S$ sorted. Then, simply sort the characters of $S_{bwt}$ to get the first column. Let $S_1$ denote this first column.
  • Given the circular shift operation, we know that $\forall i$ if $S[i] = S_{bwt}[i]$, then $S[i+1] = S_{1}[i]$

  • Now, $\{\forall i \quad (S_{bwt}[i], S_{1}[i])\}$ denote the set of all pairs of successive characters in $S$.

  • Again, we know that the first and second columns of the table contain the pairs are successive characters sorted, then we simply need to sort these pairs to get the first and second column of the table
  • Similarly, we simply need to paste the characters in $S_{bwt}$ to those of the first two columns to get the set of all triplets of successive characters in $S$, and sort them to get the first three columns and so on, until reconstructing the entire table...
  • Once the table is reconstructed, the row containing the "end of string" character in the last position is the original string.

Example :

$S_{bwt}$ = GT$CCGAATAAA

Sort to get $S_1$: $S_1$ = $AAAAACGGTT

Paste $S_{bwt}$ to get the pairs: $\{\forall i \quad (S_{bwt}[i], S_{1}[i])\}$ = {G\$, TA, \\$A, CA, CA, GA, AC, AC, TG, AG, AT, AT}

Sort the pairs to get the (first and) second column: \$A, AC, AC, AG, AT, AT, CA, CA, G\\$, GA, TA, TG

$S_{2}$ = ACCGTTAA\$AAG

Paste $S_{bwt}$ to get the triplets: $\{\forall i \quad (S_{bwt}[i], S_{1}[i], S_{2}[i])\}$ = {G\$A, TAC, \\$A, CAG, CAT, GAT, ACA, ACA, TG\$, AGA, ATA, ATG}

...

Questions :

  • Program the inverse Burrow Wheeler Transform
  • Compute the complexity of this algorithm
In [40]:
def print_table(table):
    for row in table:
        print("".join(row))

def inverse_BWT(bwt, verbose=False, last_character="$"):
    """
    Inverse the BWT
    
    Args:
        bwt (str): bwt of a string T
        verbose (bool): if True print the process of the algorithm
        last_character (char): which is the end of string character?
    
    Return:
        T (str): BWT^{-1} of bwt
    """
    def paste(suffix_table,bwt):
        new_columns = []
        for i,s in enumerate(suffix_table):
            new_columns.append(bwt[i]+s[:])
        return(new_columns)
    bwt = list(bwt)
    table = sorted(bwt)
    while len(table[0])<len(bwt):
        table = paste(table,bwt)
        table = sorted(table)
        if verbose:
            print("suffix table in progress....")
            print_table(table)
    if verbose:
        print("suffix table")
        for row in table:
            print(row)
    for row in table:
        if row[-1] == last_character:
            return(row)
In [41]:
inverse_BWT("GT$CCGAATAAA")
Out[41]:
'ACATACAGATG$'
In [42]:
text
Out[42]:
"dans l'herbe noire les kobolds vont. le vent profond pleure, on veut croire."
In [43]:
inverse_BWT(BWT(text))
Out[43]:
"dans l'herbe noire les kobolds vont. le vent profond pleure, on veut croire.$"

Inverse Burrows Wheeler Transform

There is a more efficient way to perform the Inverse Burrows Wheeler Transform, without needing to reconstruct the entire table (pasting and sorting)

First Last (FL) property

  • Consider the following example. Index the letters of the Last column by the order of appearance of each letter (if you have seen k letters A then index the new letter A as A$_{k+1}$)
  • Search the correspondence of indexes in the First column, to do so, you can use the entire table
  • What do you notice? Why does this property appears?

Example:

\$ACATACAGATG

ACAGATG\$ACAT

ACATACAGATG\$

AGATG\$ACATAC

ATACAGATG\$AC

ATG\$ACATACAG

CAGATG\$ACATA

CATACAGATG\$A

G\$ACATACAGAT

GATG\$ACATACA

TACAGATG\$ACA

TG\$ACATACAGA

Answer:

\$ACATACAGATG$_1$

ACAGATG\$ACAT$_1$

ACATACAGATG\$$_1$

AGATG\$ACATAC$_1$

ATACAGATG\$AC$_2$

ATG\$ACATACAG$_2$

CAGATG\$ACATA$_1$

CATACAGATG\$A$_2$

G\$ACATACAGAT$_2$

GATG\$ACATACA$_3$

TACAGATG\$ACA$_4$

TG\$ACATACAGA$_5$

  • We know that there is only one sign \$ so its index in the first column is immediate $_1$

row 1 $\to$ \$$_1$ACATACAGATG$_1$

  • Then we know that \$$_1$ is the right context of G$_1$, so we can deduce the index of the G at the left of the \$ in the first column

row 9 $\to$ G$_1$\$ACATACAGAT$_2$

  • G$_1$ is the right context of T$_2$ according to the table, then looking the T that has G\$ as right context in the table allows to index T$_2$

row 12 $\to$ T$_2$G\$ACATACAGA$_5$

  • and so on ... finally:

\$$_1$ACATACAGATG$_1$

A$_1$CAGATG\$ACAT$_1$

A$_2$CATACAGATG\$$_1$

A$_3$GATG\$ACATAC$_1$

A$_4$TACAGATG\$AC$_2$

A$_5$TG\$ACATACAG$_2$

C$_1$AGATG\$ACATA$_1$

C$_2$ATACAGATG\$A$_2$

G$_1$\$ACATACAGAT$_2$

G$_2$ATG\$ACATACA$_3$

T$_1$ACAGATG\$ACA$_4$

T$_2$G\$ACATACAGA$_5$

  • We notice that the order of the indexes is preserved. This is called the First Last property.

Let us consider the following example:

We set the indexes in the last column such that:

xxxxG$_1$

.....

yyyyG$_2$

.....

zzzzG$_3$

Given the BWT construction (sorting the strings by lexicographic order) we know that:

xxxx < yyyy < zzzz

Consequently, in the BWT the rows starting by G (circular shifts of the previous ones) will preserve the same order

.....

G$_1$xxxx

G$_2$yyyy

G$_3$zzzz

.....

Efficient Inverse Burrows Wheeler Transform

Let $S_{bwt}$ the BWT of a string $S$, and let $A$ be the alphabet of $S$.

Let $S*T = (S[1], S[2], \dots, S[|S|], T[1], \dots, T[|T|])$ be the concatenation of strings $S$ and $T$

initialization

  • For each character $X \in A$ count its number of occurrences $\# X$ in $S_{bwt}$
  • Index the occurrences of $S_{bwt}$ by their order of appearance, i.e., the first occurrence of $X$ is indexed 1, the second is indexed 2, and so on (e.g., for the sequence (A,A,T,A,C,C) the indexes are (1,2,1,3,1,2)). Let $K$ represent the list of indexes, such that $K[i]$ is the index of $S_{bwt}[i]$
  • By definition, the first character of the First column is \$, its left context is stored in the first character of the Last column ($S_{bwt}[i]$), therefore this is the penultimate character in $S$ (just before \\$).

  • Let $X\leftarrow S_{bwt}[0]$ denote the current character

  • And let $k \leftarrow 1$ be its index (it is the first occurrence of $X$ in $S_{bwt}$)
  • $S \leftarrow (\$)$

Repeat While $X \neq \$$

The main idea is to find the location of $X$ in the First column of the table (not computed explicitly), in order to find its left context in the Last column, i.e., $S_{bwt}$. Notice that the characters in the first column are: 1) sorted into a lexicographic order 2) the instances of the same letter share the same appearance index than those in $S_{bwt}$.

  • $S \leftarrow (X)*S$
  • Let $j$ be the location of $X$ in the First column. $j\leftarrow k+\sum_{Y} \# Y $ for all character $Y \in A$ such that $Y<X$
  • The character at $S_{bwt}[j]$ has $X$ as right context, and thus it is its the left context.
  • $X \leftarrow S_{bwt}[j]$, $k \leftarrow K[j]$

Questions:

  • Implement this algorithm in python
  • What is the complexity of this algorithm?
In [44]:
from collections import Counter 

def occurrence_indexer(S):
    K = []
    last_index = {}
    for s in S:
        if s not in last_index: 
            last_index[s] = 0
        K.append(last_index[s])
        last_index[s] += 1
    return(K)

def last2first(counts,k,X):
    return k + sum([counts[char] for char in counts if char < X])
    
def efficient_inverse_BWT(bwt,end_of_string="$"):
    """
    Inverse the BWT
    
    Args:
        bwt (str): bwt of a string T
        last_character (char): which is the end of string character?
    
    Return:
        T (str): BWT^{-1} of bwt
    """    
    K = occurrence_indexer(bwt)
    counts = Counter(bwt)
    X = bwt[0]
    k = K[0]
    S = end_of_string
    while X != end_of_string:
        S = X+S
        j = last2first(counts,k,X)
        X = bwt[j]
        k = K[j]
    return(S)
    
    
In [45]:
inverse_BWT("GT$CCGAATAAA")
Out[45]:
'ACATACAGATG$'
In [46]:
efficient_inverse_BWT("GT$CCGAATAAA")
Out[46]:
'ACATACAGATG$'

String search using the BWT

Using a similar technique as in the efficient inverse BWT, it is possible to detect the presence of a substring $T$ in a string $S$ using the following principle

initialization

  • $L \leftarrow BWT(S)$
  • $F \leftarrow$ First column of the BW table
  • $e \leftarrow 1$
  • $f \leftarrow |F|$
  • $i \leftarrow |T|$

while $e < f$ and $i>0$:

  • $X\leftarrow T[i]$
  • $r \leftarrow $ first position of $T[i]$ in L[e,f]
  • $s \leftarrow $ last position of $T[i]$ in L[e,f]
  • $e \leftarrow $ index of the r-th occurrence of $T[i]$ in F
  • $f \leftarrow $ index of the s-th occurrence of $T[i]$ in F
  • $i \leftarrow i-1$

Question:

  • It is possible to avoid computing and storing the first column, propose a method to do this (take a look to the Efficient Inverse Burrows Wheeler Transform)
  • Program this function in python
  • Assuming that we store the index of the characters in the original string (as in a suffix array), it is possible to use the previous method to detect the location of the instances of the substring $T$ in $S$.
In [47]:
def last2first(counts,k,X):
    return k + sum([counts[char] for char in counts if char < X])
    
def get_first_occurrence(L,X):
    for i,l in enumerate(L):
        if l == X:
            return(i)
        
def pattern_matching_BWT(S,pattern):
    """
    Search a patter in a String using the BWT
    
    Args:
        S (str): string
        pattern (str): pattern
    
    Return:
        bool: true if the pattern is in the string    
    """
    L = BWT_suffix_table(T)
    K = occurrence_indexer(L)
    counts = Counter(L)
    e = 0
    f = len(L)
    i = len(pattern) - 1
    while e < f and i >= 0:
        X = pattern[i]
        first_occurence_in_L_ef = get_first_occurrence(L[e:f],X)
        if first_occurence_in_L_ef is None:
            return False
        else:
            r = first_occurence_in_L_ef+e
        #print(first_occurence_in_L_ef,L[e:f],r,X,L[r])
        
        last_occurence_in_L_ef = get_first_occurrence(L[e:f][::-1],X)
        if last_occurence_in_L_ef is None:
            return False
        else:
            s = f-last_occurence_in_L_ef-1
        #print(last_occurence_in_L_ef,L[e:f],s,X,L[s])
        
        e = last2first(counts,K[r],X)
        f = last2first(counts,K[s],X)+1
        i -= 1
        #print(r,s,e,f)
    return(i<0)
In [48]:
pattern_matching_BWT('ACATACAGATG','CATAC')
Out[48]:
True
In [ ]: