-
Notifications
You must be signed in to change notification settings - Fork 2
multi‐string BWT
By default, gsufsort appends a TERMINATOR symbol #
to the concatenated input string, resulting in: T = T_1 $_1 T_2 $_2 ... T_m $m #
However, if the tool is built with the option make TERMINATOR=0
, then the resulting string does not include the final #
terminator:
T = T_1 $_1 T_2 $_2 ... T_m $_m
Note that all $_i
symbols are stored using the same value ($
), but they are ordered based on their position in T. This ensures that sorting T_1 $ T_2 $ ... T_m $ #
is equivalent to sorting T = T_1 $_1 T_2 $_2 ... T_m $m #
.
Given an input file with the following m strings:
$ more dataset/mdolEBWT.txt
CTGA
TG
GTCC
TCA
CGACC
CGA
The (multi-string) BWT output with TERMINATOR=1 is the following:
$ make clean; make TERMINATOR=1
$ ./gsufsort dataset/mdolEBWT.txt --gsa --bwt --print
## gsufsort ##
## store_to_disk ##
mdolEBWT.txt.4.4.gsa 224 bytes (n = 28)
mdolEBWT.txt.bwt 28 bytes (n = 28)
## print ##
i GSA BWT suffixes
0 (6, 0) $ #
1 (0, 4) A $
2 (1, 2) G $
3 (2, 4) C $
4 (3, 3) A $
5 (4, 5) C $
6 (5, 3) A $
7 (0, 3) G A$
8 (3, 2) C A$
9 (5, 2) G A$
10 (4, 2) G ACC$
11 (2, 3) C C$
12 (4, 4) C C$
13 (3, 1) T CA$
14 (2, 2) T CC$
15 (4, 3) A CC$
16 (5, 0) $ CGA$
17 (4, 0) $ CGACC$
18 (0, 0) # CTGA$
19 (1, 1) T G$
20 (0, 2) T GA$
21 (5, 1) C GA$
22 (4, 1) C GACC$
23 (2, 0) $ GTCC$
24 (3, 0) $ TCA$
25 (2, 1) G TCC$
26 (1, 0) $ TG$
27 (0, 1) C TGA$
When TERMINATOR=0, the (multi-string) BWT differs only in the symbol $_m that does not appear in the first position, and the symbol # replaces $_m.
$ make clean; make TERMINATOR=0
$ ./gsufsort dataset/mdolEBWT.txt --gsa --bwt --print
## gsufsort ##
## store_to_disk ##
mdolEBWT.txt.4.4.gsa 216 bytes (n = 27)
mdolEBWT.txt.bwt 27 bytes (n = 27)
## print ##
i GSA BWT suffixes
0 (0, 4) A $
1 (1, 2) G $
2 (2, 4) C $
3 (3, 3) A $
4 (4, 5) C $
5 (5, 3) A $
6 (0, 3) G A$
7 (3, 2) C A$
8 (5, 2) G A$
9 (4, 2) G ACC$
10 (2, 3) C C$
11 (4, 4) C C$
12 (3, 1) T CA$
13 (2, 2) T CC$
14 (4, 3) A CC$
15 (5, 0) $ CGA$
16 (4, 0) $ CGACC$
17 (0, 0) $ CTGA$
18 (1, 1) T G$
19 (0, 2) T GA$
20 (5, 1) C GA$
21 (4, 1) C GACC$
22 (2, 0) $ GTCC$
23 (3, 0) $ TCA$
24 (2, 1) G TCC$
25 (1, 0) $ TG$
26 (0, 1) C TGA$
The same $
symbol is used to avoid increasing the alphabet size. However, their positions can be tracked using the document array (the first component of the GSA).