Skip to content

multi‐string BWT

Felipe A. Louza edited this page Mar 20, 2025 · 1 revision

multi-string BWT

By default, gsufsort appends a TERMINATOR symbol # to the concatenated input string, resulting in: T = T_1 $_1 T_2 $_2 ... T_m $m #

However, if the tool is built with the option make TERMINATOR=0, then the resulting string does not include the final # terminator: T = T_1 $_1 T_2 $_2 ... T_m $_m

Note that all $_i symbols are stored using the same value ($), but they are ordered based on their position in T. This ensures that sorting T_1 $ T_2 $ ... T_m $ # is equivalent to sorting T = T_1 $_1 T_2 $_2 ... T_m $m #.

Example:

Given an input file with the following m strings:

$ more dataset/mdolEBWT.txt 
CTGA
TG
GTCC
TCA
CGACC
CGA

The (multi-string) BWT output with TERMINATOR=1 is the following:

$ make clean; make TERMINATOR=1
$ ./gsufsort  dataset/mdolEBWT.txt --gsa --bwt --print
## gsufsort ##
## store_to_disk ##
mdolEBWT.txt.4.4.gsa 	224 bytes (n = 28)
mdolEBWT.txt.bwt 	28 bytes (n = 28)
## print ##
i	GSA		BWT	suffixes
0	(6, 0)   	$	#
1	(0, 4)   	A	$
2	(1, 2)   	G	$
3	(2, 4)   	C	$
4	(3, 3)   	A	$
5	(4, 5)   	C	$
6	(5, 3)   	A	$
7	(0, 3)   	G	A$
8	(3, 2)   	C	A$
9	(5, 2)   	G	A$
10	(4, 2)   	G	ACC$
11	(2, 3)   	C	C$
12	(4, 4)   	C	C$
13	(3, 1)   	T	CA$
14	(2, 2)   	T	CC$
15	(4, 3)   	A	CC$
16	(5, 0)   	$	CGA$
17	(4, 0)   	$	CGACC$
18	(0, 0)   	#	CTGA$
19	(1, 1)   	T	G$
20	(0, 2)   	T	GA$
21	(5, 1)   	C	GA$
22	(4, 1)   	C	GACC$
23	(2, 0)   	$	GTCC$
24	(3, 0)   	$	TCA$
25	(2, 1)   	G	TCC$
26	(1, 0)   	$	TG$
27	(0, 1)   	C	TGA$

When TERMINATOR=0, the (multi-string) BWT differs only in the symbol $_m that does not appear in the first position, and the symbol # replaces $_m.

$ make clean; make TERMINATOR=0
$ ./gsufsort  dataset/mdolEBWT.txt --gsa --bwt --print
## gsufsort ##
## store_to_disk ##
mdolEBWT.txt.4.4.gsa 	216 bytes (n = 27)
mdolEBWT.txt.bwt 	27 bytes (n = 27)
## print ##
i	GSA		BWT	suffixes
0	(0, 4)   	A	$
1	(1, 2)   	G	$
2	(2, 4)   	C	$
3	(3, 3)   	A	$
4	(4, 5)   	C	$
5	(5, 3)   	A	$
6	(0, 3)   	G	A$
7	(3, 2)   	C	A$
8	(5, 2)   	G	A$
9	(4, 2)   	G	ACC$
10	(2, 3)   	C	C$
11	(4, 4)   	C	C$
12	(3, 1)   	T	CA$
13	(2, 2)   	T	CC$
14	(4, 3)   	A	CC$
15	(5, 0)   	$	CGA$
16	(4, 0)   	$	CGACC$
17	(0, 0)   	$	CTGA$
18	(1, 1)   	T	G$
19	(0, 2)   	T	GA$
20	(5, 1)   	C	GA$
21	(4, 1)   	C	GACC$
22	(2, 0)   	$	GTCC$
23	(3, 0)   	$	TCA$
24	(2, 1)   	G	TCC$
25	(1, 0)   	$	TG$
26	(0, 1)   	C	TGA$

The same $ symbol is used to avoid increasing the alphabet size. However, their positions can be tracked using the document array (the first component of the GSA).

Clone this wiki locally