Am
elG
houi
la, C
laud
ia C
hica
, Em
naA
chou
ri&
Fa
tma
Gue
rfali
C3B
I Ha
nds-
on N
GS
cour
se –
IPP
–23
rdN
ov 2
016
1
SAMandBAMformats
Am
elG
houi
la, C
laud
ia C
hica
, Em
naA
chou
ri&
Fa
tma
Gue
rfali
C3B
I Ha
nds-
on N
GS
cour
se –
IPP
–23
rdN
ov 2
016
2
Rawsequencedata:Fastq files
Mapping(Bowtie,BWAorothers)
BAM/SAMfiles
• AftermappingtheFASTQfiletothereferencegenomeyouwillendupwithaSAMorBAMalignmentfile
• SAMstandsforSequenceAlignment/Mapformat
• AsingleSAMfilecanstoremapped,unmapped,andevenQC-failedreadsfromasequencingrun,andindexedtoallowrapidaccess.ThismeansthattherawsequencingdatacanbefullyrecapitulatedfromtheSAM/BAMfile.
SAM,BAMformats
Am
elG
houi
la, C
laud
ia C
hica
, Em
naA
chou
ri&
Fa
tma
Gue
rfali
C3B
I Ha
nds-
on N
GS
cour
se –
IPP
–23
rdN
ov 2
016
LiShen,2014
SAMFormat
Am
elG
houi
la, C
laud
ia C
hica
, Em
naA
chou
ri&
Fa
tma
Gue
rfali
C3B
I Ha
nds-
on N
GS
cour
se –
IPP
–23
rdN
ov 2
016
Rawsequencedata:Fastq files
Mapping(Bowtie,BWAorothers)
BAM/SAMfiles
• SAMisrarelyhelpfulandreallytakesuptoomuchspace whichiswhyweuseonlytheBAMinprinciple
• ABAMfile(.bam)isthebinaryversionofaSAMfile(savingstorageandfastermanipulation)
SAM,BAMformats
4
Am
elG
houi
la, C
laud
ia C
hica
, Em
naA
chou
ri&
Fa
tma
Gue
rfali
C3B
I Ha
nds-
on N
GS
cour
se –
IPP
–23
rdN
ov 2
016
§ ASAMfile(.sam)isatab-delimitedtextfilethatcontainssequencealignmentdata
§ SAMfilescanbeopenedusingatexteditororviewedusingtheUNIX"more"command
§ Mostalignmentprogramswillsupply:
- aheader:describingtheformatversion,sortingorderofthereads,genomicsequencestowhichthereadsweremapped
- analignmentsection:containstheinformationforeachsequenceaboutwhere/howitalignstothereferencegenome
Rawsequencedata:Fastq files
Mapping(Bowtie,BWAorothers)
BAM/SAMfiles
SAM,BAMformats
5
Am
elG
houi
la, C
laud
ia C
hica
, Em
naA
chou
ri&
Fa
tma
Gue
rfali
C3B
I Ha
nds-
on N
GS
cour
se –
IPP
–23
rdN
ov 2
016
SAM,BAMformats
Header:Alignmentsection11columns(tab-separated)
6
Am
elG
houi
la, C
laud
ia C
hica
, Em
naA
chou
ri&
Fa
tma
Gue
rfali
C3B
I Ha
nds-
on N
GS
cour
se –
IPP
–23
rdN
ov 2
016
7
SAMFormat
http://samtools.sourceforge.net/SAM1.pdfhttp://genome.sph.umich.edu/wiki/SAM
QNAME FLAG RNAME MAPQ RNEX
T
PNEX
T
TLEN
SEQPOS
CIGAR
QUAL
Am
elG
houi
la, C
laud
ia C
hica
, Em
naA
chou
ri&
Fa
tma
Gue
rfali
C3B
I Ha
nds-
on N
GS
cour
se –
IPP
–23
rdN
ov 2
016
(http://samtools.github.io/hts-specs/SAMv1.pdf)
QNAME:QuerytemplateNAME.Reads/segmentshavingidenticalQNAMEareregardedtocomefromthesametemplate.AQNAME‘*’indicatestheinformationisunavailable.
8
SAMfomat
Am
elG
houi
la, C
laud
ia C
hica
, Em
naA
chou
ri&
Fa
tma
Gue
rfali
C3B
I Ha
nds-
on N
GS
cour
se –
IPP
–23
rdN
ov 2
016
(http://samtools.github.io/hts-specs/SAMv1.pdf)
FLAG:FLAG:bitwiseFLAG(idealforcompression).
9
SAMfomat(2)
11boolean flagsallstotred inasingecolumn
Am
elG
houi
la, C
laud
ia C
hica
, Em
naA
chou
ri&
Fa
tma
Gue
rfali
C3B
I Ha
nds-
on N
GS
cour
se –
IPP
–23
rdN
ov 2
016
10
SAMfile
read mapped toposition7:FLAG163(=1+2+32+128):- Readis thesecondread inthepair(128)- Readis properly paired (1+2)- its mateis mapped to37onthereversestrand (32)
SAMflag:example
Am
elG
houi
la, C
laud
ia C
hica
, Em
naA
chou
ri&
Fa
tma
Gue
rfali
C3B
I Ha
nds-
on N
GS
cour
se –
IPP
–23
rdN
ov 2
016
Explainflagtool:https://broadinstitute.github.io/picard/explain-flags.html
11
DecodingSAMflags
Am
elG
houi
la, C
laud
ia C
hica
, Em
naA
chou
ri&
Fa
tma
Gue
rfali
C3B
I Ha
nds-
on N
GS
cour
se –
IPP
–23
rdN
ov 2
016
(http://samtools.github.io/hts-specs/SAMv1.pdf)
The MAPQvaluecanbeusedtofigureouthowuniqueanalignmentisinthegenome.ü Largenumber,>10 indicatesit'slikelythealignmentisunique.ü 255indicatesthatthemappingqualityisnotavailable
12
SAMfomat(3)
Itequals−10log10Pr{mappingpositioniswrong},roundedtothenearestinteger.
Am
elG
houi
la, C
laud
ia C
hica
, Em
naA
chou
ri&
Fa
tma
Gue
rfali
C3B
I Ha
nds-
on N
GS
cour
se –
IPP
–23
rdN
ov 2
016
• The CIGAR string is a sequence of numbers and lettersrepresenting the associated information on bases alignmentused to indicate things like which bases align (either amatch/mismatch) with the reference, are deleted from thereference, and if there are insertions that are not in the reference
SAMfomat:CIGARstring
Moreinformationabouttheseformatsavailablehere:http://samtools.sourceforge.nethttps://samtools.github.io/hts-specs/SAMv1.pdf
13
Am
elG
houi
la, C
laud
ia C
hica
, Em
naA
chou
ri&
Fa
tma
Gue
rfali
C3B
I Ha
nds-
on N
GS
cour
se –
IPP
–23
rdN
ov 2
016
Mapped andunmapped reads areimported into SAM/BAMformat
ThestandardCIGARdescriptionofpairwise alignment defines three operations:‘M’foralignment match,‘I’forinsertioncompared with thereference and‘D’fordeletion.
(NB:ThePOSindicates that theread aligns starting at position5onthereference)
TheCIGAR:3M=3basesintheread sequence align with thereference.1I=Thenext baseintheread does notexist inthereference.1D=Thereference basedoes notexist intheread sequence
POS:5CIGAR:3M1I3M1D2M
http://genome.sph.umich.edu/wiki/SAM
SAMfomat:CIGARstring
14
Am
elG
houi
la, C
laud
ia C
hica
, Em
naA
chou
ri&
Fa
tma
Gue
rfali
C3B
I Ha
nds-
on N
GS
cour
se –
IPP
–23
rdN
ov 2
016
(Lietal.,2009)
Alignments
SAMfile
Examples ofCIGARstringsfordifferent typesofalignments
SAMfomat:CIGARstring
15
Am
elG
houi
la, C
laud
ia C
hica
, Em
naA
chou
ri&
Fa
tma
Gue
rfali
C3B
I Ha
nds-
on N
GS
cour
se –
IPP
–23
rdN
ov 2
016
Nameofmate(matepairinformationforpaired-endsequencing)Positionofmate(matepairinformation)
Obviously,thechromsome andpositionareimportant.TheCIGARstringisalsoimportanttoknowwhereinsertions(i.e.introns)mightexistinyourread.
(http://samtools.github.io/hts-specs/SAMv1.pdf)
16
SAMformat(5)
SAMfomat(5)