Often, I need to extract the protein sequence from a PDB structure. At first glance this may seem like a very straight forward task. But there are several pitfalls we need to be aware of.
Extracting the residues from the atomic data yields unnatural sequences
Your first thought, as was mine when I first started in this field, may be to simply extract the residues from the atomic data. The issue is this will yield unnatural sequences due to unmodeled (disordered) regions in the structure:
Take for example PDBID: 2GNK
PDBID: 2GKN | Unmodeled region (residues 38-54) shown in green
The region in green is a disordered unmodeled region (NOTE: disordered and unmodeled do exactly mean the same thing but it is often the case that a disordered region is unsolved/unmodeled). Thus, because it is unsolved there are no atomic coordinates present for this region. If we were to extract the residues from the coordinate data, it would skip this region because it is not represented in the atomic data.
For example, notice that in the PDB file it jumps from residue 37 to 55.
...
ATOM 274 N GLY A 37 -15.075 39.805 6.661 1.00 44.73 N
ATOM 275 CA GLY A 37 -16.284 39.652 5.872 1.00 48.53 C
ATOM 276 C GLY A 37 -17.423 38.816 6.407 1.00 50.47 C
ATOM 277 O GLY A 37 -17.543 37.645 6.067 1.00 56.21 O
ATOM 278 N PHE A 55 -13.565 41.983 11.476 1.00 42.29 N
ATOM 279 CA PHE A 55 -12.505 40.970 11.325 1.00 43.12 C
ATOM 280 C PHE A 55 -12.714 39.735 12.211 1.00 43.54 C
ATOM 281 O PHE A 55 -12.850 39.879 13.442 1.00 47.11 O
ATOM 282 CB PHE A 55 -11.171 41.606 11.629 1.00 44.34 C
...
Therefore, while the true expressed sequence of the structure would look like this (truncated to +/- 5 residues around disordered region for explanatory purposes)
34 VKGFGRQKGNAELYRGAEYSVNFLPKV 59
While your extracted sequence would look like this
34 VKGFG*****************FLPKV 59
If you are simply using the extracted sequence to map a structural representative to an alignment, this may not be such a problem as you don’t have structural data for that region in the first place, and you could let the alignment algorithm “figure it out” and add gaps to that region (albeit this is not a good practice).
But if you instead intend to use the extracted sequence as input to some predictor, the missing region becomes a significant issue as most predictors rely on the context around a residue. Feeding a predictor the sequence VKGFGFLPKV will generate inaccurate results as it no longer has the context of the residues of the unmolded disordered region that actually exist in the true sequence.
SEQRES does not need to match the ATOM records
Okay, so extracting the residues from the coordinate data has its problems. What about the SEQRES header or provided fasta file from the PDB website.
The SEQRES header is supposed to contain the “primary” sequence used in the experiment. The actual expressed protein is derived from the “primary” sequence. This means that the solved protein can be a fragment, extended or mutated version of the “primary” sequence.
But there is no gurantee that the Nth residue in the SEQRES header corresponds to the Nth residue in the cordinate data. For example, the 6th residue in 5TFR’s SEQRES HEADER is a glycine but the 6 residue in the atom data is a glutamate.
_pdbx_poly_seq_scheme in mmCIF files contains a residue level mapping between the primary sequence and the cordinate data
...
loop_
_pdbx_poly_seq_scheme.asym_id
_pdbx_poly_seq_scheme.entity_id
_pdbx_poly_seq_scheme.seq_id
_pdbx_poly_seq_scheme.mon_id
_pdbx_poly_seq_scheme.ndb_seq_num
_pdbx_poly_seq_scheme.pdb_seq_num
_pdbx_poly_seq_scheme.auth_seq_num
_pdbx_poly_seq_scheme.pdb_mon_id
_pdbx_poly_seq_scheme.auth_mon_id
_pdbx_poly_seq_scheme.pdb_strand_id
_pdbx_poly_seq_scheme.pdb_ins_code
_pdbx_poly_seq_scheme.hetero
A 1 1 GLY 1 -3 ? ? ? A . n
A 1 2 SER 2 -2 ? ? ? A . n
A 1 3 HIS 3 -1 ? ? ? A . n
A 1 4 MET 4 0 ? ? ? A . n
A 1 5 GLY 5 1 ? ? ? A . n
A 1 6 GLY 6 2 ? ? ? A . n
A 1 7 GLY 7 3 ? ? ? A . n
A 1 8 THR 8 4 ? ? ? A . n
A 1 9 GLY 9 5 ? ? ? A . n
A 1 10 GLU 10 6 6 GLU GLU A . n
A 1 11 THR 11 7 7 THR THR A . n
A 1 12 LEU 12 8 8 LEU LEU A . n
A 1 13 GLY 13 9 9 GLY GLY A . n
...
This secttion of a mmCIF file contains a mapping between the primary sequence and the cordinate data. Another useful piece of data with can extract from this is a mapping between the author chains codes and those assigned by PDB.
- asym_id represents the chain code as assigned by PDB
- mon_id represents the SEQRES (primary) sequence
- auth_seq_num represents the residue number as labeled in the cordinate data (with missing residues represented as a “?” )
- pdb_mon_id represents the corresponding residue in the cordinate data
- pdb_strand_id represents the chain code as assigned by the author
- pdb_ins_code represents the insertion code (This is sometimes used to preserve the sequence numbering when an insertion is present in the cordinate data but not the primary sequence)
With this mapping, you can now use the full primary sequence of protein structures but maintain a mapping back to the cordinate data. This is important as it allows us to maintain the full context sequence represented by a given protein structure.