Validity checking antibody sequence data

NextMove Software’s Sugar & Splice product is a toolkit for handling biopolymers, including oligopeptides, oliogonucleotides and oligosaccharides (and combinations thereof). Amongst its many possible applications is the ability to rapidly identify possible errors in the sequences of antibodies (during registration). Therapeutic antibodies are increasingly important to the pharmaceutical industry, with about 8 out of the top 20 selling drugs being monoclonal antibody therapeutics. Antibodies are large covalently bound molecules, formed of four protein chains cross-linked by disulfide bonds. These polypeptide chains consist of two longer “heavy” chains and two shorter “light” chains. Almost always (with the exception of synthetic bispecific antibodies), the sequences of the two heavy chains in each antibody are identical, and the sequences of the two light chains are identical.

The most defining characteristic of these heavy and light chains, as implied by their names, are their lengths. Heavy chains are almost always between 432 and 456 amino acids long, and light chains are almost always 204 to 220 amino acids long. This very naturally suggests an incredibly simple sanity check (during antibody sequence registration); namely that the light and heavy chains are within their correct length ranges respectively. A major benefit of therapeutic antibodies over arbitrary peptide and protein therapeutics relies on their not being recognized as “foreign” by the patient’s immune system, and thereby avoiding the side-effects associated with an immune response. To achieve this an antibody drug must look very much like a native human antibody which places tight constraints on the length and composition of its constituent protein chains. This remarkably simple check is in practice effective at finding real problems in trusted data sources, as described by the three example failures below.

[1] Canakinumab light chain
Novartis’ canakinumab (sold under the tradename Ilaris) is a human antibody against IL-1β for the treatment of cryopyrin-associated periodic syndrome (CAPS). The sequence of the light chain in both DrugBank (DB06168) and ChEMBL (CHEMBL1201834) is 394 amino acids long, much longer than a typical light chain. Inspection of the drug bank FASTA records reveals the source of the error and the correction.

>8836_H|canakinumab|Homo sapiens||H-GAMMA-1 (VH(1-118)+CH1(119-216)+HINGE-REGION(217-231)+CH2(232-341)+CH3(342-448))|||||||448||||MW 49253.6|MW 49253.6|
QVQLVESGGGVVQPGRSLRLSCAASGFTFSVYGMNWVRQAPGKGLEWVAIIWYDGDNQYY
ADSVKGRFTISRDNSKNTLYLQMNGLRAEDTAVYYCARDLRTGPFDYWGQGTLVTVSSAS
TKGPSVFPLAPSSKSTSGGTAALGCLVKDYFPEPVTVSWNSGALTSGVHTFPAVLQSSGL
YSLSSVVTVPSSSLGTQTYICNVNHKPSNTKVDKRVEPKSCDKTHTCPPCPAPELLGGPS
VFLFPPKPKDTLMISRTPEVTCVVVDVSHEDPEVKFNWYVDGVEVHNAKTKPREEQYNST
YRVVSVLTVLHQDWLNGKEYKCKVSNKALPAPIEKTISKAKGQPREPQVYTLPPSREEMT
KNQVSLTCLVKGFYPSDIAVEWESNGQPENNYKTTPPVLDSDGSFFLYSKLTVDKSRWQQ
GNVFSCSVMHEALHNHYTQKSLSLSPGK
>8836_L|canakinumab|Homo sapiens||L-KAPPA (V-KAPPA(1-107)+C-KAPPA(108-214))|||||||214||||MW 23357.9|MW 23357.9|
QVQLVESGGGVVQPGRSLRLSCAASGFTFSVYGMNWVRQAPGKGLEWVAIIWYDGDNQYY
ADSVKGRFTISRDNSKNTLYLQMNGLRAEDTAVYYCARDLRTGPFDYWGQGTLVTVSSAS
TKGPSVFPLAPSSKSTSGGTAALGCLVKDYFPEPVTVSWNSGALTSGVHTFPAVLQSSGL
EIVLTQSPDFQSVTPKEKVTITCRASQSIGSSLHWYQQKPDQSPKLLIKYASQSFSGVPS
RFSGSGSGTDFTLTINSLEAEDAAAYYCHQSSSLPFTFGPGTKVDIKRTVAAPSVFIFPP
SDEQLKSGTASVVCLLNNFYPREAKVQWKVDNALQSGNSQESVTEQDSKDSTYSLSSTLT
LSKADYEKHKVYACEVTHQGLSSPVTKSFNRGEC

Notice that the first three lines of the light chain entry (8836_L) match identically the first three lines of the heavy chain entry (8836_H), probably due to a cut’n’paste error. The correct sequence for the light chain starts with the sequence “EIVL”, and indeed this is confirmed as the resulting sequence is 214 amino acids long, which both the length specified in the DrugBank FASTA header line and falls within our allowed length range.

[2] Adalimumab heavy chain
Abbott Laboratories’ (now AbbVie’s) Adalimumab, sold under the tradename Humira, is a human antibody against TNF-α for autoimmune disorders. Here the ChEMBL entry (CHEMBL1201580) lacks any sequence annotation for either chain, and DrugBank’s entry (DB00051) contains a heavy chain sequence of only 224 amino acids, too short to be the full sequence. This partial sequence fragment is actually just the Fab (antigen binding) domain of Adalimumab’s heavy chain. Searching online reveals a presentation from Jeremy Fry in 2012 that repeats this Fab fragment sequence (on slide 7) together with text “Full sequence information for Humira is not in the public domain”.

Fortunately, this is no longer the case as the full sequence of the heavy chain has now been published, found in a bitmap image (figure 8) in a PDF whitepaper from Thermo Fisher Scientific who use Adalimumab’s heavy chain as an example case study for intact antibody sequencing with the Orbitrap mass spectrometry hardware.

Here (possibly for the first time in machine-readable form) is the full sequence of Adalimumab’s heavy chain.

>Adalimumab_H
EVQLVESGGGLVQPGRSLRLSCAASGFTFDDYAMHWVRQAPGKGLEWVSAITWNSGHIDY
ADSVEGRFTISRDNAKNSLYLQMNSLRAEDTAVYYCAKVSYLSTASSLDYWGQGTLVTVS
SASTKGPSVFPLAPSSKSTSGGTAALGCLVKDYFPEPVTVSWNSGALTSGVHTFPAVLQS
SGLYSLSSVVTVPSSSLGTQTYICNVNHKPSNTKVDKKVEPKSCDKTHTCPPCPAPELLG
GPSVFLFPPKPKDTLMISRTPEVTCVVVDVSHEDPEVKFNVYVDGVEVHNAKTKPREEQY
NSTYRVVSVLTVLHQDWLNGKEYKCKVSNKALPAPIEKTISKAKGQPREPQVYTLPPSRD
ELTKNQVSLTCLVKGFYPSDIAVEWESNGQPENNYKTTPPVLDSDGSFFLYSKLTVDKSR
WQQGNVFSCSVMHEALHNHYTQKSLSLSPGK

[3] Cixutumumab heavy chain
ImClone Systems’ Cixutumumab is a human antibody against IGF-1R. The ChEMBL entry (CHEMBL1743001) provides a heavy chain sequence that is 460 amino acids long. Whilst not much longer than the permissible range, it does flag this sequence as suspicious. This is confirmed by sequence alignment, for example to the heavy chain of rafivirumab shown on the lower lines in the sequence alignment below.

SNSAlign (Needleman-Wunsch sequence alignment) version 0.9beta14
Copyright (C) 2014 NextMove Software Limited

Sequence 1 Length: 460
Sequence 2 Length: 456
Alignment Type: 1
Gap penalty:    -10
Extend penalty: -2
Alignment score: 1542
Identity:    78.73% (359/456)
Similarity:  81.58% (372/456)

    EVQLVQSGAEVKKPGSSVKVSCKASGGTFSSYAISWVRQAPGQGLEWMGGIIPIFGTANY 60
    |||||||||||||||||||||||||||||: | ::|||||||||||||||||||||||||
    EVQLVQSGAEVKKPGSSVKVSCKASGGTFNRYTVNWVRQAPGQGLEWMGGIIPIFGTANY 60

    LRFLEWSTQDGTAALGCLVKVPSSSLGTQTPSVFLFPPKPKTKPREEQYNKAKGQPREPQ 120
                                                                
    ------------------------------------------------------------ 60

    ENNYKTTPPVQKSLSLSPGKAQKFQGRVTITADKSTSTAYMELSSLRSEDTAVYYCAR-- 178
                        ||:||||:|||||:||||||||||||||:|||||:|||  
    --------------------AQRFQGRLTITADESTSTAYMELSSLRSDDTAVYFCAREN 100

    ----APHYYYY-YMDVWGKGTTVTVSSASTKGPSVFPLAPSSKSTSG----------DYF 223
          :||:  : | ||:|| |||||||||||||||||||||||||          |||
    LDNSGTYYYFSGWFDPWGQGTLVTVSSASTKGPSVFPLAPSSKSTSGGTAALGCLVKDYF 160

    PEPVTVSWNSGALTSGVHTFPAVLQSSGLYSLSSVV----------TYICNVNHKPSNTK 273
    ||||||||||||||||||||||||||||||||||||          ||||||||||||||
    PEPVTVSWNSGALTSGVHTFPAVLQSSGLYSLSSVVTVPSSSLGTQTYICNVNHKPSNTK 220

    VDKKVEPKSCDKTHTCPPCPAPELLGG----------KDTLMISRTPEVTCVVVDVSHED 323
    |||:|||||||||||||||||||||||          |||||||||||||||||||||||
    VDKRVEPKSCDKTHTCPPCPAPELLGGPSVFLFPPKPKDTLMISRTPEVTCVVVDVSHED 280

    PEVKFNWYVDGVEVHNA----------STYRVVSVLTVLHQDWLNGKEYKCKVSNKALPA 373
    |||||||||||||||||          |||||||||||||||||||||||||||||||||
    PEVKFNWYVDGVEVHNAKTKPREEQYNSTYRVVSVLTVLHQDWLNGKEYKCKVSNKALPA 340

    PIEKTIS----------VYTLPPSREEMTKNQVSLTCLVKGFYPSDIAVEWESNGQP--- 420
    |||||||          ||||||||||||||||||||||||||||||||||||||||   
    PIEKTISKAKGQPREPQVYTLPPSREEMTKNQVSLTCLVKGFYPSDIAVEWESNGQPENN 400

    -------LDSDGSFFLYSKLTVDKSRWQQGNVFSCSVMHEALHNHYT--------- 460
           ||||||||||||||||||||||||||||||||||||||||         
    YKTTPPVLDSDGSFFLYSKLTVDKSRWQQGNVFSCSVMHEALHNHYTQKSLSLSPG 456

Alas this mysterious pattern of gaps doesn’t indicate the discovery of a novel human immunoglobulin transposon by the ChEMBL team in Hinxton, but another cut’n’paste error. In UNIPROT sequence file format, and often in patent filings, protein sequences are written in blocks of ten amino acids. The tell tale alignment above (which looks very pretty as a dotplot) indicates that one of the columns of this sequence file has been shuffled. If one looks closely, many of the missing 10-mers from the later lines, appear consecutively in the anomalous large insertion close to the start.

As demonstrated by the above examples, antibody sequence databases appear to contain a number of serious errors that can be caught/flagged by even simple quality checks. Fortunately, advanced antibody validation algorithms are able to identify far more subtle problems in large biologics registration systems, sometimes containing millions of entries. Hopefully, corrections for the above issues will shortly be seen in DrugBank, ChEMBL and IUPHAR databases.