{"id":983,"date":"2014-07-07T13:44:22","date_gmt":"2014-07-07T12:44:22","guid":{"rendered":"http:\/\/nextmovesoftware.com\/blog\/?p=983"},"modified":"2014-07-07T13:44:22","modified_gmt":"2014-07-07T12:44:22","slug":"validity-checking-antibody-sequence-data","status":"publish","type":"post","link":"https:\/\/nextmovesoftware.com\/blog\/2014\/07\/07\/validity-checking-antibody-sequence-data\/","title":{"rendered":"Validity checking antibody sequence data"},"content":{"rendered":"<p>NextMove Software&#8217;s <a href=\"http:\/\/www.nextmovesoftware.com\/sugarnsplice.html\">Sugar &amp; Splice product<\/a> is a toolkit for handling <a href=\"http:\/\/en.wikipedia.org\/wiki\/Biopolymer\">biopolymers<\/a>, including oligopeptides, oliogonucleotides and oligosaccharides (and combinations thereof).  Amongst its many possible applications is the ability to rapidly identify possible errors in the sequences of <a href=\"http:\/\/en.wikipedia.org\/wiki\/Antibody\">antibodies<\/a> (during registration).  <a href=\"http:\/\/en.wikipedia.org\/wiki\/Therapeutic_antibodies\">Therapeutic antibodies<\/a> are increasingly important to the pharmaceutical industry, with about 8 out of the top 20 selling drugs being monoclonal antibody therapeutics.  Antibodies are large covalently bound molecules, formed of four protein chains cross-linked by <a href=\"http:\/\/en.wikipedia.org\/wiki\/Disulfide_bond\">disulfide bonds<\/a>.  These polypeptide chains consist of two longer <a href=\"http:\/\/en.wikipedia.org\/wiki\/Immunoglobulin_heavy_chain\">&#8220;heavy&#8221; chains<\/a> and two shorter <a href=\"http:\/\/en.wikipedia.org\/wiki\/Immunoglobulin_light_chain\">&#8220;light&#8221; chains<\/a>.  Almost always (with the exception of synthetic <a href=\"http:\/\/en.wikipedia.org\/wiki\/Bispecific_antibody\">bispecific antibodies<\/a>), the sequences of the two heavy chains in each antibody are identical, and the sequences of the two light chains are identical.<\/p>\n<p>The most defining characteristic of these heavy and light chains, as implied by their names, are their lengths.  Heavy chains are almost always between 432 and 456 amino acids long, and light chains are almost always 204 to 220 amino acids long.  This very naturally suggests an incredibly simple sanity check (during antibody sequence registration); namely that the light and heavy chains are within their correct length ranges respectively.  A major benefit of therapeutic antibodies over arbitrary peptide and protein therapeutics relies on their not being recognized as &#8220;foreign&#8221; by the patient&#8217;s immune system, and thereby avoiding the side-effects associated with an immune response.  To achieve this an antibody drug must look very much like a native human antibody which places tight constraints on the length and composition of its constituent protein chains.  This remarkably simple check is in practice effective at finding real problems in trusted data sources, as described by the three example failures below.<\/p>\n<p><b>[1] Canakinumab light chain<\/b><br \/>\nNovartis&#8217; <a href=\"http:\/\/en.wikipedia.org\/wiki\/Canakinumab\">canakinumab<\/a> (sold under the tradename Ilaris) is a human antibody against IL-1&beta; for the treatment of <a href=\"http:\/\/en.wikipedia.org\/wiki\/Cryopyrin-associated_periodic_syndrome\">cryopyrin-associated periodic syndrome (CAPS)<\/a>.  The sequence of the light chain in both <a href=\"http:\/\/www.drugbank.ca\/drugs\/DB06168\">DrugBank (DB06168)<\/a> and <a href=\"https:\/\/www.ebi.ac.uk\/chembl\/compound\/inspect\/CHEMBL1201834\">ChEMBL (CHEMBL1201834)<\/a> is 394 amino acids long, much longer than a typical light chain.  Inspection of the drug bank <a href=\"http:\/\/en.wikipedia.org\/wiki\/FASTA_format\">FASTA records<\/a> reveals the source of the error and the correction.<\/p>\n<pre>\r\n&gt;8836_H|canakinumab|Homo sapiens||H-GAMMA-1 (VH(1-118)+CH1(119-216)+HINGE-REGION(217-231)+CH2(232-341)+CH3(342-448))|||||||448||||MW 49253.6|MW 49253.6|\r\nQVQLVESGGGVVQPGRSLRLSCAASGFTFSVYGMNWVRQAPGKGLEWVAIIWYDGDNQYY\r\nADSVKGRFTISRDNSKNTLYLQMNGLRAEDTAVYYCARDLRTGPFDYWGQGTLVTVSSAS\r\nTKGPSVFPLAPSSKSTSGGTAALGCLVKDYFPEPVTVSWNSGALTSGVHTFPAVLQSSGL\r\nYSLSSVVTVPSSSLGTQTYICNVNHKPSNTKVDKRVEPKSCDKTHTCPPCPAPELLGGPS\r\nVFLFPPKPKDTLMISRTPEVTCVVVDVSHEDPEVKFNWYVDGVEVHNAKTKPREEQYNST\r\nYRVVSVLTVLHQDWLNGKEYKCKVSNKALPAPIEKTISKAKGQPREPQVYTLPPSREEMT\r\nKNQVSLTCLVKGFYPSDIAVEWESNGQPENNYKTTPPVLDSDGSFFLYSKLTVDKSRWQQ\r\nGNVFSCSVMHEALHNHYTQKSLSLSPGK\r\n&gt;8836_L|canakinumab|Homo sapiens||L-KAPPA (V-KAPPA(1-107)+C-KAPPA(108-214))|||||||214||||MW 23357.9|MW 23357.9|\r\nQVQLVESGGGVVQPGRSLRLSCAASGFTFSVYGMNWVRQAPGKGLEWVAIIWYDGDNQYY\r\nADSVKGRFTISRDNSKNTLYLQMNGLRAEDTAVYYCARDLRTGPFDYWGQGTLVTVSSAS\r\nTKGPSVFPLAPSSKSTSGGTAALGCLVKDYFPEPVTVSWNSGALTSGVHTFPAVLQSSGL\r\nEIVLTQSPDFQSVTPKEKVTITCRASQSIGSSLHWYQQKPDQSPKLLIKYASQSFSGVPS\r\nRFSGSGSGTDFTLTINSLEAEDAAAYYCHQSSSLPFTFGPGTKVDIKRTVAAPSVFIFPP\r\nSDEQLKSGTASVVCLLNNFYPREAKVQWKVDNALQSGNSQESVTEQDSKDSTYSLSSTLT\r\nLSKADYEKHKVYACEVTHQGLSSPVTKSFNRGEC\r\n<\/pre>\n<p>Notice that the first three lines of the light chain entry (8836_L) match identically the first three lines of the heavy chain entry (8836_H), probably due to a cut&#8217;n&#8217;paste error.  The correct sequence for the light chain starts with the sequence &#8220;EIVL&#8221;, and indeed this is confirmed as the resulting sequence is 214 amino acids long, which both the length specified in the DrugBank FASTA header line and falls within our allowed length range.<\/p>\n<p><b>[2] Adalimumab heavy chain<\/b><br \/>\nAbbott Laboratories&#8217; (now AbbVie&#8217;s) <a href=\"http:\/\/en.wikipedia.org\/wiki\/Adalimumab\">Adalimumab<\/a>, sold under the tradename Humira, is a human antibody against TNF-&alpha; for autoimmune disorders.  Here the <a href=\"https:\/\/www.ebi.ac.uk\/chembl\/compound\/inspect\/CHEMBL1201580\">ChEMBL entry (CHEMBL1201580)<\/a> lacks any sequence annotation for either chain, and <a href=\"http:\/\/www.drugbank.ca\/drugs\/DB00051\">DrugBank&#8217;s entry (DB00051)<\/a> contains a heavy chain sequence of only 224 amino acids, too short to be the full sequence.  This partial sequence fragment is actually just the Fab (antigen binding) domain of Adalimumab&#8217;s heavy chain.  Searching online reveals a <a href=\"http:\/\/www.hesiglobal.org\/files\/public\/Committee%20Presentations\/PATC\/Fry-for%20website-APPROVED.pdf\">presentation from Jeremy Fry in 2012<\/a> that repeats this Fab fragment sequence (on slide 7) together with text &#8220;Full sequence information for Humira is not in the public domain&#8221;.<\/p>\n<p>Fortunately, this is no longer the case as the full sequence of the heavy chain has now been published, found in a bitmap image (figure 8) in a <a href=\"http:\/\/planetorbitrap.com\/download.php?filename=52d9bccd14347.pdf\">PDF whitepaper from Thermo Fisher Scientific<\/a> who use Adalimumab&#8217;s heavy chain as an example case study for intact antibody sequencing with the Orbitrap mass spectrometry hardware.<\/p>\n<p>Here (possibly for the first time in machine-readable form) is the full sequence of Adalimumab&#8217;s heavy chain.<\/p>\n<pre>\r\n&gt;Adalimumab_H\r\nEVQLVESGGGLVQPGRSLRLSCAASGFTFDDYAMHWVRQAPGKGLEWVSAITWNSGHIDY\r\nADSVEGRFTISRDNAKNSLYLQMNSLRAEDTAVYYCAKVSYLSTASSLDYWGQGTLVTVS\r\nSASTKGPSVFPLAPSSKSTSGGTAALGCLVKDYFPEPVTVSWNSGALTSGVHTFPAVLQS\r\nSGLYSLSSVVTVPSSSLGTQTYICNVNHKPSNTKVDKKVEPKSCDKTHTCPPCPAPELLG\r\nGPSVFLFPPKPKDTLMISRTPEVTCVVVDVSHEDPEVKFNVYVDGVEVHNAKTKPREEQY\r\nNSTYRVVSVLTVLHQDWLNGKEYKCKVSNKALPAPIEKTISKAKGQPREPQVYTLPPSRD\r\nELTKNQVSLTCLVKGFYPSDIAVEWESNGQPENNYKTTPPVLDSDGSFFLYSKLTVDKSR\r\nWQQGNVFSCSVMHEALHNHYTQKSLSLSPGK\r\n<\/pre>\n<p><b>[3] Cixutumumab heavy chain<\/b><br \/>\nImClone Systems&#8217; <a href=\"http:\/\/en.wikipedia.org\/wiki\/Cixutumumab\">Cixutumumab<\/a> is a human antibody against IGF-1R.  The <a href=\"https:\/\/www.ebi.ac.uk\/chembldb\/compound\/inspect\/CHEMBL1743001\">ChEMBL entry (CHEMBL1743001)<\/a> provides a heavy chain sequence that is 460 amino acids long.  Whilst not much longer than the permissible range, it does flag this sequence as suspicious.  This is confirmed by sequence alignment, for example to the heavy chain of <a href=\"http:\/\/en.wikipedia.org\/wiki\/Rafivirumab\">rafivirumab<\/a> shown on the lower lines in the sequence alignment below.<\/p>\n<pre>\r\nSNSAlign (Needleman-Wunsch sequence alignment) version 0.9beta14\r\nCopyright (C) 2014 NextMove Software Limited\r\n\r\nSequence 1 Length: 460\r\nSequence 2 Length: 456\r\nAlignment Type: 1\r\nGap penalty:    -10\r\nExtend penalty: -2\r\nAlignment score: 1542\r\nIdentity:    78.73% (359\/456)\r\nSimilarity:  81.58% (372\/456)\r\n\r\n    EVQLVQSGAEVKKPGSSVKVSCKASGGTFSSYAISWVRQAPGQGLEWMGGIIPIFGTANY 60\r\n    |||||||||||||||||||||||||||||: | ::|||||||||||||||||||||||||\r\n    EVQLVQSGAEVKKPGSSVKVSCKASGGTFNRYTVNWVRQAPGQGLEWMGGIIPIFGTANY 60\r\n\r\n    LRFLEWSTQDGTAALGCLVKVPSSSLGTQTPSVFLFPPKPKTKPREEQYNKAKGQPREPQ 120\r\n                                                                \r\n    ------------------------------------------------------------ 60\r\n\r\n    ENNYKTTPPVQKSLSLSPGKAQKFQGRVTITADKSTSTAYMELSSLRSEDTAVYYCAR-- 178\r\n                        ||:||||:|||||:||||||||||||||:|||||:|||  \r\n    --------------------AQRFQGRLTITADESTSTAYMELSSLRSDDTAVYFCAREN 100\r\n\r\n    ----APHYYYY-YMDVWGKGTTVTVSSASTKGPSVFPLAPSSKSTSG----------DYF 223\r\n          :||:  : | ||:|| |||||||||||||||||||||||||          |||\r\n    LDNSGTYYYFSGWFDPWGQGTLVTVSSASTKGPSVFPLAPSSKSTSGGTAALGCLVKDYF 160\r\n\r\n    PEPVTVSWNSGALTSGVHTFPAVLQSSGLYSLSSVV----------TYICNVNHKPSNTK 273\r\n    ||||||||||||||||||||||||||||||||||||          ||||||||||||||\r\n    PEPVTVSWNSGALTSGVHTFPAVLQSSGLYSLSSVVTVPSSSLGTQTYICNVNHKPSNTK 220\r\n\r\n    VDKKVEPKSCDKTHTCPPCPAPELLGG----------KDTLMISRTPEVTCVVVDVSHED 323\r\n    |||:|||||||||||||||||||||||          |||||||||||||||||||||||\r\n    VDKRVEPKSCDKTHTCPPCPAPELLGGPSVFLFPPKPKDTLMISRTPEVTCVVVDVSHED 280\r\n\r\n    PEVKFNWYVDGVEVHNA----------STYRVVSVLTVLHQDWLNGKEYKCKVSNKALPA 373\r\n    |||||||||||||||||          |||||||||||||||||||||||||||||||||\r\n    PEVKFNWYVDGVEVHNAKTKPREEQYNSTYRVVSVLTVLHQDWLNGKEYKCKVSNKALPA 340\r\n\r\n    PIEKTIS----------VYTLPPSREEMTKNQVSLTCLVKGFYPSDIAVEWESNGQP--- 420\r\n    |||||||          ||||||||||||||||||||||||||||||||||||||||   \r\n    PIEKTISKAKGQPREPQVYTLPPSREEMTKNQVSLTCLVKGFYPSDIAVEWESNGQPENN 400\r\n\r\n    -------LDSDGSFFLYSKLTVDKSRWQQGNVFSCSVMHEALHNHYT--------- 460\r\n           ||||||||||||||||||||||||||||||||||||||||         \r\n    YKTTPPVLDSDGSFFLYSKLTVDKSRWQQGNVFSCSVMHEALHNHYTQKSLSLSPG 456\r\n<\/pre>\n<p>Alas this mysterious pattern of gaps doesn&#8217;t indicate the discovery of a novel human immunoglobulin <a href=\"http:\/\/en.wikipedia.org\/wiki\/Transposable_element\">transposon<\/a> by the ChEMBL team in Hinxton, but another cut&#8217;n&#8217;paste error.  In <a href=\"http:\/\/en.wikipedia.org\/wiki\/UniProt\">UNIPROT<\/a> sequence file format, and often in patent filings, protein sequences are written in blocks of ten amino acids.  The tell tale alignment above (which looks very pretty as a <a href=\"http:\/\/en.wikipedia.org\/wiki\/Dot_plot_%28bioinformatics%29\">dotplot<\/a>) indicates that one of the columns of this sequence file has been shuffled.  If one looks closely, many of the missing 10-mers from the later lines, appear consecutively in the anomalous large insertion close to the start.<\/p>\n<p>As demonstrated by the above examples, antibody sequence databases appear to contain a number of serious errors that can be caught\/flagged by even simple quality checks.  Fortunately, advanced antibody validation algorithms are able to identify far more subtle problems in large biologics registration systems, sometimes containing millions of entries.  Hopefully, corrections for the above issues will shortly be seen in DrugBank, ChEMBL and IUPHAR databases.<\/p>\n","protected":false},"excerpt":{"rendered":"<p>NextMove Software&#8217;s Sugar &amp; Splice product is a toolkit for handling biopolymers, including oligopeptides, oliogonucleotides and oligosaccharides (and combinations thereof). Amongst its many possible applications is the ability to rapidly identify possible errors in the sequences of antibodies (during registration). Therapeutic antibodies are increasingly important to the pharmaceutical industry, with about 8 out of the &hellip; <a href=\"https:\/\/nextmovesoftware.com\/blog\/2014\/07\/07\/validity-checking-antibody-sequence-data\/\" class=\"more-link\">Continue reading <span class=\"screen-reader-text\">Validity checking antibody sequence data<\/span><\/a><\/p>\n","protected":false},"author":4,"featured_media":0,"comment_status":"open","ping_status":"open","sticky":false,"template":"","format":"standard","meta":{"footnotes":""},"categories":[1],"tags":[],"_links":{"self":[{"href":"https:\/\/nextmovesoftware.com\/blog\/wp-json\/wp\/v2\/posts\/983"}],"collection":[{"href":"https:\/\/nextmovesoftware.com\/blog\/wp-json\/wp\/v2\/posts"}],"about":[{"href":"https:\/\/nextmovesoftware.com\/blog\/wp-json\/wp\/v2\/types\/post"}],"author":[{"embeddable":true,"href":"https:\/\/nextmovesoftware.com\/blog\/wp-json\/wp\/v2\/users\/4"}],"replies":[{"embeddable":true,"href":"https:\/\/nextmovesoftware.com\/blog\/wp-json\/wp\/v2\/comments?post=983"}],"version-history":[{"count":13,"href":"https:\/\/nextmovesoftware.com\/blog\/wp-json\/wp\/v2\/posts\/983\/revisions"}],"predecessor-version":[{"id":998,"href":"https:\/\/nextmovesoftware.com\/blog\/wp-json\/wp\/v2\/posts\/983\/revisions\/998"}],"wp:attachment":[{"href":"https:\/\/nextmovesoftware.com\/blog\/wp-json\/wp\/v2\/media?parent=983"}],"wp:term":[{"taxonomy":"category","embeddable":true,"href":"https:\/\/nextmovesoftware.com\/blog\/wp-json\/wp\/v2\/categories?post=983"},{"taxonomy":"post_tag","embeddable":true,"href":"https:\/\/nextmovesoftware.com\/blog\/wp-json\/wp\/v2\/tags?post=983"}],"curies":[{"name":"wp","href":"https:\/\/api.w.org\/{rel}","templated":true}]}}