Refining the impact of genetic evidence on clinical success

0 Comments


Definition of metrics

Except where otherwise noted, we define genetic support of a drug mechanism (that is, a T–I pair) as a genetic association mapped to the corresponding target gene for a trait that is ≥0.8 similar to the indication (see MeSH term similarity below). We defined P(G) as the proportion of drug mechanisms satisfying the above definition of genetic support. P(S) is the proportion of programmes in one phase that advance to a subsequent phase (for instance, phase I to phase II). Overall P(S) from phase I to launched is the product of P(S) at each individual phase. RS is the ratio of P(S) for programmes with genetic support to P(S) for programmes lacking genetic support, which is equivalent to a relative risk or risk ratio. Thus, if N denotes the total number of programmes that have reached the reference phase, and X denotes the number of those that advance to a later phase of interest, and the subscripts G and!G indicate the presence or absence of genetic support, then P(G) = NG/(NG + N!G); P(S) = (XG + X!G)/(NG + N!G); RS = (XG/NG)/(X!G/N!G). RS from phase I to launched is the product of RS at each individual phase. The count of ‘programs’ for X and N is T–I pairs throughout, except for Fig. 3d, which uses D–I pairs to specifically interrogate P(G) for which the same drug has been developed for different indications. For clarity, we note that whereas other recent studies22,25 have examined the fold enrichment and overlap between genes with a human genetic support and genes encoding a drug target, without regard to similarity, herein all of our analyses are conditioned on the similarity between the drug’s indication and the genetically associated trait.

Drug development pipeline

Citeline Pharmaprojects26 is a curated database of drug development programmes including preclinical, all clinical phases and launched (approved and marketed) drugs. It was queried via API (22 December 2022) to obtain information on drugs, targets, indications, phases reached and current development status. T–I pair was the unit of analysis throughout, except where otherwise indicated in the text (D–I pairs were examined in Fig. 3d). Current development status was defined as ‘active’ if the T–I pair had at least one drug still in active development, and ‘historical’ if development of all drugs for the T–I pair had ceased. Targets were defined as genes; as most drugs do not directly target DNA, this usually refers to the gene encoding the protein target that is bound or modulated by the drug. We removed combination therapies, diagnostic indication and programmes with no human target or no indication assigned. For most analyses, only programmes added to the database since 2000 were included, whereas for the count and similarity of launched indications per target, we used all launches for all time. Indications were considered to possess ‘genetic insight’—meaning the human genetics of this trait or similar traits have been successfully studied—if they had ≥0.8 similarity to (1) an OMIM or IntOGen disease, or (2) a GWAS trait with at least 3 independently associated loci, on the basis of lead SNP positions rounded to the nearest 1 megabase. For calculating RS, we used the number of T–I pairs with genetic insight as the denominator. The rationale for this choice is to focus on indications for which there exists the opportunity for human genetic evidence, consistent with the filter applied previously5. However, we observe that our findings are not especially sensitive to the presence of this filter, with RS decreasing by just 0.17 when the filter is removed (Extended Data Fig. 3g,h). Note that the criteria for determining genetic insight are distinct from, and much looser than, the criteria for mapping GWAS hits to genes (see L2G scores under OTG below). Many drugs had more than one target assigned, in which case all targets were retained for T–I pair analyses. As a sensitivity test, running our analyses restricted to only drugs with exactly one target assigned yielded very similar results (Supplementary Figures).

OMIM

OMIM is a curated database of Mendelian gene–disease associations. The OMIM Gene Map (downloaded 21 September 2023) contained 8,671 unique gene–phenotype links. We restricted to entries with phenotype mapping code 3 (‘the molecular basis for the disorder is known; a mutation has been found in the gene’), removed phenotypes with no MIM number or no gene symbol assigned, and removed duplicate combinations of gene MIM and phenotype MIM. We used regular expression matching to further filter out phenotypes containing the terms ‘somatic’, ‘susceptibility’ or ‘response’ (drug response associations) and those flagged as questionable (‘?’), or representing non-disease phenotypes (‘[’). A set of OMIM phenotypes are flagged as denoting susceptibility rather than causation (‘{’); this category includes low-penetrance or high allele frequency association assertions that we wished to exclude, but also germline heterozygous loss-of-function mutations in tumour suppressor genes, for which the underlying mechanism of disease initiation is loss of heterozygosity, which we wished to include. We therefore also filtered out phenotypes containing ‘{’ except for those that did contain the terms ‘cancer’, ‘neoplasm’, ‘tumor’ or ‘malignant’ and did not contain the term ‘somatic’. Remaining entries present in OMIM as of 2021 were further evaluated for validity by two curators, and gene–disease combinations for which a disease association was deemed not to have been established were excluded from all analyses. All of the above filters left 5,670 unique G–T links. MeSH terms for OMIM phenotypes were then mapped using the EFO OWL database using an approach previously described27, with further mappings from Orphanet, full text matches to the full MeSH vocabulary and, finally, manual curation, for a cumulative mapping rate of 93% (5,297 of 5,670). Because sometimes distinct phenotype MIM numbers mapped to the same MeSH term, this yielded 4,510 unique gene–MeSH links.

OTG

OTG is a database of GWAS hits from published studies and biobanks. OTG version 8 (12 October 2022) variant-to-disease, L2G, variant index and study index data were downloaded from EBI. Traits with multiple EFO IDs were excluded as these generally represent conditional, epistasis or other complex phenotypes that would lack mappings in the MeSH vocabulary. Of the top 100 traits with the greatest number of genes mapped, we excluded 76 as having no clear disease relevance (for example, ‘red cell distribution width’) or no obvious marginal value (for example, excluded ‘trunk predicted mass’ because ‘body mass index’ was already included). Remaining traits were mapped to MeSH using the EFO OWL database, full text queries to the MeSH API, mappings already manually curated in PICCOLO (see below) or new manual curation. In total, 25,124 of 49,599 unique traits (51%) were successfully mapped to a MeSH ID. We included associations with P < 5 × 10−8. OTG L2G scores used for gene mapping are based on a machine learning model trained on gold standard causal genes28; inputs to that model include distance, functional annotations, expression quantitative trait loci (eQTLs) and chromatin interactions. Note that we do not use Mendelian randomization29 to map causal genes, and even gene mappings with high L2G scores are necessarily imperfect. OTG provides an L2G score for the triplet of each study or trait with each hit and each possible causal gene. We defined L2G share as the proportion of the total L2G score assigned each gene among all potentially causal genes for that trait–hit combination. In sensitivity analyses we considered L2G share thresholds from 10% to 100% (Fig. 1b and Extended Data Fig. 3a), but main analyses used only genes with ≥50% L2G share (which are also the top-ranked genes for their respective associations). OTG links were parsed to determine the source of each OTG data point: the EBI GWAS catalog30 (n = 136,503 hits with L2G share ≥0.5), Neale UK Biobank (http://www.nealelab.is/uk-biobank; n = 19,139), FinnGen R6 (ref. 31) (n = 2,338) or SAIGE (n = 1,229).

PICCOLO

PICCOLO32 is a database of GWAS hits with gene mapping based on tests for colocalization without full summary statistics by using Probabilistic Identification of Causal SNPs (PICS) and a reference dataset of SNP linkage disequilibrium values. As described32, gene mapping uses quantitative trait locus (QTL) data from GTEx (n = 7,162) and a variety of other published sources (n = 6,552). We included hits with GWAS P < 5 × 10−8, and with eQTL P < 1 × 10−5, and posterior probability H4 ≥ 0.9, as these thresholds were determined empirically32 to strongly predict colocalization results.

Genebass

Genebass33 is a database of genetic associations based on exome sequencing. Genebass data from 394,841 UK Biobank participants (the ‘500K’ release) were queried using Hail (19 October 2023). We used hits from four models: pLoF (predicted loss-of-function) or missense|LC (missense and low confidence LoF), each with sequencing kernel association test (SKAT) or burden tests, filtering for P < 1 × 10−5. Because the traits in Genebass are from UK Biobank, which is included in OTG, we used the OTG MeSH mappings established above.

IntOGen

IntOGen is a database of enrichments of somatic genetic mutations within cancer types. We used the driver genes and cohort information tables (31 May 2023). IntOGen assigns each gene a mechanism in each tumour type; occasionally, a gene will be classified as a tumour suppressor in one type and an oncogene in another. We grouped by gene and assigned each gene its modal classification across cancers. MeSH mappings were curated manually.

MeSH term similarity

MeSH terms in either Pharmaprojects or the genetic associations datasets that were Supplementary Concept Records (IDs beginning in ‘C’) were mapped to their respective preferred main headings (IDs beginning in ‘D’). A matrix of all possible combinations of drug indication MeSH IDs and genetic association MeSH IDs was constructed. MeSH term Lin and Resnik similarities were computed for each pair as described34,35. Similarities of −1, indicating infinite distance between two concepts, were assigned as 0. The two scores were regressed against each other across all term pairs, and the Resnik scores were adjusted by a multiplier such that both scores had a range from 0 to 1 and their regression had a slope of 1. The two scores were then averaged to obtain a combined similarity score. Similarity scores were successfully calculated for 1,006 of 1,013 (99.3%) unique MeSH terms for Pharmaprojects indications, corresponding to 99.67% of Pharmaprojects T–I pairs, and for 2,260 of 2,262 (99.9%) unique MeSH terms for genetic associations, corresponding to >99.9% of associations.

Therapeutic areas

MeSH terms for Pharmaprojects indications were mapped onto 16 top-level headings under the Diseases [C] and Psychiatry and Psychology [F] branches of the MeSH tree (https://meshb.nlm.nih.gov/treeView), plus an ‘other’. The signs/symptoms area corresponds to C23 Pathological Conditions, Signs and Symptoms and contains entries such as inflammation and pain. Many MeSH terms map to >1 tree positions; these multiples were retained and counted towards each therapy area, except for the following conditions: for terms mapped to oncology, we deleted their mappings to all other areas; and ‘other’ was used only for terms that mapped to no other areas.

Analysis of T2D GWASs

We included 19 genes from OMIM linked to Mendelian forms of diabetes or syndromes with diabetic features. For Vujkovic et al.18, we considered as novel any genes with a novel nearest gene, novel coding variant or a novel lead SNP colocalized with an eQTL with H4 ≥ 0.9. Non-novel nearest genes, coding variants and colocalized lead SNPs were considered established variants. For Suzuki et al.19, we used the available L2G scores that OTG had assigned for the same lead SNPs in previously reported GWASs for other phenotypes, yielding mapped genes with L2G share >0.5 for 27% of loci. Genes were considered novel if absent from the Vujkovic analysis. Together, these approaches identified 217 established GWAS genes and 645 novel ones (469 from Vujkovic and 176 from Suzuki). We identified 347 unique drug targets in Pharmaprojects reported with a T2D or diabetes mellitus indication, including 25 approved. We reviewed the list of approved drugs and eliminated those for which there were questions around the relevance of the drug or target to T2D (AKR1B1, AR, DRD1, HMGCR, IGF1R, LPL, SLC5A1). Because Pharmaprojects ordinarily specifies the receptor as target for protein or peptide replacement therapies, we also remapped the minority of programmes for which the ligand, rather than receptor, had been listed as target (changing INS to INSR, GCG to GCGR). To assess the proportion of programmes with genetic support, we first grouped by drug and selected just one target, preferring the target with the earliest genetic support (OMIM, then established GWASs, then novel GWASs, then none). Next we grouped by target and selected its highest phase reached. Finally, we grouped by highest phase reached and counted the number of unique targets.

Universe of possible genetically supported G–I pairs

In all of our analyses, targets are defined as human gene symbols, but we use the term G–I pair to refer to possible genes that one might attempt to target with a drug, and T–I pair to refer to genes that are the targets of actual drug candidates in development. To enumerate the space of possible G–I pairs, we multiplied the n = 769 Pharmaprojects indications considered here by the ‘universe’ of n = 19,338 protein-coding genes, yielding a space of n = 14,870,922 possible G–I pairs. Of these, n = 101,954 (0.69%) qualify as having genetic support per our criteria. A total of 16,808 T–I pairs have reached at least phase I in an active or historical programme, of which 1,155 (6.9%) are genetically supported. This represents an enrichment compared with random chance (OR = 11.0, P < 1.0 × 10−15, Fisher’s exact test), but in absolute terms, only 1.1% of genetically supported G–I pairs have been pursued. A genetically supported G–I pair may be less likely to attract drug development interest if the indication already has many other potential targets, and/or if the indication is but the second-most similar to the gene’s associated trait. Removing associations with many GWAS hits and restricting to the single most similar indication left a space of 34,190 possible genetically supported G–I pairs, 719 (2.1%) of which had been pursued. This small percentage might yet be perceived to reflect competitive saturation, if the vast majority of indications are undevelopable and/or the vast majority of targets are undruggable. We therefore asked what proportion of genetically supported G–I pairs had been developed to at least phase I, as a function of therapy area cross-tabulated against Open Targets predicted tractability status or membership in canonically ‘druggable’ protein families, using families from ref. 22 as well as UniProt pkinfam for kinases36. We also grouped at the level of gene, rather than G–I pair (Extended Data Fig. 8).

Druggability and protein families

Antibody and small molecule druggability status was taken from Open Targets37. For antibody tractability, Clinical Precedence, Predicted Tractable–High Confidence and Predicted Tractable–Medium to Low Confidence were included. For small molecules, Clinical Precedence, Discovery Precedence and Predicted Tractable were included. Protein families were from sources described previously22, plus the pkinfam kinase list from UniProt36. To make these lists non-overlapping, genes that were both kinases and also enzymes, ion channels or nuclear receptors were considered to be kinases only.

Statistics

Analyses were conducted in R 4.2.0. For binomial proportions P(G) and P(S), error bars are Wilson 95% CIs, except for P(S) for phase I–launch for which the Wald method is used to compute the confidence intervals on the product of the individual probabilities of success at each phase. RS uses Katz 95% CIs, with the phase I launch RS based on the number of programs entering phase I and succeeding in phase III. Effects of continuous variables on probability of launch were assessed using logistic regression. Differences in RS between therapy areas were tested using the Cochran–Mantel–Haenszel chi-squared test (cmh.test from the R lawstat package, v.3.4). Pipeline progression of D–I pairs conditioned on the highest phase reached by a drug was modelled using an ordinal logit model (polr with Hess = TRUE from the R MASS package, v.7.3-56). Correlations across therapy areas were tested by weighted Pearson’s correlation (wtd.cor from the R weights package, v.1.0.4); to control for the amount of data available in each therapy area, the number of genetically supported T–I pairs having reached at least phase I was used as the weight. Enrichments of T–I pairs in the utilization analysis were tested using Fisher’s exact test. All statistical tests were two-sided.

Reporting summary

Further information on research design is available in the Nature Portfolio Reporting Summary linked to this article.



Source link

Leave a Reply

Your email address will not be published. Required fields are marked *

Related Posts

TIG Welder – 1st Shift

Job title: TIG Welder - 1st Shift Company: JCI Job description: To Join Our Welding Team Our Welders work in an air conditioned and heated facility located at 412 Railroad Avenue in Federalsburg... and electrification…

Regional Human Resources Manager – West

Job title: Regional Human Resources Manager - West Company: Safelite Job description: best place you'll ever work. A Brief Overview The Regional Human Resources Manager plays a key role in supporting our regional... experience in…

Heating, Ventilation and Air Conditioning Technician HV-CS

Job title: Heating, Ventilation and Air Conditioning Technician HV-CS Company: Hudson Valley Community College Job description: of a background check. Heating, Ventilation and Air Conditioning Technician HV-CS-9700 MINIMUM QUALIFICATIONS: Associate...: The successful candidate must take…