Identify pan-cancer genes

4 min readJul 16, 2019

Will there ever be a cure for cancer? Probably, at least we are getting closer and closer. Thanks to the advances in technologies, such as next-generation sequencing, CRISPR and deep learning, many people believe that combining an unprecedentedly large amount of biological data and advanced machine learning algorithms can revolutionize drug development.

I will continuously write blogs about how to analyze large biological data sets for drug discovery using either simple methods or advanced machine learning algorithms. This post demonstrated an example: identifying cancer essential genes for drug target discovery.

The hypothesis is simple (probably naive): perturbing the genes that are essential for the viability of cancer cell lines can inhibits cancer cell growth and essential genes across multiple cancer cell lines might be good therapeutic targets for treating many types of cancers, regardless of where in the body it started or the type of tissue from which it developed.

A recent ambitious project, Cancer Dependency Map (DepMap) provides some good data sets. The goal of the project is “to create a comprehensive preclinical reference map connecting tumor features with tumor dependencies to accelerate the development of precision treatments”.

Currently, the project releases genome-scale CRISPR knockout screens for 17,634 genes in 563 cell lines (Data/Achilles_gene_effect.csv). The website also lists a data set collected from other projects using RNAi screen (D2_combined_gene_dep_scores.csv). Since reproducibility is a big issue in biology, both of the data sets were used. If genes are ranked high in both data sets, these genes could be really important, regardless of genetic screen technology and data processing method.

The two files are metrics of dependency scores. The dependency score was derived from the observed cell viability, after knocking-down/knocking-out one specific gene in one cell line. A lower score means higher essentiality.

I simply calculated the average of dependency scores for each gene across all the cancer cell lines in each data set (without considering that different types of cancer may have different numbers of cell lines), and then combined the two data sets by a robust rank aggregation method (RRA).

Dependency scores from two data sets are generally correlated.

Many functional analyses can be done after ranking the genes based on their essentiality across multiple cancer cell lines, but for simplicity, I just list here the top 10 genes and some existing studies that support the cancer drug target hypotheses.

SNRPD1 (small nuclear ribonucleoprotein D1 polypeptide). siRNA-mediated depletion of SmE (SNRPE) or SmD1 (SNRPD1) led to a marked reduction of cell viability in breast, lung, and melanoma cancer cell lines … These findings provide a rationale to develop new therapeutic agents targeting spliceosome core components in oncology.(PubMed)
RAN (RAN, member RAS oncogene family). The knockdown of Ran GTPase expression inhibits the proliferation and migration of breast cancer cells (PubMed). Ran GTPase protein promotes human pancreatic cancer proliferation by deregulating the expression of Survivin and cell cycle proteins (PubMed).
PSMA3 (proteasome subunit alpha 3). Exosome-transmitted PSMA3 and PSMA3-AS1 promote proteasome inhibitor resistance in multiple myeloma (PubMed). Secretomic profiling of cells from hollow fiber bioreactor reveals PSMA3 as a potential cholangiocarcinoma biomarker (PubMed).
KIF11 (kinesin family member 11). Characterization of KIF11 as a novel prognostic biomarker and therapeutic target for oral cancer (PubMed). KIF11 is required for proliferation and self-renewal of docetaxel resistant triple-negative breast cancer cells (PubMed).
SNRNP200 (small nuclear ribonucleoprotein U5 subunit 200). donor-derived U5 snRNP200 complex-recognizing AML (acute myeloid leukemia)-specific antibodies may contribute to antitumor responses (PubMed).
HSPE1 (heat shock protein family E (Hsp10) member 1). Elevated expression of HSP10 protein inhibits apoptosis and associates with poor prognosis of astrocytoma (PubMed). High expression of heat shock protein 10 correlates negatively with estrogen/progesterone receptor status and predicts poor prognosis in invasive ductal breast carcinoma (PubMed).
PRPF19 (pre-mRNA processing factor 19, PRP19).Prp19 facilitates invasion of hepatocellular carcinoma via p38 mitogen-activated protein kinase/twist1 pathway (PubMed).
RUVBL1 (RuvB like AAA ATPase 1). RUVBL1, a novel C-RAF-binding protein, activates the RAF/MEK/ERK pathway to promote lung cancer tumorigenesis (PubMed). RUVBL1 directly binds actin filaments and induces formation of cell protrusions to promote pancreatic cancer cell invasion (PubMed).
PSMD7 (proteasome 26S subunit, non-ATPase 7). PSMD7 downregulation induces apoptosis and suppresses tumorigenesis of esophageal squamous cell carcinoma via the mTOR/p70S6K pathway (PubMed).
COPB1 (coatomer protein complex subunit beta 1). Silencing of ER-Golgi trafficking genes (ARF4, COPB1 and USO1)attenuated the metastatic phenotype in vitro and lung colonization in vivo (breast cancer, PubMed).

Looks great! Within just a few minutes, I identified many potential cancer drug targets that can be used in various types of cancers. (Of courses, generating these data sets, further validating the candidate genes, and designing and testing drugs require much more time and resources. In addition, a big confounding factor is that the data came from only cancer cell lines, thus many top genes may also be essential to normal cells and the drugs may have some toxic effect.)

Part 2: https://medium.com/@encodebox/identify-pan-cancer-genes-part-2-58514c1eb2e2

R code:

Identify pan-cancer genes

Part 2: https://medium.com/@encodebox/identify-pan-cancer-genes-part-2-58514c1eb2e2

Written by Encode Box

No responses yet