Identify pan-cancer genes — part 2
In the previous post, I used genetic screen data to identify genes that are essential to multiple cancer cell lines. These genes could be good candidates to develop drugs that can treat multiple types of cancer. This gene essentiality analysis reminded me of another important concept in biology, gene network.
Gene network represents gene interactions. Similar to a social network, these interactions can be shown as a graph, in which nodes are genes and edges are interaction relationships. Edges can be directed or undirected, weighted or unweighted, depending on the type of network. Gene network is very useful for gene function study and drug discovery. It also has many connections with deep learning (artificial neural network). I will write multiple posts in the future to explain gene network in more detail.
This post continues the work of identifying pan-cancer genes by adding gene network information. Gene interactions can be learned by experiments or inferred from data. For example, a genetic screen experiment can identify genetic interactions. If knocking down two genes together causes phenotypic change, e.g. cell death, but knocking down either one of the two genes does not have the same effect, we can tell there is some interaction between the two genes. Unfortunately, the genetic screen data used in the previous post was not designed for detecting genetic interactions. Only one gene was knocked down/off in one cell. Don’t worry. I did not intend to use the data to build a gene network anyway.
In a gene network, if one gene has many neighbors, this gene is called a hub. It is similar to an airport hub in a flight map and a celebrity in a social network. The hypothesis I am going to test is that genes with more neighbors are more essential. If the hypothesis is correct, top hub genes are good candidates for cancer drug targets.
I downloaded the human gene network from BioGRID, which contains over a million protein and genetic interactions in many organism species curated from publications. I compared the node degree versus gene essentiality. The number of neighbors is called “degree”. Rank scores from the previous post were used to represent gene essentiality in cancer cell lines.
Regardless of the big dark mess on the right of the scatter plot, it is very interesting to see the strong correlation on the left part. Then I decided to zoom in and focused on the top 1000 genes based on the essentiality rank score.
WOW! The result is surprisingly good. The correlation is statistically significant (P-value < 2.2e-16 in the linear model. To note, the plot clearly shows the violation of one linear regression assumption, homogeneity of variance.) The significant correlation between gene network node degree and the gene essentiality indicates gene network analysis is a powerful tool for drug discovery. (Similar result was observed by another group.)
One further application of the correlation is to combine the gene network with genetic screen data to prioritize drug targets. Using the same robust rank aggregation method (RRA), the top 10 ranked genes based on both genetic screen and network degree information are:
- VCP (valosin containing protein) cancer cell–accelerated fibroblast migration, prostate cancer.
- PRPF8 (pre-mRNA processing factor 8). myeloid neoplasm,
- CDC5L (cell division cycle 5 like). colorectal cancer, cervical tumors, and osteosarcoma
- PCNA (proliferating cell nuclear antigen). many types of cancer cells
- DDB1 (damage specific DNA binding protein 1). mediates p16 activation during oncogenic checkpoint response
- RPA1 (replication protein A1). hepatocellular carcinoma, colon cancer.
- XPO1 (exportin 1). pancreatic cancer, lung cancer
- RPS6 (ribosomal protein S6). pancreatic cancer, lung cancer
- POLR2A (RNA polymerase II subunit A). triple negative breast cancer, colorectal cancer.
- RPA2 (replication protein A2). breast cancer, colon cancer.
Most of the genes, including the list in the previous post, are cell cycle related genes. It is OK that the genes lists from the two posts are different. Biological data sets usually have very high variance. Top genes based on a few data sets are probably not the best drug targets, but at least they can demonstrate the hypothesis or analysis method works (more or less).
I didn’t find a good gene network specific for cancer, so there is a same caveat as the first post, that essential genes based on a general gene network are probably also important for normal cells.
R code: