Apply deep learning to transcriptome-based supervised learning
Deep learning has brought breakthroughs in image, video, speech and text analyses [1]. It has also attracted many interests in biology [2]. Some deep learning applications in biology have shown it is significantly and convincingly superior to traditional machine learning methods in many tasks, e.g. disease diagnosis using image data [3], variant calling [4], regulatory DNA element detection [5], etc.
However, one field is still in its infancy, which is supervised learning using transcriptomic data. Transcriptomics and other -omics hold many promises in biology and medicine, e.g. disease understanding, drug target/biomarker discovery, and patient stratification, etc.
Deep learning has been very successful for end-to-end supervised learning, and potentially one can apply deep learning to predict the phenotypes (e.g. disease type, subtype, stage, and progression) using whole transcriptomes. Ideally, the model should be able to detect complex signals from many genes and their interactions directly from whole transcriptomes and avoid the bias and variations introduced when filtering and selecting “relevant” genes in traditional differential expression analysis, correlation analysis, machine learning models and manual selection base on prior knowledge.
However, transcriptomic data do not have the privileges as those successful deep learning applications in biology, which usually have either a large number of samples or a small number of input features. For example. The regulatory DNA element detection task only has a few base pairs of one-hot encoded nucleic acids as one input sample. Instead, transcriptomic data have more than 20,000 protein-coding genes and even more non-coding genes, and usually, have very small sample sizes and contain large systematic sources of variation, despite rapid advances in sequencing technologies.
The high-dimensionality and noise make it difficult to successfully apply any machine learning algorithm in transcriptomic data analysis. Smart algorithms and data processing techniques may be able to solve the issues. For example, although not for transcriptomic data, a deep learning model trained on only 30 images (515X512 pixels) achieved high accuracy in biomedical image segmentation [6], using the innovative U-net algorithm and excessive data augmentation.
There are already some attempts of adapting deep learning algorithms for transcriptome-based supervised learning, which can be classified into three categories.
- feedforward neural network (FNN)
- 2-dimension convolutional neural network (2D CNN)
- graph convolutional network (GCN)
FNN is arguably the most straightforward idea of using deep learning for transcriptome-based supervised learning. Simply take all the expression values at the gene or transcript level as the input layer and optionally add a few fully-connected hidden layers. It may have advantages compared with traditional machine learning techniques because FNN is can learn basically any function according to the universal approximation theorem [7].
Overfitting is probably the major issue of transcriptome-based FNN models. A direct application of multi-layer FNN to distinguish cancer samples from controls did not outperform a simpler and linear model, LASSO [8]. Reducing the number of input variables may help, e.g. only using differentially expressed genes [8, 9] or the “landmark genes” defined in the LINCS project [10], and aggregating genes to pathway levels [10]. Another study reduced the dimension by using autoencoder to transform transcriptomic data to a lower-dimensional representation and then applied FNN (the encoder part) and other methods for supervised learning [11]. This strategy worked well in the breast cancer detection task. Instead of reducing the input size, a creative idea, graph-embedded deep feedforward network (GEDFN), was developed to modify the fully-connected FNN architecture into sparsely-connected [12]. In GEDFN, a neuron in the first hidden layer represents a gene, similar to the input layer. The neurons between the first two layers are connected only if they have known gene-gene interaction (GGI) relationships. (The GGI relationships can be represented by a graph, in which genes are nodes and their functional interactions are edges. The graph is usually called a gene network, e.g. gene regulatory network, protein-protein interaction network, and co-expression network. The network can be retrieved from existing databases or inferred from a variety of data sources.) GEDFN demonstrated good performance in both simulation and estrogen receptor status prediction in breast cancer [12].
On the other hand, CNN already implemented the ideas of sparse interactions and parameter sharing, thus complex dimension reduction procedures may not be necessary for CNN. CNN was the algorithm that sparked the deep learning breakthroughs [13], and its powerful capabilities to learn local stationary structures and compose them to form multi-scale hierarchical patterns can be leveraged for transcriptomic data analysis.
Specifically, the genotype information is transmitted to phenotype level through a rich hierarchy of biological subsystems, that can be represented by hierarchical pathways or a hierarchical scale-free gene network [14]. In addition, genes in the same local pathway or network module often have similar expression patterns. The critical part to utilize CNN on transcriptome-based supervised learning is how to organize the genes so that they can be locally and hierarchically connected as pixels in images.
Existing ideas of utilizing CNN on transcriptome-based supervised learning are limited in 2D CNN, which is to convert transcriptome to image-like data, either based on genes’ chromosome position [15] or KEGG pathway [16]. It does not make sense to use 2D CNN on linearly ordered genes based on chromosome position, which essentially a 1D sequence. OmicsMapNet assigned genes into a 2D treemap with a four-layer hierarchy based on KEGG pathways and showed some promising results on brain cancer subtype identification [16].
The third type of idea, GCN, operates convolution and pooling on a graph [17], in this case, a gene network.
Besides gene network, many other types of data can be represented as graphs, e.g. chemical structure, social network, world wide web, and knowledge graph, and GCN is one of the emerging fields in machine learning community. Potentially, any of the GCN-based supervised learning algorithms can be applied to transcriptomic data, but only one study explored the feasibility [18]. The authors concluded that GCN is useful for the data with small input size, but using all the genes did not work well. In addition, the performance is very dependent on the quality of the graph. I also tried a couple of GCN algorithms, but none of them was better than random guessing (results not shown). Nevertheless, GCN algorithms are still in rapid development and have showed some promising results in other fields. It is worth to continue pursuing.
In sum, there are some explorations in this field, but I have not found any significant or convincing results yet. At the same time, so many ideas have not been tested before, at least not published yet. I am also validating some of my own ideas. I believe breakthroughs will come very soon (not likely from me though).
Another important concept is to extract relevant genes from the model for drug target and biomarker discovery. Interpretability is an ongoing research topic in the deep learning field now, we can borrow many ideas from the cutting-edge research, e.g. saliency map.
References
[1] Y. LeCun, Y. Bengio, and G. Hinton, “Deep learning,” Nature, vol. 521, no. 7553, pp. 436–44, May 28 2015, doi: 10.1038/nature14539.
[2] T. Ching et al., “Opportunities and obstacles for deep learning in biology and medicine,” Journal of The Royal Society Interface, vol. 15, no. 141, p. 20170387, 2018.
[3] J. De Fauw et al., “Clinically applicable deep learning for diagnosis and referral in retinal disease,” Nature medicine, vol. 24, no. 9, p. 1342, 2018.
[4] R. Poplin et al., “A universal SNP and small-indel variant caller using deep neural networks,” Nature Biotechnology, vol. 36, p. 983, 09/24/online 2018, doi: 10.1038/nbt.4235.
[5] B. Alipanahi, A. Delong, M. T. Weirauch, and B. J. Frey, “Predicting the sequence specificities of DNA- and RNA-binding proteins by deep learning,” Nature Biotechnology, vol. 33, p. 831, 07/27/online 2015, doi: 10.1038/nbt.3300.
[6] O. Ronneberger, P. Fischer, and T. Brox, “U-net: Convolutional networks for biomedical image segmentation,” in International Conference on Medical image computing and computer-assisted intervention, 2015: Springer, pp. 234–241.
[7] B. C. Csáji, “Approximation with artificial neural networks,” Faculty of Sciences, Etvs Lornd University, Hungary, vol. 24, p. 48, 2001.
[8] D. Urda, J. Montes-Torres, F. Moreno, L. Franco, and J. M. Jerez, “Deep learning to analyze RNA-Seq gene expression data,” in International Work-Conference on Artificial Neural Networks, 2017: Springer, pp. 50–59.
[9] K. K. Wong, R. Rostomily, and S. T. Wong, “Prognostic Gene Discovery in Glioblastoma Patients using Deep Learning,” Cancers, vol. 11, no. 1, p. 53, 2019.
[10] A. Aliper, S. Plis, A. Artemov, A. Ulloa, P. Mamoshina, and A. Zhavoronkov, “Deep learning applications for predicting pharmacological properties of drugs and drug repurposing using transcriptomic data,” Molecular pharmaceutics, vol. 13, no. 7, pp. 2524–2530, 2016.
[11] P. Danaee, R. Ghaeini, and D. A. Hendrix, “A deep learning approach for cancer detection and relevant gene identification,” in PACIFIC SYMPOSIUM ON BIOCOMPUTING 2017, 2017: World Scientific, pp. 219–229.
[12] Y. Kong and T. Yu, “A graph-embedded deep feedforward network for disease outcome classification and feature selection using gene expression data,” Bioinformatics, vol. 34, no. 21, pp. 3727–3737, 2018.
[13] A. Krizhevsky, I. Sutskever, and G. E. Hinton, “Imagenet classification with deep convolutional neural networks,” in Advances in neural information processing systems, 2012, pp. 1097–1105.
[14] A.-L. Barabasi and Z. N. Oltvai, “Network biology: understanding the cell’s functional organization,” Nature reviews genetics, vol. 5, no. 2, p. 101, 2004.
[15] B. Lyu and A. Haque, “Deep learning based tumor type classification using gene expression data,” in Proceedings of the 2018 ACM International Conference on Bioinformatics, Computational Biology, and Health Informatics, 2018: ACM, pp. 89–96.
[16] S. Ma and Z. Zhang, “OmicsMapNet: Transforming omics data to take advantage of Deep Convolutional Neural Network for discovery,” arXiv preprint arXiv:1804.05283, 2018.
[17] M. Henaff, J. Bruna, and Y. LeCun, “Deep convolutional networks on graph-structured data,” arXiv preprint arXiv:1506.05163, 2015.
[18] F. Dutil, J. P. Cohen, M. Weiss, G. Derevyanko, and Y. Bengio, “Towards gene expression convolutions using gene interaction graphs,” arXiv preprint arXiv:1806.06975, 2018.