Autoencoder in biology — review and perspectives
Autoencoder is a wildly used deep learning architecture. A good introduction can be found here. In this post, I would like to share my perspectives on the application of autoencoder in biology.
Autoencoder has many connections in biology. One example is the L1000 platform developed in the Connectivity Map project [1], in which 1,000 landmark genes are sufficient to recover 81% of the information in the full transcriptome, and significantly lower the cost of transcriptome profiling. These 1,000 genes can be loosely considered as the dimension-reduced latent variables. Another direct application of autoencoder to transcriptomics [2] used an autoencoder with 30 latent variables and found that the latent variables are connected to different biological aspects of the cell, such as different pathways or biological processes. These findings enlightened the possible existence of some simple but powerful rules that can control all the biological processes. Autoencoder may be a good way to infinitely approach these hidden rules.
There have been lots of other use-cases of autoencoder in biology. It can be used as a dimension reduction method for unsupervised clustering and visualization, similar to principal component analysis (PCA). The latent variables from the variational autoencoder (VAE) trained on TCGA pan-cancer RNA-seq data preserved well-known relationships across cancer types and subtypes [3]. More and more applications have appeared in single-cell RNA-seq data analysis, maybe due to the large sample size in a single dataset. One early study found the VAE had superior performance and broader compatibility compared to other dimension reduction and visualization methods [4].
However, single-cell RNA-seq data often has frequent dropout events and substantial batch effect. Recent studies focused on modifying the autoencoder algorithm to solve the two challenges. For example, one trick is to set the output layer as a negative binomial model with a dropout rate and define the reconstruction error as the likelihood of the distribution of the NB model, instead of reconstructing the input data itself [5]. Another work utilized a recurrent autoencoder to iteratively perform imputations on zero-valued entries of the input layer [6].To correct the batch effect, a proposed idea is to add another loss function to make sure samples are clustered based on biological relevancy rather than batch effect [7]. A manuscript published yesterday (Today is Oct 05, 2019.) used a hierarchical zero-inflated Poisson mixed model to conduct data normalization, dropout imputation, and batch effect correction simultaneously [8]. The estimation of the model was done by expectation maximization (EM), but it should be easy for autoencoder to do the job, as pointed out that EM=VAE. It would be interesting to compare the performance of autoencoder with EM.
Autoencoder can also be used for supervised learning, similar to principal component regression (PCR). The latent space with a reduced dimension may preserve most biological information, and it becomes easier for supervised learning than using all the genes. I mentioned one example used for breast cancer detection in my previous post. Another advantage is that autoencoder is essentially an unsupervised method, thus well-annotated data or well-designed experiments are not required. Potentially, we can effectively learn the latent representation of the biological system using large, public omic data in an unsupervised way, and then use transfer learning (the encoder part) to build a specific supervised model for a small data set. However, two examples of using transfer learning and autoencoder were not very promising [9, 10]. But a similar transfer learning study based on matrix factorization (not autoencoder) showed some good results [11]. The authors actually pointed out that autoencoder should be able to achieve similar results. Nevertheless, more work is necessary to fully explore the potential.
Another promising use-case is the generative model. A recent “kinda” breakthrough using deep learning in drug discovery is the rapid identification of DDR1 kinase inhibitors [12]. The most import part of the workflow is using VAE to learn “a mapping of chemical space, a set of discrete molecular graphs, to a continuous space of 50 dimensions”. The decoder part with the multidimensional Gaussian distribution in the latent space was able to generate a large number of novel chemical structures. Then the followup reinforcement learning discovered new compounds with desired chemical features.
Another VAE-based generative model was able to predict single-cell perturbation responses in silico [13]. The hypothesis is that the perturbation effect is the same for all the cell types at the latent space (which may not be true), so if some cell types has missing perturbation experiment, just use simple vector arithmetics in latent space to get the perturbation effect and then use decoder part to generate the perturbed RNA-Seq samples for those cell types.
As seen in many of the above examples, variational autoencoder (VAE) holds great potential in biological applications. The latent space is a probability distribution instead of a single vector in the traditional autoencoder. A probability distribution is better to model the complex and dynamic biology systems. Besides dimension reduction, the probability distribution in latent space can be used to generate novel samples, as shown in the DDR1 kinase inhibitor example. In addition, we can use the distribution to interpolate samples between different conditions, e.g. from healthy to mild disease symptoms, then to severe disease stage. (I haven't seen any applications yet, but it is very interesting to explore.) The idea of VAE actually came from the variational Bayesian method [14]. It was a great example that combining statistics and machine learning together can bring breakthroughs.
Furthermore, I have a great interest in using autoencoder for gene network construction. After building the autoencoder, it is feasible to identify the effect of one gene to another gene by computing the gradient of one gene in the output layer with respect to another gene in the input layer. The combination of artificial natural network and gene network should be able to increase the accuracy of gene network built from high-dimension and noisy data, tackle the interpretability issue in deep learning for biology, and ultimately help understand the “rules” in biology. Using autoencoder to construct a gene network has many advantages. First, it can learn the complex non-linear relationships between genes. Second, it can learn all gene-gene interactions simultaneously and take all other genes into consideration when computing the relationship between one pair of genes (not like common co-expression network which ignores other genes). Third, training the model using the mini-batch gradient descent algorithm does not require more computing resources if more data are available, which allows us to leverage the huge amount of public datasets. This is not the case for many other gene network construction methods.
References:
[1] Subramanian A, et al. A Next Generation Connectivity Map: L1000 Platform And The First 1,000,000 Profiles. Cell. 2017. 171(6):1437–1452.
[2] Abdolhosseini, Farzad, et al. “Cell Identity Codes: Understanding Cell Identity from Gene Expression Profiles using Deep Neural Networks.” Scientific Reports 9.1 (2019): 2342.
[3] Way, Gregory P., and Casey S. Greene. “Extracting a biologically relevant latent space from cancer transcriptomes with variational autoencoders.” BioRxiv (2017): 174474.
[4] Wang, Dongfang, and Jin Gu. “VASC: dimension reduction and visualization of single-cell RNA-seq data by deep variational autoencoder.” Genomics, proteomics & bioinformatics 16.5 (2018): 320–331.
[5] Eraslan, Gökcen, et al. “Single-cell RNA-seq denoising using a deep count autoencoder.” Nature communications 10.1 (2019): 390.
[6] Deng, Yue, et al. “Scalable analysis of cell-type composition from single-cell transcriptomics using deep recurrent learning.” Nature methods 16.4 (2019): 311.
[7] Wang, Tongxin, et al. “BERMUDA: a novel deep transfer learning method for single-cell RNA sequencing batch correction reveals hidden high-resolution cellular subtypes.” Genome biology 20.1 (2019): 1–15.
[8] Zhang, Yiliang, et al. “SCRIBE: a new approach to dropout imputation and batch effects correction for single-cell RNA-seq data.” BioRxiv (2019): 793463.
[9] López-García, Guillermo, et al. “A transfer-learning approach to feature extraction from cancer transcriptomes with deep autoencoders.” International Work-Conference on Artificial Neural Networks. Springer, Cham, 2019.
[10] Smith, Aaron M., et al. “Deep learning of representations for transcriptomics-based phenotype prediction.” BioRxiv (2019): 574723.
[11] Taroni, Jaclyn N., et al. “MultiPLIER: a transfer learning framework for transcriptomics reveals systemic features of rare disease.” Cell systems 8.5 (2019): 380–394.
[12] Zhavoronkov, Alex, et al. “Deep learning enables rapid identification of potent DDR1 kinase inhibitors.” Nature biotechnology 37.9 (2019): 1038–1040.
[13] Lotfollahi, Mohammad, F. Alexander Wolf, and Fabian J. Theis. “scGen predicts single-cell perturbation responses.” Nature methods 16.8 (2019): 715.
[14] Kingma, Diederik P., and Max Welling. “Auto-encoding variational bayes.” arXiv preprint arXiv:1312.6114 (2013).