A new graph-based clustering method with application to single-cell RNA-seq data from human pancreatic islets

Traditional bulk RNA-sequencing of human pancreatic islets mainly reflects transcriptional response of major cell types. Single-cell RNA sequencing technology enables transcriptional characterization of individual cells, and thus makes it possible to detect cell types and subtypes. To tackle the heterogeneity of single-cell RNA-seq data, powerful and appropriate clustering is required to facilitate the discovery of cell types. In this paper, we propose a new clustering framework based on a graph-based model with various types of distances. We take the compositional nature of single-cell RNA-seq data into account and employ log-ratio transformations. The practical merit of the proposed method is demonstrated through the application to the centered log-ratio transformed single-cell RNA-seq data for human pancreatic islets. The practical merit is also demonstrated through comparisons with existing single-cell clustering methods.

A Novel Deep-Learning Based Approach to Identify, Localize, and Grade Malignant Prostate Lesions

Prostate cancer (PCa) is one of the most widespread and deadly cancers among American men. Currently, patients who are suspected of having PCa are recommended to undergo a biopsy, a mpMRI scan, or a MRI scan. These scans are often not reliable and can have a low detection rate. Furthermore, pathologists almost always have trouble grading the aggressiveness of the prostate tumor. These tumors are classified by the Gleason Grade scale or the ISUP Group Grade scale, which ranks the aggressiveness of the tumor on a scale from 1 to 5 where higher numbers represent a more dangerous tumor. To help identify, segment, and grade prostate tumors, recent studies point to the use of machine learning or artificial intelligence. We present a novel method to both grade and localize prostate tumors in multiparametric magnetic resonance images (mpMRI). Using a convolutional neural network (CNN) to identify mpMRI images that contained the tumors, a Faster region-based convolutional neural network (Faster RCNN) to segment the tumors, and a CNN to grade the tumors by differentiating tumors into high-grade and low-grade tumors, we present a novel method to localize and grade the tumors that performs at least as well as current machine learning techniques while using faster networks and requiring less data. Specifically, we used deep learning to segment the tumors, whereas previous studies had used machine learning, which is more computationally expensive than our method. Our CNN that identified whether mpMRI images had tumors or not had an accuracy of 98.7%. The Faster RCNN used had a sensitivity of 0.972, a false positive rate of 0.263, and an area under curve (AUC) of 0.9007. The CNN used to grade the tumors had an accuracy of 97.6%, which is similar to other machine learning models. In conclusion, these three models provide a novel approach to classifying and grading prostate tumors while requiring less data.

Time series Gene Network Analyses of Explosive Breaching in U.S. Military Warfighters

Background: Injuries from exposure to blast explosions rose dramatically during the Iraq and Afghanistan wars due to increase use of IEDs resulting in blast-related neurotrauma. To investigate the effect of blast on gene regulatory networks, we use timeseries gene expression data from military “breachers” exposed to controlled, low-level blast explosives during training.
Methods: Blood samples were collected from male participants (age 30.2 ± 7.4 years) during a 3-day breacher training: baseline (day1), pre- and-post- breaching (day2), and follow-up (day3) at a U.S. Army training site. Blood samples were obtained and RNA-seq was performed for every time point. RNA-seq read counts were transformed to logCPM by fitting a linear model with the variable of interest (i.e., pre-post breaching operations) and adjusted for subject effect using limma. Potential batch effects was eliminated using surrogate variable analysis. To investigate whether exposure to explosive blast affects gene expression in a coordinated fashion, network analysis was performed via multiscale embedded gene co-expression network analysis (MEGENA) using variably expressed genes (sd >= 0.25) across all subjects/ time. Principal component (PC) analysis was performed on each network using a linear model and fitted on the 1st PC across timepoints for pre-post and pre-follow-up comparisons separately.
Results & Conclusions. To investigate whether prior exposure load including history of traumatic brain injury (TBI) or cumulative career breaching in military service impacts responsivity to an explosion acutely, interaction models including prior exposure load by time (pre-post breaching) analyses were performed. Compared to those with no history of TBI, those with prior TBI history showed blunted acute gene expression responsivity (i.e., no significant alterations) following exposure to blast explosives. Network analyses identified 5 unnested-networks with significant interaction between TBI/Breaching history and time (pre-post/follow-up, p<0.05) that included genes involved in inflammation and adaptive immunity. Semaphorin 7A (a hub gene within one of these networks) is a membrane associated protein that plays an important role in mediating the connection between the CNS and the immune system. Emerging data from both experimental and human studies have shown that a number of Sema loci are involved in response to CNS injury. Although SEMA7A has not been previously implicated in CNS injury, it shows increased expression following blast sub-acutely in subjects with prior history of TBI. These findings show the power of network approaches to identify transcriptional alterations directly in response to exposure to explosives, which is translationally significant in furthering our understanding of blast-related neurotrauma.

A mathematical model of the molecular mechanism of PZA in Mycobacterium tuberculosis

Pyrazinamide (PZA) is one of the most important drugs used in first and second-line treatments against tuberculosis to eliminate bacteria in latent state, reducing relapse incidence. Pyrazinoic acid (POA) is the active form of PZA, generated by hydrolysis of PZA mediated by the enzyme pyrazinamidase (PZase), encoded by the gene pncA. Its antibiotic effect requires an acidic environment, discovered by the high in vivo and poor in vitro sterilizing activity, potentially due the acidic environment experienced by M. tuberculosis inside granulomes.
The molecular targets of this drug are still unknown, and are thought to be more than one. The current mechanism, based on POA accumulation, states that PZA enters the cell through passive diffusion, is hydrolyzed to POA (mediated by PZase), and expelled to the extracellular environment by efflux pumps. Outside, expelled POA is protonated to HPOA in the acidic environment, and re-enters the bacteria by a membrane potential gradient. In this way, the external acidic environment helps to recover and accumulate intracellular POA, setting up a cycle that leads to a slight acidification of the cytoplasm, which together with the intracellular accumulation of POA act on several potential targets and metabolic pathways. However, cytoplasmic acidification is not strictly required. Previous studies have shown that overexpression of pncA or efflux pump inhibitors eliminate the requirement of an acidic environment for susceptibility in vitro.
To better understand the contribution of an acidic extracellular environment to the internal accumulation of POA, and to study alternative scenarios of resistance/susceptibility, we modelled the PZA/POA metabolic pathway using a system of nonlinear differential equations, fitting experimental parameters. Our results indicate that, in equilibrium (laboratory) conditions, POA accumulation is independent of external pH and only depends on the ratio between the rates of POA production and efflux. In addition, the acidic environment has a significant contribution in the internal accumulation of total-POA (defined as POA+HPOA) only when the ratio efflux-rate/diffusion of external POA and HPOA is greater than 1. Interestingly, the rate of POA production could help to increase the total-POA accumulation independently of the POA efflux rate and external pH. In addition, although low POA efflux rates accumulate more internal total-POA than an acidic environment, the cytoplasm will not be acidified, suggesting that a reduction of internal pH is not sufficient to produce the bactericidal effect. Finally, we performed simulations to evaluate possible consequences in growth rate under different PZA concentrations.

scFEA: A graph neural network model to estimate cell-wise metabolic using single cell RNA-seq data

The metabolic heterogeneity, and metabolic interplay between cells and their microenvironment have been known as significant contributors to disease treatment resistance. Our understanding of the intra-tissue metabolic heterogeneity and cooperation phenomena among cell populations is unfortunately quite limited, without a mature single cell metabolomics technology. To mitigate this knowledge gap, we developed a novel computational method, namely scFEA (single cell Flux Estimation Analysis), to infer single cell fluxome from single cell RNA-sequencing (scRNA-seq) data. scFEA is empowered by a comprehensively reorganized human metabolic map as focused metabolic modules, a novel probabilistic model to leverage the flux balance constraints on scRNA-seq data, and a novel graph neural network based optimization solver. The intricate information cascade from transcriptome to metabolome was captured using multi-layer neural networks to fully capitulate the non-linear dependency between enzymatic gene expressions and reaction rates. We experimentally validated scFEA by generating an scRNA-seq dataset with matched metabolomics data on cells of perturbed oxygen and genetic conditions. Application of scFEA on this dataset demonstrated the consistency between predicted flux and metabolic imbalance with the observed variation of metabolites in the matched metabolomics data. We also applied scFEA on publicly available single cell melanoma and head and neck cancer datasets, and discovered different metabolic landscapes between cancer and stromal cells. The cell-wise fluxome predicted by scFEA empowers a series of downstream analysis including identification of metabolic modules or cell groups that share common metabolic variations, sensitivity evaluation of enzymes with regards to their impact on the whole metabolic flux, and inference of cell-tissue and cell-cell metabolic communications.

A molecular taxonomy of tumors independent of tissue-of-origin

Cancer is an organism-level disease, impacting processes from cellular metabolism and the microenvironment to systemic immune response. Nevertheless, efforts to distinguish overarching mutational processes from interactions with the cell of origin for a tumor have seen limited success, presenting a barrier to individualized medicine. Here we present a novel, pathway-centric approach, extracting somatic mutational profiles within and between tissues, largely orthogonal to cell of origin, mutational burden, or stage. Known predisposition variants are equally distributed among clusters, and largely independent of molecular subtype. Prognosis and risk of death vary jointly by cancer type and cluster. Analysis of metastatic tumors reveals that differences are largely cluster-specific and complementary, implicating convergent mechanisms that combine familiar driver genes with diverse low-frequency lesions in tumor-promoting pathways, ultimately producing distinct molecular phenotypes. The results shed new light on the interplay between organism-level dysfunction and tissue-specific lesions.

Latent representation of Hi-C intra-chromosomal interactions using Hi-C-LSTM

The organization of the genome in 3D space inside the nucleus plays an important role in general functional characteristics of the genome. Chromosome Conformation Capture (3C) and Hi-C techniques, have enabled us to quantify the strength of interactions between loci that are nearby in space. A Hi-C assay outputs a matrix of interaction strengths for every pair of genomic positions. This huge amount of data requires the development of computational techniques to discover hidden relationships between genome structure and function.

Representation learning methods provide a way to understand 3D organization. These methods assign a low-dimensional vector of features to each genomic position that summarizes the 3D organization properties of that position. Representations learnt from Hi-C datasets can serve multiple purposes. They can capture the existing elements that drive 3D conformation, identify spatially clustering genomic regions, and identify relationships between 3D structure and functional phenomena. Several existing methods for Hi-C representation learning have been developed recently, including SNIPER and SCI. However, these existing methods do not take into account the linear structure of the genome, which hampers their performance.

In this work, we propose a method, Hi-C-LSTM, that produces low-dimensional representations of the Hi-C intra-chromosomal contacts, assigning a vector of features to each genomic position that represents that position's contact activity with all other positions in the given chromosome. We do this by using a deep long-short term memory (LSTM) recurrent neural network. This LSTM model takes the representations as input and outputs predicted Hi-C contact strength for each pair of positions. The key benefit of the LSTM is its sequential structure, which allows the model to take the linear structure of the genome into account.

We find that representation learning using an LSTM structure results in extremely effective representations according to several metrics. First, we find that Hi-C-LSTM captures more information from the input Hi-C matrix than existing methods. Second, we find that Hi-C-LSTM identifies which genomic loci drive 3D organization. Third, we find that Hi-C-LSTM can be used to perform in-silico experiments to evaluate the effect of removing or editing genomic loci.

VSS: Variance-stabilized signals for sequencing-based genomic signals

A sequencing-based genomic assay such as ChIP-seq outputs a real-valued signal for each position in the genome that measures the strength of activity at that position.
Read counts of genomic assays have a nonuniform mean-variance relationship, which poses a challenge to their analysis. For example, a locus with 1,000 reads in one experiment might get 1,100 reads in a replicate experiment by chance, whereas a locus with 100 reads might usually see no more than 110 reads by chance in a replicate. This property means that, for example, the difference in read count between biosamples is a poor measure of the difference in activity. To handle this issue, most statistical models of genomic signals such as those used in peak calling, model the mean-variance relationship of read counts explicitly using, for example, a negative binomial distribution. These statistical models can account for this pattern, but learning these models is computationally challenging. Therefore, many applications including imputation and segmentation and genome annotation (SAGA) instead use Gaussian models and use a transformation such as log or inverse hyperbolic sine (asinh) to stabilize variance.
In this study, we proposed VSS, a method that produces units for sequencing-based genomic signals that have the desirable property of variance stability. We found that the transformations that are currently used to stabilize variance like log(x+1) and asinh(x) do not fully do so. In fact, we found that the mean-variance relationship of genomic signals varies greatly between data sets, indicating that no single transformation can be applied to all data sets uniformly. Instead, variance stability requires a method such as VSS that empirically determines the experiment-specific mean-variance relationship.
We showed that VSS successfully stabilizes variance in genomic data sets. Further, we found that using variance-stabilized data improves the performance of Gaussian models such as SAGA.
Variance-stabilized signals will aid in all downstream applications of genomic signals. In particular, they are valuable for two reasons.
First, VSS signals allow downstream methods to use squared error loss or Gaussian likelihood distributions, which are much easier to optimize than the existing practice of implementing a model that accounts for the mean-variance relationship.
This will improve tasks that currently use Gaussian models, such as chromatin state annotation and imputation. Second, VSS signals can be easily analyzed by eye because the viewer does not need to take the mean-variance relationship into account when visually inspecting the data.

Continuous chromatin state feature annotation of the human epigenome

Sequencing-based genomics assays can measure many types of genomic biochemical activity, including transcription factor binding, chromatin accessibility, transcription, and histone modifications. Data from sequencing-based genomics assays is now available from hundreds of human cellular conditions, including varying tissues, individuals, disease states, and drug perturbations. Semi-automated genome annotation (SAGA) methods are widely used to understand genome activity and gene regulation. These algorithms take as input a collection of sequencing-based genomics data sets from a particular tissue. They output an annotation of the genome that assigns a label to each genomic position.

All existing SAGA methods output a discrete annotation that assigns a single label to each position. This discrete annotation strategy has several limitations. First, discrete annotations cannot represent the strength of genomic elements. Variation among genomic elements in intensity or frequency of activity of cells in the sample is captured in variation in the intensity of the associated marks. Such variation is lost if all such elements are assigned the same label. Second, a discrete annotation cannot represent combinatorial elements that simultaneously exhibit multiple types of activity. To model combinatorial activity, a discrete annotation must use a separate label to represent each pair (or triplet etc.) of activity types.

We propose a method that uses a Kalman filter state space model to efficiently annotate the genome with chromatin state features. That is, our method outputs a vector of real-valued chromatin state features for each genomic position, where each chromatin state feature putatively represents a different type of activity. Continuous chromatin state features have a number of benefits over discrete labels. First, these features preserve the underlying continuous nature of the input signal tracks. Second, in contrast to discrete labels, continuous features can easily capture the strength of a given element. Third, continuous features can easily handle positions with combinatorial activity by assigning a high weight to multiple features. Fourth, chromatin state features lend themselves to expressive visualizations because they project complex data sets onto a small number of dimensions.

We also propose several measures of the quality of a chromatin state feature annotation relative to genes and gene expression, and we compare epigenome-ssm to existing SAGA methods (ChromHMM, Segway) according to these quality measures. We found that epigenome-ssm produces the highest-quality annotations of the methods we compared. Particularly, an epigenome-ssm annotation is a better predictor of gene expression and enhancer activity.

Learning a genome-wide score of human-mouse conservation at the functional genomics level

Identifying genomic regions with functional genomic properties that are conserved between human and mouse is an important challenge in the context of mouse model studies. To address this, we develop a computational method, Learning Evidence of Conservation from Integrated Functional genomic annotations (LECIF). LECIF integrates thousands of human and mouse functional genomic annotations to learn a score of evidence for conservation at the functional genomics level. While LECIF leverages data from diverse sources, it does not require explicit matching of experiments across species by biological origin or type.

To do so, LECIF trains an ensemble of neural networks that take human and mouse regions aligning at the sequence level as positive pairs and randomly matched human and mouse regions that do not align to each other but somewhere else in the other species as negative pairs. Each human or mouse region is characterized by a species-specific feature vector with thousands of values corresponding to ChromHMM chromatin state annotations, signals from RNA-seq experiments, and peak calls from DNase-seq, transcription factor ChIP-seq, histone mark ChIP-seq, and CAGE experiments. We used the trained classifier used to provide a LECIF score for all aligning regions.

The LECIF score is highly predictive of pairs of regions that align at the sequence level, even when controlling for properties of regions that align in general. The score captures correspondence of biologically similar annotations between human and mouse, even though LECIF was not explicitly given such information. While the LECIF score is moderately correlated with sequence constraint scores, it captures distinct
information on conserved properties. The LECIF score is higher in regions previously shown to have similar phenotypic properties in human and mouse at the genetic and epigenetic level. We expect the human-mouse LECIF score will be an important a resource for studies using mouse as a model organism.