Gene Expression Analysis of Dexamethasone’s Treatment Effects on Epidermal Keratinocytes (i.e. skin cells) and A549 Cell Line (i.e. a lung cancer cell type) from the Human Body

Group Member: Jason Ye, TzuChuan Huang, Jialin Ma, Xiaotong Jiang


Glucocorticoid (GC) dexamethasone is known as a highly effective anti-inflammatory, immunosuppressant and a decongestant drug (Bordag et al, 2015). For decades, it has been applied to a variety of disease conditions, ranging from auto-immune diseases and allergic reactions to cancers (Bordag et al, 2015). It can, thus, immensely improve both prognosis and quality of life (Bordag et al, 2015). Unfortunately, glucocorticoid therapy can cause serious systemic side effects, especially in chronic and high dose applications, leading to decreased life expectancy and increased health care costs (Bordag et al, 2015).

Gene expression patterns provide us with insights on how drugs and cells interact. In this study, we chose two datasets dealing with epidermal keratinocytes and A549 cell line as our subjects of interest, and studied their gene expression profiles upon treatment by dexamethasone. By doing so, we hoped to unravel some side effects of dexamethasone.

A keratinocyte is the predominant cell type in the epidermis, the outermost skin layer, constituting 90% of the cells found there ( The primary function of keratinocytes is the formation of a barrier against damages, such as those from fungi and UV radiation ( A number of structural proteins (e.g. filaggrin, keratin), enzymes (e.g. proteases), lipids and antimicrobial peptides (e.g. defensins) work together to maintain the important barrier function of the skin ( 

A549 cells, as found in the lung tissue of their origin, are squamous and responsible for the diffusion of some substances, such as water and electrolytes, across alveoli ( A549 cells are widely used as a transfection host and as a type II pulmonary epithelial cell model for drug metabolism ( When grown for sufficient time in a cell culture, A549 cells may begin to differentiate, as signaled by the presence of multilamellar bodies (


  • To analyze the treatment effect of dexamethasone on epidermal keratinocytes and A549 cell line (adenocarcinomic human alveolar basal epithelial cells) by comparing differentially expressed genes.
  • To determine the effect of synthetic glucocorticoid (GC) dexamethasone on pathway regulation based on molecular response, as GC serves as a therapeutic agent for skin conditions.





A549 lung cancer cell line (RNA-seq, Reddy 2015):

Control samples: A549 cell line, untreated, 3 technical replicates

– Treatment samples: A549 cell line, treated with dexamethasone for 4 hours, 4 technical replicates

Epidermal keratinocytes (microarray, accession # GSE26487 in NCBI’s GEO database, Stojadinovic et al, 2011):

– Control sample: Keratinocytes, untreated for 24 hours, technical replicate #1

– Control sample: Keratinocytes, untreated for 24 hours, technical replicate #2

– Treatment sample: Keratinocytes, treated with dexamethasone for 24 hours, technical replicate #1

– Treatment sample: Keratinocytes, treated with dexamethasone for 24 hours, technical replicate #2

Data processing and quality control

During analysis of A549 cell line, we used transcript abundance quantification methods called Salmon (Patro et al. 2017) to estimate abundances without aligning reads, followed by the tximport package for assembling estimated count and offset matrices for use with Bioconductor differential gene expression packages. We then converted the data into DESeqDataSet class and removed the empty rows. For PCA plot and SD-mean plot, we used regularized log transformation to reduce the variance caused by lowly expressed genes.

Before analysis of epidermal keratinocytes, we created plots to assess the quality of the data. Boxplots show differences in probe intensity behavior between arrays. MA plots show to what extent the variability in expression depends on expression level. To remove such dependency, we conducted normalization on the data. The standard method for normalization is RMA. The RMA method proceeds with background subtraction, normalization and summarization using a deconvolution method for background correction and quantile normalization, and the RMA (robust multi-chip average) algorithm for summarization (Klaus and Reisenauer, 2018).

Differential Expression Analysis

For A549 cell line, differentially expressed genes were determined using the DESeq2 package.  False discovery rate threshold was set to 0.05, while log2 fold change threshold was set to 2.  Genes were considered differentially expressed if they met both the threshold for adjusted p-value and that for fold change.

For epidermal keratinocytes, differentially expressed genes were determined using the limma package. A linear model was fitted between the control and treatment groups to identify differentially expressed genes. Since data on epidermal keratinocytes used the same set of biological replicates in the control as in the treatment groups, sample bias was also limited in this way. Genes were considered differentially expressed if the p-value adjusted to multiple testing fell below 0.05.

Hierarchical Clustering

With distance between each pair of samples computed, we needed the hierarchical clustering algorithm to join these samples into groups. Heatmaps are ubiquitous in genomics literature. They are generally used to represent the varying levels of gene expression across samples. We created our heatmaps using an R package called pheatmap, which stands for “Pretty Heatmaps” (Kolde 2018). This package creates clustered heatmaps and offers control over heatmap appearance and dimensions.

Principle component analysis

Principal component analysis (PCA) is a mathematical algorithm that reduces the dimensionality of the data. It accomplishes this reduction by identifying directions, called principal components, along which the variance in data is maximal. We performed principal component analysis on genes differentially expressed in epidermal keratinocytes and A549 cell line treated by dexamethasone, with respect to the control cells. We used the prcomp method in R, extracted the first two principal components, and determined the amount of variance between the treatment and control groups, for both A549 cell line and epidermal keratinocytes.


Data cleaning and preprocessing

Prior to analyzing the datasets, we needed to ensure that our data was preprocessed to the same state, so that we could move on to the next steps without fear of bias. For Salmon’s quantified genes from A549 cell line, we first removed the empty rows. We found, from the result, that 12,360 empty rows existed, and by removing them, 37,361 rows remained. Afterwards, regularized log transformation was applied to reduce the variance on lowly expressed genes. We could now visualize the control and treatment representations upon applying the rlog transformation method. In Figure 2 (a), we clearly observed that the samples could be clustered by control and treatment conditions. PCA plot (Figure 2 (b)) of the data also shows that PC1 could easily separate our treatment and control groups, explaining 87% of the variance in data.


Figure 2.(a) Heatmap of distances between samples in A549 cell line; (b) PCA plot of control and treated samples in A549 cell line


In epidermal keratinocytes, we first obtained a heatmap representation of the sample-to-sample distance (data not shown). Upon finding that samples did not strongly cluster by control and treatment conditions, we normalized the data to make conditions more consistent for comparison among all samples. Figure 3 shows the heatmap of sample-to-sample distance after normalization, in which based on the colors shown, some distinctions in expression levels could now be seen between control samples and samples treated with dexamethasone.


Figure 3. Heatmap of sample-to-sample distance after normalization for epidermal keratinocytes

With normalization carried out, we noticed from the boxplot in Figure 4 that the mean and variance of our control and treatment groups became more consistent upon normalization. In Figure 5, PCA plot indicated that we could separate control from treatment groups along the second PC axis after normalization.


Figure 4. Boxplots before and after normalization for epidermal keratinocytes


Figure 5. (a) PCA plot of raw data for epidermal keratinocytes; (b) PCA plot of the normalized data


Differential Expression Analysis

A549 cell line:

We ran the differential expression pipeline on the A549 cell line and obtained log2fold (LFC) changes and p-values for the gene list. We set the adjusted p-value cutoff to 0.05. Now, the output result in Table 1 showed that at a false discovery rate (FDR) of 5%, 22% of the genes were up-regulated, while 21% were down-regulated, out of 25,001 genes.

Condition Read count Percentage
LFC > 0 (up-regulated) 5536 22%
LFC < 0 (down-regulated) 5371 21%
Outliers 54 0.22%
Low counts 5817 23%

Table 1. Gene overview at a false discovery rate (FDR) of 5% for A549 cell line

We examined the fold change distribution induced by treatment in an MA-plot and a histogram of the p-values (Figure 6). In the MA-plot, each gene is represented as a dot, and specific genes with p-values under 0.05 are shown in red. Also, the histogram is consistent with Table 1 in that more than 10,000 genes exhibited significant change, with p-values less than 0.05, after treatment with dexamethasone under the p-value test.


Figure 6. (a) MA plot of mean of normalized count against fold change in A549 cell line; (b) histogram of p-values in A549 cell line

Epidermal keratinocytes:

A large number of probes displayed expression intensities in the background range for the epidermal keratinocytes. Thus, without preprocessing, we could not easily detect real changes across the array. The first step involved removing the data with low variance in the background range of intensity. By doing so, we could filter the data that may contribute to false positive results. In Figure 7, the red bar denotes threshold value separating out the low and high intensities. If more than two samples of a gene had intensities higher than the threshold value, the gene would be kept, deemed to have higher than background intensities. We could then observe that 8,565 out of 12,625 genes had higher expression values than the threshold (Table 2).


Figure 7. Histogram of median intensities in epidermal keratinocytes

  Not selected Selected
Gene counts 4060 8565

Table 2. Genes selected for further differential expression analysis based on the threshold expression intensity


Identification of differentially expressed genes

In A549 cell line, based on previous observations, we selected the genes that had adjusted p-values below 0.05, and ordered them by their fold changes.

With at least hundreds of genes being differentially expressed, we were interested in finding the ones with relatively large fold changes. We selected the genes with absolute log2 fold changes higher than 2. Figure 8 shows a histogram of the log2 fold change distribution in genes with adjusted p-value lower than 0.05. Yellow bars denote genes with fold changes less than 2, which were thus filtered. After this filtering, 10,907 genes remained and were selected for further study of possible differential expression, out of what originally amounted to 25,001 genes.


Figure 8. Histogram of selected genes with log2 fold changes

Analysis of epidermal keratinocytes:

In limma package, we could apply t-test or ANOVA to identify differential expression in epidermal keratinocytes. In this package, we could fit a linear model to the expression data of each gene. Considering batch effects between replicates, we first created a matrix to remove them. Afterwards, we applied this linear model and computed moderated t–statistics by calling the eBayes() function. The eBayes() function would compute a t-statistic and a corresponding p-value for each gene.

Upon fitting the model, we visualized the distribution of p-values and log2 fold changes in a volcano plot, Figure 9 (a), to determine model quality. Figure 9 (b) shows a histogram of p-value distribution — the peak near zero indicates an abundance of p-values smaller than 0.05, corresponding to genes likely to be differentially expressed from dexamethasone treatment. Due to a large number of genes having p-values within 0.05, we increased our stringency in avoiding false positive results by setting the adjusted p-value cutoff to 0.05, without setting a cutoff for either the fold change or the true p-value.


Figure 9. (a) Distribution of fold changes and p-values from epidermal keratinocytes in a Volcano plot; (b) histogram of p-value distribution from epidermal keratinocytes

Table 3 below shows the number of differentially expressed genes with p-values within 0.001 and adjusted p-values within 0.05.

P value Insignificant Significant
0.05 (Adjusted p-value) 8,117 448
0.001 8,205 360

Gene clustering

A549 cell line:

We obtained a heatmap to show the hierarchical clustering of differentially expressed genes. Rather than using absolute expression strengths, we plotted the heatmap by deviation of each gene’s expression in a specific sample from the average across all samples. It is seen in Figure 10 (a) that control and treated samples were clearly separated, and the genes were divided into two clusters. Genes in the top cluster were mostly up-regulated genes, appearing in higher levels of expression with dexamethasone treatment than without. Genes in the bottom cluster were mostly down-regulated genes with lower expression levels after treatment. As well, we were interested in genes with largest expression changes after treatment with dexamethasone. Figure 10 (b) shows the top 30 genes that were likely most affected by dexamethasone treatment. These genes showed the highest variance across all samples, where most of these genes were up-regulated from control through treatment with dexamethasone.


Figure 10. (a) Heatmap of genes across all samples in A549 cell line; (b) heatmap of genes with the highest variance across all samples in A549 cell line

Epidermal keratinocytes:

For epidermal keratinocytes, we could roughly divide the genes into three clusters, in which control and treated samples were clearly separated. In Figure 11, we could also observe that the gene cluster near the top showed comparatively high expression levels, whether those genes fell into control or treatment groups. Gene cluster near the middle showed comparatively low expression levels upon dexamethasone treatment. Relatively speaking, gene cluster near the bottom showed a mixture of low and high expression levels from the treatment.


Figure 11. Heatmap of genes across all samples for epidermal keratinocytes

PCA plot

After gene selection, we analyzed the genes with PCA plots. Figure 12 (a) shows the selected genes in A549 cell line. It is clear that those genes can be separated into two parts by the second PC axis. Since we filtered out the genes with low fold changes, the second PC axis aligned with fold changes shown earlier in an MA plot, with the two clusters being up-regulated and down-regulated genes. As shown in Figure 12 (b), second PC axis readily separated the genes from epidermal keratinocytes into two clusters as well.


Figure 12. (a) PCA plot of genes in A549 cell line; (b) PCA plot of genes in epidermal keratinocytes

Pathway Analysis

Below is a diagram of the pathway clusters for differentially expressed genes in A549 cell line:


Below is a diagram of the pathway clusters for differentially expressed genes in epidermal keratinocytes:


Diagrams on pathway clusters shown above for our two datasets were drawn using the ReactomePA package from R (Yu and He, 2016).

Many of the differentially expressed genes in either of the two datasets fell either into cancer-related pathways or into pathways dealing with immune systems. As seen in the clusters of enriched pathways above, interleukin 4/13 (IL 4/13) signaling pathways are present in both the A549 lung cancer cell line and the epidermal keratinocytes. Literature searches on the two signaling pathways reveal that these pathways are related to tumor growth and immune response. This is consistent with our finding of enriched pathways being mostly those dealing with cancer and immune systems.

More Detailed Literature Survey

Finally, we searched further into literature on the potential effects of dexamethasone treatment on IL-4/13 signaling. IL-4/13 serve as important signaling molecules in human’s immune system. These molecules bear relations with adaptive immune response (Waters et al, 2018), inflammation, allergic response (Gour and Wills-Karp, 2015), and tumor growth (Kawakami et al, 2002).

Studies from Tomita et al. (1999) show that dexamethasone suppresses gene expression and production of IL-13 molecules by human mast cell line and lung mast cells. This agrees with our pathway analysis that also shows IL-13 signaling to be an enriched pathway for the A549 data dealing with lung cancer cells.

Discussion and Conclusion

From our findings, we came to the following conclusions:

  1. A549 lung cancer cell line showed far more significant response to dexamethasone treatment than epidermal keratinocytes, upon comparing the numbers of differentially expressed genes from dexamethasone treatment between the two datasets.
  2. Interleukin 4/13 (IL 4/13) signaling pathway was found to be enriched in both epidermal keratinocytes and A549 cell line.
  3. Both similarities and differences thus exist in how the two cell types respond to dexamethasone treatment, despite the two cell types themselves being highly different from each other.
  4. Dexamethasone can, hence, affect cells of widely different types in a human body.

Lists of Differentially Expressed Genes

Below is a list of top ten differentially expressed genes from epidermal keratinocytes, upon treatment with dexamethasone:

Gene ID Symbol Gene name log2(fold change) p-value Adjusted p-value
37989_at PTHLH parathyroid hormone like hormone -4.19 0.0e+00 0.00
38428_at MMP1 matrix metallopeptidase 1 -4.24 0.0e+00 0.00
40522_at GLUL glutamate-ammonia ligase 3.11 0.0e+00 0.00
36629_at TSC22D3 TSC22 domain family member 3 2.92 0.0e+00 0.00
41193_at DUSP6 dual specificity phosphatase 6 -2.83 0.0e+00 0.00
1081_at ODC1 ornithine decarboxylase 1 -3.47 0.0e+00 0.00
39528_at RRAD RRAD, Ras related glycolysis inhibitor and calcium channel regulator 2.50 0.0e+00 0.00
37701_at RGS2 regulator of G protein signaling 2 2.81 0.0e+00 0.00
39070_at FSCN1 fascin actin-bundling protein 1 -2.41 0.0e+00 0.00
37324_at TFRC transferrin receptor 2.71 0.0e+00 0.00

Below is a list of top ten differentially expressed genes from A549 cell line, upon treatment with dexamethasone:

Gene ID Symbol Base mean log2(FC) p-value log10(p-value) Adj. p-value log10(adj. p-value)
ENSG00000235655 NA 3077 15.6 0.00 -37.8 0.00 -36.5
ENSG00000282230 ADAM9 2889 15.6 0.00 -27.7 0.00 -26.6
ENSG00000244398 NA 1582 14.7 0.00 -34.3 0.00 -33.0
ENSG00000283485 NA 3148 -14.7 0.00 -45.6 0.00 -44.1
ENSG00000280830 ADI1 1388 14.5 0.00 -33.2 0.00 -32.0
ENSG00000234851 NA 1.20×10^4 -14.2 0.00 -83.0 0.00 -81.0
ENSG00000235043 NA 803 13.7 0.00 -23.4 0.00 -22.4
ENSG00000204388 HSPA1B 737 13.6 0.00 -29.6 0.00 -28.5
ENSG00000255072 PIGY 732 13.6 0.00 -29.3 0.00 -28.2
ENSG00000220842 NA 659 13.4 0.00 -24.5 0.00 -23.5

Source code

The full source code to conduct this study is available at The results are also available as reports at and


  1. A549 cell. (2018, June 27). Retrieved November 9, 2018, from
  2. Bordag, N., Klie, S., Jurchott, K., Vierheller, J., Schiewe, H., Albrecht, V., . . . Selbig, J. (2015). Glucocorticoid (dexamethasone)-induced metabolome changes in healthy males suggest prediction of response and side effects. Scientific Reports, 1-12. srep15954(2015)
  3. Gour, N., & Wills-Karp, M. (2015). IL-4 and IL-13 signaling in allergic airway disease. Cytokine,75(1), 68-78.
  4. Kawakami, M., Kawakami, K., Stepensky, V. A., Maki, R. A., Robin, H., Muller, W., . . . Puri, R. K. (2002). Interleukin 4 receptor on human lung cancer. Clinical Cancer Research, 8(11), 3503-3511.
  5. Keratinocyte. (2018, July 18). Retrieved November 9, 2018, from
  6. Klaus, B., & Reisenauer, S. (2018, September 14). An end to end workflow for differential gene expression using Affymetrix microarrays. In Bioconductor. Retrieved November 9, 2018, from
  7. Kolde, R. (2018). Package ‘pheatmap.
  8. Reddy, T. (2015, March 16). ENCBS527MAH / cell line. Retrieved November 9, 2018, from
  9. Stojadinovic, O., Lee, B., Vouthounis, C., Vukelic, S., Pastar, I., Blumenberg, M., . . . Tomic-Canic, M. (2011, January 7). Effects of glucocorticoids in epidermal keratinocytes. Retrieved November 9, 2018, from
  10. Tomita, M., Itoh, H., Kobayashi, T., Onitsuka, T., & Nawa, Y. (1999). Expression of mast cell proteases in rat lung during helminth infection: mast cells express both rat mast cell protease II and tryptase in helminth infected lung. International Archives of Allergy and Immunology, 120(4), 303-309.
  11. Waters, R. S., Perry, J. S. A., Han, S., Bielekova, B., & Gedeon, T. (2018). The effects of interleukin-2 on immune response regulation. Mathematical Medicine and Biology, 1-52.
  12. Yu, G., & He, Q.-Y. (2016). ReactomePA: An R/Bioconductor package for reactome pathway analysis and visualization. Molecular BioSystems, (2), 477-479.

Microarray Analysis of T. thermophilus at Different Stages of Growth

6150 Group 4:  Tatyana Kiryutina, Tianci Li, Jiahan Zhan, Junkai Yang, Madeline Gray


Thermus thermophilus is a thermophilic, gram negative bacterium that can live either anareobically or aerobically and was originally isolated from a hot spring. They may reproduce in the temperature range of 50° to 82° Celsius and in the wide range between pH 3.4 and 9.0. They are often used as a model organism for extremophiles for genetic manipulation and systems biology studies, and are particularly useful when studying temperature responses in microorganisms.

In this project, we chose two microarray datasets sampling T. thermophilus under different conditions and compared the gene expression of the wild type strains from both studies. The microarray datasets described and analyzed were retrieved from accession numbers GSE22567 and GSE12436 in the GEO database. Both of these projects compare gene expression in wild type samples to knockout mutants. These mutants include mutS, a DNA mismatch repair protein and ttcsp1 a cold-shock protein believed to play a role in transcriptional regulation under colder temperatures (45°C). Even though both studies use the same strain (HB8), they grow the wild types (and samples) under slightly different conditions.  In GSE12436, the bacteria were grown in a rich media (TR) for 16 hours, sampled at 70°C and after 30 additional minutes at 45°C; in GSE22567, a slightly different TR media was used, bacteria were only grown at 70°C, and were sampled after 5 hours[1].

As the media used in each were of very similar composition and samples were taken at a constant temperature, we compared the wild type gene expression across both experiments to look for possible differences in expression due to stage of growth. After 5 hours of growth, bacteria are generally considered to be in the logarithmic stage of growth, whereas 16 hours of growth is considered stationary phase when the carrying capacity is reached. By comparing the gene expression profiles of T. thermophilus in exponential and stationary phases, we gain valuable insight on how growth conditions affect gene activity in thermophilic bacteria.





Figure 1. Workflow of gene expression analysis of microarray datasets GSE22567 and GSE12436. Both datasets were run through the pipeline individually first, then combined to compare data between wild type sample sets of each experiment.


Collection and Differentially Expressed Genes Analysis

To obtain the microarray datasets, we searched and selected datasets through the ExAtlas web interface[2].  The microarray datasets GSE22567 and GSE12436 were run through the above pipeline individually (comparing only the wild-type strains and the mutants within each dataset) and then an overlap analysis was run comparing wild type strains from both experiments. We picked the datasets through ExAtlas and then used ExAtlas to run an ANOVA to generate matrices ranking the intensities of each gene probed. We then used ExAtlas to further organize the genes in the matrix by generating a list of differentially expressed genes under default parameters of the program (False Discovery Rate <0.05, Fold Change threshold: 2, using Median as the baseline).  We used the same parameters for each individual dataset analysis and for the overlap analysis of both datasets.


Heatmap & PCA Analyses

For all three datasets, we used ExAtlas to generate a heat map with hierarchical clustering as well as PCA analyses. The microarray datasets were again analyzed with individual gene lists and the overlap gene list. The microarray and RNA-seq clustering analyses parameters were set to filter rows and columns, cluster in hierarchy, and use a fold change threshold of 2; all PCA analyses parameters utilized the default settings of ExAtlas (FDR <0.05, Fold change threshold: 2, Correlation: 0.7).


Pathway Analysis

Lists of differentially expressed genes were output as locus IDs from ExAtlas, and then converted to gene IDs through the David 6.8 online interface[3]. From these gene ID lists, we used default grouping parameters (Kappa Similarity: Term overlap of 4 and similarity threshold of 0.35) in David 6.8 to cluster genes into functional groupings for each data set separately, then comparing both microarray datasets, (wild-type samples only). We then used KEGG to identify local and global metabolic pathways of the differentially expressed genes to gain insight as to the changes occurring under different experimental conditions.


Results and Discussion


Gene set N genes upregulated N genes downregulated N total
Deletion of ttcsp1 0 min 42 64 106
Deletion of ttcsp1 45oC 30min 15 19 34

Table 1. Table of differentially expressed genes from GSE12436 generated by ExAtlas.

As this dataset appears to have not been used to publish, we do not have a defined set of results to compare our expression analysis to. There is a related paper published by the same authors[4] using different datasets analyzing the same gene in the same organism, but this paper analyzed replicates of each condition individually and does not provide a comparison across sample sets as we have. Among the differentially expressed genes found, we were able to use David 6.8 to identify four main functional groupings that were impacted in this dataset; they include ABC transporters, carbon metabolism, methane metabolism, and “metabolism in diverse environments”. The ttcsp1 mutants were found to have down regulated genes used in degrading aromatic compounds, though this grouping was not found to be differentially expressed or down regulated in the wild type samples. 


Gene set N genes upregulated N genes downregulated N total
Deletion of mutS 3 2 5

Table 2. Table of deferentially expressed genes from GSE22567 generated by ExAtlas.

 As we used different analysis tools and different parameters to compare the controls and mutants within each data set, our results are somewhat different from those published. However, our analysis did return genes in the same family as those mentioned in the published data. For GSE12436 in particular, the list does not include the exact same genes (only similar genes in the same family) as the full experiment and dataset was analyzed in a Super Series. As the differential gene expression results of the mutants were not comparable between the two datasets, we decided to focus our analysis and further pursue differential expression between wild type samples of T. thermophilus of both datasets. After lookinng up genes in David and KEGG, we found that the upregulated genes play a role in thiamine metabolism while the downregualted gene in the mutant was DNA mismatch repair protein mutS (which is to be expected).


GSE22567 Dataset: Differential expression in mutS mutants


Figure 2. Principle Component Analysis comparing gene expression in Thermus thermophilus strain HB8 wild type and mutS mutants. The PCA plots above were generated through ExAtlas with default settings (FDR: 0.05, Fold change threshold: 2, Correlation: 0.7). The wild type strain samples do not cluster in any of the components in either dimensional representation, particularly with the second replicate as an outlier; this suggests a possible inconsistency in experimental methods and/or replicates.


Figure 3. Hierarchical clustering of gene expression in Thermus thermophilus strain HB8 wild type and mutS mutants. The heat map shows hierarchical clustering of the genes for each replicate of the two treatments (with and without the mutS gene). Consistent with the PCA results, this clustering suggests that the wild type second replicate is inconsistent with other wild type samples. While there are some areas of clustering among samples of the same treatment, overall this analysis suggests there is a high degree of variability in gene expression even across samples of the same treatment.

After analyzing this dataset, we found three genes commonly upregulated in the mutS mutants and two genes commonly down regulated in the mutS mutants; the upregulated genes play a role in Thiamine metabolism as discovered by clustering in David and pathway analysis in KEGG. The two down regulated genes include the mutS gene of interest, that plays a role in DNA mismatch repair, and another unknown gene. These genes were commonly upregulated in the wild type replicates, as to be expected.


GSE12436 Dataset: Differential expression in ttcsp1 mutants at 70° and 40°


Figure 4. Principle Component Analysis comparing gene expression in Thermus thermophilus strain HB8 wild type and ttcsp1 mutants at 70° and 40°. The PCA plots above were generated through ExAtlas with default settings (FDR: 0.05, Fold change threshold: 2, Correlation: 0.7). The plot on the left depicts each sample set in 2D and the plot on the right in 3D with replicates already combined in each. As each sample set was treated with different conditions (variables including temperature (represented in the sample names as 0 minutes=70°C and 30 minutes=45°) and presence/absence of ttcsp1 gene), the sample sets vary as expected. The samples could be suggested to cluster both by temperature (the 45°C samples) and by genotype (wild type samples) in the 2D plot, depending on the component of interest- however there appears to be no clustering in the 3D plot.



Figure 5. Hierarchical clustering of gene expression in Thermus thermophilus strain HB8 wild type and ttcsp1 mutants at 70° and 40°. The heat map shows hierarchical clustering of the genes for each of the four treatments (with and without ttcsp1 gene; at 70°/0 minutes and 45°/30minutes). The clustering shows clearer distinctions between expressed genes at each temperature regardless of ttcsp1 gene presence, which is to be expected for normal cell temperature responses. This map shows there are still differences in gene expression in the ttCsp1 mutants during the temperature changes, suggesting that this gene may not play a role in the temperature response as previously believed.



GSE22567 and GSE12436 Wild Type Samples- Cross Comparison


WT 0min_up WT 45oC 30min_down
wild-type strain_3_down 9 11

Table 3. Table of overlapping deferentially expressed genes between GSE22567 and GSE12436 generated by ExAtlas.



Figure 6. Principle Component Analysis comparing gene expression in Thermus thermophilus strain HB8 wild type from GSE22567 and GSE12436 datasets. The PCA plots above were generated through ExAtlas with default settings (FDR: 0.05, Fold change threshold: 2, Correlation: 0.7). The plots depicts each sample set of wild type strains of Thermus thermophilus HB8 from both datasets; the samples labeled as “WT 45oC 30min” and “WT 0min” are sample sets from dataset GSE12436, and the sample labeled as “wild type strain” is from dataset GSE22567. The samples “WT 0min” and “wild type strain” were both grown at 70°C, while sample “WT 45oC 30min” was collected after temperature shift to 45°C. It is possible that differences in the growth phase between the experiments explain the variance across wild type sample sets sampled at the same temperature.




Figure 7. Hierarchical clustering of gene expression in Thermus thermophilus strain HB8 wild type from GSE22567 and GSE12436 datasets. The heat map shows hierarchical clustering of genes for each of the wild type sample sets from both datasets. The clustering of genes suggests similar expression patterns primarily between both wild type strains at 70°, with the wild type strain at 45° showing nearly opposite expression patterning for a large cluster of genes.


 Functional Annotation and Pathway Analysis for Microarray Data

 Wild type strains of Thermus thermophilus HB8 were grown in study GSE22567 at 70°C and in GSE112436 at 70° and 45°C. There were 9 common genes that are differentially expressed in both datasets. Among these, the most prominently annotated genes include cytochrome c oxidases (or complex IV) proteins that exist in the transmembrane region of the respiratory electron transport chain. These isoforms catalyze reaction with EC number, which is the final step in the mitochondrial electron transfer chain. It is a key reaction in electron transport because it catalyzes transfer of electrons to oxygen, converting it into water. This reaction causes two protons to translocate across membrane to increase electrochemical potential required for ATP synthase to synthesize ATP.


Figure 8. Enzyme complexes of oxidative phosphorylation. Complex IV reaction noted in red text. This was retrieved from searching locus ID TTHA1134 in KEGG (organism code ttj).



Figure 9. Global pathway analysis showing oxidative phophorylation among other metabolic pathways in T. thermophilus. Cytochrome C oxidase proteins are critical proteins of this pathway. This figure shows all metabolic pathways mapped for Thermus thermophilus in the KEGG pathways database. Encased in the black rectangle is the specific oxidative phosphorylation pathway in which cytochrome c oxidases play a role. This pathway in particular was found to be differentially regulated under differing culture and growth conditions between both studies.

In the GSE22567 study these two genes were downregulated in the wild-type strain of thermus thermophilus HB8 yet upregulated in the GSE12436 study under the same temperature of 70°C in TR medium. There are several differences in growth conditions to note. First, there were slightly different compositions of polypeptone, yeast extract, NaCl, and metals necessary to solidify gellan gum (CaCl2 and MgCl2) for solid culture medium. pH adjustments were conducted with NaOH, but up to pH 7.2 in GSE12436 and pH 7.5 in GSE22567. The main difference was in growth time. In GSE12436 growth was conducted for 16 hours to collect cultures at the stationary phase of bacteria life cycle, and in GSE2567 cultures were collected at after 5 hours of growth during the exponential phase. According to the results, in the younger strain (5 hours) the activity of complex IV is downregulated and in older strains (16 hours) it is upregulated. In addition, GSE12436 was a temperature study in which cells were also collected 30 min after temperature shift at 45 C. The two complex IV isoforms were downregulated for wild type thermus thermophilus HB8 strains at both 45 C (GSE12436) and 70 C (GSE12436).

As cytochrome C oxidase generally plays a role in respiration and resides in the mitochondrial membrane, our findings of differential regulation of this complex at different growth stages and temperatures suggests that the cells are likely altering their metabolism. They could be shifting between respiration and fermentation, or responding to oxidative stress from environmental changes in the media due to an increase in population density & activity.



  1. Fukui K, Wakamatsu T, Agari Y, Masui R et al. Inactivation of the DNA repair genes mutS, mutL or the anti-recombination gene mutS2 leads to activation of vitamin B1 biosynthesis genes. PLoS One2011 Apr 28;6(4):e19053.
  2. Sharov, A.A., Schlessinger, D., and Ko, M.S.H. 2015. Exatlas: An interactive online tool for meta-analysis of gene expression data. J. Bioinform. Comput. Biol., DOI: 10.1142/S0219720015500195.
  3. Huang DW, Sherman BT, Lempicki RA. Systematic and integrative analysis of large gene lists using DAVID Bioinformatics Resources. Nature Protoc. 2009;4(1):44-57.  [PubMed]
  4. Tanaka, T. , Mega, R. , Kim, K. , Shinkai, A. , Masui, R. , Kuramitsu, S. and Nakagawa, N. (2012), A non‐cold‐inducible cold shock protein homolog mainly contributes to translational control under optimal growth conditions. The FEBS Journal, 279: 1014-1029. doi:1111/j.1742-4658.2012.08492.x
  5. Castresana J, Lübben M, Saraste M, Higgins DG (June 1994). “Evolution of cytochrome oxidase, an enzyme older than atmospheric oxygen” (PDF). The EMBO Journal13 (11): 2516–25. PMC 395125PMID 8013452.








Differential gene expression in naturally occurring and engineered neuroendocrine small cell lung cancers (SCLCs)

Group Members: Nishant Gerald, Preethi Gowrishankar, Prerna Jain, Gabriel Leventhal-Douglas, Siddartha Sharma


In 2015, cancer was the leading cause of death in the US, with lung cancer being the leading cause of death by cancer (US Cancer Statistics Working Group). Small cell lung cancer (SCLC) makes up between 10-15% of all lung cancers and is generally quite aggressive, metastasizing at a much faster rate than non-small cell lung cancers (American Cancer Society).

Park et. al have worked on altering the following five oncogenic drivers (hereafter referred to as PARCB) in healthy epithelial tissue cells (lung and prostate) to transform them into tumorigenic cell lines:

  • P53 – a tumor suppressor protein, most frequently mutated in human cancer and plays a crucial role in tumorigenesis (Surget et al, 2013)
  • myr ATK1 – myristoylated serine/threonine kinase1
  • RB1 – a short hairpin RNA  
  • c-Myc – which belongs to a family of proto-oncogenes which synthesize transcription factors which contribute to cell proliferation (NCBI)
  • BCL2 – which codes for regulatory proteins that control apoptosis (Tsujimoto et. al)

After investigating the engineered SCLC samples created with the oncogenic drivers detailed above, our goal was to compare expression profiles from naturally occurring and engineered small cell neuroendocrine cancers, and more specifically small cell lung cancer cell lines, to check for commonalities in expression patterns. This information could them be used to identify key contributors to the carcinogenic behavior of the cells. Subsequently, we aimed to identify points of vulnerability in pathways that could potentially be targeted by drug therapies.




Figure 1. Workflow diagram

Initially, our analysis was done on both datasets together, as shown in the middle fork in Figure 1. However, the lack of commonalities in the differentially expressed genes found between the two datasets led us to also complete the analytical pipeline with each dataset individually in order to compare the overall pathways impacted in each of the datasets.

Dataset Selection

Our datasets (GSE118206 and GSE60052) were chosen for their similarities in cell line and assay type.  Both these datasets contained RNA-seq data for small cell lung cancer (SCLC). However, GSE118206 comprised of samples from tumors that were genetically engineered, while the samples in GSE60052 were naturally occurring. We chose four lung tumor samples and two control samples from each dataset for our analysis, detailed in Table 1.

Sample ID

Sample Name Sample Type Dataset
SRR1797218/GSM1464282 B08-3483NA Control GSE60052
SRR1797219/GSM1464283 B08-3758NA Control GSE60052
SRR7651701/GSM3321025 NHBE1 RNA-seq Control GSE118206
SRR7651702/GSM3321026 NHBE4 RNA-seq Control GSE118206
SRR1797225/GSM1464289 11A Naturally Occurring GSE60052
SRR1797226/GSM1464290 12A Naturally Occurring GSE60052
SRR1797227/GSM1464291 13A Naturally Occurring GSE60052
SRR1797228/GSM1464292 14A Naturally Occurring GSE60052
SRR7651710/GSM3321034 NHBE1-PARCB1 RNA-seq Engineered GSE118206
SRR7651711/GSM3321035 NHBE1-PARCB2 RNA-seq Engineered GSE118206
SRR7651712/GSM3321036 NHBE4-PARCB1 RNA-seq Engineered GSE118206
SRR7651713/GSM3321037 NHBE4-PARCB2 RNA-seq Engineered GSE118206

Table 1. SCLC samples included in analysis.

The engineered tumors arose from one of the two cell lines in the normal human bronchial epithelium (NHBE) controls for that dataset (NHBE1 and NHBE4) and were engineered by the manipulation of a set of five pre-identified oncogenic drivers (PARCB).

Differentially Expressed Genes

We utilized Kallisto ( to index the reference transcriptome and complete transcript quantification in conjunction with Sleuth ( in order to obtain a list of genes that were differentially expressed across the samples of interest. Stringent q values (~0.0015) were used to filter out the most significantly expressed genes and control for false positives. 

Principal Component Analysis and Heatmap

Principal component analysis and heatmap creation was carried out using ClustVis (, which is an R-based web tool. Counts data obtained from sleuth was parsed to obtain differential counts across different samples within a dataset, and was subsequently used for analysis.

Pathway Analysis – Local

Conducted with KEGG Mapper to find relevant pathways according to both differentially expressed gene lists; DS1 (Engineered SCLC) & DS2 (Clinical SCLC). Search Pathway was the mapping tool used to identify significant genes. The reference pathway used was KO (KEGG Orthology) to denote any cellular process differences between datasets. The KEGG pathways generated highlight genes from our input lists in RED.

Pathway Analysis – Global

Global Pathway Analysis was carried out using the Reactome Software. Lists of differentially expressed genes were used as input for three separate pathway analyses:

  1. First set of genes came from engineered cancer cell lines.
  2. Second list of genes came from physiological cancer cell lines.
  3. A third set was used to check for genes expressed when both cell lines were were run together.

After adding differentially expressed gene names to the input, all non-human identifiers were converted to human equivalents before running analysis. Pathways were analyzed from the results obtained.


Differential Expression


Table 2. Tabulated lists of the top 10 differentially expressed genes (upregulated in green and downregulated in red) for: A) Dataset 1 (Engineered); B) Dataset 2 (Naturally occurring)


Screen Shot 2018-11-09 at 11.05.09 AM

Table 3. Principal component analysis across all datasets (individual and combined) shows that majority of the segregation is explained by the first two principal components.

Screen Shot 2018-11-06 at 12.06.08 AM

Figure 2. PCA of Dataset1 and Dataset 2. While the engineered SCLC samples segregated from their controls, clinical tumor samples and the controls all clustered together. We see that engineered SCLC samples cluster along their origin cell lines (NHBE1 and NHBE4).

Screen Shot 2018-11-06 at 12.06.36 AM

Figure 3. PCA of the first dataset shows clustering of the cell lines from the above plot in more detail. There is segregation between the controls and engineered SCLC samples, and between the origin cell lines within the cancerous samples. Much of this difference is explained by PC1, which accounts for 84.7% of the variation between the samples.

Screen Shot 2018-11-06 at 12.07.03 AM

Figure 4. PCA of the clinical dataset showed that the controls and cancerous samples segregated along PC1, which accounted for 75% of the variation between the samples. Unlike the engineered dataset, there seemed to be more variation between the controls, as they were clearly separated along the axis of PC2, which represents 9.5% of the variation between the samples. PC2 only represents a very small portion of the variation between the samples compared to PC1, so the difference between the controls and the cancerous samples is much more significant than the difference between the two controls.



Figure 5. Heatmap of Dataset 1 (Engineered) showed a distinct difference between the controls and the tissue samples. Within the cancer samples, there was a significant difference in expression between the two cell lines. The difference in expression profile in the cancerous cells seems to be tied more closely to the origin cell line rather than to the genetic factor that was used to create the cell line.


Figure 6. Heatmap of Dataset 2 (clinical). The top half of the differentially expressed genes are all under-expressed in the naturally occurring lung cancer cells, while the bottom half of the genes are almost all overexpressed. This pattern is similar to the one seen for the samples in the engineered dataset.

Pathway Analysis – Local

While each dataset’s pathway analysis yielded multiple significant pathways, one common pathway was significantly observed (Small cell lung cancer).


Figure 7. Small cell lung cancer pathway analysis for Dataset 1. Dataset 1 excludes the E2F transcription factor (involved in the cell cycle) while containing genetic alterations for tumor suppressors. Genes involved in this pathway observed from our data include LAMA4 and RB1.


Figure 8. Small cell lung cancer pathway analysis for dataset two contains common highlighted elements with dataset 1. In addition, the E2F transcription factor for RB (in cell cycle) is present. All differentially expressed genes observed from this dataset are  E2F2, FN1 and LAMB1.

Pathway Analysis – Global

The global pathway analysis done by Reactome provides over-representation data that determines whether certain pathways are over-represented or not.


Figure 9. Global pathway map for both datasets

For combined samples, out of all the differentially expressed genes, identifiers were found for 90 genes and a total of 697 pathways were hit by at least one of them.


Figure 10. Global pathway map for Dataset 1 (Engineered)

For engineered samples, out of all the differentially expressed genes, identifiers were found for 61 genes and a total of 489 pathways were hit by at least one of them.

In the pathway map for the engineered cell lines, we see over-representation in protein metabolism and transport alongside cell communication and apoptotic pathways.


Figure 11. Global pathway map for Dataset 2 (Naturally occurring)

For clinical samples, out of all the differentially expressed genes, identifiers were found for 76 genes and a total of 478 pathways were hit by at least one of them.

In the global pathway map for the natural lung cancer cell lines, we see that disease pathways and immune pathways are over-represented which makes sense for physiological lung cancer cell lines. We also see over-representation in Signal transduction pathways alongside other over-represented pathways.

[R-HSA-5683826] Surfactant metabolism-1

Figure 12. Surfactant metabolism pathway

A major pathway over-represented in the differentially expressed genes from the clinical dataset is the Surfactant Metabolism Pathway. Surfactant reduces the surface tension of fluid in the lungs and helps make the small air sacs in the alveoli more stable which keeps them from collapsing when an individual exhales. The Surfactant proteins A, B, C and D make up the surfactants alongside phospholipids. They also play an important role in innate host defense by binding and clearing invading microbes from the lung. Mutation in the genes involved in these processes can result in respiratory distress syndrome, lung proteinosis, interstitial lung diseases and chronic lung diseases.

We see that the surfactant proteins are downregulated in the differentially expressed genes which leads to reduced quantity of the release of the surfactant proteins causing various lung problems.

SFTPA2 and SFTPB are especially downregulated in the set of available differentially expressed genes and lower expression of these proteins leads to Idiopathic Pulmonary Fibrosis, pulmonary surfactant metabolism dysfunction 1 (SMDP1) and respiratory distress syndrome (RDS)

[R-HSA-5626467] RHO GTPases activate IQGAPs-1

Figure 14. IQCAP activation pathway. (One of the pathways showing expression of ACTB)

In the engineered cell lines, we see a host of over-expressed pathways such as activation of scaffolding proteins IQCAP’s which integrate multiple signalling pathways coordinate various cellular level activities, gap junction trafficking and regulation which coordinates cellular signalling, metabolism, secretion and contraction, cell junction organization and cell-extracellular matrix interactions which have an expression of ACTB. ACTB encodes β-actin, an abundant cytoskeletal housekeeping protein.

ACTB is highly downregulated in our set of differentially expressed genes. Downregulation of ACTB leads to loss of expression of the abovementioned pathways and disrupts the cytoskeletal structure and functions. The downregulation of ACTB in the engineered cell lines can help create tumorous formations in the tissue by causing loss of function of the Beta-Actin gene


In analyzing the differential gene expression of the cells from these two datasets, we were expecting to discover significant commonalities given that both the datasets consisted of neuroendocrine SCLC samples. 

One gene, AHNAK, was found to be downregulated in both the engineered and clinical datasets. AHNAK is a gene that codes for the Ahnak protein, which acts as a tumor suppressor. It does this by inhibiting the I-Smad mediated pathway, resulting in the downregulation of c-Myc and cyclin D (Lee, et. al.). Park et al. used c-Myc as one of the five key oncodrivers used to engineer the tumors in the dataset. Additionally, the downregulation of cyclin D by Ahnak results in the dephosphorylation of the Rb protein, which in turn results in cell cycle arrest. RB1, the gene that codes for the Rb protein, was found to be dysregulated in 62% of the patients from the clinical dataset (Jiang, et. al.). Thus, the downregulation of AHNAK is in line with what we would expect to see in carcinogenic samples and is consistent with the findings from the two papers.

Despite this similarity, overall our analysis showed that there were very clear differences in the expression profiles and impacted pathways between the two datasets. The mechanism of tumor formation seems to be very different in the two datasets, with a the engineered samples seeming to be focused on protein metabolism and transport while the clinical samples seemed to be affected in the surfactant metabolism pathway. This difference may be due to the fact that the method in which the tumors were created in engineered samples, while efficient, is not necessarily indicative of the method in which tumors are formed in naturally occurring SCLC.

Additionally, we found that the cancerous cell samples in the engineered dataset shared more similarities with the samples that shared their respective origin cell line that those that shared the mode of cancer induction, indicating that there may not be an overarchingly characteristic expression profile of SCLC. Rather, it seems that the cancerous activity of these samples is unique from cell line to cell line.


  1. American Cancer Society. Cancer Facts & Figures 2018. Atlanta, Ga: American Cancer Society; 2018.
  2. BJiang L, Huang J, Higgs BW, Hu Z et al. Genomic Landscape Survey Identifies SRSF1 as a Key Oncodriver in Small Cell Lung Cancer. PLoS Genet 2016 Apr;12(4):e1005895.
  3. Krajewska, M., Krajewski, S., Epstein, J.I., Shabaik, A., Sauvageot, J., Song, K., Kitada, S., Reed, J.C., Immunohistochemical analysis of bcl-2, bax, bcl-X, and mcl-1 expression in prostate cancers. Am. J. Pathol. 148, 1567–1576 (1996). pmid:8623925
  4. Lin, Y., Fukuchi, J., Hiipakka, R.A.,  Kokontis, J.M., Xiang, J., Up-regulation of Bcl-2 is required for the progression of prostate cancer cells from an androgen-dependent to an androgen-independent growth stage. Cell Res. 17, 531–536 (2007). doi:10.1038/cr.2007.12 pmid:17404601
  5. Park JW, Lee JK, Sheu KM, Wang L et al. Reprogramming normal human epithelial tissues to a common, lethal neuroendocrine cancer lineage. Science2018 Oct 5;362(6410):91-95.
  6. Surget S, Khoury MP, Bourdon JC (December 2013).Uncovering the role of p53 splice variants in human malignancy: a clinical perspective.
  7. Tsujimoto Y, Finger LR, Yunis J, Nowell PC, Croce CM (Nov 1984). Cloning of the chromosome breakpoint of neoplastic B cells with the t(14;18) chromosome translocation
  8. Tan HL, Sood A, Rahimi HA, Wang W, Gupta N, Hicks J, Mosier S, Gocke CD, Epstein JI, Netto GJ, Liu W, Isaacs WB, De Marzo AM, Lotan TL, Rb loss is characteristic of prostatic small cell neuroendocrine carcinoma. Clin. Cancer Res. 20, 890–903 (2014). doi:10.1158/1078-0432.CCR-13-1982 pmid:24323898
  9. U.S. Cancer Statistics Working Group. U.S. Cancer Statistics Data Visualizations Tool, based on November 2017 submission data (1999-2015): U.S. Department of Health and Human Services, Centers for Disease Control and Prevention and National Cancer Institute;, June 2018.

Differential Gene Expression by Gender in Lithium Responders

Differential Gene Expression by Gender in Lithium Responders

Kira Combs, Aparna Maddala, Collin Spencer, Vunya Srinivasa


Image result for bipolar disorder chart

Figure 1: Mood cycling exhibited by patients with bipolar disorder.

Bipolar Disorder & Treatments

Bipolar disorder (BPD) is one of the most difficult mental disorders to effectively treat. BPD consists of rotating manic and depressive episodes. Conventional psychotropic medications for depression are nominally better than placebos and cannot treat manic episodes associated with BPD. Lithium was the first medication to treat mania and has continued to be efficacious 60 years later. However, akin to all psychotropics, lithium does not have a similar effect on each patient. (Rybakowski, 2014) Genome wide association studies identifying common variants associated with Li response have been unreplicable and insignificant. All mental disorders exist on a dynamic spectrum, so it is entirely possible that common genetic variants may not be found with current technology or at all. Expression profiling, then, provides a more comprehensive view of the pathology of psychiatric diseases. Significant literature over the past 10 years has better elucidated the role of inflammation in depression and other mental disorders. Dysregulation of the Hypothalamus-Pituitary-Adrenal axis and the role of inflammatory cytokines has heralded a new era of psychiatric treatments that focus on the glutamatergic system. How lithium fits into this new paradigm is not well characterized.  Furthermore, how positive responders differ in expression profiles is unknown. The study our project is based on seeks to explore what biomarkers are associated with positive lithium response and hypothesize how lithium may act to improve mood. 

Overview of Dataset

Beech et. al (2014) captured expression profiles from blood samples for 60 patients before and after one month of treatment with Li. These samples were accompanied by treatment response assessed with the Clinical Global Impression Scale for Bipolar Disorder-Severity before and after the 6 month period. The overall goal of the study was to identify biomarkers associated with positive response to Lithium treatment. Researchers used Illumina BeadStudio software to generate probe and expression profiles from microarray data. Further mixed model analysis of variance was conducted to identify genes whose expression changed differentially over the experiment. Identified genes were linked to pathways through network analysis with DAVID. The team identified 62 genes with differential induction between treatment responders and non-responders. This was used to create a systems map of how Lithium contributes to mood regulation. They identified dysregulation of the glutamatergic pathway and an increase in the balance of anti-apoptotic to pro-apoptotic gene expression. This is consistent with data from other mental disorders implicating glutamatergic and GABAergic pathways in disease pathology.

Figure 2: Mechanism of lithium action on the gutamatergic pathway. RD Beech et. al, Gene-expression differences in peripheral blood between lithium responders and non-responders in the Lithium Treatment-Moderate dose Use Study (LiTMUS), The Pharmacogenomics Journal (2014), 14, 182-191

Purpose & Methodology


Contemporary studies report an increased risk in women of bipolar II/hypomania, rapid cycling and mixed episodes (Diflorio & Jones, 2010). The goal of our project is to identify differences in gene expression involved in positive response to lithium between male and female patients with bipolar disorder (BPD). Furthermore, we create a set of differentially expressed genes and explore what overarching biological pathways they are implicated with compared to the original study.


Figure 3: Analysis pipeline. Clustering and PCA were conducted using ExAtlas. DAVID was used to conduct local pathway analysis, while Reactome was used to conduct global pathway analysis.

Differentially Expressed Gene Analysis

ExAtlas was used to identify differentially expressed genes between male and female positive responders after 1 month to Li+OPT or OPT treatment. Pairs of expression profiles between males and females were compared through a meta-analysis of samples to generate differentially expressed genes between the groups. A FDR threshold of 0.05 controlled for any false positives with a fold change > 2 used as default criteria of statistical significance. Additionally, a differentially expressed gene analysis was conducted for the male samples and the female samples separately. Once again, a FDR threshold of 0.05 controlled for any false positives with a fold change > 2 used as default criteria of statistical significance.

Clustering and PCA

Clustering analysis was conducted through ExAtlas. Three different heatmaps were generated, one for the male dataset, one for the female dataset, and one for the two datasets together. This allowed for sufficient comparison between the two groups. ExAtlas was also used to conduct a principal component analysis (PCA), which was used to examine the variance between various sample groups.


Global Pathway analysis was conducted in Reactome, and our list of differentially expressed genes was uploaded into DAVID for local pathway analysis. After receiving the generated list of relevant pathways, the four pathways with the lowest p-values were chosen for further examination.

Results and Discussion

Differentially Expressed Genes

Upon conducting differentially expressed gene analysis with a fold change threshold of 2, we found a total of 5335 significantly regulated genes. 2559 (48%) of those genes were upregulated, while (2776) 52% were downregulated.

Total number of significantly regulated genes 5335 (FC > 2)
Number of upregulated genes 2559 (48%)
Number of downregulated genes 2776 (52%)

Table 1: Significantly regulated genes uncovered through differentially expressed gene analysis.

Clustering and Heatmap

heatmap (2)

Figure 4: Clustering analysis heatmap comparing male and female nonresponders to responders. The first column represents a female nonresponder, while the ninth column represents a male nonresponder. The remaining samples are responders to either treatment.

Figure 4 is a visual representation of our clustering analysis. Several of the samples exhibit large areas of significantly downregulated genes, but none of these areas occur in the same genes. This indicates a highly variable response to treatment between individual responders.

Screen Shot 2018-10-31 at 4.19.28 PM

Figure 5: Overlap heatmap comparing male and female samples. The vertical samples are female, while the male samples are horizontal.

To compare the gene expression between male and female samples, we separated all the male and female samples and created an overlap heatmap between the two (Figure 5). Once again, this heatmap emphasizes the high degree of variance between samples. However, the first male sample had a high degree of overlap with the first two female samples, and all of these samples were responders. Therefore, these three responders seemed to have highly similar gene expression profiles.

Principal Component Analysis

Principal component analysis (PCA) was then used to examine the variance between the various sample groups. The first principal component captured 11.3% of the variance, while the second principal component captured 10.7% of the variance. 32.6% of variance was captured by the first 3 principal components. Therefore, our results from the PCA do not appear to be significant. This confirms the results of the clustering analysis, as no easily visible patterns were apparent in the heatmaps.

Percent Cumulative Percent
PC1 11.3%
PC2 10.7% 22.0%
PC3 10.6% 32.6%

Table 2: Principal component analysis of the samples. The first three principal components captured 32.6% of the variance, so our results are not significant.


Figure 6: PC2 Scatterplot. There is a large amount of clustering in the scatterplot, and the results of our PCA were insignificant (32.6% of variance captured by first 3 PCs).

The scatterplot of our PC2 results still shows a large amount of clustering. The largest cluster contains both male and female patients from both the Li + OPT and the OPT treatment groups, suggesting that there was no universal difference between males and females or Li + OPT and OPT alone. All of the samples outside of the cluster were responders, suggesting that the gene expression for some responders was markedly different from the remaining samples. However, there did not seem to be trends between these responders. Once again, there was a high degree of individual variation in treatment response.

Pathway Analysis

The top gene pathways identified by DAVID for our differentially expressed genes lists included: maturity onset diabetes of the young, olfactory transduction, neuroactive ligand-receptor interaction, and the cAMP signaling pathway. 

**All up-regulated genes that were common between our differentially expressed genes list and the pathways we used for analyzed are indicated with red highlighting of the gene symbol**


Figure 4: Genes of the Maturity Onset of Diabetes of the Young pathway; P-value = 3.5e-5

The maturity onset diabetes of the young (MODY) pathway represents genes involved in the development of the MODY disease condition which is caused by mutation of at least five genes encoding transcription factors. 

We observed that bipolar disorder appears to be co-morbid with diabetes and outside literature does confirm that there is a correlation. Bipolar seems to be frequently correlated with physiological ailments associated with its risk factors like obesity. This includes hypertension, alcoholism, and diabetes mellitus. This provides an explanation for why the diabetes pathway shows up among the most significant pathways associated with our genes list.


Figure 5: Genes of the Olfactory Transduction Pathway; P-value = 1.1e-4

The olfactory transduction pathways includes the genes involved in transduction of the binding of an odorant molecule into an electrical signal detected by the brain. The pathway involves activation of a type III adenylyl cyclase which increases intracellular levels of cAMP. Outside literature on the subject suggests that bipolar disorder can be detected via observation altered calcium signaling in olfactory neurons and it presents as a mental disorder, affecting neuronal cell signaling.

Figure 6: Genes of the Neuroactive Ligand-receptor Interaction pathway; P-value = 2.6e-4

The neuroactive ligand-receptor interaction pathway includes all of the genes involved in cell signaling between neurons. Up-regulated genes within this pathway included amino acid transport genes, corroborating the results of our reference study which also showed some up-regulation of genes within the same family. 

Figure 7: Genes of the cAMP Signaling pathway; P-value = 5.1e-4

The cAMP signaling pathway represents the general pathway for cAMP intracellular signaling which is activated by adenylyl cyclase and GPCR.  The cAMP signaling pathway is a second-messenger cell-signaling pathway that is used in a large variety of cells and regions of the body; therefore, alteration of this pathway can lead to the development of one of many pathological conditions, including mental disorders like bipolar disorder. Of significance within the cAMP signaling pathway is the up-regulation of genes controlling calcium handling and hyper-excitability of hippocampal neurons, specifically PLC and BDNF respectively, more indication of the presence of the disease condition.

Global Pathway Analysis

Figure 8: Global Pathway Analysis of our Differentially Expressed Gene List using Reactome

According to our global pathway analysis, we observed up-regulation of pathways classified under Immune System, Signal Transduction, Metabolism, and Disease. Darker blue pathways indicated particularly over-represented pathways, appearing mostly within Signal Transduction and Metabolism, but also within Disease to a lesser extent. These observations correlate well with our reference paper which looked at Bipolar Disorder and also points to its potential for comorbidity with physiological disorders. Medications such as mood stabilizers, anticonvulsants, and antipsychotics used commonly to treat bipolar disorder have been associated with detrimental metabolic effects leading to development of risk factors such as diabetes and obesity which very quickly can lead to disease conditions like CVD (cardiovascular disease). This explains many of the overrepresented pathways that we observed within our global pathway analysis. Overrepresented pathways were also observed within genes responsible for metabolism of proteins; this would be an interesting link to investigate further.

Overall, the pathways we identified as most significant correlated well with the disease condition that our reference paper investigated; however, there were no common specific genes between our analysis and the reference paper.


Analysis showed no significant difference in response to treatment between males and females, and no difference in the two types of treatments. Gene expression was highly individual and seemed to have no correlation to treatment type or gender. However, all individuals with highly different gene expression were responders to OPT. OPT is not a standardized treatment and is individually optimized. This makes it harder to predict response to treatment, and this could have had an effect on the highly individualized gene expression we saw. Additionally there was no elimination of baseline gene expression, making our number of differentially expressed genes more substantial. Overall, this analysis indicates that there is not much basis for genetic predictions of treatment response in regards to bipolar disorder, and future research should explore this further.


Amy M Kilbourne et. al, “Burden of General Medical Conditions Among Individuals with Bipolar Disorder”, Bipolar Disorders: An International Journal of Psychiatry and Neurosciences (2004)

Arianna Diflorio & Ian Jones, “Is sex important? Gender differences in bipolar disorder”, International Review of Psychiatry (2010), 22:5, 437-452,

Chang-Gyu Hahn et. al, “Aberrant Intracellular Calcium Signaling in Olfactory Neurons from Patients with Bipolar Disorder”, American Journal of Psychiatry (2005)

Fabregat, A., S. Jupe, L. Matthews, K. Sidiropoulos, M. Gillespie, P. Garapati, R. Haw, B. Jassal, F. Korninger, B. May, M. Milacic, C.D. Roca, K. Rothfels, C. Sevilla, V. Shamovsky, S. Shorser, T. Varusai, G. Viteri, J. Weiser, G. Wu, L. Stein, H. Hermjakob, and P. D’Eustachio. 2018. The Reactome Pathway Knowledgebase. Nucleic acids research. 46:D649-d655.

Huang DW, Sherman BT, Lempicki RA. Bioinformatics enrichment tools: paths toward the comprehensive functional analysis of large gene lists. Nucleic Acids Res. 2009;37(1):1-13.

Huang DW, Sherman BT, Lempicki RA. Systematic and integrative analysis of large gene lists using DAVID Bioinformatics Resources. Nature Protoc. 2009;4(1):44-57.

Janusz Rybakowski ,“Response to lithium in bipolar disorder: clinical and genetic findings” ACS chemical neuroscience vol. 5,6 (2014): 413-21.

JW Newcomer, “Medical Risk in Patients with Bipolar and Schizophrenia”, The Journal of Clinical Psychiatry (2006), 67, 25-30:36-42

Kuo Yan, “The cyclic AMP signaling pathway: Exploring targets for successful drug discovery (Review)”, Molecular Medicine Reports (2016), 13(5): 3715-3723

RD Beech et. al, “Gene-expression differences in peripheral blood between lithium responders and non-responders in the Lithium Treatment-Moderate dose Use Study (LiTMUS)”, The Pharmacogenomics Journal (2014), 14, 182-191

Sharov, A.A., Schlessinger, D., and Ko, M.S.H. 2015. Exatlas: An interactive online tool for meta-analysis of gene expression data. J. Bioinform. Comput. Biol.

Wang, J., Vasaikar, S., Shi, Z., Greer, M., & Zhang, B. WebGestalt 2017: a more comprehensive, powerful, flexible and interactive gene set enrichment analysis toolkit. Nucleic Acids Research.


Differential Expression of CD99 in Males and Females

Project Group: Taylor Cooper, Josiah Davidson, Quincy Faber, & Jake Holt


Gene expression can be studied to determine risk for certain diseases. Many proteins are considered biomarkers; their level of expression can indicate disease. CD99 is a transmembrane protein that functions in cell migration, death, and differentiation. It is specifically part of the leukocyte-endothelial transmigratory pathway. CD99 on both the leukocyte as well as the endothelial cells must bind in order for diapedesis to occur (1). Because of its role in transmigration its overexpression is common in many cancers including Ewing Sarcoma, lymphoblastic lymphoma/leukemia, and malignant gliomas, where it allows migration, invasion, and metastasis (2).

Our project goal was to analyze the effect of gender on tissues in the frontal cortical regions of the brain. Our dataset consisted of the microarray expression profiles from Affymetrix of 41 brain tissue samples extracted post-mortem (3).





Figure 1. Methodology flow chart

Analysis of Differentially Expressed Genes:

Microarray expression profiles of the brain tissues were analyzed using AtlAnalyze v 2.1.0. Our data set was split into males and females: 21 males and 20 females. We used default settings for the analysis fold change 2. This gave us the expression profile matrix which was used to find a list of upregulated and downregulated genes.

Control of False Positives:

Benjamini-hochberg was used to control the false discovery rate.


AltAnalyze was used to create a hierarchial clustering heat map. Java TreeView was used to export image files of the clustering results. The column clustering method used was hopach and the row clustering method used was weighted. Rows were normalized relative to the row median.

Principal Component Analysis:

Principle Component Analysis was used to visualize trends. This was done in AltAnalyze on the total data set and the data was displayed on a three-dimensional map. The False Discovery Rate was set at 0.05 and the fold change was 2.

Pathway Analysis:

DAVID was used to find pathway analysis results. We used the Kegg pathway. Reactome was used to find the global picture of our pathway analysis results.


Results and Discussion:

The data set was segregated into two comparison groups based on sex. Age was not taken into account. Since the two groups both consisted of brain tissue from Homo sapiens, it was not necessary to include a 3rd control that would essentially function as a negative control.

Table 1 lists the differentially expressed genes between the two comparison groups. All of the upregulated genes, bar two are Y-linked, which is intuitive. Given that females do not have a Y chromosome, it is expected that all the Y-linked genes would be upregulated when compared with a female comparison group. Therefore, attention was paid to the two genes that were not Y-linked: Lysine Demethylase 5D and CD99. The down regulated genes are all X inactivated, and again, given that we are comparing two groups with different sex genotypes, this is also in accordance with what we expect.

Differential Genes:

Upregulated Genes Downregulated genes
DEAD-box helicase 3, Y-linked X inactive specific transcript (non-protein coding) – 214218_s_at
ubiquitin specific peptidase 9, Y-linked X inactive specific transcript (non-protein coding) – 221728_x_at
ribosomal protein S4, Y-linked 1 X inactive specific transcript (non-protein coding) – 224589_at
eukaryotic translation initiation factor 1A,


X inactive specific transcript (non-protein coding) – 224588_at
neuroligin 4, Y-linked X inactive specific transcript (non-protein coding) – 227671_at
lysine demethylase 5D X inactive specific transcript (non-protein coding) – 224590_at
ubiquitin specific peptidase 9, Y-linked
CD99 molecule (Xg blood group)

Table 1. List of Differentially Expressed Genes


Figure 2. Data set heatmap with hierarchical clustering of rows and columns of all 14 significantly differentially expressed genes.

The cluster analysis heatmap, generated by AltAnalyze, contains all fourteen differentially regulated genes. The genes for the CD99 molecule and lysine demethylase break the expected pattern of X linked genes being down regulated in men and up regulated in women and y linked genes being up regulated in men and down regulated in women.

Principle Component Analysis:



Figure 3. 3D PCA

All the variance between samples can be attributed to whether the patient was male or female. The male and female samples clustered together into two groups. With a PC1 of 90.7 percent, the pathway analysis is significant. There is a clear separation between male and female samples, with the exception of a couple outliers.

Pathway Analysis:


Figure 4. Genes involved in cell adhesion molecules.

DAVID pathway analysis shows that CD99 is involved in cell adhesion between endothelial cells and between leukocyte and endothelial cells. The adhesion between leukocyte and endothelial cells is important in the transendothelial migration of leukocyte cells, as the second pathway diagram shows.


Figure 5. Genes involved in Leukocyte Transendothelial Migration

In endothelial cells, CD99 is expressed on the surface and plays an important role in Leukocyte migration. The forward migration of leukocyte cells between the junctions of endothelial cells (diapedesis) has been shown to be regulated by CD99 in conjunction with other molecules (PECAM-1). CD99 functions solely in diapedesis and plays no earlier role in endothelial to leukocyte adhesion during leukocyte migration (4). CD99 was shown by Watson et al. to have a role in LBRC recycling and leukocyte transendothelial migration. CD99 is found in at the endothelial border in an association with sAC, the scaffolding protein ezrin, and protein kinase A (PKA). Engagement of CD99 causes activation of Sac and release of cAMP.  PKA is activated, followed by LBRC recycling and leukocyte transendothelial migration (1).


Figure 6. Global Pathway


While fourteen genes were identified as being differentially regulated between men and women, twelve of the fourteen were either x or y linked. We chose to only consider CD99 because it is encoded by the MIC2 gene located in the pseudoautosomal region on the X and Y chromosomes, it has also been correlated with many types of cancer including: Ewing Sarcoma, lymphoblastic lymphoma/leukemia, and malignant gliomas (1).

According to our results, CD99 is upregulated in men, which is concurrent with previous studies (5). It has been shown to be highly expressed in brains with malignant glioma and is involved with the migration of tumor cells (6). Statistics show that brain cancer is more common in men than in women. According to the National Cancer Institute, from 2011 to 2015, the number of new cases of brain and other nervous system cancers per 100,000 people was 7.5 for men and 5.4 for women (7).

Further research could shine light on the link between expression of CD99 in the brain and the development of cancerous brain tumors, as well as the mechanism of CD99 upregulation in men.



  1. Schenkel, A., Mamdouh, Z., Chen, X., Liebman, R. and Muller, W. (2002). CD99 plays a major role in the migration of monocytes through endothelial junctions.
  2. Pasello, M., Manara, M.C. & Scotlandi, K. J. 2018. CD99 at the crossroads of  physiology and pathology. Cell Commun. 12: 55-68. doi:10.3390/genes9030159.
  3. Lu, T., L. Aron, J. Zullo, Y. Pan, H. Kim, Y. Chen, T. Yang, H. Kim, D. Drake, X.S. Liu, D.A. Bennett, M.P. Colaiacovo, and B.A. Yanker. 2014. REST and stress resistance in ageing and Alzheimer’s disease. Nature 507: 448-454. doi:10.1038/nature13163
  4. (2015). CD99: An endothelial passport for leukocytes. The Journal of experimental medicine, 212(7), 977.
  5. Lefevre, N., F. Corazza, J. Duchateau, J. Desir, & G. Casimir. 2012. Sex differences in inflammatory cytokines and CD99 expression following in vitro lipopolysaccharide stimulation. Shock 48(1): 37-42. doi: 10.1097/SHK.0b013e3182571e46
  6. Seoul, H., Chang, J., Yamamoto, J., Romagnuolo, R., Suh, Y., Weeks, A., Agnihotri, H., Smith, C., Rtka, J. 2012. Overexpression of CD99 Increases the Migration and Invasiveness of Human Malignant Glioma Cells. Genes Cancer. 3(9-10): 535–549. doi:10.1177/1947601912473603.
  7. “Cancer Stat Facts: Brain and Other Nervous System Cancer.” NCI: Surveillance, Epidemiology, and End Results Program,

RNA-Seq Analysis of Drug Treated Mycobacterium tuberculosis

BIOL 6150- Group 1

Aditya Devarakonda, Mansi Gupta, Owen Hale, Kirti Karunakaran and Shrinkhla Sharma


The advent of RNA sequencing has allowed for snapshots of real time metabolic activity to be taken and analyzed using the amount of transcripts produced as a proxy for gene expression. This allows for organisms to be investigated quickly under a multitude of different conditions and gauge the different effects these conditions have of them. For our analysis, we selected a two sets of data from a study involving the effects of various antibiotics on Mycobacterium tuberculosis. For our analysis in particular, we have focused on the differences between the antibiotics Ciprofloxacin (CIP) and Ethambutol (EMB), both of which work by inhibiting growth.

Mycobacterium tuberculosis (MTB) is a bacteria responsible for the infection known as Tuberculosis in humans. The infection generally begins in a latent form in the lungs that later activates and causes respiratory distress in the form of hemoptysis and a productive cough. While is generally treated by an antibiotic or a cocktail including a few, many who are infected are still unable to recover once the infection is in its active state. Many strains of this bacteria are also resistant to certain antibiotics, thus making them especially difficult to eradicate from the lung. One of the most common antibiotics administered for treatment of Tuberculosis is the bactericidal medication known as Ciprofloxacin. It is commonly administered in patients who have infections which have become resistant to antibiotics. Another bacteriostatic medication which is used with less resistant forms of the infection is Ethambutol. Ethambutol is typically used in conjunction with other antibiotics to help eradicate the infection.

The data for analysis was obtained from the dataset GSE107381 ( which contained data regarding RNA expression of MTB with multiple different antibiotic treatments at time points from 0, 4, and 24 hour periods. Being that the Ciprofloxacin is bactericidal, a substance that kills bacteria, and Ethambutol is bacteriostatic, an agent that prevents bacteria from reproducing, the goal of our analysis is to determine the type of response each of these treatments illicit from MTB, and how they contrast in the effects they have on the microbe.


Method Workflow

Analysis of Raw Data and Differential Expression Analysis

In order to first obtain a count table of the data, we ran the raw RNA sequencing data through Salmon using the tool through bioconda. This analysis resulted in a count table for the expression of the genes in MTB during the control condition, Ethambutol after 4 hours, and Ciprofloxacin after 4 hours. After completing this step, we were able to use the DeSEQ2 tool found on to find a list of all of the genes which were differentially expressed from the control treatment in both of the antibiotic treatment conditions. Our cutoff for a gene to be classified as differentially expressed was a fold change threshold of 1.5 and an FDR of 0.05 using the Benajmini Hochberg adjustment. Based on the fold change, we are able to determine which genes are upregulated and which genes are down regulated (Table 1).

Heat Mapping and Hierarchical Clustering:

Hierarchical clustering is a method in which different genes can be grouped based on the level at which they are up or down regulated across the experimental groups. With the set of count data obtained from the Salmon tool performed previously, we were able to perform a hierarchical clustering analysis of both the expression in the Ethambutol and Ciprofloxacin groups. The tool we used to perform the clustering and produce the heat maps was found on Clustering was performed via the average linkage method using the Euclidean distance measurement method.

Principal Component Analysis (PCA) Clustering:

For our Principal Component Analysis (PCA), we took the differentially expressed genes found earlier in the Ethambutol and Ciprofloxacin treatment groups and plotted the on a two-dimensional graph to see if there were any visual groupings present between the control and experimental groups for each separate treatment. The same fold change threshold of 1.5 and false discovery rate of 0.05 was applied to each group in this analysis as well. We were able to perform this analysis through the use of the use of ClustVis ( The matrix of differentially expressed genes was uploaded as a text file for each of the experimental groups.

 Pathway Analysis:

We performed our pathway analysis using DAVID ( After uploading the list of differentially expressed genes of both treatments from our DESeq2 output dataset, we were able to perform Kegg pathway analysis to see which differentially expressed genes were implicated in certain pathways, as well as what effects these pathways had on the ability of the cell to function. We are also able to compare the pathways implicated in the Ethambutol and Ciprofloxacin treatment groups.

Results and Discussion:

Understanding gene expression patterns has allowed us to better understanding the mechanisms in many different organisms. In this project, we use gene expression to better understand how the drugs Ciprofloxacin(CIP) and Ethambutol(EMB) interacts in Mycobacterium tuberculosis. The genome of Mycobacterium tuberculosis contains a total of 4113 genes. Through our analysis of the raw data set from the we were able to find a total of 1234 genes which were upregulated in the Ciprofloxacin(CIP) treatment group and 1024 genes which were upregulated in the Ethambutol(EMB) treatment group (FDR ≤ 0.05, FCR ≥ ±1.5) as seen in Table 1. Of the 1234 upregulated genes in the CIP treatment group, 688 were upregulated and 546 were downregulated. In the EMB treatment group, 422 of the genes were upregulated while 602 genes were downregulated.

Table 1: The number of up regulated and down regulated differentially expressed genes in Mycobacterium tuberculosis.

After determining which genes were differentially expressed, we analyzed the results through clustering. Figure 1 is a visual depictions of our hierarchical clustering data. This heat map shows the differential expression between the control data points and the experimental groups for the differentially expressed genes in each of the data sets. There are apparent groups of genes which are strongly differentially expressed between each of the control groups as well as the treatment groups for both treatments. While when compared to the control group, we see that generally the same gene clusters are upregulated and downregulated respectively in both of the experimental groups, we see that generally the magnitude of the up or down regulation of the gene cluster is greater in the Ciprofloxacin experimental groups. Another interesting point to note through the analysis of the data is that while the data obtained from the three experimental trials of each treatment appears to be fairly consistent across replicates, the data resultant from the third replicate of the control data set (denoted by CTRL3) appears to be showing a varied expression profile when compared to the previous two control datasets. To a lesser degree, the first replicate of the EMB (denoted by EMB1) treatment also appears to be less consistent with the results of the second and third replicates. This may be indicative of some error in the experimental processes or data collections methods of the study. This also has an impact on the results of the hierarchical clustering as the establishment of baseline expression values is computed using all of the values of a give row. Having one control group skew the data may also impact our ability to visualize the true nature and extent of the differential expression between the control and experimental treatment groups.

Figure 1: Heat map representing the hierarchical clustering data of the control (CTRL), Ciprofloxacin(CIP), and Ethambutol(EMB) in Mycobacterium tuberculosis.

After examining out data with clustering methods, we then look at our results through principal component analysis which is used to explain differences and variances between the different treatment groups. Table 2 shows the percentage of variance captured by each of the principal components from component 1 to 4. These results are quite significant as the first principal component captures about 52% of the variance and the second principal component captures approximately 39% of the variance in the genes for both the CIP and EMB treatment groups. This is interesting to see as we see similar patterns in the heat map above in terms of the differential expression relative to each of these groups when compared to the control. Through the first three principal components, 94-95% of the total variance of the genes is captured for both groups respectively.

Table 2: Principal component analysis was completed and determined the percentage of variance captured for the different treatments.

To better visualize the percentage of variance between the different treatments, scatter plots were created. Figures 2 and 3 are two dimensional representations of the first and second principal components. We are able to see very defined clustering on control replicates 1 and two (CTRL1 and CTRL2) in both of the plots with almost near perfect overlap. The third control replicate (CTRL3) is very far removed from the other two via both the first and second principal components. This corresponds with what we see in the heat mapping data above as the expression profile of CTRL3 was quite different from the initial two controls. We have far more defined groups with the treatment groups in both charts. While the Ciprofloxacin treatment replicates are focused very tightly, we see a slightly broader grouping of the Ethambutol treatment replicates. These results also correspond well to what we visualized in our heat map.

Figures 2 and 3: Scatter plots representing the principal component analysis.

Using the differentially expressed gene data, pathway analysis was completed to see if there were any differences or similarities between the Ciprofloxacin(CIP) and Ethambutol(EMB). Figure 4 shows the Selenocompound Metabolism pathway. In both the EMB and CIP treatments, the cystathione beta lyase (CBL) and sulfate adenylyltransferase ( genes are differentially expressed. CBL is responsible for a reaction which produces a precursor to S-adenosylmethionine which is an important component of DNA replication. The other differentially expressed gene in this pathway, Sulfate adenylyltransferase, plays a role in tRNA synthesis and translational regulation. These two genes within this pathway seem to be upregulated as a mechanism to counteract the general responses of both EMB and CIP. Ciprofloxacin operates by inhibiting DNA gyrase and halting DNA replication which prevents cell division and ultimately leads to cell death as breakage of double stranded DNA in bacteria leads to cell death. The response of the cell to upregulate genes involved in replication may be a means for the DNA to form new, uninhibited gyrases, and translational genes may be upregulated in order to produce these new proteins. Ethambutol functions by preventing bacterial cell wall formation and allowing them to be more permeable to external sources. The upregulation of translational genes may be a method of increasing the protein synthesis by the cell to be able to continue to form a cell wall.

Figure 4: The Selenocompound Metabolism pathway seen in Mycobacterium tuberculosis.

In addition to the Selenocompound Metabolism Pathway another pathway was also seen, but only in Ciprofloxacin(CIP). Figure 5 represents the purine metabolism pathway, and the genes denoted with stars are the genes which are upregulated in the CIP treatment groups. These genes were not implicated in the ethambutol treatment, although many of the same genes were differentially expressed in both groups with some varying differences in magnitude. This could be indicative of the gene expression differences between bactericidal and bacteriostatic antibiotics. As the genes in the cycle regulate the formation of nucleotides, they are implicated in DNA replication which Ciprofloxacin directly targets. Also being specifically purines, we see that this pathway can also be a part of ATP production in order to power various pathway response to counter (and potentially break down or pump out Ciprofloxacin) or as energy to induce apoptosis.

Figure 5: The purine metabolism pathway seen in the Ciprofloxacin treatment. The red stars indicate genes that were differentially expressed.


Overall, through the various methods of analysis, we see that the mechanisms of response to both the bactericidal Ciprofloxacin and the bacteriostatic Ethambutol are very similar. Both have a similar number of differentially expressed genes which was observed through the Salmon and DeSEQ2 analysis. We also see that through the heat maps, the genes which are differentially expressed in each treatment group relative to the control are similar, albeit they differ more strongly in the CIP treatment groups than in the EMB treatment groups (See Figure 6 for detail). The Principal Component Analysis for the two treatments are also very similar, and both treatments are defined clusters distinct from the control. The main difference we see between the two treatments comes in the form of the pathway activation. The purine metabolism pathway was differentially expressed in the CIP group but not in the EMB. This difference may play a key role in the difference between the bacteriostatic and bactericidal effects respectively. Overall from this, we can see that cell response to these two antibiotics, irrespective of mechanism, are incredibly similar. We can attribute this to the idea that when the cells are placed into a state of stress, they activate many non-specific gene pathways and not as many specific gene pathways to deal with said stress. Other studies substantiate this claim by showing that in response to stress, bacteria are able to upregulate the genes responsible for producing the target of an antibiotic, but also activate hundreds of other genes in cascades to launch a large scale response. While the original study found an upregulation of the SOS response genes, we do not see this same general upregulation through our pathway analysis. This may be a result of different gene filtering values or differences in the analytical pipeline that we used compared to what was performed in the study.  Further analysis of the data could lead to a more specific “transcriptomic fingerprint” for each of the antibiotic treatments. As a modification to future analysis with this data set, it may be worth analyzing the differences in the data in the third control group (CTRL3) compared to the other two controls to see if there was some analytical error in data processing or data collection. As many of the analysis methods above use all of the groups included in the data set to establish a baseline, error in data from one group could have a compounding effect on the whole analysis.

Venn Diagram

Figure 6: The Venn diagram above shows the number of differentially expressed genes which are unique to each treatment and the genes which are differentially expressed in both treatments.


  1. Afgan, et al.The Galaxy platform for accessible, reproducible and collaborative biomedical analyses: 2018 update, Nucleic Acids Research, Volume 46, Issue W1, 2 July 2018, Pages W537–W544, doi:10.1093/nar/gky379
  2. Babicki, S., Arndt, D., Marcu, A., Liang, Y., Grant, J,. Maciejewski, A. and Wishart, D. Heatmapper: web-enabled heat mapping for all. Nucleic Acids Res. 2016 May 17 (epub ahead of print). doi:10.1093/nar/gkw419
  3. Beggs, W. H. (1979). Ethambutol. Mechanism of Action of Antibacterial Agents,43-66. doi:10.1007/978-3-642-46403-4_4
  4. Boot, M., Commandeur, S., Subudhi, A. K., Bahira, M., Smith, T. C., Abdallah, A. M., … Bitter, W. (2018). Accelerating Early Antituberculosis Drug Discovery by Creating Mycobacterial Indicator Strains That Predict Mode of Action. Antimicrobial Agents and Chemotherapy,62(7). doi:10.1128/aac.00083-18
  5. Huang DW, Sherman BT, Lempicki RA. Systematic and integrative analysis of large gene lists using DAVID Bioinformatics Resources. Nature Protoc. 2009;4(1):44-57.  [PubMed]
  6. Huang DW, Sherman BT, Lempicki RA. Bioinformatics enrichment tools: paths toward the comprehensive functional analysis of large gene lists. Nucleic Acids Res. 2009;37(1):1-13.  [PubMed]
  7. Lebel, M. (1988). Ciprofloxacin: Chemistry, Mechanism of Action, Resistance, Antimicrobial Spectrum, Pharmacokinetics, Clinical Trials, and Adverse Reactions. Pharmacotherapy: The Journal of Human Pharmacology and Drug Therapy,8(1), 3-30. doi:10.1002/j.1875-9114.1988.tb04058.x
  8. Love, M., Huber, W. and Anders, S. (2014). Moderated estimation of fold change and dispersion for RNA-seq data with DESeq2. In Genome Biology, 15 (12). [doi:10.1186/s13059-014-0550-8][Link]
  9. Melnikow, E., Schoenfeld, C., Spehr, V., Warrass, R., Gunkel, N., Duszenko, M., . . . Ullrich, H. (2008). A compendium of antibiotic-induced transcription profiles reveals broad regulation of Pasteurella multocida virulence genes. Veterinary Microbiology,131(3-4), 277-292. doi:10.1016/j.vetmic.2008.03.007
  10. Metsalu, Tauno and Vilo, Jaak. Clustvis: a web tool for visualizing clustering of multivariate data using Principal Component Analysis and heatmap.Nucleic Acids Research, 43(W1):W566–W570, 2015. doi: 10.1093/nar/gkv468.
  11. Patro, R. (2017, April). Salmon (Version 10.2) [Computer software]. Retrieved November 4, 2018, from
  12. Rhee, K. Y., & Gardiner, D. F. (2004). Clinical Relevance of Bacteriostatic versus Bactericidal Activity in the Treatment of Gram-Positive Bacterial Infections. Clinical Infectious Diseases,39(5), 755-756. doi:10.1086/422881
  13. Sandhu, G. (2011). Tuberculosis: Current situation, challenges and overview of its control programs in India. Journal of Global Infectious Diseases,3(2), 143. doi:10.4103/0974-777x.81691

Differential Gene Expression Analysis in Non-Small Cell Lung Cancer

Group Members: Ramit Bharanikumar, Sachin Sarath Y Kothandaraman, Sachin Kumar, Zainab Arshad


Of the three main types of lung cancers, non-small cell lung cancer (NSCLC), small cell lung cancer and lung carcinoid tumor, NSCLC is reported to be the most common, constituting 85% of all lung cancer patients [1]. Due to the absence of clinical signs and symptoms during the early stages of this disease, it is normally diagnosed at an advanced stage resulting in high mortality [2]. Improvements in our understanding of key differencing in expression of gene and their functional significance have the potential to impact NSCLC diagnosis, prognostication and treatment.

RNA-seq analysis of differentially expressed genes across various cell types in tumor and adjacent normal samples from the same patient may reveal driver genes in development and progression of NSCLC [3]. This will ultimately aid clinicians in making more informed decisions regarding treatment of NSCLC, which may increase the survival rates of patients.

The aim of our study is to identify differentially expressed genes in the immature monocytic myeloid cells (IMMC), Epithelial cells (Epi), and Neutrophils (Neu) of NSCLC patient samples against adjacent normal lung tissue sample RNA-seq data. The study validates the published work of Li et al, where the analysis was integrated with alternative splicing data [4]. Further, we aim to identify genes that are commonly expressed in the above cells, that could potentially help identify key pathways in disease development and progression.


Dataset GSE68795: This is a RNA-seq dataset downloaded from GEO (Gene Expression Omnibus) database, which was generated by Durrans et al[5]. This dataset has a total of 17 samples, containing three immature monocytic myeloid cell (IMMC) samples, two epithelial cell (Epi) samples and two neutrophil (Neu) samples from lung cancer patients, as well as three IMMC samples, three Epi samples and four Neu samples from adjacent normal lung tissues.


Fig 1: Dataset GSE68795 – obtained from GEO database




Fig 2: Workflow of RNA-Seq analysis pipeline and software used

Differential Expression Analysis:


We used Salmon to perform transcript quantification of our RNAseq data. Salmon is a tool for quantifying the expression of transcripts using RNA-seq data. Salmon uses algorithms coupling the concept of quasi-mapping with a two-phase inference procedure [6] to provide accurate expression estimates.  The tool runs in two phases; indexing of the reference transcriptome and quantification of reads. For this study we use the ENSEMBL human reference transcriptome (version GRCh38/hg38). The sequenced reads were mapped to the indexed reference. Gene expression values in Transcripts per Million(TPM) for each of the 17 samples were obtained in this step.

Generating a Transcript ID  to Gene ID Mapping file:

We used a python script to create a Transcript ID to Gene ID mapping file using the gene annotation file for GRCh38 from ENSEMBL(gff3 format). This file was used in the following stage to obtain the individual gene expression data from the quantified transcript values obtained in Salmon.


DESeq2 was used to obtain the differentially expressed genes from the TPM values. DESeq2 uses shrinkage estimation for dispersions and fold changes to improve stability and interpretability of estimates[7]. The output files from Salmon, and the Gene ID to Transcript ID mapping file created were used as input for DESeq2. Gene-level count matrices for use with DESeq2 were generated by importing the quantification data using the tximport package. DESeq2 was then used to test the significance of differential expression of genes based on TPM. Genes with adjusted p-values < 0.01 were categorized as significant or differentially expressed gene. DESeq2 employs Benjamini-Hochberg procedure to controls false discovery rate. Genes with fold change (FC) > 2 were considered upregulated and those with FC < 0.5 were considered as downregulated.

Principal Component Analysis:

Principal Component Analysis was done on individual cell data sets using the normalized counts value as the parameter across the tumor and adjacent samples for the differentially expressed genes. DEGs with p-adjusted values < 0.01 were taken into consideration. The tool used for PCA was Clustvis [8]. The plots are formed across the principal components displaying the most variability. The plots for the Neutrophil, Epithelial and IMMC cells were formed by data through gene matrices, with gene ids serving as the row headers and normalized count values for each sample type serving as the column values.

The preprocessing of the data required the transformation of normalized count values to ln(X+1) in case of the minimum value in the data being 0. The row scaling method used was “unit variance” and the PCA method was “SVD with imputation”.


The heatmaps and hierarchical clustering was also generated using Clustvis [8]. The hierarchical clustering is generated using the pairwise Euclidean distance elimination method for each of the genes provided in the sample sets. This leads to the similar profiled genes being clustered together. This is also represented using the heatmap where the expression profiles for the tumor samples are supposed to be significantly different from the ones for the adjacent samples. The input used is the cumulative normalized count matrix for all the samples in a particular cell type.

Pathway Analysis:

For global pathway analysis we used Reactome[10] and local pathway analysis we used KEGG(Kyoto Encyclopedia of Genes and Genomes)[11]. The input for reactome 


Differentially Expressed Genes:

On determining the list of differentially expressed genes between the normal adjacent and the tumor samples, out of the total 35874 genes, we found 879 genes as differentially expressed in neutrophils, 403 in epithelial cells and 529 in immature monocytic myeloid cells. Below are the upregulated and downregulated genes for each sample type.

Sample Type Differentially expressed Genes


Upregulated Genes

(FC > 2)

Downregulated Genes

(FC < 0.5)

Neutrophils 879 288 591
Epithelial cells 403 208 195
Immature monocytic myeloid cells 529 270 259

Table1: List of Differentially Expressed Genes from DESeq2


Fig 3: Common DE genes across samples

Since we had a stringent p-value of 0.01 to identify genes that are have a significant expression value, we were able to identify 5 genes that were common among the differentially expressed genes between the sample types. Among these common DE genes, 2 were consistently upregulated and 3 downregulated across the samples.

Heatmap and PCA:

The hierarchical clustering for Neutrophil data shows that fairly both the tumor samples are clustered together whereas the adjacent ones are closely related. There is a slight difference in expression profile for the 3rd adjacent sample which is reflected in the clustering chart. The outliner is defined by the separate clustering of a bunch of profiled genes in the specific adjacent cell sample.


Fig 4: Heatmap and hierarchical clustering for the Neutrophil samples

The results are similar in the Epithelial cell samples, where the tumor samples show a uniform expression profile and the adjacent samples follow suit.


Fig 5: Heatmap and hierarchical clustering for the Epithelial samples

IMMC cell samples observe a different behavior, with one of the adjacent and the tumor samples being a little further off on the expression profile and hence clustered hierarchically separately with more branches procuring on the profiled gene clustering. The samples are 3rd adjacent and 2nd tumor ones.


Fig 6: Heatmap and hierarchical clustering for the IMMC samples

PCA shows the trend confirmed by the heatmaps and hierarchical clustering. The samples for the adjacent cells for neutrophils, epithelial and IMMC have more close grouping and the tumor samples are differentiated on the principal component 2 axis, if so. The results of PCA for all three cell types show a genuine grouping on the tumor and adjacent sample sides. Also, the 2 principal components provide an approximate coverage of 85-93% leading to other PCs being statistically insignificant.


Fig 7,8,9: Principal Component Analysis plots for Neutrophil, Epithelial and IMMC samples  

The following table suggests the coverage provided by the 2 principal components across the 3 data sets.

Neutrophil Epithelial IMMC
PC1 84.8 87.7 77.6
PC2 5.7 5.1 7.2
Cumulative 90.5 92.8 84.8

Table 2: The principal component coverage analysis for each data set

Pathway Analysis:

For global pathway analysis we used Reactome[10] and local pathway analysis we used KEGG(Kyoto Encyclopedia of Genes and Genomes)[11].


  • Neutrophil Degranulation:



When we conducted literature review we identified that neutrophils have a prominent role in tumor angiogenesis. Tumor cells produce various cytokines and chemokines that attract leukocytes and they have been shown to contribute to tumor promotion. Neutrophils are among the first cells to arrive at sites of inflammation and promote tumor initiation and tumor growth[12].  One of the key mechanisms involves activation of the IL-8/CXCR2 pathway and subsequent recruitment of neutrophils and release of neutrophil elastase[13].


Neutrophils contain multiple proteases capable of modifying the ECM and NE contribute to local tumor invasion and angiogenesis. Cell-cell contact between TANs and tumor cells lead to signals back onto tumor cells, via JAK-STAT transduction, to synthesize and secrete additional VEGFα into the microenvironment.



                                                Cytokine-Cytokine receptor interaction


                                                 Screen Shot 2018-11-05 at 9.22.05 AM             

                                                          JAK-STAT Pathway


  •  Kinesin Molecular Motor Proteins in Cancer:



Kinesin superfamily proteins, or KIFs, are microtubule-dependent molecular motors whose movement is critical to a variety of cellular processes including mitosis, meiosis, and axonal transport. So far, 45 KIF genes grouped into 15 families have been identified, and at least twice as many proteins are thought to exist due to alternative splicing. Deregulation of kinesins promotes tumor growth by promoting out-of-control mitosis.Nearly all KIF family proteins have been implicated in tumor growth, and levels of several KIFs are altered in various cancers[14].

Screen Shot 2018-11-05 at 9.30.02 AM

Map of different Kinesins interactions

  • Surfactant Metabolism in Cancer:



Pulmonary surfactant is essential for life as it lines the alveoli to lower surface tension, thereby preventing atelectasis during breathing. Surfactant is enriched with a relatively unique phospholipid, termed dipalmitoylphosphatidylcholine, and four surfactant-associated proteins, SP-A, SP-B, SP-C, and SP-D. The hydrophobic proteins, SP-B and SP-C, together with dipalmitoylphosphatidylcholine, confer surface tension–lowering properties to the material. The more hydrophilic surfactant components, SP-A and SP-D, participate in pulmonary host defense and modify immune responses.Smoking is related to a making the surfactant proteins dysunctional leading to them not filtering out carcinogens and other infectious molecules in the air. Also alleles of the surfactants genes that lead to dysfunctional proteins are associated with cancer where their association of the surfactant protein B intron 4 variants and/or its flanking loci with mechanisms that may enhance lung cancer susceptibility.


On synthesis, pro-SFTPB quickly undergoes proteolytic cleavage by cysteine proteases in the endoplastic reticulum resulting in the synthesis and secretion of a 9-kD noncollagenous hydrophobic SFTPB, which is the functional mature form of SFTPB. Lung tumor cells (especially adenocarcinomas) have dysregulated SFTPB synthesis leading to the overexpression of pro-SFTPB with muted ability to post-translationally modify the precursor into the mature hydrophobic form[15].


  • Transcriptional regulation of white adipocytes:



Cancer-associated cachexia (CAC) is a complex, multifactorial syndrome that negatively impacts patient quality of life and prognosis.It is a disorder characterized by loss of body weight with specific losses of skeletal muscle and adipose tissue. Cachexia is driven by a metabolic changes, including elevated energy expenditure, excess catabolism and inflammation. Adipose tissue dysfunction can lead to CAC.

Screen Shot 2018-11-05 at 1.57.02 PM

                           Influence of microenvironment on adipose tissue leading to CAC


Cancer and its microenvironment influence white, beige, and brown adipose tissue. WAT undergoes lipolysis, which results in a direct loss of adipose tissue mass and also contributes to skeletal muscle mass loss. As the process progresses, skeletal muscle loss may act as positive feedback for further adipose tissue lipolysis. WAT may also undergo browning to undergo trans-differentiation to beige adipose tissue, expressing uncoupling protein 1 (UCP1) and thereby expending greater amounts of energy. Similarly, existing classical brown adipose tissue may be activated, resulting in greater UCP1 expression with a resulting increase in energy expenditure. Collectively, these changes result in a net-negative energy balance, which contributes to the development and progression of CAC[16].



In conclusion, our hypothesis upon conducting RNA-Seq analysis on the 17 samples by grouping them together as cancer and normal samples yielded key insights on how the differentially expressed genes between the two groups affects their phenotype. Some of the pathways altered because of the differential expression of the genes include but are not limited to – neutrophil degranulation, where through the JAK-STAT pathway and cytokine-cytokine receptor interaction neutrophils promoted tumor growth; deregulation of kinesins promotes tumor growth by promoting out-of-control mitosis;alleles of the surfactants genes that lead to dysfunctional proteins that enhanced lung cancer susceptibility.The modified surfactant protein B serves as a metabolite biomarker has already been tested to screen for lung cancer;finally modified transcriptional regulation of adipose differentiation leads to increased Cancer Associated Cachexia.



  1. Santarpia M, González-Cao M, Viteri S, Karachaliou N, Altavilla G, Rosell R. Programmed cell death protein-1/programmed cell death ligand-1 pathway inhibition and predictive biomarkers: Understanding transforming growth factor-beta role. Transl Lung Cancer Res. 2015;4:728–742.
  2. Non-small Cell Lung Cancer Collaborative Group. Chemotherapy in non-small cell lung cancer: a meta-analysis using updated data on individual patients from 52 randomised clinical trials. BMJ1995;311:899-909
  3. Li Y, Xiao X, Ji X, Liu B, Amos CI. RNA-seq analysis of lung adenocarcinomas reveals different gene expression profiles between smoking and nonsmoking patients. Tumour Biol. 2015;36(11):8993-9003.
  4. He X, Wang J, Li Y. Efficacy and safety of docetaxel for advanced non-small-cell lung cancer: A meta-analysis of Phase III randomized controlled trials. Onco Targets Ther. 2015;8:2023–2031. doi: 10.2147/OTT.S85648.
  5. Li Z, Zhao K, Tian H. Integrated analysis of differential expression and alternative splicing of non-small cell lung cancer based on RNA sequencing. Oncol Lett. 2017;14(2):1519-1525.
  6. Durrans A, Gao D, Gupta R, Fischer KR et al. Identification of Reprogrammed Myeloid Cell Transcriptomes in NSCLC. PLoS One 2015;10(6):e0129123
  7. Patro, R., Duggal, G., Love, M. I., Irizarry, R. A., & Kingsford, C. (2017). Salmon provides fast and bias-aware quantification of transcript expression. Nature Methods.
  8. Love MI, Huber W, Anders S (2014). “Moderated estimation of fold change and dispersion for RNA-seq data with DESeq2.” Genome Biology, 15, 550. doi: 10.1186/s13059-014-0550-8.
  9. Metsalu T, Vilo J. ClustVis: a web tool for visualizing clustering of multivariate data using Principal Component Analysis and heatmap. Nucleic Acids Res. 2015;43(W1):W566-70.
  10. Haw R, Wu G, Milacic M, et al. REACTOME KNOWLEDGEBASE: A PLATFORM FOR PATHWAY AND NETWORK ANALYSIS[C]// International Meeting of the. 2014:45-46.
  11. Kanehisa M, Goto S (2000).“KEGG: Kyoto Encyclopedia of Genes and Genomes”Nucleic Acids Res. 28(1): 27–30. 10.1093/nar/28.1.27
  12. Hagerling C, Werb Z. Neutrophils: Critical components in experimental animal models of cancer. Semin Immunol. 2016;28(2):197-204.
  13. Gong, L. et al. Promoting effect of neutrophils on lung tumorigenesis is mediated by CXCR2 and neutrophil elastase. Mol. Cancer 12, 154 (2013).
  14. Yu, Y. & Feng, Y. M. The role of kinesin family proteins in tumorigenesis and progression: potential biomarkers and molecular targets for cancer therapy. Cancer 116, 5150–5160 (2010).
  15. Sin DD, Tammemagi CM, Lam S, et al. Pro-surfactant protein B as a biomarker for lung cancer prediction. J Clin Oncol. 2013;31(36):4536-43.
  16. Vaitkus JA, Celi FS. The role of adipose tissue in cancer-associated cachexia. Exp Biol Med (Maywood). 2016;242(5):473-481.


Group 5: Investigating the role of BMI1 and miR-221 in Human Prostate Cancer


In the era of modern high-throughput sequencing, RNA Sequencing (also, referred to as whole exome sequencing) is turning out to be a powerful tool for gene expression analysis. It involves studying the total complement of RNAs and quantifying the transcripts present in a particular cell type. The other widely used method for studying gene expression profiles is microarray [1].


Prostate cancer is a disease of the prostate, a walnut-sized gland in the male reproductive system. It is the second most frequently diagnosed cancer. One of the main reason for overtreatment of this disease is the lack of reliable biomarkers to distinguish between types of prostate cancer. According to TCGA (The Cancer Genome Atlas), it is a highly heterogeneous disease as about 26 % of the total cases cannot be characterized to a molecular subtype [2]. The subtype of prostate cancer that was studied by Zho et alwas Castrate-resistant prostate cancer (CRPC), which keeps on growing even when the testosterone levels are reduced i.e. they are not dependent on testosterone for growth.

Our goal is to investigate two different sets of prostate cancer which respond differently to different treatment measures. Our goal is to analyze the gene expressions in two different studies on different types of prostate cancer cells and try to find out if there are any genes which have been upregulated or downregulated in both the studies suggesting a possible relation between pathways.


For this purpose, we are going to study two datasets,

  1. Microarray dataset based on MiR-221 mediated gene expression in human PCa cell (4 samples):
  2. RNA-Seq datasets which are based on Castration-Resistant Prostate Cancer Cells and BMI1 and RNF2 knockdown:

The first dataset is taken from a study which analyzed mRNA expression profiles in miR-221 expressing prostate cancer cells and identified that besides the down-regulation of oncogenic genes PMEPA1 and PRUNE an increased expression of genes, known to be associated with cell exposure to interferons is observed. It was found that miR-221 regulates invasiveness and apoptosis in prostate cancer via STAT1/STAT3-mediated activation of the JAK/STAT signaling pathway as miR-221 directly inhibits the expression of SOCS3 and IRF2, two oncogenes that negatively regulate this signaling pathway.

The second dataset is taken from a study of BMI1, a polycomb group (PcG) protein which is known to play an important role in epigenetic regulation of cell differentiation and cancer stem cell self-renewal. Polycomb complex protein BMI1 is expressed in humans by BMI1 gene on chromosome 10. In prostate cancer, BMI1 is upregulated in prostatic luminal epithelial cells. It is known that BMI, which is a key component of polycomb repressive complex 1 (PRC1) shows its oncogenic functions by enhancing the enzymatic activities of RING1B. In this study, the authors have found out that BMI1 regulates AR-signaling pathway independently of PRC1. Among the genes, down-regulated/up-regulated by BMI1 or RING1B knockdown, the level of AR-induced gene expression levels (PSA and TMPRSS2) is significantly decreased.


Screen Shot 2018-11-05 at 11.40.53 PMFig: Workflow of our analysis

Methodology for RNA-Seq:

  1. Dataset was accessed from GEO database and raw sequencing data was downloaded from Sequence Read Archive (SRA).
  2. We used GRCh38 transcriptome of human as a reference for mapping the sequencing reads in Kallisto and Salmon.
  3. We used both Salmon and Kallisto to quantify the reads obtained after RNA sequencing.
  4. Differential gene expression analysis was performed using DESeq2 in case of Salmon and in case of Kallisto we used Sleuth.
  5. We used R implementation in Sleuth to do clustering and do Principal Component Analysis.  

Methodology for Microarray:

  1. We loaded the microarray GEO dataset to Exatlas and used it to create the scoring matrix
  2. List of differentially expressed genes is created using this scoring matrix.
  3. Overlap analysis, Principal component analysis, clustering was performed.

Measures for Controlling False Positives

RNA-Seq Data Analysis:

We kept a check of differences in library sizes and library composition by using DESeq2 for differential gene expression analysis on the quantification obtained from Salmon. DESeq2 uses a scaling factor for each sample, it first takes the log of the expression value for each gene in all samples. This log function eliminates all genes that are only transcribed in one sample type and uses the geometric mean which helps smooth over outlier read counts. It also takes into account, the genes with more reads uses a median to downplay those genes and hence, putting more emphasis on moderately expressed genes. While using DESeq2 we should be careful of outliers, when there are only 2 samples per category. Furthermore, it uses the Benjamin Hochberg method to control false positives.

Sleuth, on the other hand, uses likelihood ratio test (LRT) which estimates the ratio of the 2 likelihoods for each transcript and produces a q-value (i.e. p-value adjusted for false discovery rate, FDR) and it is used as a measure for significance.

Microarray Data Analysis:

We used a false-discovery rate (FDR) threshold of 0.05 and a fold change threshold of 2. In Exatlas, FDR is viewed as an equivalent of a P-values. In addition to that, it tests simultaneously null-hypothesis for all genes. It is a middle ground between the p-value and Bonferroni correction.

Also, we had only two replicates for each sample in case of microarray data we conducted an overlap analysis

Analysis of Raw Data and Differential Expression Analysis

The cutoff of a gene to be classified as differentially expressed as a fold change threshold of 1.5 and a False Discovery Rate (FDR) of 0.05 using the Benjamini Hochberg adjustment. Based on the fold change, we are able to determine which genes are upregulated and which genes are down-regulated in the treatment samples.


Comparison with the Published Study:

Microarray data analysis

In the published study, 63 genes are down-regulated and 282 genes are up-regulated.

In our analysis using Exatlas, 272 genes are down-regulated and 384 genes are up-regulated.

When compared to the published results, there are 219 genes that are common in our analysis of up-regulated genes and 29 genes that are common among the down-regulated genes.

RNA-Seq Data Analysis

In the published results, there are 2,923 genes which are differentially expressed. However, when we calculate the number of differentially expressed genes using DESeq2 with a p-value less than 0.05, we discover around 400 genes. We further narrowed it down using log2 Fold Change, and we found only 115 differentially expressed genes.

Differentially Expressed Gene Analysis

RNA-seq analysis was done using BMI1 or RING1B knockdown of C4-2 cells data. The authors found 1,794 up-regulated genes and 1,897 down-regulated genes from BMI1 knockdown. However, when we performed the analysis using DESeq2, we found 115 differentially expressed genes, of which 9 were up-regulated and 106 were down-regulated. We also used Kallisto to find differentially expressed genes and discovered 139 genes are differentially expressed. Out of these 139 genes, 2 are up-regulated and 137 are down-regulated, which is consistent with the results obtained from DESeq2.

Here, we believe that the potential reason for obtaining a smaller number of genes in our analysis compared to the published results is because we have used more stringent criteria, applying a q-value parameter of less than 0.05.

For microarray data analysis, we used Exatlas, which uses a scoring matrix to find differentially expressed genes.


Fig: We discovered 30 common genes are differentially expressed in both RNA-seq and Microarray data analysis.

For microarray data analysis, We used Exatlas which uses a scoring matrix to find differentially expressed genes.

Heatmap and Hierarchical Clustering

Hierarchical Clustering: RNA-Seq Data

For the 139 differentially expressed genes that we have obtained after DESeq2 analysis, we retrieved the data for transcript counts (TPM) from Kallisto analysis for the treatment and control groups to perform clustering and generating the heat map.

Fig: Heatmap of significantly expressed genes obtained via RNA-Seq

Heatmap: Microarray

We clustered the samples and generated the heat map for the 656 differentially expressed genes using Exatlas.

heatmap2Fig: Heatmap of microarray dataset. Curly brackets show consistent behavior with control and sample.

Principal Component Analysis

After getting transcript quantifications using Kallisto, we performed downstream analysis, which included PCA analysis, using sleuth [7]


We found that more than 95% of the variance is captured by PC1 and PC2, and a clear difference can be observed between the control cluster and the knockdown clusters of BMI1 and RNF2.


Fig: PCA plots for Microarray dataset obtained by ExAtlas. ~96% of variance is captured by PC1 and PC2.

Pathway analysis

For pathway analysis, we used webgestalt (Web based Gene Set Analysis Toolkit) ( ) to find out the pathways affected by upregulated and downregulated genes. 

We used the KEGG and Reactome to analyze the pathways using the overrepresentation network analysis.

In the pathway analysis for microarray, the following categories were enriched positively:

  1. Cytokine Signaling in the Immune system
  2. Interferon alpha/beta/gamma signaling
  3. Cytokine-cytokine receptor interaction
  4. TNF signaling pathway

and, negatively:

  1. p53 signaling pathway
  2. JAK-STAT signaling pathway
  3. Proteoglycans in cancer
  4. PI3K-Akt signaling pathway
  5. Protein processing in endoplasmic reticulum
  6. MicroRNAs in Cancer

In the pathway analysis for the RNA-Seq dataset, the following categories were enriched negatively,

with respect to BM1 knockdown:

  1. Downregulation of ERBB4 signaling
  2. RNA Polymerase III Transcription, Abortive, Refractive and Termination
  3. Interferon gamma signaling
  4. RNA transport
  5. PI3K-Akt signaling pathway
  6. MicroRNAs in Cancer

and, with respect to RNF1 knockdown:

  1. PI3K-Akt signaling pathway
  2. MicroRNAs in cancer
  3. RNA transport


Fig: List of genes (RNA-Seq) involved in Biological Processes and their molecular function

fig2Fig: List of genes (RNA-Seq) involved in Biological Processes and their molecular function

prostate_highlighted.PNGFig: Prostate Cancer pathway analysis using KEGG for Microarray dataset and RNA-Seq.


Fig: JAK-STAT signaling pathway analysis using KEGG for Microarray dataset.

WhatsApp Image 2018-11-06 at 11.11.35 AM

Fig: Global pathway from Reactome highlighting cytokine signaling in the immune system

fig6Fig: Cytokine signaling pathway from Reactome

hsa04120 (1).png

Fig: WWP1 involved in the ubiquitin-mediated proteolysis.


We know Polycomb group proteins BMI1 and RING1 are abundantly expressed in Prostate cancer [8]. We also found similar upregulated behaviour of RNF1 from our microarray and RNA-seq analysis which is associated with multimeric polycomb group protein complex and associated with BMI1.  Zho et al. discovered role of BMI1 in prostate cancer independent of the polycomb repressive complex 1 via involvement in the Androgen receptor (AR) signaling pathway. Consequently, from our DESeq2 and Sleuth analysis of the RNASeq dataset we have found differentially expressed genes such as “GFR”, “P10”, “NKX3.1”, “AR”, GSK3″, “p21”, “CREB”, “TMPRSS2”, “CDKN1A”, “EGFR”, “GSK3B”, “CREB3L2” which are involved in Androgen receptor signaling. However, only “CREBRF” regulatory factor is differentially expressed in our microarray dataset and all other genes seem to not be differentially expressed in the microarray dataset. One of the potential reasons behind this could be similar expression in the microarray control and treatment samples. Apart from the AR signaling pathway, in both sets of studies, we saw the immune system pathway and interferon signaling seems to be a common pathway. According to the paper by Mohanty S. K. et al., STAT3 and STAT5 pathway can be targeted as a target for CRPC, we see JAK2 and STAT pathway coming up in the pathway analysis. From the significant list, we found mitochondrial genes, such as WWP1, have the potential to be a future biomarker for Castration Resistant Prostate Cancer (CRPC), which is involved in the ubiquitin-mediated proteolysis, as it was found to upregulated in CRPC cell samples.


  1. Kneitz B, Krebs M, Kalogirou C, Schubert M et al. Survival in patients with high-risk prostate cancer is predicted by miR-221, which regulates proliferation, apoptosis, and invasion of prostate cancer cells by inhibiting IRF2 and SOCS3. Cancer Res 2014 May 1;74(9):2591-603. PMID: 24607843
  2. Zhu S, Zhao D, Yan L, Jiang W et al. BMI1 regulates androgen receptor in prostate cancer independently of the polycomb repressive complex 1. Nat Commun 2018 Feb 5;9(1):500. PMID: 29402932
  3. Mohanty, S.K., Yagiz, K., Pradhan, D., Luthringer, D.J., Amin, M.B., Alkan, S. and Cinar, B., 2017. STAT3 and STAT5A are potential therapeutic targets in castration-resistant prostate cancer. Oncotarget, 8(49), p.85997.5689662
  4. Zhang, B., Kirov, S.A., Snoddy, J.R. (2005). WebGestalt: an integrated system for exploring gene sets in various biological contexts. Nucleic Acids Res, 33(Web Server issue), W741-748.
  5. Patro, Rob, et al. “Salmon provides fast and bias-aware quantification of transcript expression.” Nature methods 14.4 (2017): 417.
  6. Bray, Nicolas L., et al. “Near-optimal probabilistic RNA-seq quantification.” Nature biotechnology 34.5 (2016): 525.
  7. Pimentel, Harold, et al. “Differential analysis of RNA-seq incorporating quantification uncertainty.” Nature methods 14.7 (2017): 687.
  8. van Leenders, Geert JLH, et al. “Polycomb-group oncogenes EZH2, BMI1, and RING1 are overexpressed in prostate cancer with adverse pathologic and clinical features.” European urology 52.2 (2007): 455-463.
  9. Wang, Lina, et al. “Effects of microRNA-221/222 on cell proliferation and apoptosis in prostate cancer cells.” Gene572.2 (2015): 252-258.


Differential gene expression of C.elegans treated with different diet


The goal of our project is to investigate how diet will affect the gene expression of C.elegans. Regulation of fat-content is an important physiological process. Free Fatty acids (FAs) have both functional and structural function which can be consider as bioactive components and play a role in metabolism. C. elegans is used as a model to examine the effects of several FAs on body fat accumulation. And it turns out that both omega-3 and omega-6 fatty acids induced a reduction of fat content in C. elegans, with linoleic(LNA), gamma-linolenic(GLA) and dihomo-gamma-linolenic acids(DGLA) being the most effective ones. LNA, GLA and DGLA are sequential metabolites in omega-6 PUFA synthesis pathway, and DGLA was found to be the primary one.Some researches shows that the life span of C.elegans, can impacted by reduction of nutrition uptake.  In our project, we analyze the genes of C.elegans which were fed with chemical inhibitor of glycolysis–2-deoxy-Dglucose (DOG). Since DOG cannot be metabolized by the glycolytic pathway and thus less glucose is available in the ATP production, therefore, those DOG-treated C.elegans are undergo the glucose restriction diet.  In a recent research conducted by Zarse [1] show that metabolic level is undergo remarkable change by the impairment of insulin/IGF1 signaling,  therefore, this may induce the life span extension of C.elegens. They also reveal that mitochondrial metabolic regime is shift to oxidative proline metabolism  after the reduction of insulin/IGF1 signaling, this may lead to the increase of antioxidant defense by elevating of  superoxide dismutase (SOD) and catalase(CTL) and then resulting in the extension of life span. Our goal is to explore the underlining molecular mechanism of DOG-driven extension by analyzing RNA-seq data from C.elegans with different ages. We then compare DOG-treated data with control to investigate the age effect in transcriptomice level of C.elegans.


First we collect to gene data sets from two different diet treatment group, GSE46344 for hight fat diet treated C. elegans while GSE102802 for DOG-treated C. elegans. Our study is accomplished by several Bioinformatics tools. Including Principal Component Analysis (PCA), Heat map, scatter-plot and pathway analysis.

PCA: PCA is a statistical procedure that uses an orthogonal transformation to convert a set of observations of possibly correlated variables (entities each of which takes on various numerical values) into a set of values of linearly uncorrelated variables called principal components. Here, we divide RNA-seq data from C.elegans with different ages into different clusters by PCA. And then in each cluster, we can find these C.elegans have some same characteristics. Therefore, it is easy to give a prediction of age about some unknown RNA-seq.

Heat map: A heat map is a well-received approach to illustrate gene expression data. It is an impressive visual exhibit that addresses explosive amounts of NGS data. A colored matrix display represents the matrix of values as a grid; the number of columns is equal to the number of genes being analyzed, and the number of rows is equal to the number of chips. The boxes of the grid are colored according to the numerical value in the corresponding matrix cell. The map reveals patterns among the rows and columns and it’s easy to cluster these RNA-seq. hierarchical clustering is a good choice to find the association between patterns.

Scatter-plot: Here, we set the median as a benchmark to find how many genes are overexpressed or underexpressed. The significance criteria are that FRD is less than 0.05 and fold-change is less than two. For high fat diet, it is obvious that the number of significant underexpressed genes is more than significant overexpressed ones. For glucose, it’s inverse.

Pathway: In bioinformatics research, pathway analysis is used to identify related proteins within a pathway or building pathway de novo from the proteins of interest. In our project, we shows how gpx-2 gene is up-regulated in 20 days, which is helpful when studying differential expression of a gene in different period . By examining the changes in gene expression in a pathway, its biological causes can be explored.



High fat diet treated C. elegans data

After the analysis, we found that the peroxisomal beta oxidation is the main mechanism involved in the observation, that fatty acids could induced a reduction of fat content in C. elegans

Principal Component Analysis (PCA)
for file GSE102802-matrix



This figure shows the number of differentially expressed genes in day 1 compare with day 10, We can see there are 41 overexpressed genes and 24 underexpressed genes.

WeChat Image_20181111142028

WeChat Image_20181111142041

WeChat Image_20181111142052

Heat map for GSE102802




acox-1 codes acyl-CoA                         daf-22 codes ketoacyl-CoA thiolase



DOG-Treated C. elegans data

After the Analysis, the genes of the electron transport chain are heavily effected by the reduction of Glucose level in the glycolysis pathway. Those genes have been found associated with age regulation in many species.  For example, gene gpx-2 , which is responsible for the encode of  the antioxidant enzyme glutathione peroxidase (GPx), The down regulated of gpx-2 may lead to increase of free radicals. There are researches [2] show that the decreasing number of gene gpx-2 is positive associated with age increasing, therefore, this cannot shows that DOG-treated diet result in the down regulation of gpx-2 genes. Another gene that cyp-35x family are up-regulated in response to xenobiotics. This gene is down regulated in all the four age stages.  The genes encoding the antioxidant enzymes superoxide dismutase (sod-1) was compare in the DOG-treated worms and in the controls. The sod-1 gene is the major Superoxide dismutase in mitochondria. The oxidative phosphorylation (‘‘Oxphos’’) pathway plays an important role in life span extending effect of impaired glycolysis in C.elegans. In a recent paper, Schulz et al. [3] shows that this pathway can be activated by impairment of glycolysis pathway, followed by a temporarily increased production of reactive oxygen species (ROS). And then this lead to the activation of the molecular defense mechanism against ROS, and then increasing the life span.  Therefore, if  life expansion is determined by the efficiency of ROS and if improve removal of ROS can extend lifespan by DOG-treatment, we may supposed that DOG-treatment can influence the expression of sod-1 gene.  In our result , we found that sod-1 gene is up-regulated in day 20,  this means that DOG-treatment lead to the significant up-regulation of sod-1 gene and therefore increase the encoding of antioxidant enzymes and finally lead to the life expansion.







Figure : Change of gene expression


This figure shows the number of differentially expressed genes in day 1 compare with day 10. We can see there are 3347 overexpressed genes and 4248 underexpressed genes.

day 20

This figure shows the number of differentially expressed genes in day 1 compare with day 10. We can see there are 3617 overexpressed genes and 6728 underexpressed genes. There are much more underexpressed genes in day 20 compare with day 10.


Figure  : This figure shows that gpx-2 gene, which is responsible for the encoding of Glutathione peroxidase 2, is up-regulated in day 20. Gpx-2 genes is responsible for the encode of the antioxidant enzyme glutathione peroxidase (GPx) . The down regulated of gpx-2 may lead to increase of free radicals.


Figure: This figure shows that cyp-35A5 gene, which belongs to the cytochrome P450 family and associated with aging, is up-regulated in day 20.


Figure: This figure shows that the sod-1 gene, which is responsible for the encoding of   antioxidant enzymes superoxide dismutase, is up-regulated  in day 20.





The oxidative stress and oxidative damge can be reduced during aging by increasing uptake of the glutathione precursors glycine and cysteine since this can restore glutathione synthesis and concentration.  Therefore, life span may be elongated by  the reducing of oxidative stress.



In the scatter plot result, we analyzed the effect of three kind of fatty acid on gene expression, and there are 41 over expressed gene for DGLA, and 44 for LNA and 17 for GLA. So we can know that both DGLA and LNA have more significant influence on gene expression. Then we checked the over expressed gene, and we found, with LNA, the increase of expression was significant for these two gene, which can increase fat metabolism, and with LNA and DGLA, all the tested genes showed a significantly increased expression.

For the two gene, acox-1 and daf-22, on which all the fatty acids show significant influence, we can see that DGLA and LNA increased its expression more significantly. And it the pathway, we could see the role they play and how they increased the metabolism of fat, and therefore induce a reduction in body fat accumulation.


The number of DEG is not significant between DOG-treated and control until day 20. We found that the administration of DOG changes the set of hubs in the mTOR pathway: three genes (gpx-2, cyp-35A5, sod-1) become hubs under DOG-treatment, while two other genes (mop-25.1, ife-2) lose their hub status when standard diet is replaced by DOG-feeding. Those genes that up regulated in our project  is critical in decreasing the number of free radicals or have antioxidant properties. in our project, gpx-2 is up regulated which means it may reduce the number free radicals and therefore elongating lifespan.  

Most of the genes changing hub status under DOG-treatment have earlier been associated with effects on life span in C. elegans.

In the paper, researches found over 200 genes that are differentially expression between two groups. Since there are no enough data available for us to analysis, we are unable to totally replicate the result of the paper.



  1. Zarse K, Schmeisser S, Groth M, Priebe S, Beuster G, et al. (2012) Impaired
    insulin/igf1 signaling extends life span by promoting mitochondrial l-proline
    catabolism to induce a transient ros signal. Cell Metab 15: 451–65.
  2. Espinoza, S. E., Guo, H., Fedarko, N., DeZern, A., Fried, L. P., Xue, Q. L., … & Walston, J. D. (2008). Glutathione peroxidase enzyme activity in aging. The Journals of Gerontology Series A: Biological Sciences and Medical Sciences63(5), 505-509.
  3. Schulz TJ, Zarse K, Voigt A, Urban N, Birringer M, et al. (2007) Glucose
    restriction extends caenorhabditis elegans life span by inducing mitochondrial
    respiration and increasing oxidative stress. Cell Metab 6: 280–293.
  4. Mortazavi A, Williams BA, McCue K, Schaeffer L, Wold B (2008) Mapping and
    quantifying mammalian transcriptomes by rna-seq. Nat Methods 5: 621–628.
  5. Robinson MD, Oshlack A (2010) A scaling normalization method for differential
    expression analysis of rna-seq data. Genome Biol 11: R25.
  6.  Anders S, Huber W (2010) Differential expression analysis for sequence count
    data. Genome Biology 11: R106.


Exploratory Genome Analysis of Cichlid’s Brain RNA-Seq during Bower Building


Cichlids are recently speciated and strikingly diverged species (Kocher 2004), among which sand-dwelling cichlid species have developed bower-building behavior. There are primarily two bower constructing behaviors: pit-digging (depressions in the sand) and castle-building (mounds in the sand). Notably, F1 hybrids of the castle-building Mchenga conophoros and the bower-building Copadichromis virginalis display both parental behaviors in sequence. For this project, we use whole brain gene expression (RNA-Seq) data from the F1 hybrids during the pit-building phase, castle-building phase, and data from when the fish is isolated and is not constructing bowers.


The raw RNA-seq data associated with our GEO dataset, GSE97490, includes SRR5440904, SRR5440905, SRR5440906, SRR5440907, SRR5440908, SRR5440909. The raw RNA-seq data was obtained using the command line sra-toolkit package. The data was converted to two fastq files per each sample again using sra-toolkit. Next, kallisto was used to quantify read files, or in other words, to count the abundance of each transcript (Bray 2016).

The abundance files created by kallisto were then sent to sleuth, a package in R that uses statistical tools to find the list of differentially expressed genes. Sleuth was used to find genes that are differentially expressed genes (q-value = 0.005) between all three phases of bower construction. The design and functioning of sleuth are described in Pimentel et al. (2017). Sleuth is remarkably effective at reducing the false discovery rate (FDR): according to Pimentel et al. (2017), “sleuth and voom found very few false positives, whereas other methods generated many (sleuth and voom are the only methods with a median of less than 5 false positives at all FDR ranges tested at both the gene and isoform level, whereas the next best method has 95”. (Full text:

To cluster the differentially expressed genes, we used Cluster 3.0. We chose hierarchical clustering with centroid linkage. To better represent the data visually, we normalized our data and centered it around the mean. Then, Treeview was used to visualize the clustering and expression of the differentially expressed genes

We used DAVID (Database for Annotation, Visualization and Integrated Discovery) to analyze our large set of genes through clustering (Huang 2009). We used the list to find the enriched pathways and we also clustered the enriched Gene Ontology GO terms. Using REVIGO, we created a Treemap of the enriched GO terms. Then the string database (string-db) was used to find gene-gene interaction networks which can give insight into the gene lists.

We later used Reactome to find the enriched pathways in the full list of differentially genes and sublists of interest that have similar expression patters. Reactome was preferred over DAVID’s pathway analysis for its robust format conversion.

Workflow (1)

Fig 1. Workflow: the set of software tools used and output files created. Software is represented in green and output files in blue.


Differential gene expression (q-value = 0.005)

Genes were considered differentially expressed if their q-values were below 0.005. THis produced a total of 715 differentially expressed genes. Q values are a better control for false positives than p values. Q values essentially are adjusted p values that reduce the false discovery rate (FDR). In our case, using a q-value of 0.005 means that 0.5% of significant tests will result in false positives.

Digging Building Isolated
Upregulated 314 269 367
Downregulated 387 333 176

Table 1. The number of genes up or down regulated during each phase of bower construction, only genes that are both up or down-regulated in both samples are counted.

Hierarchical Clustering

As expected, the samples were clustered by their treatment or phase during bower-building (i.e. building clustered with building, digging clustered with digging, and isolated clustered with isolated). Since our list of differentially expressed genes is 715 genes long, below we also present sub-segments of interest from the full heatmap.


Fig 2. Heatmap and hierarchical clustering of the full set of differentially expressed genes.


Fig 3. Differentially expressed genes where genes in the building and digging phase are upregulated, whereas the genes in isolated fish are down-regulated. Green = downregulated and Red = upregulated.



Fig 4. Differentially expressed genes where genes in the building phase have a different expression pattern compared to the digging phase. Green = downregulated and Red = upregulated.

Principal Component Analysis

pca_conditionFig 5A. The PCA of samples based on their conditions. As we can see, there is no good segregation between them and the clustering is not true to heatmap hierarchical clustering

Even though the PC1 is capturing 98% of the variance, the segregation of data points in the scatter plot (Fig 5A) is not good and not very informative. But we can see that the building phase is overlapping with itself.

PCA of Samples
Proportion of Variance (%) 98.07 1.84 0.09
Cummulative Proportion (%) 98.07 99.90 99.99

Table 2. The proportion of variance in the PCA of all 6 samples.


Fig 5B. The PCA of samples based on their transcript counts. These results are same as out heatmap hierarchical clustering

In Fig 5B., even though there is no clear separation between the data clusters, the centroids of the clusters are separated. Each phase of bower construction grouped with itself, but digging and isolated also grouped together. We attempted to identify the outlying genes, but unfortunately, data was not available for most of them. However, we did identify one outlier, calmodulin which is used in calcium-mediated processes. We don’t know why calmodulin has a different expression pattern than the other genes because calmodulin is abundant in the brain (“Calmodulin-binding proteins in brain” 2003), so it’s not an obscure molecule. This could be worth further investigation.

PCA of Genes
Proportion of Variance (%) 75.44 24.19 2.95
Cummulative Proportion (%) 75.44 99.62 99.92

Table 3. The proportion of variance in the PCA of all the genes.


Pathway Analysis:

The analysis of the DEG list in DAVID didn’t provide any enriched local pathways in KEGG. This was due to the fact that most of the DEGs were not identified by DAVID. So, we found the most enriched GO terms (p < 0.05) and analyzed them using REVIGO. REVIGO finds the higher GO terms for a set of given GO terms along with their p-values. Figures 6A and 6B show the treemaps based on enriched GO terms of biological process and cellular components.


Fig 6A. Treemap of enriched biological processes GO terms, produced by REVIGO. Sizes are proportional to the absolute value of log(p-value) and the colors are arbitrary.

cellular processes

Fig 6B. Treemap of enriched cellular components GO terms, produced by REVIGO. Sizes are proportional to the absolute value of log(p-value) and the colors are arbitrary.

The term of our interest are ‘Calcium ion regulated exocytosis of neurotransmitters’ in biological processes and ‘presynapse’ in cellular components. The other enriched GO terms were not considered for the analysis as we do not have any experimental results to corroborate their enrichment.

String Database:

string db no nodes.png

Fig 7. Gene-Gene interaction of all DEGs which are not disconnected nodes, produced by String-DB. The non-grey nodes are of enriched pathways. Red = Ribosome, Violet = Oocyte Meiosis and Green = Cell Cycle.

Analysis of DEGs in String Database yielded 3 enriched local pathways. They are Ribosome (FDR = 3.13E-08), Oocyte Meiosis (FDR = 0.00763) and Cell cycle (FDR = 0.0101). These seem to be generic pathways which were not very informative in explaining the aberrant behaviour of F1 hybrids. For further analysis, we analyzed the DEGs upregulated or downregulated in Digging phase and Building phase separately.

Reactome Analysis:

We used Reactome Analysis to seperately analyze the DEGs belonging only to Digging or Building phase. However, the DEGs of the digging phase were not recognized by Reactome.


Fig 8A. All the enriched pathways in the building phase

Analysis of DEGs of building phase gave us Pathways enriched in Metabolism of proteins, Vesicle-mediated transport, Cell cycle, and Neuronal System (p < 0.05). We further looked into the genes which were enriched in the neuronal system’s pathways.


Fig 8B. Enrichment of genes in the neural system.

The following genes were significantly upregulated in the building phase:

  1. Creatine kinase B-type (CKB) (q < 0.005): Energy transducer in the brain and other tissues.
  2. Kinesin-like protein (KIF1A) (q < 0.005): Motor for anterograde axonal transport of synaptic vesicle precursors

These revelations support that more neurotransmitters are being released and relatively more metabolic energy is being used in the building phase.


Distinct regulation patterns are observed in the gene expression clustering across different behavioral groups. Both hierarchical clustering and the PCA group the gene expression patterns of the digging and isolated phase together (Fig 5B). Combing with the fact that hybrid cichlids display digging phase prior to building phase, this could suggest the digging phase is closer to a “default” stage for male bower-building cichlids at the presence of females.

Many genes up-regulated in building phase compared to isolated and digging are involved in neuronal system pathways (Fig 8B). A biological explanation for this could be that the visual stimulus of a completed pit induces a neuro-signaling cascade, which leads to the activation of neural circuits causing a behavioral shift toward castle-building. The shift in neural circuits can be the cause of more expression of CKB, which is relevant is energy transduction in the brain.

We found enriched biological processes GO terms for “calcium ion regulated exocytosis of neurotransmitters” (Fig 6A).  We also found enriched cellular processes GO terms for “presynapse” (Fig 6B).  This is further evidence that there is a neurological explanation for bower construction.

Not only did we observe enrichment in genes involved in the nervous system, but also strangely, genes involved in gene transcription. Currently, we do not have an explanation for this enrichment.


The project confirmed that the bower-building cichlid behavior difference has a neuronal basis by analyzing the gene expression during the different stages of bower construction. This proposes a possible model for the behavioral transition in F1 hybrid of two closely-related and behaviorally-different species, and provides a list of candidate genes for further study in the bower-building mechanism.


  1. Kocher, T.D, Adaptive evolution and explosive speciation: the cichlid fish model. Nat. Rev. Genet, 5, 288-298 (2004).
  2. York, R. A., Patil, C., Hulsey, C. D., Streelman, J. T., & Fernald, R. D. (2015). Evolution of bower building in Lake Malawi cichlid fish: Phylogeny, morphology, and behavior. Frontiers in Ecology and Evolution, 3. doi:10.3389/fevo.2015.00018
  3. Bray, N. L., Pimentel, H., Melsted, P. & Pachter, L. Near-optimal probabilistic RNA-seq quantification. Nature Biotechnology 34, 525-527(2016), doi:10.1038/nbt.3519
  4. Pimentel, H., & McGee, W. (2018). Sleuth: Tools for investigating RNA-Seq. Retrieved November 5, 2018, from
  5. Huang DW, Sherman BT, Lempicki RA. Bioinformatics enrichment tools: paths toward the comprehensive functional analysis of large gene lists. Nucleic Acids Res. 2009;37(1):1-13.
  6. Calmodulin-binding proteins in brain. (2003, March 21). Retrieved from