Analyzing differential expression using RNA-seq data
As we all know, genes which store and control the information flow in cells actually cannot function alone just by themselves. Proteins are the ones who get most of the real job done. And any gene to be expressed and function will first be transcribed to mRNA, therefore, analyzing RNA sequences and then quantifying gene expression level can be very helpful in studying regulatory mechanisms of gene expression.
Unlike microarray data which gives us a robust yet relatively simple “Yes” or “No” answer, RNA-seq data can actually give us a “How much” answer. Moreover, the involvement of next-generation sequencing technology endows RNA-seq with the power of detecting novel RNA sequences, which may directly lead to the discovery of new genes. And as the depth of sequencing increases, many rare transcripts can be detected and thus enables even the lowly expressed genes to become visible to us.
How does RNA-seq work?
The first step of RNA-seq on most NGS platform involves cDNA library preparation. At this very first step, the control of cDNA sample preparation condition is crucial to further sequencing step since it basically determines how well the further RNA-seq data represents the real transcripts from the sample cells. Moreover, unlike small RNAs (miRNAs, piRNAs, siRNAs and etc.) which can be directly sequenced after adaptor ligation, larger RNA molecules must be fragmented into smaller pieces (200-500bp) to be compatible with most deep-sequencing technologies2. And biases will be introduced during cDNA fragmentation using different fragmentation methods.
Fig. 1 Flowchart of a typical RNA-seq experiment1
After different types of amplification, samples are sequenced and data sets are then generated. Like all high-throughput or ultra-high-throughput sequencing technology, the output data sets are undoubtedly to be huge. And naturally, the large size of data sets will present large bioinformatics challenges. For both hardware and software used to handling such large size of data sets, efficiency of those bioinformatics tools becomes very crucial. Given the size of a data set, any further analysis results generated from the data set will have sizes proportionate to the original data set. As for the core part of software – algorithms we employed to analysis the data, the running time of each program will become extremely sensitive to the time complexity. Apart from the time efficiency, memory efficiency is also a key index of a bioinformatics software. Therefore, the output data need to be organized in structures that can safely store key information and at the same time minimize redundancy, and the program need to be designed to generate highly confident data in a efficient way.
Analyzing RNA-seq data
When comparing with microarray data, RNA-seq data generally exhibits a high correlation with microarray data. Given the general credibility of RNA-seq data, bias in RNA-seq data is still not negligible. When analyzing RNA-seq data, the most common mismatch of the reads to the reference comes from sequencing error. Types of sequencing error is typically related to different sequencing platforms, and so it is with the read length. Most RNA-seq methods generates short reads which presents a huge challenging when those reads are mapped to the reference genome. Due to the short length of these reads, many of them will become Multi-location Mappable Reads(MMRs). And reliable methods will be needed to filter out certain MMRs and locate a highly confident location for the reads. Furthermore, since transcripts are typically tissue-specific or cell type-specific sequences, mapping to the same reference genome may not be the ideal way of analyzing transcriptome of a specific cell type. Therefore, some research groups utilizes ab initio approach to analyze transcriptomes of specific cell types.
Reference
[1] S Marguerat, J Bahler, “RNA-seq: from technology to biology,” Cellular and Molecular Life Sciences, (2010) 67:569-579
[2] Z Wang, M Gerstein, and M Snyder, “RNA-seq: a revolutionary tool for transcriptomics,” Nature Reviews Genetics, vol. 10, no. 1, pp. 57-63, 2009
Genetics joins hands with Metabolomics
Introduction to Metabolomics
Metabolomics is an emerging new dimension of “omics” research focused on the study and analysis of small molecule metabolites in biological systems to provide powerful insights into the mechanisms of human health and disease. Your metabolomic profile not only gives a functional readout of your current health but can also help predict future disease risks that you are running. It is, essentially, a real-time window to your health and well-being.
Metabolomics holds a lot of relevance especially in the early diagnosis of metabolic disorders like diabetes, cardiac and renal diseases as the body shows signs of acute metabolic disturbances years before the actual clinical symptoms surface out.
Metabolomics, as a field, took off at very slow pace in the early and mid 2000s and has grown exponentially ever since along with simultaneous growth and emphasis in related broad and data instensive fields like genomics, proteomics and transcriptomics.
The metabolome is highly diverse and volatile. Hence, the biggest challenge of a metabolomic study is to maintain consistency, reduce variation between subjects, and optimize information recovery. Several recent technological advancements in experimental design and instrumentation have helped in this effort.
Metabolomics also has the potential to be valuable as a predictor of treatment response and survival thereby facilitating quick solutions to personalized medicine, based on one’s metabolic profiling as opposed to genomic or proteomic disease investigation strategies. Dr. Gerszten and his colleagues at Harvard made an remarkable breakthrough by identifying markers that find people who are prone to diabetes more than a decade before any clinical manifestation appears.
According to Dr. Gerszten, the director of clinical and translational medicine in MGH’s cardiology division and an associate professor of medicine, metabolites are the most proximal reporters of any disease status or phenotype as they are downstream of genetic variation, transcriptional changes and post-translational modifications of proteins (or enzymes). They also capture the environment. Hence, the metabolome is as important as the human genome for understanding diseases and it is extremely crucial to understand the metabolic pathways.
Genetics meets Metabolomics
Metabolomics and genomics has been evolving quite rapidly over the last few years. As we all know, we are dealing with a humongous number of genes and proteins (or enzymes) to study the complex interactions that occur in biological systems. This makes the process very exhausting and tedious. But, the good news is that metabolomics gives you access to the real end-point metabolites which offers the easiest way to understand gene expression and functions.
Genomics studies have demonstrated that individuals vary a lot within a population which is accounted by many single point mutations called Single Nucleotide Polymorphisms (SNPs) across their genome. But, the complexity of the interactions between these mutations in related genes makes it possible to predict phenotypical variation in individual only in few selected cases for which a sizeable effect of the polymorphism can be explained. In a recent study conducted by Professor Karsten Suhre and Dr. Christian Gieger of Helmholtz Zentrum Munchen, in collaboration with colleagues from the Wellcome Trust Sanger Institute in the UK and King’s College London under the leadership of Nicole Soranzoby, they performed a Genome Wide Association study with metabolite data from a population of individuals and discovered significantly large number of associations of frequent SNPs with considerably differing metabolomic profiles, explaining up to 12% of the observed variance.
They compared the biochemical pathways of different enzymes with the corresponding metabolic profile of the genetic variants for those enzyme coding genes and identified new genetic risk loci as markers for for complex common diseases like type 2 diabetes mellitus. Thus, knowing the metabolic phenotype (metabotype) of an individual and his/her genotypic profile can define new therapeutic approches that are more tailored to individual variations and to more lifestyle-based differences. These genetically determined metabotypes may subscribe the risk for a certain medical phenotype, the response to a given drug treatment, or the reaction to a nutritional intervention or environmental challenge. This demonstrates the immense potential of combining genetics with metabolomics in opening new approaches to personalized medicine.
Gibson’s lab at Georgia Tech
We at Gibson’s lab are working on an integrative genomics project with a cohort of 155 individuals in association with Center for Health Discovery and Well being and Emory Institute. As part of this project, we are conducting a Genome Wide Association Study (GWAS) with the metabolite data for this cohort. The following are the main steps involved in our project:
Data collection: We have performed Fourier Transform Mass Spectrometry experiments on the individual blood samples at Emory to obtain their metabolite concentrations.
Data Normalization: We adopted an advanced normalization algorithm called Supervised Normalization of Microarrays (SNM) to normalize the data. In the normalization model, we fitted all the biological variables like Age, BMI, Ethnicity and Gender and adjusted for the effects of experiment run dates for the individuals.
Data Correlation and Clustering: We performed Principal Components Analysis and Hierarchical Clustering to observe clustering for individuals due to any of the biological characteristics of the samples. There was significant clustering of individuals observed which did not entirely correlate with any of the biological variables.
GWAS: We have currently initiated Cross correlation analysis between the metabolite information and genotype data for the same set of individuals. We expect to determine significant SNP associations with metabolite abundance and trace the biological relevance of those associations.
CISH, Infectious Disease and Science in Action
Our immune system is a wonderful thing. It affords us protection against an unimaginable range of pathogens, all by use of the impressive genetic acrobatics that underlie the diversity of expressed MHC proteins, T-cell receptors, and immunoglobulin molecules. In this context, immune gene expression heterogeneity is an extremely apparent source of personal variability in immune system function. But what other factors may influence a person’s immune system function? There are a large number of signal transduction pathways involved in immune response that might harbor genetic variants that influence a person’s immune system response to pathogens.

Interleukin-2 Signalling
Enter CISH, which stands for Cytokine-Induced SRK Homology domain 2 protein. One of the most consistently up-regulated genes in response to interleukin-2 (IL2) stimulation, CISH binds to activated IL2 receptors and prevents the activation of Signal Transducer and Activator of Transcription 5 (STAT5), a protein that promotes downstream cytokine signalling. In this way, CISH inhibits downstream cytokine signalling, effectively breaking the signal transduction cycle.
In a recent paper by Khor et al., researchers studied whether polymorphisms found near and within the CISH gene affect CISH expression in response to IL2 stimulation. In thousands of samples from cases and controls from populations in Africa and Asia dealing with infectious diseases (malaria, tuberculosis, and bacteremia), they studied the reactions of cultured peripheral blood mononucleated cells (PBMCs) to IL2 activation. After stimulating the PMBCs, the researchers measured the levels of CISH expressed and correlated that expression level with the CISH genotypes at each SNP and between different combinations of SNPs.
Their data analysis shows that subjects with certain alleles show markedly reduced expression of CISH and that these alleles at “risk” for lower expression are found in higher concentrations in subjects with a disease when compared to control subjects. This means that individuals with risk alleles at one or more of the SNP sites were at a significantly elevated risk for the diseases covered by this study. One risk allele conferred an 18% greater risk, while four or more risk alleles conferred an 81% elevated risk. That a single gene could convey differences in immune competency when dealing with relatively disparate diseases is a tantalizing result.
But why would lower levels of CISH result in greater risk of infection? What is the functional mechanism conferring this increased risk? Considering CISH inhibits downsteam cytokine signalling, shouldn’t decreased CISH expression result in a stronger immune response? Kohr and coauthors suggest the positive relationship between CISH and immune function may be due to mediation of immune response, citing the negative effects of overreactive immune systems, or due IL2′s induction of increased pathogen growth rates.
The wonderful thing about science is that everyone has a theory. After publication of the original article, further correspondence from other researchers trickled in and enriched the scientific conversation. One correspondent writes in to suggest that CISH reduces the number of regulatory T-cells, which allows for greater freedom in T-cell receptor variance, thus a more robust immune response. Another suggests that CISH-mediated conversion of naive CD4+ T-cells to T-helper 2 cells may be a functional mechanism underlying the observed results. Finally, a third group writes in to indicate that they believe Khor et al. have overlooked a serious flaw in their model: tuberculosis causes only a cell-mediated immune reponse, they say, while malaria causes a humoral immune response in addition to the cell-mediated response. Could the inconsistency at site -292 between other cohorts and the Gambian Tuberculosis cohort be signal and not noise?
The authors of the original study reply that even while including the Gambian cohort, which saw no significant difference in allele frequency at -292 between cases and controls, pooled analysis still indicates a significant difference in allele frequency between all cases and controls. This is a relatively convincing argument, though, as always, more data would settle the matter a little more firmly. The authors also discuss the two mechanistic theories suggested by correspondents and basically agree that CISH mediation of general immune competence is an area that merits further study.
One of the key elements of any scientific endeavor should be logical dialogue between parties to search for bias and interchange ideas. This paper, the correspondents’ responses and the authors’ replies really illustrate that point. One avenue of investigation not specifically covered by this study is the immune response to viral infections and whether viral diseases might show similar or different patterns of immune competence correlated with CISH expression levels. As always, one answer spawns countless questions, which is good news for graduate students everywhere.
Role of Transposable elements in causing Prostate cancer
Transposons are sequences of DNA that can move or transpose themselves to new positions within the genome of a single cell. The mechanism of transposition can be either “copy and paste” or “cut and paste”. Transposition can create phenotypically significant mutations and alter the cell’s genome size. Barbara McClintock‘s discovery of these jumping genes early in her career earned her a Nobel prize in 1983.
Recent years have witnessed an increase in research activity for the detection of structural variants (SVs) and their association to human disease. The advent of next-generation sequencing technologies make it possible to extend the scope of structural variation studies to a point previously unimaginable as exemplified by the 1000 Genomes Project. Although various computational methods have been described for the detection of SVs, no such algorithm is yet fully capable of discovering transposon insertions, a very important class of SVs to the study of human evolution and disease. High-throughput sequencing technologies can be used to identify complete and novel formulation to discover both loci and classes of transposons inserted into genomes.
Our group came across one such algorithm, called VariationHunter, which is based on combinatorial algorithms. This algorithm has been shown be effective in discovering >85% of transposon insertion events with precision of >90%, for the used simulated test data.
Watch the following video to know more about the algorithm and its implications : -
This is an initiative to understand the long ignored implications of the part Transposons play in causing/progressing prostate Cancer.
Team : -
Dr. I.K.Jordan, Kevin Lee, Deepak Purushotham
The Study of Phylogenetic Limiting Similarity: Do more closely-related protist species hold similar effects on their predators?
Introduction
As an undergraduate researcher, I work with Professor Jiang in the Biology department of Georgia Tech. Our research primarily focuses on phylogenic community ecology. I initially found Prof. Jiang’s research interesting because of his engaging lecture in Ecology class that I am currently in. When I visited him to be part of his research team this semester, he offered me a topic that extends Darwin’s hypothesis of phylogenetic limiting similarity: to observe how phylogenetic relatedness in protist species hold similar overall effects on their predators, for example, population size of the predators.
Methods
In our experiment, eight different protist species with 6 replicates each are grown: Tetrahymena pyriformis, Glaucoma scintillans,Colpidium kleini, Colpidium striatum, Paramecium aurelia, Paramecium caudatum, Paramecium multimicronucleatum, andSpirostomum teres. Prior to the actual experiment, these eight species were cultured separately on three bacterial species: Bacillus cereus, Bacillus subtilis, and Serratia marcescens. Microcosms are 250mL glass jars each filled up with 100mL of medium. The medium was prepared by dissolving 0.55g of North Carolina protozoan pellet in 1L of deionized water, autoclaving it, then inoculating with the bacterial assemblage. Two wheat seeds per jar were inserted for food source for the species. As the population density stabilized in stock cultures, 100 individuals for each protist were transferred to all 48 glass jars to begin the experiment. Protist species were grown first for the first 3 weeks until the predators were put in; this is to figure out the carrying capacity and growth rate of prey before predators are inserted. The predator used in the experiment was Strentor, a blue-green protist that appears as a large trumpet. This predator species is known to be voracious: consuming everything on its way. From the stock cultures, 5 predators were isolated and transferred to each jar. To calculate the population density, approximately 0.3mL from the medium were drawn and diluted to count the number of prey and Strentor. For predator data, the growth rate, carrying capacity, body size, and population density will be calculated in the end of the research.
Results
As shown in Figure 1, the results strongly show population density fluctuations, especially in the protist population. Some key points from the results are that T. pyriformis and G. scintillans are the two species that can grow exponentially in a short amount of time and had the most fluctuating population density compared to other species. These two species are well-known as being “crazy,” because they can reproduce and grow exponentially within a short time. The population density of P. caudatum was the lowest throughout the experiment, yet showed some increase in the end. There are some important trends in the population density fluctuations. There was a slight increase of population density between Oct-10 and Oct-11, because the old medium was replaced by new medium on Oct-10, thereby providing fresh nutrients to the protist species. Predators were inserted on Oct-11, and starting from Oct-12, there is somewhat a decrease in population density, as displayed in Figure 1. The most decrease occurred in C. kleini. However, there was slight increase in some species: G. scintillans and P. caudatum. For the predators, the number of predators counted varies greatly and needs more time for the population to stabilize. The average population density of predator species will be calculated differently.
Discussion
Since I am still in the process of this research topic, conclusion will be made in the end of this semester, when the data are more stable with more stabilized fluctuation points. Therefore, at this time, I cannot make any critical discussion about the findings. However, some ideas behind the trends shown in Figure 1 can be made. The huge decrease in population density in C. kleini may be due to overexploitation of the prey species by predator. This can be supported by the most number of predators counted in C. kleinimedia on Oct-27, the most recent date of counting. The increase in population density in P. caudatum and G. scintillans may be due to the impact of nutrient renewal by replacing the medium on Oct-11, right before inserting predators. For these species, the nutrient availability can be a more important factor in growth than presence of predators.
This study is very interesting, and as of now, there is nothing to add to the media; I just need to keep counting and prey and predator species and replace the media on a weekly basis for fresh nutrients for the prey to grow better. Some questions that this study raises are how differently in time the predators will drive each species extinct and how this will be similar in more closely-related prey species. Also, I wonder if the C. kleini species will be successful in growing to its original population density and how long the return time will be compared to other species that did not have that much of a decrease. To answer these questions, more data is needed for the population density of the predators to increase more and stabilize, because in ~0.3g of sample media for each jar for counting, the number of predators is mostly zero, except for a couple of them. However, the number of predators found per sample is increasing gradually, and I wonder how much more these predators can increase in number. When this number stabilizes, the carrying capacity, growth rate, body size, and population density of the predators will be determined, and the overall results will be discussed soon.
GeneTack Frameshift Detection
Ab Initio Frameshift Detection
When reading nucleotides, there are several possible reading frames depending on how you group codons. For example, in mRNA there are three possible reading frames all starting on a different nucleotide, leading to the possible interpretation of 3 different codon sequences. Due to this phenomena, you can have overlapping genes in the same sequence which have different reading frames. An open reading frame (ORF) is a reading frame that does not contain a stop codon and insertions or deletions (in a non-multiple of 3) cause frameshift mutations and dislocate the sequence for stop codons. The presence of same strand overlapping ORFs can either be caused by frame shifted genes producing multiple ORFs or they can be true overlapping/adjacent genes. This blog entry summarizes a paper entitled “GeneTack: Frameshift Identification in Protein-Coding Sequences by the Viterbi Algorithm,” published in the “Journal of Bioinformatics and Computational Biology” (Vol. 8, No. 3).
Introduction
Frameshifts found in protein sequences can either be correct (the result of mutation) or the result of a sequencing error. Traditionally there have been two groups of programs that detect frameshifts of both kinds: comparative genomics and single sequence (ab initio). Comparative genomics, or similarity search, search the translation of the ORF in known protein databases for a hit, which leads to the limitation that it is impossible to detect frameshifts in genes with no known homologs. Ab initio methods are not hindered by this limitation. Presented is a new algorithm for intron-less nucleotide sequences, in particular those of a prokaryote genome. Prokaryote genomes are suitable for ab initio methods since they have one long, continuous ORF for each gene.
GeneTack is a program designed to run on DNA fragments with all genes located in the same strand. It employs the use of a Hidden Markov Model and a dynamic programming algorithm known as the Viterbi Algorithm. In order to process actual sequence data, GeneTack-GM is a combination program that is a wrapper for GeneTack and uses GeneMarkS to parse whole genomes into fragments with collinear genes. The program predicts both natural and error-related frameshifts, although natural predictions are not as accurate as other programs since it does not use signaling sequence information.
GeneTack Algorithm
The GeneTack Algorithm can be broken into 3 steps:
- Algorithm takes in a fragment of the genomic sequence containing collinear genes in the direct strand.
- A probabilistic Hidden Markov Model is produced allowing for different scenarios.
- The Viterbi Algorithm is applied to the HMM to determine the maximum likelihood path.
In step 1, a frameshift may result in the prediction of two adjacent genes. GeneTack attempts to discriminate between correctly predicted adjacent genes and adjacent genes produced due to a sequence error (ie a split of a single gene by frameshift). A probabilistic Hidden Markov Model is constructed that allows for three scenarios: the presence of true overlapping genes, true non-overlapping adjacent genes, and adjacent genes predicted due to presence of a frameshift. The HMM, shown in Figure 1, consists of 28 states divided into 4 groups.
- States 1, 2, and 3 emit protein coding sequences related to the 3 possible global reading frames.
- the state denoted as “n/c” emits a non-coding sequence.
- the states denoted as “i-j” emit sequences where two adjacent genes overlap, with i and j correspond to the global reading frame of the upstream and downstream gene, respectively.
- 18 states emitting nucleotide of start (triangle) and stop (square) codons.
The HMM is similar to a state transition diagram, with each hidden state emitting a single nucleotide. The initial hidden state in analysis should be either “n/c,” “start,” or “stop.” Each transition has an associated probability value.
The Viterbi Algorithm is a dynamic programming algorithm with a variety of applications. When applied to graphs (like one produced by the HMM), the viterbi algorithm can be used to determine maximum likelihood paths. Using the probability values at each “node” in the graph (the different states), the Viterbi Algorithm can calculate the probabilities of various paths with an efficient runtime. Once the maximum likelihood path is discovered, the transitions between the various states in the path indicate specific information:
- Direct transitions between states 1, 2, and 3 correspond to frameshifts
- Transitions between 1, 2, and 3 passing through the “n/c” state indicate non-overlapping adjacent genes
- Transitions between 1, 2, and 3 that pass through an i-j state indicate overlapping adjacent genes
GeneTack-GM Algorithm
In order to run GeneTack, one needs to estimate parameters and parse the sequence into fragments. GeneTack-GM accomplishes this by using GeneMarkS. Figure 2 depicts the logic of operations in the GeneTack-GM program.
Initially, GeneMarkS is run for several iterations to determine the HMM parameters. At completion of the “training process,” GeneMarkS defines the set of predicted genes. The output is used to split the sequence into fragments, which the GeneTack program analyzes to identify possible frameshifts. Finally, several filters are applied to reduce the number of false positives. It should be noted slight modifications were made to the algorithm for high GC-content (guanine-cytosine) genes.
Datasets
GeneTack-GM was assessed on 17 prokaryote genomes with GC-content randing from 28%-75%. The original E. Coli genome, used for the training process (estimating program parameters), was not included in any of the datasets. From the 17 prokaryote genomes, datasets were generated to test performance at different gene lengths by simulating frameshifts in a randomly selected gene at a random position of the gene, at least a certain distance (insensitivity zone) from either gene end. Dataset_1000 simulated frameshifts in 400 genes of length greater than 1,000bp. Dataset_600_1000 simulated frameshifts in 200 genes of length 600-1,000bp.
Results
GeneTack was compared to two other frameshift detection programs, FrameD and FSFind. All three programs were applied to both datasets. The coordinates of the predicted frameshift were compared to the coordinates of the known simulated frameshift. A True Positive (TP) is a predicted frameshift within 50bp of an actual frameshift. A False Positive (FP) is a predicted frameshift further than 50bp from an actual frameshift. A False Negative (FN) resulted from no predicted frameshift within 50bp of a known frameshift. The performance was calculated using the conventions of Sensitivity (Sn) and Specificity (Sp) Sensitivity, Sn = (TP)/(TP+FN) is defined with respect to the actual number of frameshifts. Specificity, Sp = (TP)/(TP+FP) is defined with respect to the number of predictions made. The average of these two values were used to evaluate performance. Table 3 lists the results for Dataset_1000.
In the Dataset_1000, GeneTack-GM outperformed FrameD and FSFind by a margin of 9.4% and 9.1%, respectively. In Dataset_600_1000, GeneTack outperformed FrameD and FSFind by 5.9% and 9.9%, respectively.
Analysis
GeneTack-GM provides the most accurate frameshift detection. However, certain questions arise. One such example is whether GeneTack can be used to predict programmed frameshifts. Some genes have evolved sequences that induce frameshifting by altering the ribosomal frame during protein translation, known as programmed frameshifts. To determine this, GeneTack was applied to 23 sequences with +1 and -1 annotated programmed frameshifts. GeneTack was able to successfully predict frameshifts in 18 of these sequences.
It is difficult to detect frameshifts near the start of end of a gene, known as insensitivity zones. For the Dataset_1000, an insensitivity zone of 180bp, but how was this value determined? In setting the insensitivity zone, frameshifts were introduced at 5 nucleotide steps from 1 to 200bo from the gene border. It was observed that the accuracy of detection steadily increased with offset from the gene end, topping off around 90% at a distance of 180bp in genes of greater than 1000bp. It was also observed that GeneTack is able to detect frameshifts closer to the gene end better than those closer to the start end. This phenomena can be explained due to the logistics that it is easier to predict adjacent genes downstream rather than upstream, correlating to the number of start and stop codons in the genetic code.
A variety of filers were applied to the GeneTack data. In the proceeding data, filters removed 72% of false positives while keeping 91% of true positives. However, one might ask how to improve the filters. One possibility is to have filter parameters dependent on GC-content.
Can GeneTack-GM be adapted to other genomic sequences of intron-less genes? One possible application in metagenomics is to use GeneTack with heuristic models to predict genes in short sequences.
This paper was originally revised on February 12th, 2010. I was unable to find any new frameshift detection algorithms produced since then.
Identification of 4-nitrobenzoate reductase and 4-hydroxylaminobenzoate lyase enzymes involved in the bacterial degradation of chloramphenicol
Identification of 4-nitrobenzoate reductase and 4-hydroxylaminobenzoate lyase enzymes involved in the bacterial degradation of chloramphenicol
An interesting line of research in Dr. Spain’s environmental engineering lab is the identification of 4-nitrobenzoate reductase and 4-hydrozylaminobenzoate lyase enzymes involved in the bacterial degradation of chloramphenicol. This is an interesting topic to mention in that chloramphenicol is one of the best known naturally occurring compounds and it is worthwhile to address necessary pathways to a complete degradation of the molecule. It is surprising to find there are microorganisms are resistant to this antibiotic compound and even more fascinating that there are bacteria that truly subsist on chloramphenicol, using it as a source of carbon, nitrogen, and energy, reported by Lingens and Oltmanns1. From previous studies, bacteria that subsist on chloramphenicol are reported but specifically what pathway they go through is not substantiated. From other previous studies, chloramphenicol degradation is predicted to go through 5-niotrobenzoate. The isolation of a microorganism and the identification of the pathways involved in degradation of chloramphenicol as its food and energy source is the main goal of this research.
Results
A soil bacterium was isolated, identified, and grown to discover its enzymatic pathways.
The chloramphenicol degrader was isolated; selective enrichment with chloramphenicol allowed isolation of Nocardia JS674. Nocardia JS674 is able to subsist on chloramphenicol, at concentrations up to 800 µM, with optimal growth at concentrations ranging from 300-400 µM. During growth in minimal medium supplemented with chloramphenicol, an increase in cell density accompanied the removal of chloramphenicol and the release of ammonia (Figure 1A).
Figure 1
To ascertain if the pathway is induced in the presence of chloramphenicol, the ability of glucose-grown JS674 to degrade the antibiotic was also tested. Resting cellular suspensions of JS674 grown with pyruvate as the carbon source were not able to degrade chloramphenicol. Resting cells grown with chloramphenicol, however, were able to, readily removing the antibiotic (Figure 1B). The indication that 4NBA is an intermediate in a pathway to chloramphenicol degradation was observed when resting suspensions of chloramphenicol –grown cells readily degrade 4-nitrobenzoate. On the contrary, the indication that PABA is not an intermediate in a pathway to chloramphenicol degradation was observed when chloramphenicol adapted cells are not competent for degradation of para-aminobenzoic acid.
The genome of Nocardia JS674 was sequenced to identify potential gene involved in chloramphenicol degradation. A draft genome sequence of JS674 was completed using Illumina technology to identify genetic basis of chloramphenicol degradation. Sequencing yielded a total of 4,470,844 paired-end reads that generated 767 contigs of an average length of 9,818 nucleotides. Using the draft genome sequence using homology searches allowed identification of gene encoding 4-nitrobenzoate reductase within a 42 kbp contig. Upstream and divergently transcribed from the predicted 4NBA reductase was an open reading frame encoding a 4-hydroxylaminobenzoate lyase. The enzymes predicted from JS674 show high sequence similarity to characterized 4-nitrobenzoate reductase and 4-hydroxylaminobenzoate lyase enzymes from Ralstonia pickettii and from Pseudomonas putida TW3 (44 and 57%, respectively).
After the bacterium was successfully isolated, identified, and grown, enzyme assays were performed. 4NBA reductase activity was high in cell extracts derived from chloramphenicol-grown cultures, but not apparent in uninduced cultures (Table 1; Figure 3). The reductase activity is dependent on a nicotidamide cofactor, and exhibits a slight preference for NADPH over NADH (Table 1). Cell extracts from chloramphenicol-grown cultures were also able to transform 4HABA, indicating that the lower pathway for chloramphenicol degradation proceeds through 4-nitrobenzoate and also a reductive pathway generates hydroxylaminobenzoate intermediate. The pathway is not induced when cells are grown on glucose or pyruvate.
Table 1
| 4-Nitrobenzoate Reductase Activity in JS674 | ||
| Growth Substrate | Cofactor |
Specific Activity (nmol/min·mg) |
| Chloramphenicol | NADPH |
135 ± 18 |
| Chloramphenicol | NADH |
72 ± 1 |
| Glucose | NADPH |
not detected |
| Pyruvate | NADPH |
not detected |
Figure 3
This study is incomplete and is an on- going research. Most of what I am expected to do in the lab is the above. I have isolated a couple of bacteria using different soil inoculum and am expected to proceed further as described above. There are many questions the study raises. I wonder if these are the only enzymatic pathways involved in chloramphenicol degradation and are there other enzymes known to break down chloramphenicol— the study names only one of a kind, chloramphenicol acetyltransferase, a chloramphenicol hydrolase that are widely known. If cells are not degrading chloramphenicol when they were grown in glucose or other sugar carbon source, how are they degrading chloramphenicol when exposed to it in minimal medium? Do they have set of genes that are activated in absence of sugar? I also wonder what would happen if chloramphenicol exposed cells are grown in glucose. Will they forget “how” to utilize chloramphenicol as their food source after the exposure to glucose? If they are switching back and forth then it must be a significant evolutionary process. But in soils, typically carbon and nutrient rich, why did these organisms start to degrade chloramphenicol, an antibiotic, when there are plenty of carbon source in the soils? Further research to identifying the pathway and corresponding gene in the bacterium genome is necessary for discovering chloramphenicol degradation pathways by microorganisms.
References
1 Lingens, F. and Oltmanns, 0. (1966) Biochim. Biophys. Acta 130, 336.
Use of Nanomaterials in Cancer Drug Delivery
Dr. Jang’s lipid bi-layer research team focuses on the use of polymeric nanomaterials in cancer drug delivery. This research topic is interesting has a significant value. For example, a major obstacle for chemotherapy is the inability to deliver adequate doses of drugs to the affected areas in the body. Toxicity of cancer drugs limits their dose; however, rapid clearance from circulation requires large doses in order to be effective.
Polymeric nanomaterials have the potential to improve upon present chemotherapy delivery methods. They successfully reduce side effects while increasing dosage, increase residence time in the body, offer a tunable release, and have the ability to deliver multiple drugs in one carrier. However, traditional nanomaterials lack of rapid drug release at the intended site. Currently, “smart” technologies that enhance the benefits of typical nanomaterials are being simulated. Temperature and pH responsive drug delivery devices were reviewed as methods for triggering release of encapsulating drugs, while aptamer and ligand conjugation were discussed as methods for targeted and intracellular delivery.
Figure 1: Diagram of a nanomaterial encapsulating drugs
Ligands: Site-Targeted Nanomaterials
- Attaching ligands to the nanoparticle surface helps delivering drugs to targeted cells.
- This method takes advantage of the overexpression of various receptors on tumor cell surfaces.
- E.g. folate-conjugated PEG-co-poly micelles loaded with the anti-cancer durg doxorubicin.
- Resulted in increased cytotoxicity and decreased tumor growth.
Aptamers: Site-Targeted Nanomaterials
- Aptamers are DNA and RNA sequences that recognize specific target analytes.
- Aptamers help deliver high drug doses to diseased cells with minimal toxicity to healthy cells.
- Cancer is a genetic disease, and aptamers provide a way for screening.
- E.g. Cancer-detecting assays using fluorescent imaging utilize aptamers conjugated with dye-doped silica nanoparticles.
pH-Responsive: Site-Triggered Nanomaterials
- Tumor’s microenvironment has lower pH.
- Drug release from micelles can be targeted to acidic environments by conjugating the polymer to the drug with an acid-cleavable linkage.
- Micelle like particles and liposomes with pH sensitivity are great delivery vehicles for anticancer drugs, DNA, RNA, proteins, and peptides
Thermoresponsive: Site-Triggered Nanomaterials
- Hyperthermia has been investigated as a method for triggered drug release to targeted areas in thermoresponsive liposomes.
- Hyperthermia is defined as temperatures between 37◦C (physiological temperature) and 42◦C.
- When the liposomes pass through the area with increased temperature, they release their encapsulated drugs.
- DPPC is one of the common phospholipids seen in liposomes.
Conclusion
- Smart technologies in polymer nanomaterials offer a unique way to deliver chemotherapy drugs to their intended target without affecting healthy cells.
- Progression of these techniques will eventually lead to increased accuracy in delivering higher doses and more toxic drugs.
- Challenges like premature drug release and false cell targeting should be addressed.
The Rhox Homeobox Gene Cluster Is Imprinted and Selectively Targeted for Regulation by Histone H1 and DNA Methylation
Epigenetic mechanisms tightly control gene expression to regulate a wide variety of events, including placental function, embryonic growth, tissue differentiation, and tissue remodeling. The only known epigenetic modification of DNA in mammals is methylation. DNA methylation plays a key role in many gene silencing events, including X-chromosome inactivation, genomic imprinting, and silencing of retrotransposons and heterochromatin.
A component of chromatin that has recently been suggested to have a role in targeting DNA methylation to specific genomic sites is the linker histone H1. H1 binds to nucleosome core particles and protects an additional ∼20 bp of DNA (linker DNA) from nuclease digestion. The precise functions of H1 have proven difficult to define.
Studies show that H1 reduces nucleosome sliding and access to transcription factors in vitro. This has led to the view that H1 globally represses transcription.
But the precise mechanism responsible for repression is not known. And H1 appears to act on some of its targets by promoting DNA methylation.
Here, we report a major target of H1-mediated repression: a newly discovered homeobox gene cluster on the mouse X chromosome. This reproductive homeobox(Rhox) gene cluster contains over 30 genes that are selectively expressed in postnatal and adult reproductive tissues. The founding member of this complex, Pem (Rhox5), is also expressed in a cell type- and tissue-specific manner during embryonic development. It is likely that Rhox5 has redundant roles with other Rhox genes during embryonic development, as targeted deletion of Rhox5 does not result in embryonic defects
we provide evidence that H1 also regulates the other genes in the Rhox cluster by promoting their methylation. We propose that the Rhox gene cluster is a useful model system to understand how regulators of chromatin structure control gene expression and genomic imprinting.
Rhox has two promoters. A distal promoter (Pd) expressed during embryogenesis and proximal promoter (Pp) expressed in somatic cells.
We used RT-PCR to find the sequences of Rhox genes. It revealed that ES cells express Pd transcript but almost no Pp transcripts. Pd expression was strongly upregulated as a result of H1 depletion. And this effect of H1 is specific to the Pd, as Pp transcripts were not significantly upregulated in response to H1 depletion.
We used ChIP analysis to address whether H1 acts directly to promote methylation of the Pp. It revealed that H1 was present at high levels at the Pd in control ES cells but was at background levels at the Pd in H1-TKO cells consistent with the notion that H1 directly promotes methylation of the Pd
We used Biosulfite sequencing to examine the methylation status of the Rhox5 promoter. It revealed that minimal region of the Pd is required for expression in transfected cell lines has four CpGs. And CpG sequences in the minimal Pd are hypermethylated in wildtype ES cells and hypomethylated in H1-depleted ES cells.
All these results showed that H1 specifically promotes methylation and silencing of the Pd
By showing that the Rhox5 promoter expressed in ES cells—the Pd—is both demethylated and transcriptionally induced upon H1 depletion, and cassette methylation procedure that DNA methylation directly represses Pd transcription, we demonstrate that the Rhox homeobox gene cluster is a major target of H1-mediated repression in ES cells.
Meta-analysis of Combined Microarray and NGS data – Po-Yen Wu
Microarray technology has been widely used for gene expression profiling. This technology is attractive because of its maturity and because of the large number of publicly available datasets. However, there are some inherent limitations to microarrays. Starting from 2005, after 454 Life Sciences introduced its large-scale parallel pyrosequencing system, next-generation sequencing (NGS) technology has become more and more prevalent in the studies of both genomics and transcriptomics. NGS is fascinating because it can identify and quantify rare transcripts without prior knowledge of a particular gene. It can also provide information regarding alternative splicing and sequence variation in identified genes.
A known problem in obtaining statistically significant inferences from gene expression profiling data, e.g., detecting differentially expressed genes (DEGs), is that the feature size (number of genes) is always way larger than the expression sample size. Based on the aforementioned vantage of both platforms, we thought that it is worth combining datasets from both microarray and NGS technologies to increase the sample size. Thus we can take the advantage of both the large number of available microarray samples and the higher sensitivity of NGS samples for detecting DEGs.
We used publicly available microarray and NGS samples from the GEO and SRA databases. The microarray dataset was normalized using the web-based caCORRECT tool and the RMA algorithm. The NGS dataset was aligned to the human genome reference GRCh37.61 using bwa aligner. The RPKM normalization method was also a necessary step for preprocessing the NGS dataset. We applied several platform-specific algorithms for DEG detection to the combined datasets as well as individual microarray or NGS dataset to investigate whether it is feasible to do the cross-platform meta-analysis. These algorithms usually are designed to work well for a particular platform. For microarray data, some common methods for identifying DEGs include Significance Analysis of Microarrays (SAM), linear models and empirical Bayes, and Rank Products (RP). Poisson-based models—e.g., Audic-Claverie (AC), Poisson model with likelihood ratio test, and negative binomial model with exact test–are appropriate for NGS digital expression values. Therefore, several processing steps are required to make meta-analysis feasible, i.e., we need to make expression distribution between two platforms as close as possible.
Table I summaries the performance of each DEG detection algorithm towards different datasets. We tested the performance of each method on five datasets, representing different combinations of normalized microarray and NGS data. We observed that AC statistics, DEGseq, and Rank Products are able to detect more DEGs when microarray and NGS data are combined. This suggests that these methods may be more robust to heterogeneous combination of microarray and NGS data. The other three methods, edgeR, SAM, and Empirical Bayes did not perform well when combining datasets compared to individual datasets. Some meta-analysis methods (e.g., SAM) results in an increased FDR compared to that of individual data, regardless of the normalization method. These methods depend on assumptions about expression values (i.e., expression values for each sample are assumed to be drawn from particular distributions). Statistically, the dynamic range and data distributions differ between microarray and NGS platforms. The use of normalization methods such as RPKM and quantile normalization is sometimes not enough to overcome these differences. Methods such as Rank Products are designed for meta-analysis and are less stringent in terms of distribution similarity between datasets.
Results of our exploratory analysis indicate that it is feasible to use meta-analysis methods for microarray and NGS data. Such analysis takes advantage of both the large number of available microarray samples and the higher sensitivity of NGS samples for detecting DEGs. However, the success of meta-analysis may also depend on normalization or other preprocessing steps.











