Years: 2017201620152014201320122011201020092008200720062004200320022000


2017 (5)

» A neurogenetic model for the study of schizophrenia spectrum disorders: The International 22q11.2 Deletion Syndrome Brain Behavior Consortium.
Gur R, Bassett A, McDonald-McGinn D, Bearden C, Chow E, Emanuel B, Owen M, Swillen A, van den Bree M, Vermeesch J, Vorstman J, Warren S, Lehner T, Morrow B, The International 22q11.2 Deletion Syndrome Brain Behavior Consortium (2017) Mol Psychiatry (In press).  website     disease study | review / perspective
ABSTRACT: Rare copy number variants contribute significantly to the risk for schizophrenia, with the 22q11.2 locus consistently implicated. Individuals with the 22q11.2 deletion syndrome (22q11DS) have an estimated 25-fold increased risk for schizophrenia spectrum disorders, compared to individuals in the general population. The International 22q11DS Brain Behavior Consortium is examining this highly informative neurogenetic syndrome phenotypically and genomically. Here we detail the procedures of the phenomic effort to characterize the neuropsychiatric and neurobehavioral phenotypes associated with 22q11DS, focusing on schizophrenia and subthreshold expression of psychosis. The genomic approach includes a combination of whole genome sequencing and genome-wide microarray technologies, allowing the investigation of all possible DNA variation and gene pathways influencing the schizophrenia-relevant phenotypic expression. A phenotypically rich data set provides a psychiatrically well-characterized sample of unprecedented size (n=1,616) that informs the neurobehavioral developmental course of 22q11DS. This combined set of phenotypic and genomic data will enable hypothesis testing to elucidate the mechanisms underlying the pathogenesis of schizophrenia spectrum disorders.

» Translation fidelity coevolves with longevity.
Ke Z, Mallik P, Johnson AB, Luna F, Zhang ZD, Gladyshev V, Seluanov A, Gorbunova V (2017) Aging Cell (In press).    aging | cell biology
ABSTRACT: Whether errors in protein synthesis play a role in aging has been a subject of intense debate. It has been suggested that rare mistakes in protein synthesis in young organisms may result in errors in the protein synthesis machinery, eventually leading to an increasing cascade of errors as organism age. Studies that followed generally failed to identify a dramatic increase in translation errors with aging. However, whether translation fidelity plays a role in aging remained an open question. To address this issue, we examined the relationship between translation fidelity and maximum lifespan across 17 rodent species with diverse lifespans. To measure translation fidelity, we utilized sensitive luciferase-based reporter constructs with mutations in an amino acid residue critical to luciferase activity, wherein misincorporation of amino acids at this mutated codon re-activated the luciferase. The frequency of amino acid misincorporation at the first and second codon positions showed strong negative correlation with maximum lifespan. This correlation remained significant after phylogenetic correction, indicating that translation fidelity coevolves with longevity. These results give a new life to the role of protein synthesis errors in aging: Although the error rate may not significantly change with age, the basal rate of translation errors is important in defining lifespan across mammals.

» Cyclin C regulates adipogenesis by stimulating transcriptional activity of CCAAT/enhancer binding protein alpha.
Song Z, Xiaoli AM, Zhang Q, Zhang Y, Yang ES, Wang S, Chang R, Zhang ZD, Yang G, Strich R, Pessin JE, Yang F (2017) J Biol Chem (In press).  pubmed     cell biology
ABSTRACT: Brown adipose tissue (BAT) is important for maintaining energy homeostasis and adaptive thermogenesis in rodents and humans. As disorders arising from dysregulated energy metabolism, such as obesity and metabolic diseases, have increased, so has interest in the molecular mechanisms in adipocyte biology. Using a functional screen, we identified cyclin C (CycC), a conserved subunit of the Mediator complex, as a novel regulator for brown adipocyte formation. siRNA-mediated CycC knockdown (KD) in brown preadipocytes impaired the early transcriptional program of differentiation, and genetic knockout (KO) of CycC completely blocked the differentiation process. RNA-seq analyses of CycC-KD revealed a critical role of CycC in activating genes co-regulated by peroxisome proliferator activated receptor gamma (PPARγ) and CCAAT/enhancer binding protein alpha (C/EBPα). Overexpression of PPARγ2 or addition of the PPARγ ligand rosiglitazone rescued the defects in CycC-KO brown preadipocytes, and efficiently activated the PPARγ-responsive promoters in both wild-type (WT) and CycC-KO cells, suggesting that CycC is not essential for PPARγ transcriptional activity. In contrast, CycC-KO significantly reduced C/EBPα-dependent gene expression. Unlike for PPARγ, overexpression of C/EBPα could not induce C/EBPα target gene expression in CycC-KO cells or rescue the CycC-KO defects in brown adipogenesis, suggesting that CycC is essential for C/EBPα-mediated gene activation. CycC physically interacted with C/EBPα and this interaction was required for C/EBPα transactivation domain activity. Consistent with the role of C/EBPα in white adipogenesis, CycC-KD also inhibited differentiation of 3T3-L1 cells into white adipocytes. Together, these data indicate that CycC activates adipogenesis by stimulating the transcriptional activity of C/EBPα.

» Transcriptomic dynamics of breast cancer progression in the MMTV-PyMT mouse model.
Cai Y, Nogales-Cadenas R, Zhang Q, Lin JR, Zhang W, O'Brien K, Montagna C, Zhang ZD (2017) BMC Genomics 18(1):185.  pubmed   reprint     disease study | functional genomics
ABSTRACT: Background: Malignant breast cancer with complex molecular mechanisms of progression and metastasis remains a leading cause of death in women. To improve diagnosis and drug development, it is critical to identify panels of genes and molecular pathways involved in tumor progression and malignant transition. Using the PyMT mouse, a genetically engineered mouse model that has been widely used to study human breast cancer, we profiled and analyzed gene expression from four distinct stages of tumor progression (hyperplasia, adenoma/MIN, early carcinoma and late carcinoma) during which malignant transition occurs. Results: We found remarkable expression similarity among the four stages, meaning genes altered in the later stages showed trace in the beginning of tumor progression. We identified a large number of differentially expressed genes in PyMT samples of all stages compared with normal mammary glands, enriched in cancer-related pathways. Using co-expression networks, we found panels of genes as signature modules with some hub genes that predict metastatic risk. Time-course analysis revealed genes with expression transition when shifting to malignant stages. These may provide additional insight into the molecular mechanisms beyond pathways. Conclusions: Thus, in this study, our various analyses with the PyMT mouse model shed new light on transcriptomic dynamics during breast cancer malignant progression.

» Network analysis of mitonuclear GWAS reveals functional networks and tissue expression profiles of disease-associated genes.
Johnson SC, Gonzales B, Milholland B, Zhang Q, Zhang Z, Suh S (2017) Hum Genet 136:55-65.  pubmed   reprint     disease study | systems biology
ABSTRACT: Evidence that mitochondria play a role in disease largely arises from observed measures of dysfunction in pathological conditions and genome-wide association studies (GWAS) associating mitochondrial factors to disease risk. While a wealth of data suggesting mitochondria are important in diverse human pathologies, the specific mechanistic relationship between this organelle and disease is often elusive. Using GWAS data we examined the genetic signature of the nuclear encoded mitochondrial proteome in human diseases. We found unique mitochondrial pathways associating with each disease, characterized by distinctive protein-protein interaction networks and enrichment of unique gene ontology term sets. Genome-wide RNA-sequencing expression data from 32 human tissues revealed unique tissue specific expression profiles for each disease. Finally, we examined mitonuclear GWAS risk alleles in cancer using eQTL data, finding the directional impact of risk alleles provides a coherent model for the contextual role of GWAS risk variants. These unbiased, whole genome based assessments provide new insight into the mechanistic role of mitochondria in different human diseases.

2016 (5)

» Cell culture-based profiling across mammals reveals DNA repair and metabolism as determinants of species longevity.
Ma S, Upneja A, Galecki A, Tsai Y, Burant CF, Raskind S, Zhang Q, Zhang ZD, Seluanov A, Gorbunova V, Clish CB, Miller RA, Gladyshev VN (2016) eLife 5:e19130.  pubmed   reprint     aging | comparative genomics
ABSTRACT: Mammalian lifespan differs by >100-fold, but the mechanisms associated with such longevity differences are not understood. Here, we conducted a study on primary skin fibroblasts isolated from 16 species of mammals and maintained under identical cell culture conditions. We developed a pipeline for obtaining species-specific ortholog sequences, profiled gene expression by RNA-seq and small molecules by metabolite profiling, and identified genes and metabolites correlating with species longevity. Cells from longer-lived species up-regulated genes involved in DNA repair and glucose metabolism, down-regulated proteolysis and protein transport, and showed high levels of amino acids but low levels of lysophosphatidylcholine and lysophosphatidylethanolamine. The amino acid patterns were recapitulated by further analyses of primate and bird fibroblasts. The study suggests that fibroblast profiling captures differences in longevity across mammals at the level of global gene expression and metabolite levels and reveals pathways that define these differences.

◸ journal highlighted article 
» Integrated post-GWAS analysis shed new light on the disease mechanisms of schizophrenia.
Lin JR, Cai Y, Zhang Q, Zhang W, Nogales-Cadenas R, Zhang ZD (2016) Genetics 204(4):1587-1600.  pubmed   reprint   website     analytical method | disease study
ABSTRACT: Schizophrenia is a severe mental disorder with a large genetic component. Recent genome-wide association studies (GWAS) have identified many schizophrenia-associated common variants. For most of the reported associations, however, the underlying biological mechanisms are not clear. The critical first step for their elucidation is to identify the most likely disease genes as the source of the association signals. Here, we describe a general computational framework of post-GWAS analysis for complex disease gene prioritization. We identify 132 putative schizophrenia risk genes in 76 risk regions spanning 120 schizophrenia-associated common variants, 78 of which have not been recognized as schizophrenia disease genes by previous GWAS. Even more significantly, 29 of them are outside the risk regions, likely under regulation of transcriptional regulatory elements therein contained. These putative schizophrenia risk genes are transcriptionally active in both brain and the immune system and highly enriched among cellular pathways, consistent with leading pathophysiological hypotheses about the pathogenesis of schizophrenia. With their involvement in distinct biological processes, these putative schizophrenia risk genes with different association strengths show distinctive temporal expression patterns and play specific biological roles during brain development.

» MicroRNA expression and gene regulation drive breast cancer progression and metastasis in PyMT mice.
Nogales-Cadenas R, Cai Y, Lin JR, Zhang Q, Zhang W, Montagna C, Zhang ZD (2016) Breast Cancer Res 18(1):75.  pubmed   reprint   website     disease study | functional genomics
ABSTRACT: BACKGROUND: MicroRNAs (miRNAs) are small non-coding RNA molecules of about 22 nucleotides whose function is to silence the expression of their target genes. Numerous studies have shown that miRNAs are not only key regulators in important cellular processes but also drivers in the development of many diseases including, especially, cancer. Estrogen receptor positive luminal B is the second most common but the least studied subtype of breast cancer. Only a few studies examined the expression profiles of miRNAs in luminal B breast cancer, and their regulatory roles in the cancer progression have yet to be investigated. METHODS: In this study, using the PyMT mice, a widely used luminal B breast cancer model, we profiled miRNA expression at four time points that represent different key developmental stages of cancer progression. We considered the expression of both miRNAs and mRNAs at these time points to improve the identification of regulatory targets of miRNAs. By combining gene functional and pathway annotation with miRNA-mRNA interactions, we created a PyMT-specific tripartite miRNA-mRNA-pathway network and identified novel Functional-Regulatory Programs (FRPs). RESULTS: We identified 151 differentially expressed miRNAs with a strict dual nature of either up- or down-regulation during the whole course of disease progression. Among 82 newly discovered breast cancer-related miRNAs, 35 can potentially regulate 271 protein-coding genes based on their sequence complementarity and expression profiles. We also identified miRNA-mRNA regulatory modules driving specific cancer-related biological processes. CONCLUSION: In this study we profiled the expression of miRNAs during breast cancer progression in PyMT mouse model. By integrating miRNA and mRNA expression profiles together, we identified differentially expressed miRNAs and their target genes involved in several cancer hallmarks. We applied a novel clustering method to an annotated miRNA-mRNA regulatory network and identified network modules involved in specific cancer-related biological processes.

» Systems-level analysis of human aging genes shed new light on mechanisms of aging.
Zhang Q, Nogales-Cadenas R, Lin JR, Zhang W, Cai Y, Vijg J, Zhang ZD (2016) Hum Mol Gen 25(14):2934-2947.  pubmed   reprint     aging | systems biology
ABSTRACT: Although studies over the last decades have firmly connected a number of genes and molecular pathways to aging, the aging process as a whole still remains poorly understood. To gain novel insights into the mechanisms underlying aging, instead of considering aging genes individually, we studied their characteristics at the systems level in the context of biological networks. We calculated a comprehensive set of network characteristics for human aging-related genes from the GenAge database. By comparing them with other functional groups of genes, we identified a robust group of aging-specific network characteristics. To find the structural basis and the molecular mechanisms underlying this aging-related network specificity, we also analyzed protein domain interactions and gene expression patterns across different tissues. Our study revealed that aging genes not only tend to be network hubs, playing important roles in communication among different functional modules or pathways, but also are more likely to physically interact and be co-expressed with essential genes. The high expression of aging genes across a large number of tissue types also points to a high level of connectivity among aging genes. Unexpectedly, contrary to the depletion of interactions among hub genes in biological networks, we observed close interactions among aging hubs, which renders the aging subnetworks vulnerable to random attacks and thus may contribute to the aging process. Comparison across species reveals the evolution process of the aging subnetwork. As the organisms become more complex, the complexity of its aging mechanisms increases and their aging hub genes are more functionally connected.

» Prioritization of schizophrenia risk genes by a network-regularized logistic regression method.
Zhang W, Lin JR, Nogales-Cadenas R, Zhang Q, Cai Y, Zhang ZD (2016) Lecture Notes in Bioinformatics 9565:434-445.  reprint     analytical method | disease study
ABSTRACT: Schizophrenia (SCZ) is a severe mental disorder with a large genetic component. While recent large-scale microarray- and sequencing-based genome wide association studies have made significant progress toward finding SCZ risk variants and genes of subtle effect, the interactions among them were not considered in those studies. Using a protein-protein interaction network both in our regression model and to generate a SCZ gene subnetwork, we developed an analytical framework with Logit-Lapnet, the graphical Laplacian-regularized logistic regression, for whole exome sequencing (WES) data analysis to detect SCZ gene subnetworks. Using simulated data from sequencing-based association study, we compared the performances of Logit-Lapnet with other logistic regression (LR)-based models. We use Logit-Lapnet to prioritize genes according to their coefficients and select top-ranked genes as seeds to generate the gene sub-network that is associated to SCZ. The comparison demonstrated not only the applicability but also better performance of Logit-Lapnet to score disease risk genes using sequencing-based association data. We applied our method to SCZ whole exome sequencing data and selected top-ranked risk genes, the majority of which are either known SCZ genes or genes potentially associated with SCZ. We then used the seed genes to construct SCZ gene subnetworks. This result demonstrates that by rank-ing gene according to their disease contributions our method scores and thus prioritiz-es disease risk genes for further investigation. An implementation of our approach in MATLAB is freely available for download at:

2015 (6)

» DNA repair in species with extreme lifespan differences.
MacRae SL, Croken MM, Calder RB, Aliper A, Milholland B, White RR, Zhavoronkov A, Gladyshev VN, Seluanov A, Gorbunova V, Zhang ZD, Vijg J (2015) Aging 7(12):1171-1184.  pubmed   reprint     aging | comparative genomics | evolutionary genomics
ABSTRACT: Differences in DNA repair capacity have been hypothesized to underlie the great range of maximum lifespans among mammals. However, measurements of individual DNA repair activities in cells and animals have not substantiated such a relationship because utilization of repair pathways among animals-depending on habitats, anatomical characteristics, and life styles-varies greatly between mammalian species. Recent advances in high-throughput genomics, in combination with increased knowledge of the genetic pathways involved in genome maintenance, now enable a comprehensive comparison of DNA repair transcriptomes in animal species with extreme lifespan differences. Here we compare transcriptomes of liver, an organ with high oxidative metabolism and abundant spontaneous DNA damage, from humans, naked mole rats, and mice, with maximum lifespans of ~120, 30, and 3 years, respectively, with a focus on genes involved in DNA repair. The results show that the longer-lived species, human and naked mole rat, share higher expression of DNA repair genes, including core genes in several DNA repair pathways. A more systematic approach of signaling pathway analysis indicates statistically significant upregulation of several DNA repair signaling pathways in human and naked mole rat compared with mouse. The results of this present work indicate, for the first time, that DNA repair is upregulated in a major metabolic organ in long-lived humans and naked mole rats compared with short-lived mice. These results strongly suggest that DNA repair can be considered a genuine longevity assurance system.

» RNA: DNA hybrids in the human genome have distinctive nucleotide characteristics, chromatin composition, and transcriptional relationships.
Nadel J, Athanasiadou R, Lemetre C, Wijetunga NA, O Broin P, Sato H, Zhang Z, Jeddeloh J, Montagna C, Golden A, Seoighe C, Greally JM (2015) Epigenetics Chromatin 8:46.  pubmed   reprint     epigenomics | functional genomics
ABSTRACT: BACKGROUND: RNA:DNA hybrids represent a non-canonical nucleic acid structure that has been associated with a range of human diseases and potential transcriptional regulatory functions. Mapping of RNA:DNA hybrids in human cells reveals them to have a number of characteristics that give insights into their functions. RESULTS: We find RNA:DNA hybrids to occupy millions of base pairs in the human genome. A directional sequencing approach shows the RNA component of the RNA:DNA hybrid to be purine-rich, indicating a thermodynamic contribution to their in vivo stability. The RNA:DNA hybrids are enriched at loci with decreased DNA methylation and increased DNase hypersensitivity, and within larger domains with characteristics of heterochromatin formation, indicating potential transcriptional regulatory properties. Mass spectrometry studies of chromatin at RNA:DNA hybrids shows the presence of the ILF2 and ILF3 transcription factors, supporting a model of certain transcription factors binding preferentially to the RNA:DNA conformation. CONCLUSIONS: Overall, there is little to indicate a dependence for RNA:DNA hybrids forming co-transcriptionally, with results from the ribosomal DNA repeat unit instead supporting the intriguing model of RNA generating these structures in trans. The results of the study indicate heterogeneous functions of these genomic elements and new insights into their formation and stability in vivo.

» From gene expression to disease phenotypes: network-based approaches to study complex human diseases.
Zhang Q, Zhang W, Nogales-Cadenas R, Lin JR, Cai Y, Zhang ZD (2015) Transcriptomics and Gene Regulation Translational Bioinformatics 9:115-140.  reprint     review / perspective
ABSTRACT: Gene expression is a fundamental biological process under tight regulation at all levels in normal cells. Its dysregulation can cause abnormal cell behaviors and result in diseases, and thus gene expression profiling and analysis have been widely used to provide the first clue about the molecular mechanisms of human diseases. Because genes and their products interact with and regulate one another, it is essential to analyze gene expression data and understand the genetics of disease in a biological network context. In this chapter, we first introduce the state-of-the-art gene expression analysis with network integration and the joint analysis of mRNA and miRNA expression to understand disease regulatory mechanisms, and then discuss how disease genes are predicted by incorporating knowledge of gene regulation and characterized in biological networks.

» Whole-Genome Sequencing and Integrative Genomic Analysis Approach on Two 22q11.2 Deletion Syndrome Family Trios for Genotype to Phenotype Correlations.
Chung JH, Cai J, Suskin BG, Zhang Z, Coleman K, Morrow BE (2015) Hum Mutat 36(8):797-807.  pubmed   reprint     disease study | genetic variation | genomic sequencing
ABSTRACT: The 22q11.2 deletion syndrome (22q11DS) affects 1:4,000 live births and presents with highly variable phenotype expressivity. In this study, we developed an analytical approach utilizing whole-genome sequencing (WGS) and integrative analysis to discover genetic modifiers. Our pipeline combined available tools in order to prioritize rare, predicted deleterious, coding and noncoding single-nucleotide variants (SNVs), and insertion/deletions from WGS. We sequenced two unrelated probands with 22q11DS, with contrasting clinical findings, and their unaffected parents. Proband P1 had cognitive impairment, psychotic episodes, anxiety, and tetralogy of Fallot (TOF), whereas proband P2 had juvenile rheumatoid arthritis but no other major clinical findings. In P1, we identified common variants in COMT and PRODH on 22q11.2 as well as rare potentially deleterious DNA variants in other behavioral/neurocognitive genes. We also identified a de novo SNV in ADNP2 (NM_014913.3:c.2243G>C), encoding a neuroprotective protein that may be involved in behavioral disorders. In P2, we identified a novel nonsynonymous SNV in ZFPM2 (NM_012082.3:c.1576C>T), a known causative gene for TOF, which may act as a protective variant downstream of TBX1, haploinsufficiency of which is responsible for congenital heart disease in individuals with 22q11DS.

» Comparative analysis of genome maintenance genes in naked mole-rat, mouse, and human.
MacRae SL, Zhang Q, Lemetre C, Seim I, Calder RB, Hoeijmakers J, Suh Y, Gladyshev VN, Seluanov A, Gorbunova V, Vijg J, Zhang ZD (2015) Aging Cell 14(2):288-291.  pubmed   reprint     aging | comparative genomics | evolutionary genomics
ABSTRACT: Genome maintenance (GM) is an essential defense system against aging and cancer, as both are characterized by increased genome instability. Here, we compared the copy number variation and mutation rate of 518 GM-associated genes in the naked mole-rat (NMR), mouse, and human genomes. GM genes appeared to be strongly conserved, with copy number variation in only four genes. Interestingly, we found NMR to have a higher copy number of CEBPG, a regulator of DNA repair, and TINF2, a protector of telomere integrity. NMR, as well as human, was also found to have a lower rate of germline nucleotide substitution than the mouse. Together, the data suggest that the long-lived NMR as well as human have more robust GM than mouse, and identifies new targets for the analysis of the exceptional longevity of the NMR.

» INK4 locus of the tumor-resistant rodent, the naked mole rat, expresses a functional p15/p16 hybrid isoform.
Tian X, Azpurua J, Ke Z, Zhang Z, Vijg J, Gladyshev V, Gorbunova V, Seluanov A (2015) Proc Natl Acad Sci U S A 112(4):1053-1058.  pubmed   reprint     cell biology
ABSTRACT: The naked mole rat (Heterocephalus glaber) is a long-lived and tumor resistant rodent. Tumor resistance in the naked mole rat is mediated by extracellular matrix component hyaluronan of very high molecular weight (HMW-HA). HMW-HA triggers hypersensitivity of naked mole rat cells to contact inhibition, which is associated with induction of the INK4 locus leading to cell cycle arrest. The INK4a/b locus is among the most frequently mutated in human cancer. This locus encodes three distinct tumor suppressors: p15INK4b, p16INK4a, and ARF. While p15INK4b has its own open reading frame, p16INK4a and ARF share common second and third exons with alternative reading frames. Here we show that in the naked mole rat the INK4a/b locus encodes an additional product that consists of p15INK4b exon 1 joined to p16INK4a exons 2 and 3. We have named this isoform pALTINK4a/b (for alternative splicing). We show that pALTINK4a/b is present in both cultured cells and naked mole rat tissues, but is absent in human and mouse cells. Additionally, we demonstrate that the pALTINK4a/b expression is induced during early contact inhibition and upon a variety of stresses such as UV, gamma irradiation, loss of substrate attachment and expression of oncogenes. When overexpressed in naked mole rat or human cells pALTINK4a/b has stronger ability to induce cell cycle arrest than either p15INK4b or p16INK4a. We hypothesize that the presence of the fourth product, pALTINK4a/b of the INK4a/b locus in the naked mole rat contributes to the increased resistance to tumorigenesis of this species.

2014 (2)

» Comparative genetics of longevity and cancer: insights from long-lived rodents.
Gorbunova V, Seluanov A, Zhang Z, Gladyshev VN, Vijg J (2014) Nat Rev Genet 15(8):531-540.  pubmed   reprint     aging | review / perspective
ABSTRACT: Mammals have evolved a dramatic diversity of aging rates. Within the single order of Rodentia maximum lifespans differ from four years in mice to 32 years in naked mole rats. Cancer rates also differ significantly, from cancer-prone mice to virtually cancer-proof naked and blind mole rats. Recent progress in rodent comparative biology, in combination with the emergence of whole genome sequence information, has opened opportunities for the discovery of genetic factors controlling longevity and cancer susceptibility.

» Mosaic epigenetic dysregulation of ectodermal cells in autism spectrum disorder.
Berko ER, Suzuki M, Beren F, Lemetre C, Alaimo CM, Calder RB, Ballaban-Gil K, Gounder B, Kampf K, Kirschen J, Maqbool SB, Momin Z, Reynolds DM, Russo N, Shulman L, Stasiek E, Tozour J, Valicenti-McDermott M, Wang S, Abrahams BS, Hargitai J, Inbar D, Zhang Z, Buxbaum JD, Molholm S, Foxe JJ, Marion R6, Auton A, Greally JM (2014) PLoS Genet 10(5):e1004402.  pubmed   reprint     disease study | epigenomics
ABSTRACT: DNA mutational events are increasingly being identified in autism spectrum disorder (ASD), but the potential additional role of dysregulation of the epigenome in the pathogenesis of the condition remains unclear. The epigenome is of interest as a possible mediator of environmental effects during development, encoding a cellular memory reflected by altered function of progeny cells. Advanced maternal age (AMA) is associated with an increased risk of having a child with ASD for reasons that are not understood. To explore whether AMA involves covert aneuploidy or epigenetic dysregulation leading to ASD in the offspring, we tested an homogeneous ectodermal cell type from 47 individuals with ASD compared with 48 typically developing (TD) controls born to mothers of _35 years, using a quantitative genome-wide DNA methylation assay. We show that DNA methylation patterns are dysregulated in ectodermal cells in these individuals, having accounted for confounding effects due to subject age, sex and ancestral haplotype. We did not find mosaic aneuploidy or copy number variability to occur at differentially-methylated regions in these subjects. Of note, the loci with distinctive DNA methylation were found at genes expressed in the brain and encoding protein products significantly enriched for interactions with those produced by known ASD-causing genes, representing a perturbation by epigenomic dysregulation of the same networks compromised by DNA mutational mechanisms. The results indicate the presence of a mosaic subpopulation of epigenetically-dysregulated, ectodermally-derived cells in subjects with ASD. The epigenetic dysregulation observed in these ASD subjects born to older mothers may be associated with aging parental gametes, environmental influences during embryogenesis or could be the consequence of mutations of the chromatin regulatory genes increasingly implicated in ASD. The results indicate that epigenetic dysregulatory mechanisms may complement and interact with DNA mutations in the pathogenesis of the disorder.

2013 (4)

» Naked mole-rat has increased translational fidelity compared with the mouse, as well as a unique 28S ribosomal RNA cleavage.
Azpurua J, Ke Z, Chen IX, Zhang Q, Ermolenko DN, Zhang ZD, Gorbunova V, Seluanov A (2013) Proc Natl Acad Sci U S A 110(43):17350-17355.  pubmed   reprint     comparative genomics | evolutionary genomics
ABSTRACT: The naked mole-rat (Heterocephalus glaber) is a subterranean eusocial rodent with a markedly long lifespan and resistance to tumorigenesis. Multiple data implicate modulation of protein translation in longevity. Here we report that 28S ribosomal RNA (rRNA) of the naked mole-rat is processed into two smaller fragments of unequal size. The two breakpoints are located in the 28S rRNA divergent region 6 and excise a fragment of 263 nt. The excised fragment is unique to the naked mole-rat rRNA and does not show homology to other genomic regions. Because this hidden break site could alter ribosome structure, we investigated whether translation rate and amino acid incorporation fidelity were altered. We report that naked mole-rat fibroblasts have significantly increased translational fidelity despite having comparable translation rates with mouse fibroblasts. Although we cannot directly test whether the unique 28S rRNA structure contributes to the increased fidelity of translation, we speculate that it may change the folding or dynamics of the large ribosomal subunit, altering the rate of GTP hydrolysis and/or interaction of the large subunit with tRNA during accommodation, thus affecting the fidelity of protein synthesis. In summary, our results show that naked mole-rat cells produce fewer aberrant proteins, supporting the hypothesis that the more stable proteome of the naked mole-rat contributes to its longevity.

» SubNet: a Java application for subnetwork extraction.
Zhang Q, Zhang ZD (2013) Bioinformatics 29(19):2509-2511.  pubmed   reprint   website     analytical method | software / pipeline / database | systems biology
ABSTRACT: The extraction of targeted subnetworks is a powerful way to identify functional modules and pathways within complex networks. Here, we present SubNet, a Java-based standalone program for extracting subnetworks given a basal network and a set of selected nodes. Designed with a graphical user-friendly interface, SubNet combines four different extraction methods, which offers the possibility to interrogate a biological network according to the question investigated. Of note, we developed a method based on the highly successful Google PageRank algorithm to extract the subnetwork using the node centrality metric, to which possible node weights of the selected genes can be incorporated.

» A brief introduction to tiling microarrays: principles, concepts, and applications.
Lemetre C, Zhang ZD (2013) Tiling Arrays: Methods and Protocols (Tin-Lap Lee and Alfred Chun Shui Luk eds.) Methods in Molecular Biology, vol. 1067:3-19.  pubmed   reprint     review / perspective
ABSTRACT: Technological achievements have always contributed to the advancement of biomedical research. It has never been more so than in recent times, when the development and application of innovative cutting-edge technologies have transformed biology into a data-rich quantitative science. This stunning revolution in biology primarily ensued from the emergence of microarrays over two decades ago. The completion of whole-genome sequencing projects and the advance in microarray manufacturing technologies enabled the development of tiling microarrays, which gave unprecedented genomic coverage. Since their first description, several types of application of tiling arrays have emerged, each aiming to tackle a different biological problem. Although numerous algorithms have already been developed to analyze microarray data, new method development is still needed not only for better performance but also for integration of available microarray data sets, which without doubt constitute one of the largest collections of biological data ever generated. In this chapter we first introduce the principles behind the emergence and the development of tiling microarrays, and then discuss with some examples how they are used to investigate different biological problems.

» The origin, evolution, and functional impact of short insertion-deletion variants identified in 179 human genomes.
Montgomery SB, Goode DL, Kvikstad E, Albers CA, Zhang ZD, Mu XJ, Ananda G, Howie B, Karczewski KJ, Smith KS, Anaya V, Richardson R, Davis J; 1000 Genomes Project Consortium, Macarthur DG, Sidow A, Duret L, Gerstein M, Makova KD, Marchini J, McVean G, Lunter G (2013) Genome Res 23(5):749-761.  pubmed   reprint     genetic variation | genomic sequencing
ABSTRACT: Short insertions and deletions (indels) are the second most abundant form of human genetic variation, but our understanding of their origins and functional effects lags behind that of other types of variants. Using population-scale sequencing, we have identified a high-quality set of 1.6 million indels from 179 individuals representing three diverse human populations. We show that rates of indel mutagenesis are highly heterogeneous, with 43%-48% of indels occurring in 4.03% of the genome, whereas in the remaining 96% their prevalence is 16 times lower than SNPs. Polymerase slippage can explain upwards of three-fourths of all indels, with the remainder being mostly simple deletions in complex sequence. However, insertions do occur and are significantly associated with pseudo-palindromic sequence features compatible with the fork stalling and template switching (FoSTeS) mechanism more commonly associated with large structural variations. We introduce a quantitative model of polymerase slippage, which enables us to identify indel-hypermutagenic protein-coding genes, some of which are associated with recurrent mutations leading to disease. Accounting for mutational rate heterogeneity due to sequence context, we find that indels across functional sequence are generally subject to stronger purifying selection than SNPs. We find that indel length modulates selection strength, and that indels affecting multiple functionally constrained nucleotides undergo stronger purifying selection. We further find that indels are enriched in associations with gene expression and find evidence for a contribution of nonsense-mediated decay. Finally, we show that indels can be integrated in existing genome-wide association studies (GWAS); although we do not find direct evidence that potentially causal protein-coding indels are enriched with associations to known disease-associated SNPs, our findings suggest that the causal variant underlying some of these associations may be indels.

2012 (3)

» An integrated encyclopedia of DNA elements in the human genome.
The ENCODE Project Consortium (2012) Nature 489:57-74.  pubmed   reprint   website     functional genomics
ABSTRACT: The human genome encodes the blueprint of life, but the function of the vast majority of its nearly three billion bases is unknown. The Encyclopedia of DNA Elements (ENCODE) project has systematically mapped regions of transcription, transcription factor association, chromatin structure and histone modification. These data enabled us to assign biochemical functions for 80% of the genome, in particular outside of the well-studied protein-coding regions. Many discovered candidate regulatory elements are physically associated with one another and with expressed genes, providing new insights into the mechanisms of gene regulation. The newly identified elements also show a statistical correspondence to sequence variants linked to human disease, and can thereby guide interpretation of this variation. Overall, the project provides new insights into the organization and regulation of our genes and genome, and is an expansive resource of functional annotations for biomedical research.

» A systematic survey of loss-of-function variants in human protein-coding genes.
MacArthur DG, Balasubramanian S, Frankish A, Huang N, Morris J, Walter K, Jostins L, Habegger L, Pickrell JK, Montgomery SB, Albers CA, Zhang ZD, Conrad DF, Lunter G, Zheng H, Ayub Q, DePristo MA, Banks E, Hu M, Handsaker RE, Rosenfeld JA, Fromer M, Jin M, Mu XJ, Khurana E, Ye K, Kay M, Saunders GI, Suner MM, Hunt T, Barnes IH, Amid C, Carvalho-Silva DR, Bignell AH, Snow C, Yngvadottir B, Bumpstead S, Cooper DN, Xue Y, Romero IG; 1000 Genomes Project Consortium, Wang J, Li Y, Gibbs RA, McCarroll SA, Dermitzakis ET, Pritchard JK, Barrett JC, Harrow J, Hurles ME, Gerstein MB, Tyler-Smith C (2012) Science 335:806-807.  pubmed   reprint     genetic variation | genomic sequencing
ABSTRACT: Genome-sequencing studies indicate that all humans carry many genetic variants predicted to cause loss of function (LoF) of protein-coding genes, suggesting unexpected redundancy in the human genome. Here we apply stringent filters to 2951 putative LoF variants obtained from 185 human genomes to determine their true prevalence and properties. We estimate that human genomes typically contain ~100 genuine LoF variants with ~20 genes completely inactivated. We identify rare and likely deleterious LoF alleles, including 26 known and 21 predicted severe disease-causing variants, as well as common LoF variants in nonessential genes. We describe functional and evolutionary differences between LoF-tolerant and recessive disease genes and a method for using these differences to prioritize candidate genes found in clinical sequencing studies.

» The Einstein Genome Gateway using WASP - a high throughput multi-layered life sciences portal for XSEDE.
Golden A, McLellan AS, Dubin RA, Jing Q, O Broin P, Moskowitz D, Zhang Z, Suzuki M, Hargitai J, Calder RB, Greally JM (2012) Stud Health Technol Inform. 175:182-191.  pubmed     software / pipeline / database
ABSTRACT: Massively-parallel sequencing (MPS) technologies and their diverse applications in genomics and epigenomics research have yielded enormous new insights into the physiology and pathophysiology of the human genome. The biggest hurdle remains the magnitude and diversity of the datasets generated, compromising our ability to manage, organize, process and ultimately analyse data. The Wiki-based Automated Sequence Processor (WASP), developed at the Albert Einstein College of Medicine (hereafter Einstein), uniquely manages to tightly couple the sequencing platform, the sequencing assay, sample metadata and the automated workflows deployed on a heterogeneous high performance computing cluster infrastructure that yield sequenced, quality-controlled and 'mapped' sequence data, all within the one operating environment accessible by a web-based GUI interface. WASP at Einstein processes 4-6 TB of data per week and since its production cycle commenced it has processed ~ 1 PB of data overall and has revolutionized user interactivity with these new genomic technologies, who remain blissfully unaware of the data storage, management and most importantly processing services they request. The abstraction of such computational complexity for the user in effect makes WASP an ideal middleware solution, and an appropriate basis for the development of a grid-enabled resource - the Einstein Genome Gateway - as part of the Extreme Science and Engineering Discovery Environment (XSEDE) program. In this paper we discuss the existing WASP system, its proposed middleware role, and its planned interaction with XSEDE to form the Einstein Genome Gateway.

2011 (3)

» Identification of genomic indels and structural variations using split reads.
Zhang ZD, Du J, Lam H, Abyzov A, Urban AE, Snyder M, Gerstein M (2011) BMC Genomics 12(1):375.  pubmed   reprint     analytical method | software / pipeline / database
ABSTRACT: BACKGROUND: Recent studies have demonstrated the genetic significance of insertions, deletions, and other more complex structural variants (SVs) in the human population. With the development of the next-generation sequencing technologies, high-throughput surveys of SVs on the whole-genome level have become possible. Here we present split-read identification, calibrated (SRiC), a sequence-based method for SV detection. RESULTS: We start by mapping each read to the reference genome in standard fashion using gapped alignment. Then to identify SVs, we score each of the many initial mappings with an assessment strategy designed to take into account both sequencing and alignment errors (e.g. scoring more highly events gapped in the center of a read). All current SV calling methods have multilevel biases in their identifications due to both experimental and computational limitations (e.g. calling more deletions than insertions). A key aspect of our approach is that we calibrate all our calls against synthetic data sets generated from simulations of high-throughput sequencing (with realistic error models). This allows us to calculate sensitivity and the positive predictive value under different parameter-value scenarios and for different classes of events (e.g. long deletions vs. short insertions). We run our calculations on representative data from the 1000 Genomes Project. Coupling the observed numbers of events on chromosome 1 with the calibrations gleaned from the simulations (for different length events) allows us to construct a relatively unbiased estimate for the total number of SVs in the human genome across a wide range of length scales. We estimate in particular that an individual genome contains ~670,000 indels/SVs. CONCLUSIONS: Compared with the existing read-depth and read-pair approaches for SV identification, our method can pinpoint the exact breakpoints of SV events, reveal the actual sequence content of insertions, and cover the whole size spectrum for deletions. Moreover, with the advent of the third-generation sequencing technologies that produce longer reads, we expect our method to be even more useful.

» ACT: aggregation and correlation toolbox for analyses of genome tracks.
Jee J, Rozowsky J, Yip KY, Lochovsky L, Bjornson R, Zhong G, Zhang Z, Fu Y, Wang J, Weng Z, Gerstein M (2011) Bioinformatics 27(8):1152-1154.  pubmed   reprint   website     analytical method | software / pipeline / database
ABSTRACT: We have implemented aggregation and correlation toolbox (ACT), an efficient, multifaceted toolbox for analyzing continuous signal and discrete region tracks from high-throughput genomic experiments, such as RNA-seq or ChIP-chip signal profiles from the ENCODE and modENCODE projects, or lists of single nucleotide polymorphisms from the 1000 genomes project. It is able to generate aggregate profiles of a given track around a set of specified anchor points, such as transcription start sites. It is also able to correlate related tracks and analyze them for saturation-i.e. how much of a certain feature is covered with each new succeeding experiment. The ACT site contains downloadable code in a variety of formats, interactive web servers (for use on small quantities of data), example datasets, documentation and a gallery of outputs. Here, we explain the components of the toolbox in more detail and apply them in various contexts. AVAILABILITY: ACT is available at

» Mapping copy number variation by population-scale genome sequencing.
Mills RE, Walter K, Stewart C, Handsaker RE, Chen K, Alkan C, Abyzov A, Yoon SC, Ye K, Cheetham RK, Chinwalla A, Conrad DF, Fu Y, Grubert F, Hajirasouliha I, Hormozdiari F, Iakoucheva LM, Iqbal Z, Kang S, Kidd JM, Konkel MK, Korn J, Khurana E, Kural D, Lam HY, Leng J, Li R, Li Y, Lin CY, Luo R, Mu XJ, Nemesh J, Peckham HE, Rausch T, Scally A, Shi X, Stromberg MP, Stutz AM, Urban AE, Walker JA, Wu J, Zhang Y, Zhang ZD, Batzer MA, Ding L, Marth GT, McVean G, Sebat J, Snyder M, Wang J, Ye K, Eichler EE, Gerstein MB, Hurles ME, Lee C, McCarroll SA, Korbel JO; 1000 Genomes Project (2011) Nature 470(7332):59-65.  pubmed   reprint     genetic variation | genomic sequencing
ABSTRACT: Genomic structural variants (SVs) are abundant in humans, differing from other forms of variation in extent, origin and functional impact. Despite progress in SV characterization, the nucleotide resolution architecture of most SVs remains unknown. We constructed a map of unbalanced SVs (that is, copy number variants) based on whole genome DNA sequencing data from 185 human genomes, integrating evidence from complementary SV discovery approaches with extensive experimental validations. Our map encompassed 22,025 deletions and 6,000 additional SVs, including insertions and tandem duplications. Most SVs (53%) were mapped to nucleotide resolution, which facilitated analysing their origin and functional impact. We examined numerous whole and partial gene deletions with a genotyping approach and observed a depletion of gene disruptions amongst high frequency deletions. Furthermore, we observed differences in the size spectra of SVs originating from distinct formation mechanisms, and constructed a map of SV hotspots formed by common mechanisms. Our analytical framework and SV map serves as a resource for sequencing-based association studies.

2010 (3)

» Detection of copy number variation from array intensity and sequencing read depth using a stepwise Bayesian model.
Zhang ZD, Gerstein MB (2010) BMC Bioinformatics 11(1):539.  pubmed   reprint     analytical method | software / pipeline / database
ABSTRACT: BACKGROUND: Copy number variants (CNVs) have been demonstrated to occur at a high frequency and are now widely believed to make a significant contribution to the phenotypic variation in human populations. Array-based comparative genomic hybridization (array-CGH) and newly developed read-depth approach through ultrahigh throughput genomic sequencing both provide rapid, robust, and comprehensive methods to identify CNVs on a whole-genome scale. RESULTS: We developed a Bayesian statistical analysis algorithm for the detection of CNVs from both types of genomic data. The algorithm can analyze such data obtained from PCR-based bacterial artificial chromosome arrays, high-density oligonucleotide arrays, and more recently developed high-throughput DNA sequencing. Treating parameters--e.g., the number of CNVs, the position of each CNV, and the data noise level--that define the underlying data generating process as random variables, our approach derives the posterior distribution of the genomic CNV structure given the observed data. Sampling from the posterior distribution using a Markov chain Monte Carlo method, we get not only best estimates for these unknown parameters but also Bayesian credible intervals for the estimates. We illustrate the characteristics of our algorithm by applying it to both synthetic and experimental data sets in comparison to other segmentation algorithms. CONCLUSIONS: In particular, the synthetic data comparison shows that our method is more sensitive than other approaches at low false positive rates. Furthermore, given its Bayesian origin, our method can also be seen as a technique to refine CNVs identified by fast point-estimate methods and also as a framework to integrate array-CGH and sequencing data with other CNV-related biological knowledge, all through informative priors.

» A map of human genome variation from population-scale sequencing.
The 1000 Genomes Project Consortium (2010) Nature 467(7319):1061-1073.  pubmed   reprint   website     genetic variation | genomic sequencing
ABSTRACT: The 1000 Genomes Project aims to provide a deep characterization of human genome sequence variation as a foundation for investigating the relationship between genotype and phenotype. Here we present results of the pilot phase of the project, designed to develop and compare different strategies for genome-wide sequencing with high-throughput platforms. We undertook three projects: low-coverage whole-genome sequencing of 179 individuals from four populations; high-coverage sequencing of two mother-father-child trios; and exon-targeted sequencing of 697 individuals from seven populations. We describe the location, allele frequency and local haplotype structure of approximately 15 million single nucleotide polymorphisms, 1 million short insertions and deletions, and 20,000 structural variants, most of which were previously undescribed. We show that, because we have catalogued the vast majority of common variation, over 95% of the currently accessible variants found in any individual are present in this data set. On average, each person is found to carry approximately 250 to 300 loss-of-function variants in annotated genes and 50 to 100 variants previously implicated in inherited disorders. We demonstrate how these results can be used to inform association and functional studies. From the two trios, we directly estimate the rate of de novo germline base substitution mutations to be approximately 10(-8) per base pair per generation. We explore the data with regard to signatures of natural selection, and identify a marked reduction of genetic variation in the neighbourhood of genes, due to selection at linked sites. These methods and public data will support the next phase of human genetic research.

» Identification and analysis of unitary pseudogenes: historic and contemporary gene losses in humans and other primates.
Zhang ZD, Frankish A, Hunt T, Harrow J, Gerstein M (2010) Genome Biol 11(3):R26.  pubmed   reprint     comparative genomics | evolutionary genomics | pseudogene | software / pipeline / database
ABSTRACT: BACKGROUND: Unitary pseudogenes are a class of unprocessed pseudogenes without functioning counterparts in the genome. They constitute only a small fraction of annotated pseudogenes in the human genome. However, as they represent distinct functional losses over time, they shed light on the unique features of humans in primate evolution. RESULTS: We have developed a pipeline to detect human unitary pseudogenes through analyzing the global inventory of orthologs between the human genome and its mammalian relatives. We focus on gene losses along the human lineage after the divergence from rodents about 75 million years ago. In total, we identify 76 unitary pseudogenes, including previously annotated ones, and many novel ones. By comparing each of these to its functioning ortholog in other mammals, we can approximately date the creation of each unitary pseudogene (that is, the gene 'death date') and show that for our group of 76, the functional genes appear to be disabled at a fairly uniform rate throughout primate evolution - not all at once, correlated, for instance, with the 'Alu burst'. Furthermore, we identify 11 unitary pseudogenes that are polymorphic - that is, they have both nonfunctional and functional alleles currently segregating in the human population. Comparing them with their orthologs in other primates, we find that two of them are in fact pseudogenes in non-human primates, suggesting that they represent cases of a gene being resurrected in the human lineage. CONCLUSIONS: This analysis of unitary pseudogenes provides insights into the evolutionary constraints faced by different organisms and the timescales of functional gene loss in humans.

2009 (5)

» EBNA1 regulates cellular gene expression by binding cellular promoters.
Canaan A, Haviv I, Urban AE, Schulz VP, Hartman S, Zhang Z, Palejev D, Deisseroth AB, Lacy J, Snyder M, Gerstein M, Weissman SM (2009) Proc Natl Acad Sci U S A 106(52):22421-22426.  pubmed   reprint     functional genomics
ABSTRACT: Epstein-Barr virus (EBV) is associated with several types of lymphomas and epithelial tumors including Burkitt's lymphoma (BL), HIV-associated lymphoma, posttransplant lymphoproliferative disorder, and nasopharyngeal carcinoma. EBV nuclear antigen 1 (EBNA1) is expressed in all EBV associated tumors and is required for latency and transformation. EBNA1 initiates latent viral replication in B cells, maintains the viral genome copy number, and regulates transcription of other EBV-encoded latent genes. These activities are mediated through the ability of EBNA1 to bind viral-DNA. To further elucidate the role of EBNA1 in the host cell, we have examined the effect of EBNA1 on cellular gene expression by microarray analysis using the B cell BJAB and the epithelial 293 cell lines transfected with EBNA1. Analysis of the data revealed distinct profiles of cellular gene changes in BJAB and 293 cell lines. Subsequently, chromatin immune-precipitation revealed a direct binding of EBNA1 to cellular promoters. We have correlated EBNA1 bound promoters with changes in gene expression. Sequence analysis of the 100 promoters most enriched revealed a DNA motif that differs from the EBNA1 binding site in the EBV genome.

» Integrating sequencing technologies in personal genomics: optimal low cost reconstruction of structural variants.
Du J, Bjornson RD, Zhang ZD, Kong Y, Snyder M, Gerstein MB (2009) PLoS Comput Biol 5(7):e1000432.  pubmed   reprint   website     analytical method
ABSTRACT: The goal of human genome re-sequencing is obtaining an accurate assembly of an individual's genome. Recently, there has been great excitement in the development of many technologies for this (e.g. medium and short read sequencing from companies such as 454 and SOLiD, and high-density oligo-arrays from Affymetrix and NimbelGen), with even more expected to appear. The costs and sensitivities of these technologies differ considerably from each other. As an important goal of personal genomics is to reduce the cost of re-sequencing to an affordable point, it is worthwhile to consider optimally integrating technologies. Here, we build a simulation toolbox that will help us optimally combine different technologies for genome re-sequencing, especially in reconstructing large structural variants (SVs). SV reconstruction is considered the most challenging step in human genome re-sequencing. (It is sometimes even harder than de novo assembly of small genomes because of the duplications and repetitive sequences in the human genome.) To this end, we formulate canonical problems that are representative of issues in reconstruction and are of small enough scale to be computationally tractable and simulatable. Using semi-realistic simulations, we show how we can combine different technologies to optimally solve the assembly at low cost. With mapability maps, our simulations efficiently handle the inhomogeneous repeat-containing structure of the human genome and the computational complexity of practical assembly algorithms. They quantitatively show how combining different read lengths is more cost-effective than using one length, how an optimal mixed sequencing strategy for reconstructing large novel SVs usually also gives accurate detection of SNPs/indels, how paired-end reads can improve reconstruction efficiency, and how adding in arrays is more efficient than just sequencing for disentangling some complex SVs. Our strategy should facilitate the sequencing of human genomes at maximum accuracy and low cost.

» PEMer: a computational framework with simulation-based error models for inferring genomic structural variants from massive paired-end sequencing data.
Korbel JO, Abyzov A, Mu XJ, Carriero N, Cayting P, Zhang Z, Snyder M, Gerstein MB (2009) Genome Biol 10(2):R23.  pubmed   reprint   website     analytical method | software / pipeline / database
ABSTRACT: Personal-genomics endeavors, such as the 1000 Genomes project, are generating maps of genomic structural variants by analyzing ends of massively sequenced genome fragments. To process these we developed Paired-End Mapper (PEMer; This comprises an analysis pipeline, compatible with several next-generation sequencing platforms; simulation-based error models, yielding confidence-values for each structural variant; and a back-end database. The simulations demonstrated high structural variant reconstruction efficiency for PEMer's coverage-adjusted multi-cutoff scoring-strategy and showed its relative insensitivity to base-calling errors.

» Rapid in vivo exploration of a 5S rRNA neutral network.
Zhang ZD, Nayar M, Ammons D, Rampersad J, Fox GE (2009) J Microbiol Methods 76(2):181-187.  pubmed   reprint     structural rnas
ABSTRACT: A partial knockout compensation method to screen 5S ribosomal RNA sequence variants in vivo is described. The system utilizes an Escherichia coli strain in which five of eight genomic 5S rRNA genes were deleted in conjunction with a plasmid which is compensatory when carrying a functionally active 5S rRNA. The partial knockout strain is transformed with a population of potentially compensatory plasmids each carrying a randomly generated 5S rRNA gene variant. a The ability to compensate the slow growth rate of the knockout strain is used in conjunction with sequencing to rapidly identify variant 5S rRNAs that are functional as well as those that likely are not. The assay is validated by showing that the growth rate of 15 variants separately expressed in the partial knockout strain can be accurately correlated with in vivo assessments of the potential validity of the same variants. A region of 5S rRNA was mutagenized with this approach and nine novel variants were recovered and characterized. Unlike a complete knockout system, the method allows recovery of both deleterious and functional variants.. The method can be used to study variants of any 5S rRNA in the E. coli context including those of E. coli.

» PeakSeq enables systematic scoring of ChIP-seq experiments relative to controls.
Rozowsky J, Euskirchen G, Auerbach RK, Zhang ZD, Gibson T, Bjornson R, Carriero N, Snyder M, Gerstein MB (2009) Nat Biotechnol 27(1):66-75.  pubmed   reprint   website     analytical method | software / pipeline / database
ABSTRACT: Chromatin immunoprecipitation (ChIP) followed by tag sequencing (ChIP-seq) using high-throughput next-generation instrumentation is fast, replacing chromatin immunoprecipitation followed by genome tiling array analysis (ChIP-chip) as the preferred approach for mapping of sites of transcription-factor binding and chromatin modification. Using two deeply sequenced data sets for human RNA polymerase II and STAT1, each with matching input-DNA controls, we describe a general scoring approach to address unique challenges in ChIP-seq data analysis. Our approach is based on the observation that sites of potential binding are strongly correlated with signal peaks in the control, likely revealing features of open chromatin. We develop a two-pass strategy called PeakSeq to compensate for this. A two-pass strategy compensates for signal caused by open chromatin, as revealed by inclusion of the controls. The first pass identifies putative binding sites and compensates for genomic variation in the 'mappability' of sequences. The second pass filters out sites not significantly enriched compared to the normalized control, computing precise enrichments and significances. Our scoring procedure enables us to optimize experimental design by estimating the depth of sequencing required for a desired level of coverage and demonstrating that more than two replicates provides only a marginal gain in information.

2008 (4)

» Modeling ChIP sequencing in silico with applications.
Zhang ZD, Rozowsky J, Snyder M, Chang J, Gerstein M (2008) PLoS Comput Biol 4(8):e1000158.  pubmed   reprint   website     analytical method
ABSTRACT: ChIP sequencing (ChIP-seq) is a new method for genomewide mapping of protein binding sites on DNA. It has generated much excitement in functional genomics. To score data and determine adequate sequencing depth, both the genomic background and the binding sites must be properly modeled. To develop a computational foundation to tackle these issues, we first performed a study to characterize the observed statistical nature of this new type of high-throughput data. By linking sequence tags into clusters, we show that there are two components to the distribution of tag counts observed in a number of recent experiments: an initial power-law distribution and a subsequent long right tail. Then we develop in silico ChIP-seq, a computational method to simulate the experimental outcome by placing tags onto the genome according to particular assumed distributions for the actual binding sites and for the background genomic sequence. In contrast to current assumptions, our results show that both the background and the binding sites need to have a markedly nonuniform distribution in order to correctly model the observed ChIP-seq data, with, for instance, the background tag counts modeled by a gamma distribution. On the basis of these results, we extend an existing scoring approach by using a more realistic genomic-background model. This enables us to identify transcription-factor binding sites in ChIP-seq data in a statistically rigorous fashion.

» Rapid evolution by positive Darwinian selection in T-cell antigen CD4 in primates.
Zhang ZD, Weinstock G, Gerstein M (2008) J Mol Evol 66(5):446-456.  pubmed   reprint     comparative genomics | evolutionary genomics
ABSTRACT: CD4, an integral membrane glycoprotein, plays a critical role in the immune response and in the life cycle of simian and human immunodeficiency virus (SIV and HIV). Pairwise comparisons of orthologous human and mouse genes show that CD4 is evolving much faster than the majority of mammalian genes. The acceleration is too great to be attributed to a simple relaxation of the action of purifying selection alone. Here we show that the selective pressure acting on CD4 is highly variable between regions in the protein and identify codon sites under strong positive selection. We reconstruct the coding sequences for ancestral primate CD4s and model tertiary structures of all ancestral and extant sequences. Structural mapping of positively selected sites shows they distribute on the surface of the D1 domain of CD4, where the exogenous SIV gp120 protein binds. Moreover, structural models of the ancestral sequences show substantially larger variation in the interfacial electrostatic charge on CD4 and in the surface complementary between CD4 and gp120 in CD4 lineages from primates with natural SIV infections than those without. Thus, positive selection on CD4 among primates may reflect forces driven by SIV infection and could provide a link between changes in sequence and structure of CD4 during evolution and the interaction with the immunodeficiency virus.

» Systematic analysis of transcribed loci in ENCODE regions using RACE sequencing reveals extensive transcription in the human genome.
Wu JQ, Du J, Rozowsky J, Zhang Z, Urban AE, Euskirchen G, Weissman S, Gerstein M, Snyder M (2008) Genome Biol 9(1):R3.  pubmed   reprint     functional genomics
ABSTRACT: BACKGROUND: Recent studies of the mammalian transcriptome have revealed a large number of additional transcribed regions and extraordinary complexity in transcript diversity. However, there is still much uncertainty regarding precisely what portion of the genome is transcribed, the exact structures of these novel transcripts, and the levels of the transcripts produced. RESULTS: We have interrogated the transcribed loci in 420 selected ENCyclopedia Of DNA Elements (ENCODE) regions using rapid amplification of cDNA ends (RACE) sequencing. We analyzed annotated known gene regions, but primarily we focused on novel transcriptionally active regions (TARs), which were previously identified by high-density oligonucleotide tiling arrays and on random regions that were not believed to be transcribed. We found RACE sequencing to be very sensitive and were able to detect low levels of transcripts in specific cell types that were not detectable by microarrays. We also observed many instances of sense-antisense transcripts; further analysis suggests that many of the antisense transcripts (but not all) may be artifacts generated from the reverse transcription reaction. Our results show that the majority of the novel TARs analyzed (60%) are connected to other novel TARs or known exons. Of previously unannotated random regions, 17% were shown to produce overlapping transcripts. CONCLUSION: We conclude that RACE sequencing is an efficient, sensitive, and highly accurate method for characterization of the transcriptome of specific cell/tissue types. Using this method, it appears that much of the genome is represented in polyA+ RNA. Moreover, a fraction of the novel RNAs can encode protein and are likely to be functional.Furthermore, it is estimated that 9% of the novel transcripts encode proteins.

» Analysis of nuclear receptor pseudogenes in vertebrates: how the silent tell their stories.
Zhang ZD, Cayting P, Weinstock G, Gerstein M (2008) Mol Biol Evol 25(1):131-143.  pubmed   reprint   website     functional genomics | pseudogene
ABSTRACT: Transcription factor pseudogenes have not been systematically studied before. Nuclear receptors (NRs) constitute one of the largest groups of transcription factors in animals (e.g., 48 NRs in human). The availability of whole-genome sequences enables a global inventory of the NR pseudogenes in a number of vertebrate model organisms. Here we identify the NR pseudogenes in 8 vertebrate organisms and make our results available online at The assignments reveal that NR pseudogenes as a group have characteristics related to generation and distribution contrary to expectations derived from previous large-scale pseudogene studies. In particular, 1) despite its large size, the NR gene family has only a very small number of pseudogenes in each of the vertebrate genomes examined; 2) despite the low transcription levels of NR genes, except for one, all other NR pseudogenes identified in this study are retropseudogenes; and 3) no duplicated NR pseudogenes are found, contrary to the fact that the NR gene family was expanded through several waves of gene duplication events. Our analyses further reveal a number of interesting aspects of NR pseudogenes. Specifically, through careful sequence analysis, we identify remnant introns in 2 mouse retropseudogenes, psiRev-erbbeta and psiLRH1. Generated from partially processed pre-mRNAs, they appear to be rare examples of highly unusual semiprocessed pseudogenes. Second, by comparing the genomic sequences, we uncover a pseudogene that is unique to the human lineage relative to chimpanzee. Generated by a recent duplication of a segment in the human genome, this pseudogene is a duplicated-processed pseudogene, belonging to a new pseudogene species. Finally, FXRbeta was nonfunctionalized in the human lineage and thus appears to be an example of a rare unitary pseudogene. By comparing orthologous sequences, we dated the FXR-FXRbeta duplication and the nonfunctionalization of FXRbeta in primates.

2007 (8)

» Divergence of transcription factor binding sites across related yeast species.
Borneman AR, Gianoulis TA, Zhang ZD, Yu H, Rozowsky J, Seringhaus MR, Wang LY, Gerstein M, Snyder M (2007) Science 317(5839):815-819.  pubmed   reprint   website     functional genomics
ABSTRACT: Characterization of interspecies differences in gene regulation is crucial for understanding the molecular basis of both phenotypic diversity and evolution. By means of chromatin immunoprecipitation and DNA microarray analysis, the divergence in the binding sites of the pseudohyphal regulators Ste12 and Tec1 was determined in the yeasts Saccharomyces cerevisiae, S. mikatae, and S. bayanus under pseudohyphal conditions. We have shown that most of these sites have diverged across these species, far exceeding the interspecies variation in orthologous genes. A group of Ste12 targets was shown to be bound only in S. mikatae and S. bayanus under pseudohyphal conditions. Many of these genes are targets of Ste12 during mating in S. cerevisiae, indicating that specialization between the two pathways has occurred in this species. Transcription factor binding sites have therefore diverged substantially faster than ortholog content. Thus, gene regulation resulting from transcription factor binding is likely to be a major cause of divergence between related species.

» Transcription factor binding site identification in yeast: a comparison of high-density oligonucleotide and PCR-based microarray platforms.
Borneman AR, Zhang ZD, Rozowsky J, Seringhaus MR, Gerstein M, Snyder M (2007) Funct Integr Genomics 7(4):335-345.  pubmed   reprint     functional genomics
ABSTRACT: In recent years, techniques have been developed to map transcription factor binding sites using chromatin immunoprecipitation combined with DNA microarrays (chIP chip). Initially, polymerase chain reaction (PCR)-based DNA arrays were used for the chIP chip procedure, however, high-density oligonucleotide (HDO) arrays, which allow for the production of thousands more features per array, have emerged as a competing array platform. To compare the two platforms, data from chIP chip analysis performed for three factors (Tec1, Ste12, and Sok2) using both HDO and PCR arrays under identical experimental conditions were compared. HDO arrays provided increased reproducibility and sensitivity, detecting approximately three times more binding events than the PCR arrays while also showing increased accuracy. The increased resolution provided by the HDO arrays also allowed for the identification of multiple binding peaks in close proximity and of novel binding events such as binding within ORFs. The HDO array platform provides a far more robust array system by all measures than PCR-based arrays, all of which is directly attributable to the large number of probes available.

» Identification and analysis of functional elements in 1% of the human genome by the ENCODE pilot project.
The ENCODE Project Consortium (2007) Nature 447(7146):799-816.  pubmed   reprint   website     functional genomics
ABSTRACT: We report the generation and analysis of functional data from multiple, diverse experiments performed on a targeted 1% of the human genome as part of the pilot phase of the ENCODE Project. These data have been further integrated and augmented by a number of evolutionary and computational analyses. Together, our results advance the collective knowledge about human genome function in several major areas. First, our studies provide convincing evidence that the genome is pervasively transcribed, such that the majority of its bases can be found in primary transcripts, including non-protein-coding transcripts, and those that extensively overlap one another. Second, systematic examination of transcriptional regulation has yielded new understanding about transcription start sites, including their relationship to specific regulatory sequences and features of chromatin accessibility and histone modification. Third, a more sophisticated view of chromatin structure has emerged, including its inter-relationship with DNA replication and transcriptional regulation. Finally, integration of these new sources of information, in particular with respect to mammalian evolution based on inter- and intra-species sequence comparisons, has yielded new mechanistic and evolutionary insights concerning the functional landscape of the human genome. Together, these studies are defining a path for pursuit of a more comprehensive characterization of human genome function.

» Mapping of transcription factor binding regions in mammalian cells by ChIP: comparison of array- and sequencing-based technologies.
Euskirchen GM, Rozowsky JS, Wei CL, Lee WH, Zhang ZD, Hartman S, Emanuelsson O, Stolc V, Weissman S, Gerstein MB, Ruan Y, Snyder M (2007) Genome Res 17(6):898-909.  pubmed   reprint     functional genomics
ABSTRACT: Recent progress in mapping transcription factor (TF) binding regions can largely be credited to chromatin immunoprecipitation (ChIP) technologies. We compared strategies for mapping TF binding regions in mammalian cells using two different ChIP schemes: ChIP with DNA microarray analysis (ChIP-chip) and ChIP with DNA sequencing (ChIP-PET). We first investigated parameters central to obtaining robust ChIP-chip data sets by analyzing STAT1 targets in the ENCODE regions of the human genome, and then compared ChIP-chip to ChIP-PET. We devised methods for scoring and comparing results among various tiling arrays and examined parameters such as DNA microarray format, oligonucleotide length, hybridization conditions, and the use of competitor Cot-1 DNA. The best performance was achieved with high-density oligonucleotide arrays, oligonucleotides >/=50 bases (b), the presence of competitor Cot-1 DNA and hybridizations conducted in microfluidics stations. When target identification was evaluated as a function of array number, 80%-86% of targets were identified with three or more arrays. Comparison of ChIP-chip with ChIP-PET revealed strong agreement for the highest ranked targets with less overlap for the low ranked targets. With advantages and disadvantages unique to each approach, we found that ChIP-chip and ChIP-PET are frequently complementary in their relative abilities to detect STAT1 targets for the lower ranked targets; each method detected validated targets that were missed by the other method. The most comprehensive list of STAT1 binding regions is obtained by merging results from ChIP-chip and ChIP-sequencing. Overall, this study provides information for robust identification, scoring, and validation of TF targets using ChIP-based technologies.

» Statistical analysis of the genomic distribution and correlation of regulatory elements in the ENCODE regions.
Zhang ZD, Paccanaro A, Fu Y, Weissman S, Weng Z, Chang J, Snyder M, Gerstein MB (2007) Genome Res 17(6):787-797.  pubmed   reprint   poster   website     analytical method | functional genomics
ABSTRACT: The comprehensive inventory of functional elements in 44 human genomic regions carried out by the ENCODE Project Consortium enables for the first time a global analysis of the genomic distribution of transcriptional regulatory elements. In this study we developed an intuitive and yet powerful approach to analyze the distribution of regulatory elements found in many different ChIP-chip experiments on a 10 approximately 100-kb scale. First, we focus on the overall chromosomal distribution of regulatory elements in the ENCODE regions and show that it is highly nonuniform. We demonstrate, in fact, that regulatory elements are associated with the location of known genes. Further examination on a local, single-gene scale shows an enrichment of regulatory elements near both transcription start and end sites. Our results indicate that overall these elements are clustered into regulatory rich islands and poor deserts. Next, we examine how consistent the nonuniform distribution is between different transcription factors. We perform on all the factors a multivariate analysis in the framework of a biplot, which enhances biological signals in the experiments. This groups transcription factors into sequence-specific and sequence-nonspecific clusters. Moreover, with experimental variation carefully controlled, detailed correlations show that the distribution of sites was generally reproducible for a specific factor between different laboratories and microarray platforms. Data sets associated with histone modifications have particularly strong correlations. Finally, we show how the correlations between factors change when only regulatory elements far from the transcription start sites are considered.

» Integrated analysis of experimental data sets reveals many novel promoters in 1% of the human genome.
Trinklein ND, Karaoz U, Wu J, Halees A, Force Aldred S, Collins PJ, Zheng D, Zhang ZD, Gerstein MB, Snyder M, Myers RM, Weng Z (2007) Genome Res 17(6):720-731.  pubmed   reprint     functional genomics
ABSTRACT: The regulation of transcriptional initiation in the human genome is a critical component of global gene regulation, but a complete catalog of human promoters currently does not exist. In order to identify regulatory regions, we developed four computational methods to integrate 129 sets of ENCODE-wide chromatin immunoprecipitation data. They collectively predicted 1393 regions. Roughly 47% of the regions were unique to one method, as each method makes different assumptions about the data. Overall, predicted regions tend to localize to highly conserved, DNase I hypersensitive, and actively transcribed regions in the genome. Interestingly, a significant portion of the regions overlaps with annotated 3'-UTRs, suggesting that some of them might regulate anti-sense transcription. The majority of the predicted regions are >2 kb away from the 5'-ends of previously annotated human cDNAs and hence are novel. These novel regions may regulate unannotated transcripts or may represent new alternative transcription start sites of known genes. We tested 163 such regions for promoter activity in four cell lines using transient transfection assays, and 25% of them showed transcriptional activity above background in at least one cell line. We also performed 5'-RACE experiments on 62 novel regions, and 76% of the regions were associated with the 5'-ends of at least two RACE products. Our results suggest that there are at least 35% more functional promoters in the human genome than currently annotated.

» What is a gene, post-ENCODE? History and updated definition.
Gerstein MB, Bruce C, Rozowsky JS, Zheng D, Du J, Korbel JO, Emanuelsson O, Zhang ZD, Weissman S, Snyder M (2007) Genome Res 17(6):669-681.  pubmed   reprint   poster     functional genomics | review / perspective
ABSTRACT: While sequencing of the human genome surprised us with how many protein-coding genes there are, it did not fundamentally change our perspective on what a gene is. In contrast, the complex patterns of dispersed regulation and pervasive transcription uncovered by the ENCODE project, together with non-genic conservation and the abundance of noncoding RNA genes, have challenged the notion of the gene. To illustrate this, we review the evolution of operational definitions of a gene over the past century--from the abstract elements of heredity of Mendel and Morgan to the present-day ORFs enumerated in the sequence databanks. We then summarize the current ENCODE findings and provide a computational metaphor for the complexity. Finally, we propose a tentative update to the definition of a gene: A gene is a union of genomic sequences encoding a coherent set of potentially overlapping functional products. Our definition side-steps the complexities of regulation and transcription by removing the former altogether from the definition and arguing that final, functional gene products (rather than intermediate transcripts) should be used to group together entities associated with a single gene. It also manifests how integral the concept of biological function is in defining genes.

» Tilescope: online analysis pipeline for high-density tiling microarray data.
Zhang ZD, Rozowsky J, Lam HY, Du J, Snyder M, Gerstein M (2007) Genome Biol 8(5):R81.  pubmed   reprint   poster   website     software / pipeline / database
ABSTRACT: We developed Tilescope, a fully integrated data processing pipeline for analyzing high-density tiling-array data In a completely automated fashion, Tilescope will normalize signals between channels and across arrays, combine replicate experiments, score each array element, and identify genomic features. The program is designed with a modular, three-tiered architecture, facilitating parallelism, and a graphic user-friendly interface, presenting results in an organized web page, downloadable for further analysis.

2006 (4)

» A supervised hidden markov model framework for efficiently segmenting tiling array data in transcriptional and chIP-chip experiments: systematically incorporating validated biological knowledge.
Du J, Rozowsky JS, Korbel JO, Zhang ZD, Royce TE, Schultz MH, Snyder M, Gerstein M (2006) Bioinformatics 22(24):3016-3024.  pubmed   reprint   website     analytical method
ABSTRACT: MOTIVATION: Large-scale tiling array experiments are becoming increasingly common in genomics. In particular, the ENCODE project requires the consistent segmentation of many different tiling array datasets into 'active regions' (e.g. finding transfrags from transcriptional data and putative binding sites from ChIP-chip experiments). Previously, such segmentation was done in an unsupervised fashion mainly based on characteristics of the signal distribution in the tiling array data itself. Here we propose a supervised framework for doing this. It has the advantage of explicitly incorporating validated biological knowledge into the model and allowing for formal training and testing. METHODOLOGY: In particular, we use a hidden Markov model (HMM) framework, which is capable of explicitly modeling the dependency between neighboring probes and whose extended version (the generalized HMM) also allows explicit description of state duration density. We introduce a formal definition of the tiling-array analysis problem, and explain how we can use this to describe sampling small genomic regions for experimental validation to build up a gold-standard set for training and testing. We then describe various ideal and practical sampling strategies (e.g. maximizing signal entropy within a selected region versus using gene annotation or known promoters as positives for transcription or ChIP-chip data, respectively). RESULTS: For the practical sampling and training strategies, we show how the size and noise in the validated training data affects the performance of an HMM applied to the ENCODE transcriptional and ChIP-chip experiments. In particular, we show that the HMM framework is able to efficiently process tiling array data as well as or better than previous approaches. For the idealized sampling strategies, we show how we can assess their performance in a simulation framework and how a maximum entropy approach, which samples sub-regions with very different signal intensities, gives the maximally performing gold-standard. This latter result has strong implications for the optimum way medium-scale validation experiments should be carried out to verify the results of the genome-scale tiling array experiments.

» The DNA sequence, annotation and analysis of human chromosome 3.
Muzny DM, et al (2006) Nature 440(7088):1194-1198.  pubmed   reprint     genomic sequencing
ABSTRACT: After the completion of a draft human genome sequence, the International Human Genome Sequencing Consortium has proceeded to finish and annotate each of the 24 chromosomes comprising the human genome. Here we describe the sequencing and analysis of human chromosome 3, one of the largest human chromosomes. Chromosome 3 comprises just four contigs, one of which currently represents the longest unbroken stretch of finished DNA sequence known so far. The chromosome is remarkable in having the lowest rate of segmental duplication in the genome. It also includes a chemokine receptor gene cluster as well as numerous loci involved in multiple human cancers such as the gene encoding FHIT, which contains the most common constitutive fragile site in the genome, FRA3B. Using genomic sequence from chimpanzee and rhesus macaque, we were able to characterize the breakpoints defining a large pericentric inversion that occurred some time after the split of Homininae from Ponginae, and propose an evolutionary history of the inversion.

» The finished DNA sequence of human chromosome 12.
Scherer SE, et al (2006) Nature 440(7082):346-351.  pubmed   reprint     genomic sequencing
ABSTRACT: Human chromosome 12 contains more than 1,400 coding genes and 487 loci that have been directly implicated in human disease. The q arm of chromosome 12 contains one of the largest blocks of linkage disequilibrium found in the human genome. Here we present the finished sequence of human chromosome 12, which has been finished to high quality and spans approximately 132 megabases, representing approximately 4.5% of the human genome. Alignment of the human chromosome 12 sequence across vertebrates reveals the origin of individual segments in chicken, and a unique history of rearrangement through rodent and primate lineages. The rate of base substitutions in recent evolutionary history shows an overall slowing in hominids compared with primates and rodents.

» Microbial identification by mass cataloging.
Zhang Z, Jackson GW, Fox GE, Willson RC (2006) BMC Bioinformatics 7:117.  pubmed   reprint     structural rnas
ABSTRACT: BACKGROUND: The public availability of over 180,000 bacterial 16S ribosomal RNA (rRNA) sequences has facilitated microbial identification and classification using hybridization and other molecular approaches. In their usual format, such assays are based on the presence of unique subsequences in the target RNA and require a prior knowledge of what organisms are likely to be in a sample. They are thus limited in generality when analyzing an unknown sample.Herein, we demonstrate the utility of catalogs of masses to characterize the bacterial 16S rRNA(s) in any sample. Sample nucleic acids are digested with a nuclease of known specificity and the products characterized using mass spectrometry. The resulting catalogs of masses can subsequently be compared to the masses known to occur in previously-sequenced 16S rRNAs allowing organism identification. Alternatively, if the organism is not in the existing database, it will still be possible to determine its genetic affinity relative to the known organisms. RESULTS: Ribonuclease T1 and ribonuclease A digestion patterns were calculated for 1,921 complete 16S rRNAs. Oligoribonucleotides generated by RNase T1 of length 9 and longer produce sufficient diversity of masses to be informative. In addition, individual fragments or combinations thereof can be used to recognize the presence of specific organisms in a complex sample. In this regard, 140 strains out of 1,921 organisms (7.3%) could be identified by the presence of a unique RNase T1-generated oligoribonucleotide mass. Combinations of just two and three oligoribonucleotide masses allowed 54% and 72% of the specific strains to be identified, respectively. An initial algorithm for recovering likely organisms present in complex samples is also described. CONCLUSION: The use of catalogs of compositions (masses) of characteristic oligoribonucleotides for microbial identification appears extremely promising. RNase T1 is more useful than ribonuclease A in generating characteristic masses, though RNase A produces oligomers which are more readily distinguished due to the large mass difference between A and G. Identification of multiple species in mixtures is also feasible. Practical applicability of the method depends on high performance mass spectrometric determination, and/or use of methods that increase the one dalton (Da) mass difference between uracil and cytosine.

2004 (2)

» Genome sequence of the Brown Norway rat yields insights into mammalian evolution.
The Rat Genome Project Consortium (2004) Nature 428(6982):493-521.  pubmed   reprint     genomic sequencing
ABSTRACT: The laboratory rat (Rattus norvegicus) is an indispensable tool in experimental medicine and drug development, having made inestimable contributions to human health. We report here the genome sequence of the Brown Norway (BN) rat strain. The sequence represents a high-quality 'draft' covering over 90% of the genome. The BN rat sequence is the third complete mammalian genome to be deciphered, and three-way comparisons with the human and mouse genomes resolve details of mammalian evolution. This first comprehensive analysis includes genes and proteins and their relation to human disease, repeated sequences, comparative genome-wide studies of mammalian orthologous chromosomal regions and rearrangement breakpoints, reconstruction of ancestral karyotypes and the events leading to existing species, rates of variation, and lineage-specific and lineage-independent evolutionary events such as expansion of gene families, orthology relations and protein evolution.

» Genomic analysis of the nuclear receptor family: new insights into structure, regulation, and evolution from the rat genome.
Zhang Z, Burch PE, Cooney AJ, Lanz RB, Pereira FA, Wu J, Gibbs RA, Weinstock G, Wheeler DA (2004) Genome Res 14(4):580-590.  pubmed   reprint   poster     comparative genomics | functional genomics
ABSTRACT: Completion of the Rattus norvegicus genome sequence enabled a global inventory and analysis of the nuclear receptors (NRs) in three mammalian species. Forty-nine NR members were found in mouse, 48 in human. Forty-seven were found in the rat, with gaps at the locations expected for the other two. Pairwise comparisons of their distribution in rat, mouse, and human identified 11 syntenic NR gene blocks, including three small clusters of two or three closely related genes, each spanning 40 kb to 1700 kb. The exon structure of the ligand-binding domain suggests that exon shuffling has played a role in the evolution of this family. An invariant splice junction in all members of the NR family except LXRbeta suggests a functional role for the intron. The ligand-binding domains of PXR and CAR are among the most divergent in the family. Their higher nucleotide substitution rates may be related to the central role played by these two NRs in the metabolism of the foreign compounds and may have resulted from limited positive selection.

2003 (1)

» Common 5S rRNA variants are likely to be accepted in many sequence contexts.
Zhang Z, D'Souza LM, Lee YH, Fox GE (2003) J Mol Evol 56(1):69-76.  pubmed   reprint     structural rnas
ABSTRACT: Over evolutionary time RNA sequences which are successfully fixed in a population are selected from among those that satisfy the structural and chemical requirements imposed by the function of the RNA. These sequences together comprise the structure space of the RNA. In principle, a comprehensive understanding of RNA structure and function would make it possible to enumerate which specific RNA sequences belong to a particular structure space and which do not. We are using bacterial 5S rRNA as a model system to attempt to identify principles that can be used to predict which sequences do or do not belong to the 5S rRNA structure space. One promising idea is the very intuitive notion that frequently seen sequence changes in an aligned data set of naturally occurring 5S rRNAs would be widely accepted in many other 5S rRNA sequence contexts. To test this hypothesis, we first developed well-defined operational definitions for a Vibrio region of the 5S rRNA structure space and what is meant by a highly variable position. Fourteen sequence variants (10 point changes and 4 base-pair changes) were identified in this way, which, by the hypothesis, would be expected to incorporate successfully in any of the known sequences in the Vibrio region. All 14 of these changes were constructed and separately introduced into the Vibrio proteolyticus 5S rRNA sequence where they are not normally found. Each variant was evaluated for its ability to function as a valid 5S rRNA in an E. coli cellular context. It was found that 93% (13/14) of the variants tested are likely valid 5S rRNAs in this context. In addition, seven variants were constructed that, although present in the Vibrio region, did not meet the stringent criteria for a highly variable position. In this case, 86% (6/7) are likely valid. As a control we also examined seven variants that are seldom or never seen in the Vibrio region of 5S rRNA sequence space. In this case only two of seven were found to be potentially valid. The results demonstrate that changes that occur multiple times in a local region of RNA sequence space in fact usually will be accepted in any sequence context in that same local region.

2002 (2)

» Identification of characteristic oligonucleotides in the bacterial 16S ribosomal RNA sequence dataset .
Zhang Z, Willson RC, Fox GE (2002) Bioinformatics 18(2):244-250.  pubmed   reprint     structural rnas
ABSTRACT: MOTIVATION: The phylogenetic structure of the bacterial world has been intensively studied by comparing sequences of 16S ribosomal RNA (16S rRNA). This database of sequences is now widely used to design probes for the detection of specific bacteria or groups of bacteria one at a time. The success of such methods reflects the fact that there are local sequence segments that are highly characteristic of particular organisms or groups of organisms. It is not clear, however, the extent to which such signature sequences exist in the 16S rRNA dataset. A better understanding of the numbers and distribution of highly informative oligonucleotide sequences may facilitate the design of hybridization arrays that can characterize the phylogenetic position of an unknown organism or serve as the basis for the development of novel approaches for use in bacterial identification. RESULTS: A computer-based algorithm that characterizes the extent to which any individual oligonucleotide sequence in 16S rRNA is characteristic of any particular bacterial grouping was developed. A measure of signature quality, Q(s), was formulated and subsequently calculated for every individual oligonucleotide sequence in the size range of 5-11 nucleotides and for 15mers with reference to each cluster and subcluster in a 929 organism representative phylogenetic tree. Subsequently, the perfect signature sequences were compared to the full set of 7322 sequences to see how common false positives were. The work completed here establishes beyond any doubt that highly characteristic oligonucleotides exist in the bacterial 16S rRNA sequence dataset in large numbers. Over 16,000 15mers were identified that might be useful as signatures. Signature oligonucleotides are available for over 80% of the nodes in the representative tree.

» NCIR: a database of non-canonical interactions in known RNA structures.
Nagaswamy U, Larios-Sanz M, Hury J, Collins S, Zhang Z, Zhao Q, Fox GE (2002) Nucleic Acids Res 30(1):395-397.  pubmed   reprint   website     software / pipeline / database | structural rnas
ABSTRACT: The secondary and tertiary structure of an RNA molecule typically includes a number of non-canonical base-base interactions. The known occurrences of these interactions are tabulated in the NCIR database, which can be accessed from The number of examples is now over 1400, which is an increase of >700% since the database was first published. This dramatic increase reflects the addition of data from the recently published crystal structures of the 50S (2.4 A) and 30S (3.0 A) ribosomal subunits. In addition, non-canonical interactions observed in published crystal and NMR structures of tRNAs, group I introns, ribozymes, RNA aptamers and synthetic oligonucleotides are included. Properties associated with these interactions, such as sequence context, sugar pucker conformation, glycosidic angle conformation, melting temperature, chemical shift and free energy, are also reported when available. Out of the 29 anticipated pairs with at least two hydrogen bonds, 28 have been observed to date. In addition, several novel examples, not generally predicted, have also been encountered, bringing the total of such pairs to 36. Added to this list are a variety of single, bifurcated, triple and quadruple interactions. The most common non-canonical pairs are the sheared GA, GA imino, AU reverse Hoogsteen, and the GU and AC wobble pairs. The most frequent triple interaction connects N3 of an A with the amino of a G that is also involved in a standard Watson-Crick pair.

2000 (1)

» Database of non-canonical base pairs found in known RNA structures.
Nagaswamy U, Voss N, Zhang Z, Fox GE (2000) Nucleic Acids Res 28(1):375-376.  pubmed   reprint   website     software / pipeline / database | structural rnas
ABSTRACT: Atomic resolution RNA structures are being published at an increasing rate. It is common to find a modest number of non-canonical base pairs in these structures in addition to the usual Watson-Crick pairs. This database summarizes the occurrence of these rare base pairs in accordance with standard nomenclature. The database,, contains information such as sequence context, sugar pucker conformation, anti / syn base conformations, chemical shift, p K (a)values, melting temperature and free energy. Of the 29 anticipated pairs with two or more hydrogen bonds, 20 have been encountered to date. In addition, four unexpected pairs with two hydrogen bonds have been reported bringing the total to 24. Single hydrogen bond versions of five of the expected geometries have been encountered among the single hydrogen bond interactions. In addition, 18 different types of base triplets have been encountered, each of which involves three to six hydrogen bonds. The vast majority of the rare base pairs are antiparallel with the bases in the anti configuration relative to the ribose. The most common are the GU wobble, the Sheared GA pair, the Reverse Hoogsteen pair and the GA imino pair.