Student Projects

Here we showcase student projects from courses related to our Bioinformatics, MS program. More projects will be added in the coming weeks.

Guided Studies in Bioinformatics, BI7583

Michael D'Eletto

Read the full abstract

About the Project:
Functional Annotation of Multiple Gene Models

Abstract: 
After generations of sequencing the genomes of various organisms, there exists an abundance of sequencing data that must be analyzed and annotated.  Bioinformaticians are left with the challenge of using open-source programs to align and assemble these millions of reads.  From these genome assemblies, functional properties of individual genes must be annotated before being loaded into databases like Genbank.  Numerous annotation pipelines have been developed; however, emphasis on extensive downstream functional annotation has been lacking.  Software such as the MAKER pipeline [1] provides gene models based on multiple sources of evidence, but stops short of providing any functional information.  Other tools, such as DAVID [2] are accessible only via a web site and hence would require submitting large amounts of data over the web, something many companies are not comfortable with.  Tools such as AutoFACT [3] are not currently maintained and are primarily aimed at RNA transcript annotation.  Corporations also face special needs in that they (1) require high levels of security for their information and (2) are not always able to pay for software that may be free for academics.  In addition, the level of support, documentation, maintenance, and integration for bioinformatics tools varies greatly and is often at too low a level for a small bioinformatics group to deal with.

Rama Srinivasan

Read the full abstract

About the Project
Identification of Novel Peptides from the Venom Duct Transcriptome of Marine Snail Cinguloterebra Anilis

Abstract
Molecules produced in nature that are biologically active continue to be the source and inspiration for a vast number of drugs, diagnostics, and pharmacological tools. However, it remains challenging not only to find new organisms that produce natural products, but also to identify all of the bioactive molecules produced by these organisms.

Marine snails have proven to be good sources of neuroactive peptides in the past. Whereas toxins from species like cone snails have been moderately well categorized, toxins from the vermivorous Terebrid snails remain more poorly characterized.

Working in collaboration with the Holford Lab at the Hunter College of CUNY, I focus on discovering neuroactive peptides from the venom tissues of the snail Cinguloterebra anilis. We are working on Illumina RNA-Seq data of the anilis venom duct, and aim to assemble, annotate and filter our way to discovering new toxins, later progressing to physiological assays.

Oscar L Rodriguez

Read the full abstract

About the Project
Joint Automated Genome Annotation of 73 Human Cell Types

Abstract
The ENCODE consortium produced functional genomics data in many cell types. Our goal is to annotate the active genomic functional elements in this diverse set of cell types. The challenge is that many of these cell types have little data available. We aim to leverage existing high quality annotations from six well-studied cell types in the production of annotations for the remaining cell types.

Novel classification and visualization of genome-wide expression patterns in known breast cancer subtypes | Alexander R. Mankovich, Class of 2014

Introduction to cancer subtyping and signatures for outcome prediction:
Breast cancer research, while making steady advances in the disease's diagnosis and the discovery of new therapies, is still limited in its capacity to characterize disease subtypes in full. Five molecular subtypes have been described in the past: HER2+/ERBB2+, basal-like, Luminal A, Luminal B, and normal-like. There are several approaches used to classify these subtypes: histopathology, arising from the examination of tissue to assign a grade and particular physiological manifestation of the tumor; molecular pathology, which measures key proteins expressed by the majority of tumor cells; genetic analysis, which identifies genome-wide changes in tumor cells (such as copy number alterations); and gene-expression, the analysis of particular genes driving tumor biology. These four approaches are used together to delineate a patient's tumor into a detailed subclassification driving clinical outlook such as risk of metastasis, likelihood of recurrence, and potential curative therapies using together to delineate a patient's tumor into a detailed subclassification driving clinical outlook such as risk of metastasis, likelihood of recurrence, and potential curative therapies. .

Utilizing various analytical, statistical, and visual methods, RNA-seq expression signatures can more precisely guide clinical understanding of the driving forces behind tumor biology and further demarcate diverse breast cancer subtypes based on signature motifs and their associated prognostic or predictive factors - such as possible therapies, metastatic potential, recurrence risk, and survival probability. I propose to create a framework which generates long-range expression signatures from tumor samples, selects signatures which are alike, identifies significant correlating prognostic and predictive factors, and visualizes those relationships in a biologically intuitive manner.

STAT-GPS: a complete functional genome annotation tool
focusing on extensive downstream analysis of genes
 | Michael D’Eletto, Class of 2014

After generations of sequencing the genomes of various organisms, there exists an abundance of sequencing data that must be analyzed and annotated.  Bioinformaticians are left with the challenge of using open-source programs to align and assemble these millions of reads.  From these genome assemblies, functional properties of individual genes must be annotated before being loaded into databases like Genbank.  Numerous annotation pipelines have been developed; however, emphasis on extensive downstream functional annotation has been lacking.  Software such as the MAKER pipeline provides gene models based on multiple sources of evidence, but stops short of providing any functional information.  Other tools, such as DAVID are accessible only via a web site and hence would require submitting large amounts of data over the web, something many companies are not comfortable with.  Tools such as AutoFACT are not currently maintained and are primarily aimed at RNA transcript annotation.  Corporations also face special needs in that they (1) require high levels of security for their information and (2) are not always able to pay for software that may be free for academics.  In addition, the level of support, documentation, maintenance, and integration for bioinformatics tools varies greatly and is often at too low a level for a small bioinformatics group to deal with.

This thesis is a continuation of a graduate project revolved around development of an extensive functional annotation pipeline which emphasizes on downstream analysis of genes.  Initial development of the pipeline focused on primary annotations involving ab initio gene prediction and protein/EST alignment to known hits in various databases.  These primary annotations merely touched the surface of the overall function of each annotated gene.  Continual development of the pipeline has delved into the functional and structural analyses of each gene and its proteins, as well as prediction of regulatory, non-coding elements in the DNA.  These analyses include, but are not limited to: (1) automated homology modeling, (2) pathway assignment, (3) ncRNA prediction, and (4) de-novo promoter element discovery.

This pipeline, known as STAT-GPS (Solazyme Total Annotation Tool for Genomic and Protein Sequences) utilizes a combination of both open-source software and remote servers to attain the most reliable, accurate, and thorough functional annotation possible.  This program, which is developed in the Python language, is intended for both genomic and RNA transcripts, although genomic transcripts are the main goal.  The source code is available for download and redistribution at https://github.com/mdeletto/STAT-GPS.  A formal paper intended for publication in the Bioinformatics journal is being written concurrently and will include supplementary data about the efficiency of this pipeline.

Malcolm Houtz, Class of 2015

In 2011, Gan et al published work indicating that different accessions of Arabidopsis thaliana use alternate gene models to those annotated in the reference genome. An implication of this finding is that a large proportion of genes predicted to be damaged or knocked out (using the reference genome annotation) in non-reference accessions were in fact not influenced by these mutations. The transcriptomes were reassembled for 18 accessions, and new annotation files were created.

Using RNA-Seq data already sequenced and assembled by Purugganan Laboratory, we propose to study and potentially re-annotate the transcriptomes of 4 rice accessions.

The first phase of the project involves matching gene-ids with known polymorphisms or indels to a large FPKM matrix. A summarized categorization of expressed and unexpressed genes will be delivered. The summary will give an indication of expression levels for genes predicted to be damaged. Each accession was tested under many different conditions – summary at different levels may make sense. This piece of the project is intended to extend Malcolm’s very basic R skills.

If a significant number of genes which are predicted to be damaged are in fact expressed, transcriptomes will be reassembled and annotated. Using an existing General Feature Format file, we will find additional, novel transcripts and create new GFF’s for each of the 4 sequenced accessions.

Although familiar software (cufflinks) does allow the discovery of novel transcripts, the method for updating an existing GFF with additional transcripts is currently unclear.

Final deliverables will be pipelines for transcript reassembly and updating GFFs with additional annotations.

Oscar Rodriguez, Undergrad Class of 2014

download project poster

Background: 
The ENCODE consortium produced functional genomics data in many cell types. Our goal is to annotate the active genomic functional elements in this diverse set of cell types. The challenge is that many of these cell types have little data available. We aim to leverage existing high quality annotations from six well-studied cell types in the production of annotations for the remaining cell types.

Approach:
We use the genome annotation software Segway to perform annotations, augmented with entropic graph-based regularization (EGBR) to leverage existing annotations. We chose cell types that had at least two out of four distinct types of assays (DNase-seq, RNAseq, histone modification ChIP-seq and transcription factor ChIP-seq).

Results:
We will produce functional annotations of 73 cell types.  These annotations will be made publicly available on the UCSC Genome Browser.  In addition, the project has successfully migrated the Segway+EGBR annotation software to the DNAnexus cloud computing platform.

Pipeline for SNP Detection | Ernest Szeto, Summer 2010

download project poster
view the data

About the Project:
The advent of inexpensive short reads sequencing technology has made the detection of SNPs (Single Nucleotide Polymorphism) more widely available.

In this project, we take Solexa reads, align them with a well curated finished Yeast genome for the purpose of detecting mutated genes. The amino acid mutation as a result of the SNP is indicated.

A pipeline has been developed for mapping reads to reference genes in a reference genome for purposes of SNP detection. A pipeline provides a systematic framework for analyzing multiple data sets dealing with the same problem.

Because bioinformatic data can be noisy, quantitative techniques are used for ranking the candidates for gene mutation. In this case, the percentage coverage and depth (percCov, depth) for producing the ranked candidate list. The candidate list is made available for human inspection to ascertain the the biological meaning of the SNP candidates. The list is annotated with gene product name, amino acid change, and Pfam domain affected to aid in the interpretation.