Student Projects

Bioinformatics, MS Online

Below is a showcase of current NYU Tandon School of Engineering student projects from courses related to our Bioinformatics master's degree program. Please check back often to learn more about our new student projects.

Check out the new BioStar Handbook!

Danny Simpson, MS 2016

From Mollusks to Medicine: A Venomics Approach for the Discovery and Characterization of Therapeutics from Terebridae Peptide Toxins

Rajeeva Lochan Musunuri, MS 2015

Validating somatic structural variants with local assembly
Interning at the New York Genome Center

Detecting structural variants (SVs) from sequencing data is complex and is fraught with high false negative rate. It is therefore necessary to use multiple orthogonal methodologies (such as read depth, read pairs, split reads) to detect structural variants. When searching for somatic SVs in cancer samples (tumor/normal paired analysis), a false negative call in the normal will lead to a false positive somatic call in the tumor. This can be problematic because SVs are known to be highly relevant in cancer development and metastasis.

Previous studies have shown that assembly based methods have the highest resolution in determining the SV breakpoints with base-pair precision. In this project, I have created a modular framework for validating and also identifying SV calls by performing local assembly of the reads around the breakpoints with different assembly tools such as TIGRA, SGA, SPAdes, CORTEX, FERMI. The framework provides a way to obtain a high quality clinically actionable set of structural variant calls.

Marina Hoashi MS Class of 2015

Mammals have evolved to nourish their offspring exclusively with maternal milk for around half of the lactation period, a crucial infant developmental window. In view of the oral-breast contact during lactation and the altered oral microbiota in Caesarean section (C-section) born infants, we expected differences in milk composition by delivery mode. Here we performed a cross-sectional study of microbes and glycosylation patterns in human milk at different times postpartum, and found differences by time after birth only in women who delivered vaginally. These results warrant further research into the role of microbes in milk glycosylation and its developmental functions.

Rama Srinivasan

Read the full abstract

About the Project
Identification of Novel Peptides from the Venom Duct Transcriptome of Marine Snail Cinguloterebra Anilis

Molecules produced in nature that are biologically active continue to be the source and inspiration for a vast number of drugs, diagnostics, and pharmacological tools. However, it remains challenging not only to find new organisms that produce natural products, but also to identify all of the bioactive molecules produced by these organisms.

Marine snails have proven to be good sources of neuroactive peptides in the past. Whereas toxins from species like cone snails have been moderately well categorized, toxins from the vermivorous Terebrid snails remain more poorly characterized.

Working in collaboration with the Holford Lab at the Hunter College of CUNY, I focus on discovering neuroactive peptides from the venom tissues of the snail Cinguloterebra anilis. We are working on Illumina RNA-Seq data of the anilis venom duct, and aim to assemble, annotate and filter our way to discovering new toxins, later progressing to physiological assays.

Oscar L Rodriguez

Read the full abstract

About the Project
Joint Automated Genome Annotation of 73 Human Cell Types

The ENCODE consortium produced functional genomics data in many cell types. Our goal is to annotate the active genomic functional elements in this diverse set of cell types. The challenge is that many of these cell types have little data available. We aim to leverage existing high quality annotations from six well-studied cell types in the production of annotations for the remaining cell types.

Novel classification and visualization of genome-wide expression patterns in known breast cancer subtypes | Alexander R. Mankovich, Class of 2014

Introduction to cancer subtyping and signatures for outcome prediction:
Breast cancer research, while making steady advances in the disease's diagnosis and the discovery of new therapies, is still limited in its capacity to characterize disease subtypes in full. Five molecular subtypes have been described in the past: HER2+/ERBB2+, basal-like, Luminal A, Luminal B, and normal-like. There are several approaches used to classify these subtypes: histopathology, arising from the examination of tissue to assign a grade and particular physiological manifestation of the tumor; molecular pathology, which measures key proteins expressed by the majority of tumor cells; genetic analysis, which identifies genome-wide changes in tumor cells (such as copy number alterations); and gene-expression, the analysis of particular genes driving tumor biology. These four approaches are used together to delineate a patient's tumor into a detailed subclassification driving clinical outlook such as risk of metastasis, likelihood of recurrence, and potential curative therapies using together to delineate a patient's tumor into a detailed subclassification driving clinical outlook such as risk of metastasis, likelihood of recurrence, and potential curative therapies. .

Utilizing various analytical, statistical, and visual methods, RNA-seq expression signatures can more precisely guide clinical understanding of the driving forces behind tumor biology and further demarcate diverse breast cancer subtypes based on signature motifs and their associated prognostic or predictive factors - such as possible therapies, metastatic potential, recurrence risk, and survival probability. I propose to create a framework which generates long-range expression signatures from tumor samples, selects signatures which are alike, identifies significant correlating prognostic and predictive factors, and visualizes those relationships in a biologically intuitive manner.

STAT-GPS: a complete functional genome annotation tool focusing on extensive downstream analysis of genes | Michael D’Eletto, Class of 2014

After generations of sequencing the genomes of various organisms, there exists an abundance of sequencing data that must be analyzed and annotated.  Bioinformaticians are left with the challenge of using open-source programs to align and assemble these millions of reads.  From these genome assemblies, functional properties of individual genes must be annotated before being loaded into databases like Genbank.  Numerous annotation pipelines have been developed; however, emphasis on extensive downstream functional annotation has been lacking.  Software such as the MAKER pipeline provides gene models based on multiple sources of evidence, but stops short of providing any functional information.  Other tools, such as DAVID are accessible only via a web site and hence would require submitting large amounts of data over the web, something many companies are not comfortable with.  Tools such as AutoFACT are not currently maintained and are primarily aimed at RNA transcript annotation.  Corporations also face special needs in that they (1) require high levels of security for their information and (2) are not always able to pay for software that may be free for academics.  In addition, the level of support, documentation, maintenance, and integration for bioinformatics tools varies greatly and is often at too low a level for a small bioinformatics group to deal with.

This thesis is a continuation of a graduate project revolved around development of an extensive functional annotation pipeline which emphasizes on downstream analysis of genes.  Initial development of the pipeline focused on primary annotations involving ab initio gene prediction and protein/EST alignment to known hits in various databases.  These primary annotations merely touched the surface of the overall function of each annotated gene.  Continual development of the pipeline has delved into the functional and structural analyses of each gene and its proteins, as well as prediction of regulatory, non-coding elements in the DNA.  These analyses include, but are not limited to: (1) automated homology modeling, (2) pathway assignment, (3) ncRNA prediction, and (4) de-novo promoter element discovery.

This pipeline, known as STAT-GPS (Solazyme Total Annotation Tool for Genomic and Protein Sequences) utilizes a combination of both open-source software and remote servers to attain the most reliable, accurate, and thorough functional annotation possible.  This program, which is developed in the Python language, is intended for both genomic and RNA transcripts, although genomic transcripts are the main goal.  The source code is available for download and redistribution on Github.  A formal paper intended for publication in the Bioinformatics journal is being written concurrently and will include supplementary data about the efficiency of this pipeline.

Malcolm Houtz, Class of 2015

In 2011, Gan et al published work indicating that different accessions of Arabidopsis thaliana use alternate gene models to those annotated in the reference genome. An implication of this finding is that a large proportion of genes predicted to be damaged or knocked out (using the reference genome annotation) in non-reference accessions were in fact not influenced by these mutations. The transcriptomes were reassembled for 18 accessions, and new annotation files were created.

Using RNA-Seq data already sequenced and assembled by Purugganan Laboratory, we propose to study and potentially re-annotate the transcriptomes of 4 rice accessions.

The first phase of the project involves matching gene-ids with known polymorphisms or indels to a large FPKM matrix. A summarized categorization of expressed and unexpressed genes will be delivered. The summary will give an indication of expression levels for genes predicted to be damaged. Each accession was tested under many different conditions – summary at different levels may make sense. This piece of the project is intended to extend Malcolm’s very basic R skills.

If a significant number of genes which are predicted to be damaged are in fact expressed, transcriptomes will be reassembled and annotated. Using an existing General Feature Format file, we will find additional, novel transcripts and create new GFF’s for each of the 4 sequenced accessions.

Although familiar software (cufflinks) does allow the discovery of novel transcripts, the method for updating an existing GFF with additional transcripts is currently unclear.

Final deliverables will be pipelines for transcript reassembly and updating GFFs with additional annotations.

Oscar Rodriguez, Undergrad Class of 2014

Download project poster

The ENCODE consortium produced functional genomics data in many cell types. Our goal is to annotate the active genomic functional elements in this diverse set of cell types. The challenge is that many of these cell types have little data available. We aim to leverage existing high quality annotations from six well-studied cell types in the production of annotations for the remaining cell types.

We use the genome annotation software Segway to perform annotations, augmented with entropic graph-based regularization (EGBR) to leverage existing annotations. We chose cell types that had at least two out of four distinct types of assays (DNase-seq, RNAseq, histone modification ChIP-seq and transcription factor ChIP-seq).

We will produce functional annotations of 73 cell types.  These annotations will be made publicly available on the UCSC Genome Browser.  In addition, the project has successfully migrated the Segway+EGBR annotation software to the DNAnexus cloud computing platform.

Back to Top -