Research

My research is centered at the intersection of population health, data science, and disease prevention. I investigate the epidemiology of major health conditions with a goal to translate findings into actionable public health strategies.

📈Cancer Epidemiology

I conduct research on the patterns, causes, prevention, and control of cancer in diverse populations. This includes identifying risk factors, understanding disparities, and evaluating the effectiveness of screening and prevention programs.

🦠Infectious Diseases

My expertise also covers the epidemiology of infectious diseases. This includes studying outbreak investigations, transmission dynamics, vaccine efficacy, and the impact of infectious agents on population health, particularly in relation to cancer and neurological conditions.

🧬Cancer Bioinformatics

A significant portion of my work involves cancer bioinformatics, where I harness the power of large-scale public datasets. I have substantial hands-on experience in developing and applying applied machine learning and deep learning models to analyze complex biological data, aiming to uncover novel insights into cancer mechanisms, identify biomarkers, and improve prediction models.

🧠Neuroepidemiology

I investigate the distribution and determinants of neurological diseases within populations. This area of my research seeks to understand risk factors, identify susceptible groups, and contribute to the prevention and management of neurological disorders.

Bulk Transcriptomics

My research interest centers on bulk transcriptomics, with a focus on understanding gene expression dynamics in complex diseases such as cancer and neurodegenerative disorders. I am particularly interested in leveraging large-scale public RNA-seq datasets to identify molecular signatures, characterize transcriptional heterogeneity, and uncover biomarkers that can inform diagnosis, prognosis, and therapeutic strategies. By integrating advanced computational approaches with biological insights, my work aims to bridge the gap between high-throughput transcriptomic data and clinically relevant applications.

Single Cell Genomics

I specialize in single-cell genomic technologies such as single-cell RNA-seq (scRNA-seq) and single-cell ATAC-seq (scATAC-seq), which enable high-resolution analysis of cellular heterogeneity and gene regulation. I design single-cell experiments and apply advanced statistical and computational methods to analyze complex single-cell datasets. I regularly use both Scanpy (Python) and Seurat (R) to build robust, reproducible analysis pipelines that uncover meaningful biological insights from diverse cell populations.

What I Use

Cancer Bioinformatics

Data Sources

The Cancer Genome Atlas (TCGA): TCGA is a comprehensive collection of multi-dimensional cancer genomics data covering multiple cancer types.
Gene Expression Omnibus (GEO): GEO is a public repository hosted by the National Center for Biotechnology Information (NCBI) containing a vast collection of gene expression data, including cancer datasets.
National Cancer Institute (NCI) Genomic Data Commons (GDC): Description: GDC is an open-access data portal providing access to a wide range of cancer genomics datasets.
cellxgene.cziscience.com - Download and visually explore reference-quality data to understand the functionality of human tissues at the cellular level with Chan Zuckerberg CELL by GENE Discover (CZ CELLxGENE Discover).
10XGenomics - High-performance in situ from the single cell leader

Analysis Tools

UCSC Xena: An online exploration tool for public and private, multi-omic and clinical/phenotype data
GEO2R: GEO2R is an interactive web tool that allows users to compare two or more groups of Samples in a GEO Series in order to identify genes that are differentially expressed across experimental conditions. Results are presented as a table of genes ordered by P-value, and as a collection of graphic plots to help visualize differentially expressed genes and assess data set quality. GEO2R uses a variety of R packages from the Bioconductor project. Bioconductor is an open-source software project based on the R programming language that provides tools for the analysis of high-throughput genomic data.
GEPIA2: GEPIA2 is a web-based tool for analyzing gene expression data in cancer. It stands for Gene Expression Profiling Interactive Analysis 2 and is an updated version of the original GEPIA tool. GEPIA2 allows users to explore gene expression patterns, perform survival analyses, and visualize gene expression data across various cancer types.
TIMER2.0: TIMER is a comprehensive resource for systematical analysis of immune infiltrates across diverse cancer types. This version of webserver provides immune infiltrates' abundances estimated by multiple immune deconvolution methods, and allows users to generate high-quality figures dynamically to explore tumor immunological, clinical and genomic features comprehensively.
UALCAN: UALCAN is a web-based platform that provides interactive and comprehensive analysis of cancer transcriptome data. It enables users to explore gene expression patterns, perform survival analyses, and compare gene expression between tumor and normal samples across different cancer types. UALCAN utilizes data from The Cancer Genome Atlas (TCGA) to facilitate cancer research and provide insights into tumor biology.
cBioPortal for Cancer Genomics:: cBioPortal hosts a large collection of cancer genomics datasets, allowing users to explore and visualize the data.
GREIN : GEO RNA-seq Experiments Interactive Navigator: GREIN is an interactive web platform that provides user-friendly options to explore and analyze GEO RNA-seq data. GREIN is powered by the back-end computational pipeline for uniform processing of RNA-seq data and the large number (>6,000) of already processed datasets. These datasets were retrieved from GEO and reprocessed consistently by the back-end GEO RNA-seq experiments processing pipeline (GREP2).
UCSC Cancer Genomics Browser: The UCSC Cancer Genomics Browser offers a comprehensive collection of cancer genomics data integrated with genomic annotations.

R Packages

TCGAbiolinks: An R/Bioconductor package for integrative analysis with GDC data. TCGAbiolinks is able to access The National Cancer Institute (NCI) Genomic Data Commons (GDC) thorough its GDC Application Programming Interface (API) to search, download and prepare relevant data for analysis in R.
maftools: Summarize, Analyze and Visualize MAF Files. This package attempts to summarize, analyze, annotate and visualize MAF files in an efficient manner from either TCGA sources or any in-house studies as long as the data is in MAF format.
SummarizedExperiment: The SummarizedExperiment container contains one or more assays, each represented by a matrix-like object of numeric or other mode. The rows typically represent genomic ranges of interest and the columns represent samples.
MutationalPatterns: Comprehensive genome-wide analysis of mutational processes. he package covers a wide range of patterns including: mutational signatures, transcriptional and replicative strand bias, lesion segregation, genomic distribution and association with genomic features, which are collectively meaningful for studying the activity of mutational processes.
GenVisR : Short for "Genomic Visualizations in R," this tool provides visualization capabilities tailored to a variety of genomic data types, including data common in cancer research such as somatic mutations, copy number variations, and more.

Data Wrangling

readxl for importing data into R
dplyr, tidyr and others from the tidyverse for data preparation.

Data Visualization

ggplot2 for the vast majority of the graphics, together with the hrbrtheme for styling.
patchwork to put graphics together.
ggraph and igraph for most of the network related graphics
plotly and other html widgets for interactive graphics.
RColorBrewer and viridis and colormap to control color in charts.
Ggrepel and other ggplot2 extension that make your life simpler.
Heatmaply for most of the heatmaps

Publication-ready Tables

gtsummary for creating publication-ready descriptives and analytical tables.
gt to customize tables and export as docs or tex.

Reproducible Research

R Markdown to produce statistical reports.
Quarto to build 95% of the website for my courses and others.

Statistical Modeling

easystats for easy statistical modeling, visualization, and reporting

Data Science, Machine Learning and Deep Learning

NumPy for scientific computing.
Pandas for data wrangling and analysis
Matplotlib for data visulization
Seaborn for advance statistical visualizations
Plotly for interative data visualization
researchpy to summarize data and perform statistical tests.
Dask for big data analysis
scikit-learn for machine learning
scikit-image for life science image manipulation