This repository contains the final project of Group 5 for the DNA/RNA Dynamics course (MSc in Bioinformatics, University of Bologna, a.y. 2024/2025). It features a complete pipeline for analyzing Illumina 450K methylation data using R. The pipeline includes preprocessing (with the preprocessNoob method), quality control, normalization, statistical analysis (using Mann-Whitney U test and P-value threshold of 0.05), principal component analysis (PCA), and the identification of differentially methylated positions (DMPs) between control (CTRL) and disease (DIS) samples.
The project was performed using R and Rstudio, analysing data from the platform Illumina HumanMethylation450k (input_data). Necessary packages were installed in our R enviroment:
install.packages(minfi)
install.packages(ggplot2)
install.packages(nitr)
install.packages(BiocManager)
install.packages(factoextra)
install.packages(cluster)
install.packages(qqman)
install.packages(gplots)
This document outlines the step-by-step workflow of our DNA/RNA methylation project, comparing CTRL and DIS sample groups using Illumina microarray data.
- Install and load required R packages:
minfi
,BiocManager
,knitr
. - Clean the R environment and set the working directory.
- Load the sample sheet using
read.metharray.sheet()
to import metadata. - Read raw data using
read.metharray.exp()
to generate theRGset
object.
- Extract fluorescence intensity data for Red (Cy5) and Green (Cy3) channels from
RGset
usinggetRed()
andgetGreen()
. - Store the data into two separate dataframes:
Red
andGreen
.
- Classify sample quality based on the percentage of failed probes:
- High quality: < 0.01%
- Good quality: < 0.2%
- Low quality: > 0.2%
- Critical quality: around 1% (may require exclusion)
- Split samples into CTRL and DIS groups using metadata.
- Create
MSet.raw
objects for each group. - Compute Beta values with
getBeta()
and M values withgetM()
. - Calculate and plot mean methylation values for both groups.
- Normalise data and asses the quality of normalisation.
- Perform Principal Component Analysis (PCA) to assess batch effects (e.g., by
Sentrix_ID
). - PCA suggested that normalization did not fully correct batch effects.
- Perform Mann whitney U test for each probe comparing CTRL vs DIS.
- Create a dataframe containing p-values.
- Filter probes with p ≤ 0.05 for significance.
- Plot the distribution of p-values.
-
Generate a heatmap using the top 100 most significant probes.
-
Apply hierarchical clustering.
Project members:
-Marco Cuscunà ([email protected])
-Marco Centenaro ([email protected])
-Michele Carbonieri ([email protected])
-Marina Mariano ([email protected])
-Luca Cagnini ([email protected])
-Massimo Lanari ([email protected])