GWAS Analysis Demonstration in R
GWAS Analysis Demonstration in R
Load the required libraries
library(rrBLUP)
library(BGLR)
library(DT)
library(SNPRelate)
library(dplyr)
library(qqman)
library(poolR)
Marker and Phenotypic Data
Description of the data
- Here we will be using rice data available at http://ricediversity.org/data/index.cfm.
- The marker data includes 44,100 SNP markers for 413 diverse accessions/genotypes of O. sativa.
- The phenotypic data includes 34 traits phenotyped for 413 acessions.
- The details about data and description of study can be found in manuscript by Zhao et al (2011).
- Here we will extract the first trait and use for GWAS analysis
Read the marker data
Description of the code
- We will upload the marker data files saved in working directory in datafiles folder.
- The data was downloaded from http://ricediversity.org/data/sets/44kgwas/.
- When we unzip the file downloaded from link above, you will find three files .fam (contains names about the acessions/genotypes), .map (contains map information including chromosome, position and marker name) and .ped (conatins the marker allele data in 0, 1, 2, and 3 format; 2 represents missing data). The files are plink converted files.
- 1 represents heterozygous, 0 and 3 represents homozygous for major and minor allele.
- We will read all the three data files. For reading .ped we will use BGLR package.
- We will recode the marker data to 0 (homozygous, major allele), 1 (Heterozygous) and 2 (homozygous, minor allele) format and missing as NA.
- Finally, we will convert the marker data into matrix and transpose it
# Read the marker data files. Marker data
rm(list = ls())
setwd("~/Box/Postdoc/Teaching/Integrative Data Science for Plant Phenomics/Data/RiceDiversity_44K_Genotypes_PLINK")
Geno <- read_ped("sativas413.ped")
p = Geno$p
n = Geno$n
Geno = Geno$x
# Acession information
FAM <- read.table("sativas413.fam")
# Map information
MAP <- read.table("sativas413.map")
# Recode the data in ped file
Geno[Geno == 2] <- NA # Converting missing data to NA
Geno[Geno == 0] <- 0 # Converting 0 data to 0
Geno[Geno == 1] <- 1 # Converting 1 to 1
Geno[Geno == 3] <- 2 # Converting 3 to 2
# Convert the marker data into matrix and transponse and check dimensions
Geno <- matrix(Geno, nrow = p, ncol = n, byrow = TRUE)
Geno <- t(Geno)
dim(Geno)
[1] 413 36901
Read the phenotypic data
Description of the code
- Reading the phenotypic data directly from website.
- Then viewing the first 5 rows and columns.
- Also checking the dimensions of file.
- Data file details:
- NFSTV ID represents the name of acessions
- Total number of traits in the file are 36
- Total number of acessions are 413
- NA represents missing data
- Extract the first trait (flowering time) and drop the missing ones and match with marker file
R code
rice.pheno <- read.table("http://www.ricediversity.org/data/sets/44kgwas/RiceDiversity_44K_Phenotypes_34traits_PLINK.txt",
header = TRUE, stringsAsFactors = FALSE, sep = "\t")
# See first few columns and rows of the data
rice.pheno[1:5, 1:5]
HybID NSFTVID Flowering.time.at.Arkansas Flowering.time.at.Faridpur
1 081215-A05 1 75.08333 64
2 081215-A06 3 89.50000 66
3 081215-A07 4 94.50000 67
4 081215-A08 5 87.50000 70
5 090414-A09 6 89.08333 73
Flowering.time.at.Aberdeen
1 81
2 83
3 93
4 108
5 101
dim(rice.pheno)
[1] 413 38
# datatable(rice.pheno, rownames = FALSE, filter='top', options =
# list(pageLength = 3, scrollX=T)) assign the row names to marker file and
# compare it with phenotypic file
rownames(Geno) <- FAM$V2
table(rownames(Geno) == rice.pheno$NSFTVID)
TRUE
413
# Now let us extract the first trait and assign it to object y
y <- matrix(rice.pheno$Flowering.time.at.Arkansas) # # use the first trait
rownames(y) <- rice.pheno$NSFTVID
index <- !is.na(y)
y <- y[index, 1, drop = FALSE] # 374
Geno <- Geno[index, ] # 374 x 36901
table(rownames(Geno) == rownames(y))
TRUE
374
Quality Control check of Marker Data
Description
- In this section we will describe how to perform the qulaity control of marker data before GWAS.
- We will ipute the marker data and filter for minor allele frequency
Imputing: Replacing missing values with mean values
Description
- Here we will be using for loop to indetify the missing ones and convert them with mean values across columns.
for (j in 1:ncol(Geno)) {
Geno[, j] <- ifelse(is.na(Geno[, j]), mean(Geno[, j], na.rm = TRUE), Geno[,
j])
}
Filtering for minor alleles
Description
- Here we will droping the alleles with frequency < 5% and saving the object as Geno1
- Second we will match and drop the markers from map info file and save it MAP1
# Filter for minor alleles
p <- colSums(Geno)/(2 * nrow(Geno))
maf <- ifelse(p > 0.5, 1 - p, p)
maf.index <- which(maf < 0.05)
Geno1 <- Geno[, -maf.index]
dim(Geno1)
[1] 413 33701
# Check the number of markers dropped Match with the map info file and save
# it as MAP1
setwd("~/Box/Postdoc/Teaching/Integrative Data Science for Plant Phenomics/Data/RiceDiversity_44K_Genotypes_PLINK")
MAP <- read.table("sativas413.map")
dim(MAP)
[1] 36901 4
MAP1 <- MAP[-maf.index, ]
dim(MAP1)
[1] 33701 4
Population structure
Description
- Now, we will look inot the popultion structure which may lead to suprious associations if not accounted.
- Here, we will be using PCA which provides fast and effective way to diagnose the population structure.
- We will run the PCA analysis using SNPRelate Package and plot it to see the structure.
- First we will create matrix and assing the row and colum names to the geno file. The we will create the gds formate file required by SNPRelate package and perform the PCA analysis.
- We will add the population information to the PCA output file and plot it.
# Create geno matrix file and assign the row and column names from fam and
# map files
Geno1 <- as.matrix(Geno1)
rownames(Geno1) <- FAM$V2
sample <- row.names(Geno1)
length(sample)
[1] 413
colnames(Geno1) <- MAP1$V2
snp.id <- colnames(Geno1)
length(snp.id)
[1] 33701
# create gds formate file with marker and sample ids and save it as 44k.gds
snpgdsCreateGeno("44k.gds", genmat = Geno1, sample.id = sample, snp.id = snp.id,
snp.chromosome = MAP1$V1, snp.position = MAP1$V4, snpfirstdim = FALSE)
# Now open the 44k.gds file
geno_44k <- snpgdsOpen("44k.gds")
snpgdsSummary("44k.gds")
The file name: /Users/waseemhussain/Box/Postdoc/Teaching/Integrative Data Science for Plant Phenomics/R_codes/44k.gds
The total number of samples: 413
The total number of SNPs: 33701
SNP genotypes are stored in SNP-major mode (Sample X SNP).
# Now perform the pca analysis and plot it
pca <- snpgdsPCA(geno_44k, snp.id = colnames(Geno1))
Principal Component Analysis (PCA) on genotypes:
Excluding 0 SNP (non-autosomes or non-selection)
Excluding 0 SNP (monomorphic: TRUE, MAF: NaN, missing rate: NaN)
Working space: 413 samples, 33,701 SNPs
using 1 (CPU) core
PCA: the sum of all selected genotypes (0,1,2) = 8315382
CPU capabilities: Double-Precision SSE2
Sun Mar 3 15:27:27 2019 (internal increment: 1812)
[..................................................] 0%, ETC: ---
[==================================================] 100%, completed in 1s
Sun Mar 3 15:27:28 2019 Begin (eigenvalues and eigenvectors)
Sun Mar 3 15:27:28 2019 Done.
pca <- data.frame(sample.id = row.names(Geno1), EV1 = pca$eigenvect[, 1], EV2 = pca$eigenvect[,
2], EV3 = pca$eigenvect[, 3], EV3 = pca$eigenvect[, 4], stringsAsFactors = FALSE)
# Plot the PCA
plot(pca$EV2, pca$EV1, xlab = "eigenvector 3", ylab = "eigenvector 4")
# Now let us add the population information to the plot. Here we will be
# using the population information from the PCA file available online
pca_1 <- read.csv("http://ricediversity.org/data/sets/44kgwas/RiceDiversity.44K.germplasm.csv",
header = TRUE, skip = 1, stringsAsFactors = FALSE) # 431 x 12
pca_2 <- pca_1[match(pca$sample.id, pca_1$NSFTV.ID), ]
table(pca_1$sample.id == pca_2$NSFTV.ID)
< table of extent 0 >
# Extract the population information and add the pca output file
pca_population <- cbind(pca_2$Sub.population, pca)
colnames(pca_population)[1] <- "population"
# Plot and add the population names
plot(pca_population$EV1, pca_population$EV2, xlab = "PC1", ylab = "PC2", col = c(1:6)[factor(pca_population$population)])
legend(x = "topright", legend = levels(factor(pca_population$population)), col = c(1:6),
pch = 1, cex = 0.6)
GWAS analysis
Description
- GWAS analysis will be done in rrBLUP package.
- we will prepare the genotypic file including markers, and map information
- We will run the GWAS analysis in rrBLUP package
- Correct for multiple testing
- And finally we will look for significant markers and draw the Manhattan plot.
Run the GWAS analysis
# create the geno file for rrBLUP package GWAS analysis
geno_final <- data.frame(marker = MAP1[, 2], chrom = MAP1[, 1], pos = MAP1[,
4], t(Geno1 - 1), check.names = FALSE) # W = \in{-1, 0, 1}
dim(Geno1)
[1] 413 33701
# create the pheno file
pheno_final <- data.frame(NSFTV_ID = rownames(y), y = y)
# Run the GWAS analysis
GWAS <- GWAS(pheno_final, geno_final, min.MAF = 0.05, P3D = TRUE, plot = FALSE)
[1] "GWAS for trait: y"
[1] "Variance components estimated. Testing markers."
Correct for Multiple Testing
- We will be using Li and Ji (2005) method for multiple testing and choose significance threshold level
- The reason we use Li an Ji (2005) is to make sure to account for correlated markers (LD between markers) and determine independent number of tests
- As it requires lot of memomary, we will run the anlysis seperately for each chromosome and then finally sum them all
- Below is the code for multiple tesing:
# Read the genotypic file and create a matrix for each chromosome
corr.matrix1 <- cor(Geno1[, 1:5888])
corr.matrix2 <- cor(Geno1[, 5889:9439])
corr.matrix3 <- cor(Geno1[, 9440:13464])
corr.matrix4 <- cor(Geno1[, 13465:16091])
corr.matrix5 <- cor(Geno1[, 16092:18701])
corr.matrix6 <- cor(Geno1[, 18702:21660])
corr.matrix7 <- cor(Geno1[, 21661:23587])
corr.matrix8 <- cor(Geno1[, 23588:25668])
corr.matrix9 <- cor(Geno1[, 25669:27501])
corr.matrix10 <- cor(Geno1[, 27502:29121])
corr.matrix11 <- cor(Geno1[, 29122:31752])
corr.matrix12 <- cor(Geno1[, 31753:33719])
# Now use the meff function from pacakge to get effective number of tests
# for each chromosome
meff_liji_1 <- meff(corr.matrix1, method = "liji")
meff_liji_2 <- meff(corr.matrix2, method = "liji")
meff_liji_3 <- meff(corr.matrix3, method = "liji")
meff_liji_4 <- meff(corr.matrix4, method = "liji")
meff_liji_5 <- meff(corr.matrix5, method = "liji")
meff_liji_6 <- meff(corr.matrix6, method = "liji")
meff_liji_7 <- meff(corr.matrix7, method = "liji")
meff_liji_8 <- meff(corr.matrix8, method = "liji")
meff_liji_9 <- meff(corr.matrix9, method = "liji")
meff_liji_10 <- meff(corr.matrix10, method = "liji")
meff_liji_11 <- meff(corr.matrix11, method = "liji")
meff_liji_12 <- meff(corr.matrix12, method = "liji")
# Now sum up all the effective tests to get effective number of independent
# tests
Meff <- sum(meff_liji_1, meff_liji_2, meff_liji_3, meff_liji_4, meff_liji_5,
meff_liji_6, meff_liji_7, meff_liji_8, meff_liji_9, meff_liji_10, meff_liji_11,
meff_liji_12)
Determine the significance threshold level
\[ \alpha_{p}= 1- (1- \alpha{e})^1/Meff\\ Meff= \text {is the effective number of independent tests}\\ \]
Meff = 3948
p_threshold = (1 - (1 - 0.05))^1/3948
p_threshold
[1] 1.266464e-05
Extracting significant markers
GWAS_1 <- GWAS %>% filter(y != "0")
# List of significant SNPs
GWAS_1 %>% filter(y < 1e-04)
marker chrom pos y
1 id1004849 1 6219761 2.870307e-05
2 id1021674 1 34699187 4.652320e-05
3 id1022236 1 35386431 2.366215e-05
4 id2005536 2 12153311 3.525448e-05
5 id2005948 2 14158673 8.091814e-05
6 id10000932 10 3091111 1.454170e-05
7 id12002616 12 6227524 7.627332e-05
8 id12004093 12 10604192 9.459465e-05
Note the markers crossing the significance threshold level.
Draw the Manhattan plot
manhattan(x = GWAS_1, chr = "chrom", bp = "pos", p = "y", snp = "marker", col = c("blue4",
"orange3"), suggestiveline = -log10(1e-04), logp = TRUE)
=====================================END==========================