GWAS Analysis Demonstration in R

Marker and Phenotypic Data
- Read the marker data
- Read the phenotypic data
Quality Control check of Marker Data
- Imputing: Replacing missing values with mean values
- Filtering for minor alleles
Population structure
GWAS analysis

Load the required libraries

library(rrBLUP)
library(BGLR)
library(DT)
library(SNPRelate)
library(dplyr)
library(qqman)
library(poolR)

Marker and Phenotypic Data

Description of the data

Here we will be using rice data available at http://ricediversity.org/data/index.cfm.
The marker data includes 44,100 SNP markers for 413 diverse accessions/genotypes of O. sativa.
The phenotypic data includes 34 traits phenotyped for 413 acessions.
The details about data and description of study can be found in manuscript by Zhao et al (2011).
Here we will extract the first trait and use for GWAS analysis

Read the marker data

Description of the code

We will upload the marker data files saved in working directory in datafiles folder.
The data was downloaded from http://ricediversity.org/data/sets/44kgwas/.
When we unzip the file downloaded from link above, you will find three files .fam (contains names about the acessions/genotypes), .map (contains map information including chromosome, position and marker name) and .ped (conatins the marker allele data in 0, 1, 2, and 3 format; 2 represents missing data). The files are plink converted files.
1 represents heterozygous, 0 and 3 represents homozygous for major and minor allele.
We will read all the three data files. For reading .ped we will use BGLR package.
We will recode the marker data to 0 (homozygous, major allele), 1 (Heterozygous) and 2 (homozygous, minor allele) format and missing as NA.
Finally, we will convert the marker data into matrix and transpose it

# Read the marker data files.  Marker data
rm(list = ls())
setwd("~/Box/Postdoc/Teaching/Integrative Data Science for Plant Phenomics/Data/RiceDiversity_44K_Genotypes_PLINK")
Geno <- read_ped("sativas413.ped")
p = Geno$p
n = Geno$n
Geno = Geno$x
# Acession information
FAM <- read.table("sativas413.fam")
# Map information
MAP <- read.table("sativas413.map")
# Recode the data in ped file
Geno[Geno == 2] <- NA  # Converting missing data to NA
Geno[Geno == 0] <- 0  # Converting 0 data to 0
Geno[Geno == 1] <- 1  # Converting 1 to 1
Geno[Geno == 3] <- 2  # Converting 3 to 2
# Convert the marker data into matrix and transponse and check dimensions
Geno <- matrix(Geno, nrow = p, ncol = n, byrow = TRUE)
Geno <- t(Geno)
dim(Geno)

[1]   413 36901

Read the phenotypic data

Description of the code

Reading the phenotypic data directly from website.
Then viewing the first 5 rows and columns.
Also checking the dimensions of file.
Data file details:
- NFSTV ID represents the name of acessions
- Total number of traits in the file are 36
- Total number of acessions are 413
- NA represents missing data
Extract the first trait (flowering time) and drop the missing ones and match with marker file

R code

rice.pheno <- read.table("http://www.ricediversity.org/data/sets/44kgwas/RiceDiversity_44K_Phenotypes_34traits_PLINK.txt", 
    header = TRUE, stringsAsFactors = FALSE, sep = "\t")
# See first few columns and rows of the data
rice.pheno[1:5, 1:5]

       HybID NSFTVID Flowering.time.at.Arkansas Flowering.time.at.Faridpur
1 081215-A05       1                   75.08333                         64
2 081215-A06       3                   89.50000                         66
3 081215-A07       4                   94.50000                         67
4 081215-A08       5                   87.50000                         70
5 090414-A09       6                   89.08333                         73
  Flowering.time.at.Aberdeen
1                         81
2                         83
3                         93
4                        108
5                        101

dim(rice.pheno)

[1] 413  38

# datatable(rice.pheno, rownames = FALSE, filter='top', options =
# list(pageLength = 3, scrollX=T)) assign the row names to marker file and
# compare it with phenotypic file
rownames(Geno) <- FAM$V2
table(rownames(Geno) == rice.pheno$NSFTVID)


TRUE 
 413

# Now let us extract the first trait and assign it to object y
y <- matrix(rice.pheno$Flowering.time.at.Arkansas)  # # use the first trait 
rownames(y) <- rice.pheno$NSFTVID
index <- !is.na(y)
y <- y[index, 1, drop = FALSE]  # 374
Geno <- Geno[index, ]  # 374 x 36901
table(rownames(Geno) == rownames(y))


TRUE 
 374

Quality Control check of Marker Data

Description

In this section we will describe how to perform the qulaity control of marker data before GWAS.
We will ipute the marker data and filter for minor allele frequency

Imputing: Replacing missing values with mean values

Description

Here we will be using for loop to indetify the missing ones and convert them with mean values across columns.

for (j in 1:ncol(Geno)) {
    Geno[, j] <- ifelse(is.na(Geno[, j]), mean(Geno[, j], na.rm = TRUE), Geno[, 
        j])
}

Filtering for minor alleles

Description

Here we will droping the alleles with frequency < 5% and saving the object as Geno1
Second we will match and drop the markers from map info file and save it MAP1

# Filter for minor alleles
p <- colSums(Geno)/(2 * nrow(Geno))
maf <- ifelse(p > 0.5, 1 - p, p)
maf.index <- which(maf < 0.05)
Geno1 <- Geno[, -maf.index]
dim(Geno1)

[1]   413 33701

# Check the number of markers dropped Match with the map info file and save
# it as MAP1
setwd("~/Box/Postdoc/Teaching/Integrative Data Science for Plant Phenomics/Data/RiceDiversity_44K_Genotypes_PLINK")
MAP <- read.table("sativas413.map")
dim(MAP)

[1] 36901     4

MAP1 <- MAP[-maf.index, ]
dim(MAP1)

[1] 33701     4

Population structure

Description

Now, we will look inot the popultion structure which may lead to suprious associations if not accounted.
Here, we will be using PCA which provides fast and effective way to diagnose the population structure.
We will run the PCA analysis using SNPRelate Package and plot it to see the structure.
First we will create matrix and assing the row and colum names to the geno file. The we will create the gds formate file required by SNPRelate package and perform the PCA analysis.
We will add the population information to the PCA output file and plot it.

# Create geno matrix file and assign the row and column names from fam and
# map files
Geno1 <- as.matrix(Geno1)
rownames(Geno1) <- FAM$V2
sample <- row.names(Geno1)
length(sample)

[1] 413

colnames(Geno1) <- MAP1$V2
snp.id <- colnames(Geno1)
length(snp.id)

[1] 33701

# create gds formate file with marker and sample ids and save it as 44k.gds
snpgdsCreateGeno("44k.gds", genmat = Geno1, sample.id = sample, snp.id = snp.id, 
    snp.chromosome = MAP1$V1, snp.position = MAP1$V4, snpfirstdim = FALSE)
# Now open the 44k.gds file
geno_44k <- snpgdsOpen("44k.gds")
snpgdsSummary("44k.gds")

The file name: /Users/waseemhussain/Box/Postdoc/Teaching/Integrative Data Science for Plant Phenomics/R_codes/44k.gds 
The total number of samples: 413 
The total number of SNPs: 33701 
SNP genotypes are stored in SNP-major mode (Sample X SNP).

# Now perform the pca analysis and plot it
pca <- snpgdsPCA(geno_44k, snp.id = colnames(Geno1))

Principal Component Analysis (PCA) on genotypes:
Excluding 0 SNP (non-autosomes or non-selection)
Excluding 0 SNP (monomorphic: TRUE, MAF: NaN, missing rate: NaN)
Working space: 413 samples, 33,701 SNPs
    using 1 (CPU) core
PCA:    the sum of all selected genotypes (0,1,2) = 8315382
CPU capabilities: Double-Precision SSE2
Sun Mar  3 15:27:27 2019    (internal increment: 1812)

[..................................................]  0%, ETC: ---        
[==================================================] 100%, completed in 1s
Sun Mar  3 15:27:28 2019    Begin (eigenvalues and eigenvectors)
Sun Mar  3 15:27:28 2019    Done.

pca <- data.frame(sample.id = row.names(Geno1), EV1 = pca$eigenvect[, 1], EV2 = pca$eigenvect[, 
    2], EV3 = pca$eigenvect[, 3], EV3 = pca$eigenvect[, 4], stringsAsFactors = FALSE)
# Plot the PCA
plot(pca$EV2, pca$EV1, xlab = "eigenvector 3", ylab = "eigenvector 4")

# Now let us add the population information to the plot. Here we will be
# using the population information from the PCA file available online
pca_1 <- read.csv("http://ricediversity.org/data/sets/44kgwas/RiceDiversity.44K.germplasm.csv", 
    header = TRUE, skip = 1, stringsAsFactors = FALSE)  # 431 x 12
pca_2 <- pca_1[match(pca$sample.id, pca_1$NSFTV.ID), ]
table(pca_1$sample.id == pca_2$NSFTV.ID)

< table of extent 0 >

# Extract the population information and add the pca output file
pca_population <- cbind(pca_2$Sub.population, pca)
colnames(pca_population)[1] <- "population"
# Plot and add the population names
plot(pca_population$EV1, pca_population$EV2, xlab = "PC1", ylab = "PC2", col = c(1:6)[factor(pca_population$population)])
legend(x = "topright", legend = levels(factor(pca_population$population)), col = c(1:6), 
    pch = 1, cex = 0.6)

GWAS analysis

Description

GWAS analysis will be done in rrBLUP package.
we will prepare the genotypic file including markers, and map information
We will run the GWAS analysis in rrBLUP package
Correct for multiple testing
And finally we will look for significant markers and draw the Manhattan plot.

Run the GWAS analysis

# create the geno file for rrBLUP package GWAS analysis
geno_final <- data.frame(marker = MAP1[, 2], chrom = MAP1[, 1], pos = MAP1[, 
    4], t(Geno1 - 1), check.names = FALSE)  # W = \in{-1, 0, 1}
dim(Geno1)

[1]   413 33701

# create the pheno file
pheno_final <- data.frame(NSFTV_ID = rownames(y), y = y)
# Run the GWAS analysis
GWAS <- GWAS(pheno_final, geno_final, min.MAF = 0.05, P3D = TRUE, plot = FALSE)

[1] "GWAS for trait: y"
[1] "Variance components estimated. Testing markers."

Correct for Multiple Testing

We will be using Li and Ji (2005) method for multiple testing and choose significance threshold level
The reason we use Li an Ji (2005) is to make sure to account for correlated markers (LD between markers) and determine independent number of tests
As it requires lot of memomary, we will run the anlysis seperately for each chromosome and then finally sum them all
Below is the code for multiple tesing:

# Read the genotypic file and create a matrix for each chromosome
corr.matrix1 <- cor(Geno1[, 1:5888])
corr.matrix2 <- cor(Geno1[, 5889:9439])
corr.matrix3 <- cor(Geno1[, 9440:13464])
corr.matrix4 <- cor(Geno1[, 13465:16091])
corr.matrix5 <- cor(Geno1[, 16092:18701])
corr.matrix6 <- cor(Geno1[, 18702:21660])
corr.matrix7 <- cor(Geno1[, 21661:23587])
corr.matrix8 <- cor(Geno1[, 23588:25668])
corr.matrix9 <- cor(Geno1[, 25669:27501])
corr.matrix10 <- cor(Geno1[, 27502:29121])
corr.matrix11 <- cor(Geno1[, 29122:31752])
corr.matrix12 <- cor(Geno1[, 31753:33719])
# Now use the meff function from pacakge to get effective number of tests
# for each chromosome
meff_liji_1 <- meff(corr.matrix1, method = "liji")
meff_liji_2 <- meff(corr.matrix2, method = "liji")
meff_liji_3 <- meff(corr.matrix3, method = "liji")
meff_liji_4 <- meff(corr.matrix4, method = "liji")
meff_liji_5 <- meff(corr.matrix5, method = "liji")
meff_liji_6 <- meff(corr.matrix6, method = "liji")
meff_liji_7 <- meff(corr.matrix7, method = "liji")
meff_liji_8 <- meff(corr.matrix8, method = "liji")
meff_liji_9 <- meff(corr.matrix9, method = "liji")
meff_liji_10 <- meff(corr.matrix10, method = "liji")
meff_liji_11 <- meff(corr.matrix11, method = "liji")
meff_liji_12 <- meff(corr.matrix12, method = "liji")

# Now sum up all the effective tests to get effective number of independent
# tests
Meff <- sum(meff_liji_1, meff_liji_2, meff_liji_3, meff_liji_4, meff_liji_5, 
    meff_liji_6, meff_liji_7, meff_liji_8, meff_liji_9, meff_liji_10, meff_liji_11, 
    meff_liji_12)

Determine the significance threshold level

\[ \alpha_{p}= 1- (1- \alpha{e})^1/Meff\\ Meff= \text {is the effective number of independent tests}\\ \]

Meff = 3948
p_threshold = (1 - (1 - 0.05))^1/3948
p_threshold

[1] 1.266464e-05

Extracting significant markers

GWAS_1 <- GWAS %>% filter(y != "0")
# List of significant SNPs
GWAS_1 %>% filter(y < 1e-04)

      marker chrom      pos            y
1  id1004849     1  6219761 2.870307e-05
2  id1021674     1 34699187 4.652320e-05
3  id1022236     1 35386431 2.366215e-05
4  id2005536     2 12153311 3.525448e-05
5  id2005948     2 14158673 8.091814e-05
6 id10000932    10  3091111 1.454170e-05
7 id12002616    12  6227524 7.627332e-05
8 id12004093    12 10604192 9.459465e-05

Note the markers crossing the significance threshold level.

Draw the Manhattan plot

manhattan(x = GWAS_1, chr = "chrom", bp = "pos", p = "y", snp = "marker", col = c("blue4", 
    "orange3"), suggestiveline = -log10(1e-04), logp = TRUE)

=====================================END==========================