gwasplot provides fast and efficient tools for visualizing and annotating GWAS summary statistics. It offers functions to reformat data, create plots, and annotate top hits with genomic features.

It is designed to work with output from GWAS performed on WGS data with many rare variants, using duckdb to manipulate data.

For more information, visit the GitHub repository or the project website.

Roadmap

add LD into locuszoom plots, perhaps referencing this setup.
build out OpenTargets API functionality for annotations
add in helper functions to faciliate fine-mapping with other packages (e.g., SuSIE)

Installation

Install the development version of gwasplot from GitHub:

# Install remotes if not already installed
if (!requireNamespace("remotes", quietly = TRUE)) {
  install.packages("remotes")
}

# Install gwasplot from GitHub
remotes::install_github("weinstockj/gwasplot")

Input formats

gwasplot expects output from either regenie or saige in csv or parquet format (parquet is generally recommended).

Regenie output looks like this:

dplyr::glimpse(df)
Rows: 5
Columns: 12
$ CHROM     <chr> "chr3", "chr3", "chr3", "chr3", "chr3"
$ POS       <int> 90000011, 90000261, 90000274, 90000324, 90000366
$ ID        <chr> "chr3-90000011-T-C", "chr3-90000261-GATT-G", "chr3-900002…"
$ ALLELE0   <chr> "T", "GATT", "T", "C", "T"
$ ALLELE1   <chr> "C", "G", "A", "A", "G"
$ A1FREQ    <dbl> 1.37775e-04, 2.91090e-04, 2.38259e-04, 1.62637e-04, 4.66158e…
$ N         <int> 482670, 482669, 482669, 482669, 482669
$ BETA      <dbl> -0.00157451, -0.06463220, -0.06452010, 0.02796820, -0.021063…
$ SE        <dbl> 0.0730887, 0.0479369, 0.0540183, 0.0646778, 0.1213880
$ CHISQ     <dbl> 0.000464077, 1.817850000, 1.426620000, 0.186990000, 0.030108…
$ LOG10P    <dbl> 0.00752913, 0.75063100, 0.63391900, 0.17689500, 0.06437000

gwastools will read in data from either regenie or saige and standarize the column names, and then create a table called summary_stats in a duckdb database.

Key Features

Data Quality Control and Filtering

gwasplot includes robust filtering capabilities to remove variants from problematic genomic regions that can lead to spurious associations:

Exclude Difficult Regions

The exclude_difficult_regions() function removes variants from regions with known mapping difficulties:

# Remove variants from multiple types of difficult regions
gwas_filtered <- exclude_difficult_regions(gwas_stats, 
                                         beds_exclude = c("hg19diff", "UCSC_unusual", "GRC_exclusions"))

Available region types include: - hg19diff: Regions with assembly differences between genome builds - UCSC_unusual: Unusual regions identified by UCSC (gaps, heterochromatin, etc.) - GRC_exclusions: Genome Reference Consortium exclusion regions - GIAB_difficult_regions: Genome in a Bottle difficult-to-map regions

Custom Variant Filtering

Filter variants using custom whitelists or blacklists:

# Keep only variants in a specific region or set
filtered_gwas <- filter_variants(gwas_stats, subset = "my_variants.parquet")

# Exclude specific variants
filtered_gwas <- filter_variants(gwas_stats, exclude = "bad_variants.parquet")

Gene Annotation

Annotate variants with nearest protein-coding genes:

# Add gene annotations
annotated_gwas <- find_nearest_gene(gwas_stats, threshold = 1e5)  # 100kb window

Additional specialized annotations: - CHIP genes: annotate_with_chip_genes() - flags variants in clonal hematopoiesis genes - Centromeres: annotate_with_centromere() - identifies variants in centromeric regions
- Immunoglobulin loci: annotate_with_immunoglobulin() - flags variants in Ig heavy/light chain regions

Visualization

Manhattan Plots

manhattan(gwas = gwas_stats, output_prefix = "my_manhattan")

QQ Plots

qqplot_save(gwas = gwas_stats, output_prefix = "my_qqplot")

LocusZoom Plots

Create detailed regional association plots with gene tracks:

# Plot a specific locus with gene structure
locuszoom(gwas_stats, locus_chr = "chr2", locus_start = 25000000, locus_end = 25500000)

Volcano Plots

For post-GWAS analysis (e.g., after empirical Bayes shrinkage):

# Automatically uses lfsr if available, otherwise p-values
volcano(gwas_data, phenotype_label = "My Phenotype")

Usage

Below are some sample usage examples:

Basic Workflow

Reformat GWAS Summary Statistics Use the reformat_summary_statistics function to read and reformat GWAS summary statistics from a parquet or CSV file:

library(gwasplot)

# Reformat summary statistics from a file
gwas_stats <- reformat_summary_statistics("path/to/your/file.parquet")
print(gwas_stats)

GWAS object
File path: ../concatenated_results.parquet
Detected format: regenie
Data names: CHROM, POS, ID, ALLELE0, ALLELE1, A1FREQ, N, BETA, SE, CHISQ, LOG10P, phenotype
Data preview:
# A tibble: 5 × 9
  CHROM      POS REF   ALT      AF_ALT     BETA  LOG10P PVALUE ID
  <chr>    <dbl> <chr> <chr>     <dbl>    <dbl>   <dbl>  <dbl> <chr>
1 chr3  90000011 T     C     0.000138  -0.00157 0.00753  0.983 chr3_90000011_T_C
2 chr3  90000261 GATT  G     0.000291  -0.0646  0.751    0.178 chr3_90000261_GA…
3 chr3  90000274 T     A     0.000238  -0.0645  0.634    0.232 chr3_90000274_T_A
4 chr3  90000324 C     A     0.000163   0.0280  0.177    0.665 chr3_90000324_C_A
5 chr3  90000366 T     G     0.0000466 -0.0211  0.0644   0.862 chr3_90000366_T_G

Complete Analysis Pipeline

# Load and reformat data
gwas_stats <- reformat_summary_statistics("path/to/gwas_results.parquet")

# Quality control: remove problematic regions for WGS GWAS
gwas_clean <- gwas_stats %>%
  exclude_difficult_regions(
    beds_exclude = c(
      "hg19diff", 
      "UCSC_unusual", 
      "GRC_exclusions"
      )
    )


manhattan(gwas_clean, output_prefix = "my_analysis")
qqplot(gwas_clean, output_prefix = "my_analysis")

# Get top hits and annotate with genes
top_hits <- gwas_clean %>%
  select_top_hits(threshold = 5e-8) %>%
  find_nearest_gene()

# Plot specific loci
locuszoom(gwas_clean, locus_chr = "chr2", locus_start = 25000000, locus_end = 25500000)

Contact

Email Josh Weinstock. See our other work here.

Overview