Skip to main content
Bioinformatics logoLink to Bioinformatics
. 2020 Dec 13;37(8):1178–1181. doi: 10.1093/bioinformatics/btaa784

PCAmatchR: a flexible R package for optimal case–control matching using weighted principal components

Derek W Brown 1,2,, Timothy A Myers 3, Mitchell J Machiela 4
Editor: Russell Schwartz
PMCID: PMC8599751  PMID: 32926120

Abstract

Summary

A concern when conducting genome-wide association studies (GWAS) is the potential for population stratification, i.e. ancestry-based genetic differences between cases and controls, that if not properly accounted for, could lead to biased association results. We developed PCAmatchR as an open source R package for performing optimal case–control matching using principal component analysis (PCA) to aid in selecting controls that are well matched by ancestry to cases. PCAmatchR takes user supplied PCA outputs and selects matching controls for cases by utilizing a weighted Mahalanobis distance metric which weights each principal component by the percentage of genetic variation explained. Results from the 1000 Genomes Project data demonstrate both the functionality and performance of PCAmatchR for selecting matching controls for case populations as well as reducing inflation of association test statistics. PCAmatchR improves genomic similarity between matched cases and controls, which minimizes the effects of population stratification in GWAS analyses.

Availability and implementation

PCAmatchR is freely available for download on GitHub (https://github.com/machiela-lab/PCAmatchR) or through CRAN (https://CRAN.R-project.org/package=PCAmatchR).

Supplementary information

Supplementary data are available at Bioinformatics online.

1 Introduction

Genome-wide association studies (GWAS) have discovered thousands of germline susceptibility variants associated with disease risk. The selection of cases and well-matched controls is vital for obtaining unbiased association results (Hinds et al., 2004; Luca et al., 2008). An increasingly observed trend in GWAS is the practice of maximizing study resources by only genotyping cases and then borrowing controls from large sets of pre-genotyped, disease-free individuals (Hinds et al., 2004; Luca et al., 2008). When borrowing controls from previous studies, it is essential that cases be genotyped on the same arrays and undergo the same quality control and filtering steps, to lessen the possibilities of technical artifacts in genotype calling. In some scenarios, pooling of control genotyping data with case genotyping data can be an effective approach for performing genetic association studies as long as case–control imbalances are appropriately accounted for in the statistical analysis (Hayeck et al., 2015; Ma et al., 2013; Zhou et al., 2018). However, there are instances in which close genetic matching of cases to controls may be necessary (e.g. selecting subjects for costly molecular analyses).

Combining data from different sources could lead to confounding due to population stratification, i.e. genetic differences in cases and controls based on ancestry and not disease status (Epstein et al., 2012; Lacour et al., 2015), as allele frequencies and linkage disequilibrium patterns vary substantially by ancestry (Machiela et al., 2015). Many methods exist for removing population stratification bias from pooled genotype data (Byun et al., 2017; Epstein et al., 2012; Hinds et al., 2004; Lacour et al., 2015; Luca et al., 2008). A popular strategy is the utilization of principal component analysis (PCA) (Byun et al., 2017; Price et al., 2006). Briefly, PCA involves transforming large sets of predetermined ancestry-related genetic variants to a set of linearly uncorrelated principal components (PCs), the first of which explain the highest percentage of variation in the full genetic dataset (Price et al., 2006). Thus, instead of matching cases and controls on all available genotyped variants, close matching can be efficiently achieved using only the first few PCs.

Here, we describe the use of PCAmatchR, an open-source R package which allows for matching of controls to cases using PCA results (Brown et al., 2020). While case–control matching using derived PCs is not a novel concept (Machiela et al., 2018), our methodology involves weighing each PC by the respective percent variation explained. Using an improved weighted Mahalanobis distance metric, our approach selects optimized matches which ensure PCs that explain the highest degrees of variation within the genomic data are more heavily weighted. This approach produces a set of matched controls that are more genetically similar to cases and results in reduced genomic inflation in genetic association studies.

2 Materials and methods

We developed PCAmatchR to optimally match population-based controls to cases. PCAmatchR converts PCs into a Mahalanobis distance metric for selecting well-matched controls. Mahalanobis distance is a commonly used multivariate distance metric that creates a standardized distance between two samples, where smaller Mahalanobis distance indicates greater sample similarity (Hu et al., 2013). This distance metric has been highly utilized for matching within propensity score analyses (Rubin, 1979; Zhao, 2004). The distance (Dij) between a vector of input PCs (X) for subjects i and j is given by:

Dij=Xi-Xj'Σ-1 Xi-Xj, (1)

where Σ is the variance covariance matrix of X in the pooled case and control sample population (Stuart, 2010). In the case of weighted matching, Equation (1) is extended to allow for weights in the distance metric. The weighted Mahalanobis distance metric is defined as:

Dij=Xi-Xj'W Σ-1 Xi-Xj, (2)

where W is given by:

W=w1000w2000wn

and each wn represents the percent variance in data explained for each of the n PCs (Hu et al., 2013; Krusińska, 1987). The percentage of variance explained by each PC is calculated by taking the PC’s corresponding eigenvalue and dividing by the sum of all eigenvalues generated in the PCA (Jolliffe et al., 2016); although other user-defined weights can also be applied. Unlike the standard Mahalanobis distance metric which underperforms in terms of matching when the number of covariates is large (greater than 8), as the metric treats each element of X with equal importance (Rubin, 1979; Stuart, 2010), the weighted Mahalanobis distance metric may produce more genetically similar matches, as PCs are weighted based on their relative importance (Krusińska, 1987). PCAmatchR utilizes the optmatch package, which performs bipartite matching using the RELAX-IV minimum cost flow solver (Hansen et al., 2006). Briefly, this optimal matching algorithm jointly minimizes the total Mahalanobis distance among all possible matches, which removes the effects of possible ties (i.e. one control being the best match for multiple cases) from the final matched set (Hansen, 2007; Rosenbaum, 1989).

PCAmatchR takes as input PCs and eigenvalues and directly outputs optimal case and control matches. The first step is to choose a subset of genotyped single nucleotide polymorphisms (SNPs) for use as input when calculating the PCs (Price et al., 2008; Tian et al., 2008; Yu et al., 2008). PCA can then be conducted in the normal way using any standard tool, e.g. PLINK -- pca, on a combined dataset including both the case and control data. The user should retain the output matrix of derived PCs and the corresponding eigenvalues to perform the matching procedure. PCAmatchR has the flexibility to both match any number of controls to each case (e.g. 1:n matching) and further create exact matches based on user-defined covariates (e.g. sex), assuming sufficient control population size.

3 Implementation

To demonstrate the utility of PCAmatchR, we performed a hypothetical example of case–control matching using data from the 1000 Genomes Project (Auton et al., 2015; Sudmant et al., 2015) Phase 3 data release, which contains genotype data from 2504 individuals from 26 distinct populations (available at https://www.cog-genomics.org/plink/2.0/resources). Using a set of ancestry informative SNPs (Yu et al., 2008), we performed a PCA on all available genotyped participants using PLINK. The first 20 PCs explained 16% of the variability of the genomic data with the first PC explaining 9% (Supplementary Fig. S1a).

For a sample analysis, we selected all individuals from the CEU [Utah Residents (CEPH) with Northern and Western European Ancestry] population as cases (N = 99), whereas all remaining samples were used as our control population (N = 2405) (Supplementary Figs S1b and S2a). Cases were 1:1 matched to controls based on the weighted Mahalanobis distance metric using the first 20 PCs (Supplementary Fig. S2b and Supplementary Code). Using the plotting functionality within PCAmatchR, the connections between cases and matched controls are depicted in Figure 1,further providing a visual representation of the multivariate distance between matches. To show that the weighted matching procedure correctly identified individuals with higher genetic similarity, we calculated a relationship matrix, which estimates the SNP-based genetic relationship between samples, using PLINK -- make-rel (Yang et al., 2011). The genetic relationships between matches were extracted and averaged across all matches to give an overall estimate of genetic similarity. For comparison, we also performed the matching procedure using the unweighted Mahalanobis distance metric. Overall, controls selected using the weighted Mahalanobis distance metric had higher average genomic similarity to the CEU cases than the controls selected using the unweighted metric (0.1216 versus 0.1172, Supplementary Fig. S3 and Supplementary Table S1). Finally, we performed GWAS analyses, based on the ancestry SNP panel variants, and observed the weighted Mahalanobis distance metric produced a smaller lambda value than the matches selected using the unweighted metric (Supplementary Fig. S4 and Supplementary Table S1), indicating the weighted matching procedure within PCAmatchR effectively removed more potential population stratification bias in the pooled sample, compared to the unweighted distance-based matches.

Fig. 1.

Fig. 1.

Visualization of the multivariate distance between CEU population cases and matched controls using the plotting functionality within PCAmatchR. Connections are formed within matched sets

For completeness, we replicated these analyses for each of the 26 available 1000 Genomes Project populations, selecting each individual population as a separate case sample. Across the populations, matches derived using the proposed weighted Mahalanobis distance metric outperformed the unweighted metric matches in terms of genomic similarity within 21 (81%) of the populations and in terms of lambda value within 23 (88%) of the populations (Supplementary Fig. S5 and Supplementary Table S1). The importance of matching was further demonstrated using comparisons to unmatched samples with randomly selected controls (size equal to the number of cases). Weighted Mahalanobis distance matching greatly reduced the effects of population stratification compared to both baseline levels as well as those achieved by current unweighted matching techniques (Supplementary Table S1).

Sensitivity analyses were performed to assess the effects of the number of PCs included when matching. Weighted matching was less sensitive to the number of included PCs compared to unweighted matching (Supplementary Table S2). When the number of PCs increased (>20), unweighted matching poorly removed the effects of population stratification from analyses. In practice, the number of PCs included for matching should be based on the percent variance explained by the eigenvalues.

4 Discussion

PCAmatchR is an efficient and easy-to-use open access tool to match cases and controls based on user defined PCs, which can aid in removing the effects of population stratification for association analyses as well as facilitate the selection of biological samples for subsequent costly assays (e.g. methylation arrays and whole-genome sequencing). When applied to available data from the 1000 Genomes Project, PCAmatchR demonstrated the ability of weighted PCs to increase genomic similarity between cases and controls and reduce genomic inflation.

Supplementary Material

btaa784_Supplementary_Data

Acknowledgements

The authors acknowledge Dr. Shu-Hong Lin and Oliva Lee for their thorough testing of PCAmatchR. The opinions expressed by the authors are their own and this material should not be interpreted as representing the official viewpoint of the U.S. Department of Health and Human Services, the National Institutes of Health or the National Cancer Institute.

Funding

This work was supported by the Intramural Research Program of the US National Cancer Institute.

Conflict of Interest: none declared.

Contributor Information

Derek W Brown, Division of Cancer Epidemiology and Genetics, National Cancer Institute, Rockville, MD 20850, USA; Cancer Prevention Fellowship Program, Division of Cancer Prevention, National Cancer Institute, Rockville, MD 20850, USA.

Timothy A Myers, Division of Cancer Epidemiology and Genetics, National Cancer Institute, Rockville, MD 20850, USA.

Mitchell J Machiela, Division of Cancer Epidemiology and Genetics, National Cancer Institute, Rockville, MD 20850, USA.

References

  1. Auton A.  et al. ; 1000 Genomes Project Consortium. (2015) A global reference for human genetic variation. Nature, 526, 68–74. [DOI] [PMC free article] [PubMed] [Google Scholar]
  2. Brown D.W.  et al. (2020) PCAmatchR: Match Cases to Controls Based on Genotype Principal Components.  R package version 0.2.1. https://CRAN.R-project.org/package=PCAmatchR.
  3. Byun J.  et al. (2017) Ancestry inference using principal component analysis and spatial analysis: a distance-based analysis to account for population substructure. BMC Genomics, 18, 789. [DOI] [PMC free article] [PubMed] [Google Scholar]
  4. Epstein M.P.  et al. (2012) Stratification‐score matching improves correction for confounding by population stratification in case‐control association studies. Genet. Epidemiol., 36, 195–205. [DOI] [PMC free article] [PubMed] [Google Scholar]
  5. Hansen B.B. (2007) Optmatch: flexible, optimal matching for observational studies. New Funct. Multivar. Anal., 7, 18–24. [Google Scholar]
  6. Hansen B.B.  et al. (2006) Optimal full matching and related designs via network flows. J. Comput. Graph. Stat., 15, 609–627. [Google Scholar]
  7. Hayeck T.J.  et al. (2015) Mixed model with correction for case–control ascertainment increases association power. Am. J. Hum. Genet., 96, 720–730. [DOI] [PMC free article] [PubMed] [Google Scholar]
  8. Hinds D.A.  et al. (2004) Matching strategies for genetic association studies in structured populations. Am. J. Hum. Genet., 74, 317–325. [DOI] [PMC free article] [PubMed] [Google Scholar]
  9. Hu H.  et al. (2013) Fault diagnosis of analogue circuits with weighted Mahalanobis distance based on entropy theory. Int. J. Digit. Content Technol. Appl., 7, 182. [Google Scholar]
  10. Jolliffe I.T.  et al. (2016) Principal component analysis: a review and recent developments. Philos. Trans. R. Soc. Math. Phys. Eng. Sci., 374, 20150202. [DOI] [PMC free article] [PubMed] [Google Scholar]
  11. Krusińska E. (1987) A valuation of state of object based on weighted Mahalanobis distance. Pattern Recognit., 20, 413–418. [Google Scholar]
  12. Lacour A.  et al. (2015) Novel genetic matching methods for handling population stratification in genome-wide association studies. BMC Bioinformatics, 16, 84. [DOI] [PMC free article] [PubMed] [Google Scholar]
  13. Luca D.  et al. (2008) On the use of general control samples for genome-wide association studies: genetic matching highlights causal variants. Am. J. Hum. Genet., 82, 453–463. [DOI] [PMC free article] [PubMed] [Google Scholar]
  14. Ma C.  et al. (2013) Recommended joint and meta‐analysis strategies for case–control association testing of single low‐count variants. Genet. Epidemiol., 37, 539–550. [DOI] [PMC free article] [PubMed] [Google Scholar]
  15. Machiela M.J.  et al. (2018) Genome-wide association study identifies multiple new loci associated with Ewing sarcoma susceptibility. Nat. Commun., 9, 3184. [DOI] [PMC free article] [PubMed] [Google Scholar]
  16. Machiela M.J.  et al. (2015) LDlink: a web-based application for exploring population-specific haplotype structure and linking correlated alleles of possible functional variants. Bioinf. Oxf. Engl., 31, 3555–3557. [DOI] [PMC free article] [PubMed] [Google Scholar]
  17. Price A.L.  et al. (2008) Discerning the ancestry of European Americans in Genetic Association Studies. PLOS Genet., 4, e236. [DOI] [PMC free article] [PubMed] [Google Scholar]
  18. Price A.L.  et al. (2006) Principal components analysis corrects for stratification in genome-wide association studies. Nat. Genet., 38, 904–909. [DOI] [PubMed] [Google Scholar]
  19. Rosenbaum P.R. (1989) Optimal matching for observational studies. J. Am. Stat. Assoc., 84, 1024–1032. [Google Scholar]
  20. Rubin D.B. (1979) Using multivariate matched sampling and regression adjustment to control bias in observational studies. J. Am. Stat. Assoc., 74, 318–328. [Google Scholar]
  21. Stuart E.A. (2010) Matching methods for causal inference: a review and a look forward. Stat. Sci. Rev. J. Inst. Math. Stat., 25, 1–21. [DOI] [PMC free article] [PubMed] [Google Scholar]
  22. Sudmant P.H.  et al. (2015) An integrated map of structural variation in 2,504 human genomes. Nature, 526, 75–81. [DOI] [PMC free article] [PubMed] [Google Scholar]
  23. Tian C.  et al. (2008) Analysis and application of European genetic substructure using 300 K SNP information. PLOS Genet., 4, e4. [DOI] [PMC free article] [PubMed] [Google Scholar]
  24. Yang J.  et al. (2011) GCTA: a tool for genome-wide complex trait analysis. Am. J. Hum. Genet., 88, 76–82. [DOI] [PMC free article] [PubMed] [Google Scholar]
  25. Yu K.  et al. (2008) Population substructure and control selection in genome-wide association studies. PLoS One, 3, e2551. [DOI] [PMC free article] [PubMed] [Google Scholar]
  26. Zhao Z. (2004) Using matching to estimate treatment effects: data requirements, matching metrics, and Monte Carlo evidence. Rev. Econ. Stat., 86, 91–107. [Google Scholar]
  27. Zhou W.  et al. (2018) Efficiently controlling for case–control imbalance and sample relatedness in large-scale genetic association studies. Nat. Genet., 50, 1335–1341. [DOI] [PMC free article] [PubMed] [Google Scholar]

Associated Data

This section collects any data citations, data availability statements, or supplementary materials included in this article.

Supplementary Materials

btaa784_Supplementary_Data

Articles from Bioinformatics are provided here courtesy of Oxford University Press

RESOURCES