Inclusion and ethics
The Institutional Review Board (IRB) at the Center for Non-Communicable Diseases (IRB: 00007048, IORG0005843, FWAS00014490) approved the study. All participants gave written informed consent.
Summary
Details of methods are below. Briefly, 75 thousand (K) (n = 75,819) individuals were recruited and consented in Pakistan for whole exome sequencing, including a stroke case:control discovery cohort of 31 K (including n = 5135 cases and n = 26,602 controls) sequenced by the Regeneron Genetics Center and a follow-up cohort of 44 K (n = 44,082), including 30 K with self-report stroke case:control status used for replication and meta-analysis (n = 160 cases and n = 30,239 controls). ExWAS was conducted for stroke and 4 overlapping stroke subtypes (intracerebral hemorrhage, subcortical intracerebral hemorrhage, ischemic stroke, and partial anterior circulating infarct) in the discovery cohort and combined in a meta-analysis of stroke with the replication cohort using both single-variant and gene burden test models26. Population attributable fraction of stroke for p.Arg1231Cys was calculated based on the prevalence of the mutation among cases and odds ratio (OR) for risk of stroke in the discovery cohort using the standard definition27. Consented callbacks were conducted in n = 128 individuals within families of homozygotes for p.Arg1231Cys. For comparison and validation, analyses were conducted in UK Biobank data using publicly available datasets and methodologies28,29,30,31, including association analysis of p.Arg1231Cys NOTCH3 with stroke phenotypes in 380 K participants, and brain imaging phenotypes in 35 K participants (Fig. 1A).
Study populations
This study focused on two distinct cohorts, including 75 K individuals from the Pakistan Genomic Resource (PGR) and 380 K individuals (n = 380,537) from the United Kingdom Biobank (UKB) The PGR 75 K individuals were recruited and consented in Pakistan for whole exome sequencing (WES) (n = 75,819), including a stroke case:control discovery cohort of 31 K (n = 31,737; n = 5135 cases and n = 26,602 controls) sequenced by the Regeneron Genetics Center and a follow-up cohort of 44 K (n = 44,082), including 30 K with self-report stroke case:control status used for replication and meta-analysis (n = 30,399; n = 160 cases and n = 30,239 controls). The remaining n = 13,683 in the follow-up cohort had sequence data but not stroke case:control status known, including n = 6067 produced by WES and n = 7616 produced by whole genome sequencing (WGS) (Fig. 1A).
Pakistan Genomic Resource (PGR)
PGR is a growing biobank that aims to enroll 1 million participants across Pakistan and as of September 2023, ~250,000 participants across n = 48 clinical sites in n = 17 cities from all over Pakistan have been enrolled. Following the success of a case-control study design in genetic studies adopted by several international (e.g., Wellcome Trust case-control consortium) and regional studies (e.g., PROMIS), PGR is a national consortium of several case-control studies focused on 50 distinct phenotypes, including: stroke, myocardial infarction, angiographically confirmed coronary artery disease, heart failure, age-related macular degeneration, keratoconus, diabetic retinopathy, glaucoma, asthma, chronic obstructive pulmonary disease (COPD), non-alcoholic fatty liver disease (NAFLD), type-2 diabetes, chronic kidney disease, Alzheimer’s disease, Parkinson’s disease, dementia, progressive multiple sclerosis, autism, Huntington’s disease, hematological cancers, breast cancer, ovarian cancer, cancers of head and neck, esophageal cancer, lung cancer, gastric cancer, colorectal cancer, melanoma, cancers of the urinary tract, cervical cancer, prostate cancer, rheumatoid arthritis, systemic lupus erythematosus (SLE), psoriatic arthritis, ankylosing spondylitis, osteoarthritis, scleroderma, juvenile arthritis, systemic sclerosis, inflammatory myositis, alopecia areata, acne rosacea, primary Sjogren’s syndrome, sarcoidosis, idiopathic pulmonary cholangitis, idiopathic pulmonary fibrosis, vitiligo, longevity/healthy aging, and previously uncharacterized Mendelian disorder. For each of these phenotypes, screening is carried out at specialized clinical sites across Pakistan by trained research medical officers who review inclusion and exclusion criteria and approach eligible participants for recruitment into PGR. In a similar manner, for each of the phenotypes, controls are frequency-matched to cases on sex and age (in 5-year bands). Controls are being recruited in the following order of priority: (1) visitors of patients attending the out-patient department; (2) patients attending the out-patient department for routine non-phenotype related complaints, or (3) non-blood related visitors. Following informed consent, both cases and controls are enrolled. Research medical officers administer pre-piloted epidemiological questionnaires to participants that seek a total of >200 items of information in relation to: ethnicity (e.g, personal and paternal ethnicity, spoken language, place of birth and any known consanguinity); demographic characteristics; lifestyle factors (e.g., tobacco and alcohol consumption, dietary intake and physical activity); and personal and family history of disease; and medication usage. The Center for Non-Communicable Diseases (CNCD), Pakistan serves as the sponsor and the coordinating center of PGR.
Using standardized procedures and equipment, research officers obtain measurements of height, weight, waist and hip circumference, systolic and diastolic blood pressure, and heart rate. Waist circumference is assessed over the abdomen at the widest diameter between the costal margin and the iliac crest, and hip circumference is assessed at the level of the greater trochanters. Information extracted from questionnaires and physical measurements is entered by two different operators into the central database, which is securely held at CNCD, Karachi, Pakistan. Non-fasting blood samples (with the time since last meal recorded) are drawn by phlebotomists from each participant and centrifuged within 45 min of venipuncture. A total of 29 ml of blood is drawn from each participant in 2 × 6 ml serum tubes and 3 × 5 ml EDTA tubes. Hence, a total of five blood tubes are collected per participant, including serum, EDTA plasma and whole blood which are all stored in cryogenic vials. All samples are stored temporarily at each recruitment center at −20 °C. Serum, plasma and whole blood samples are transported daily to the central laboratory at CNCD where they are stored at −80 °C. Measurements of total cholesterol, high-density lipoprotein-cholesterol, triglycerides, AST, ALT, glucose, creatinine, and HbA1c (in a subset) are performed centrally using (Roche Diagnostics GmbH, USA) in all study participants.
Research technicians trained in accordance with standard operating procedures in laboratories at CNCD extracted DNA from leukocytes using a reference phenol-chloroform protocol. DNA concentrations are determined. The yield of DNA per participant is typically between 600 and 800 ng/μl in a total volume of about 500 μl. To minimize any systematic biases arising from plate- or batch-specific genotyping error and/or nonrandom missingness, stock plates are used to generate genotyping plates which contain a mixture of cases and controls along with negative and positive controls designed to address genotyping quality control (QC), plate identification and orientation.
PGR has received approval by the relevant research ethics committee of each of the institutions involved in participant recruitment, as well as centrally by the IRB board of the Center for Non-Communicable Diseases which is registered with the National Institutes of Health, USA. PGR has also been approved by the National Bioethics Committee, Islamabad Health Research Institute, National Institutes of Health of Pakistan.
Eligibility criteria was defined as described below. Ischemic stroke sub-types in the PGR cohort were defined using TOAST32 and Oxfordshire33 clinical criteria.
UK Biobank
The UK Biobank (UKB) cohort had detailed medical records and lifestyle data as described online in the UKB Showcase (https://biobank.ndph.ox.ac.uk/showcase/)28. Stroke case:control status was available in 380 K UK Biobank participants, of which 280 K also had smoking and hypertension status (referred to as UKB replication cohort). A sub-cohort of n = 35,977 UKB participants had brain MRI data which was produced and analyzed as described online (https://biobank.ctsu.ox.ac.uk/crystal/crystal/docs/brain_mri.pdf)29,30,31. MRI images for n = 19 p.Arg1231Cys carriers were re-analyzed in order to identify brain regions affected (referred to as UKB 35 K brain imaging cohort). The description of phenotypes and methods for normalizing the data, including rank-inverse normal transformation (RINT) are described online28.
Exome sample preparation, sequencing, and QC
Genomic DNA samples were transferred to the Regeneron Genetics Center from the CNCD and stored in an automated sample biobank at –80 °C before sample preparation. DNA libraries were created by enzymatically shearing DNA to a mean fragment size of 200 bp, and a common Y-shaped adapter was ligated to all DNA libraries. Unique, asymmetric 10 bp barcodes were added to the DNA fragment during library amplification to facilitate multiplexed exome capture and sequencing. Equal amounts of sample were pooled before overnight exome capture, with a slightly modified version of IDT’s xGenv1 probe library; all samples were captured on the same lot of oligonucleotides. The captured DNA was PCR amplified and quantified by quantitative PCR. The multiplexed samples were pooled and then sequenced using 75 bp paired-end reads with two 10 bp index reads on an Illumina NovaSeq 6000 platform on S4 flow cells. A total of n = 42,695 samples were made available for processing. We were unable to process n = 1948 samples, most of which failed QC during processing owing to low or no DNA being present.
A total of n = 40,747 samples were sequenced, of which n = 2943 (7%) did not pass one or more of our QC metrics and were subsequently excluded. Criteria for exclusion were as follows: disagreement between genetically determined and reported sex (n = 900); high rates of heterozygosity or contamination (VBID > 5%) (n = 709); low sequence coverage (less than 80% of targeted bases achieving 20× coverage) (n = 115); genetically identified sample duplicates (n = 1662 total samples); WES variants discordant with the genotyping chip (n = 43). The remaining n = 37,804 (37 K) samples were then used to compile a project-level VCF (PVCF) for downstream analysis using the GLnexus joint genotyping tool. This final dataset contained n = 7655,430 variants. Within this dataset of 37 K exomes, stroke case:control status was known for n = 31,737, referred to as the 31 K discovery cohort. The remaining n = 6067 were part of the 41 K follow-up cohort.
Exome sequencing in the replication 30 K cohort (n = 39,399) was conducted by the CNCD using a publicly available protocol34. Briefly, blood derived DNA samples, with 10 to100 ng concentration of initial genomic DNA, underwent hybridization and capture using Illumina Rapid Cature Exome Kit or Agilents SureSelect Human Exon v2. Samples were denatured and amplified HiSeq v3 cluster chemistry and HiSeq 2000 or 2500 flowcells based on the manufacturers protocol. Reads were aligned to the GRCh38 genome reference and variants were called using GATK v.30 folllowed by variant recaliberation to remove false positive variants.
The remaining n = 7616 exome samples of the 75 K PGR consisted of whole genome sequence (WGS) data that produced a VCF subsequently filtered to include only variants in protein coding sequence. WGS samples were sequenced and process as described in a publicly available protocol (https://www.nature.com/articles/s41586-021-03205-y). Briefly, 30x whole genome sequencing was performed using Illumina HiSeqX instruments. Reads were aligned to the GRCh38 reference using BWA-align and variants were called using the publicly available GotCloud pipeline (https://genome.sph.umich.edu/wiki/GotCloud), which includes QCing variants based on a support vector machine trained on specific site quality metrics.
Variant calling
The PGR discovery cohort WES data was reference-aligned using the OQFE protocol35, which uses BWA MEM to map all reads to the GRCh38 reference in an alt-aware manner, marks read duplicates and adds additional per-read tags. The OQFE protocol retains all reads and original quality scores such that the original FASTQ is completely recoverable from the resulting CRAM file. Single-sample variants were called using DeepVariant with custom exome parameters35, generating a gVCF for each input OQFE CRAM file. These gVCFs were aggregated and joint-genotyped using GLnexus (v.1.3.1). All constituent steps of this protocol were executed using open-source software. The PGR replication and follow-up cohort were analyzed using the publicly available GotCloud workflow (https://genome.sph.umich.edu/wiki/GotCloud).
Identification of low-quality variants from sequencing using machine learning
Similar to other recent large-scale sequencing efforts, we implemented a supervised machine-learning algorithm to discriminate between probable low-quality and high-quality variants36,37. In brief, we defined a set of positive control and negative control variants based on the following criteria: (1) concordance in genotype calls between array and exome-sequencing data; (2) transmitted singletons; (3) an external set of likely ‘high quality’ sites; and (4) an external set of likely ‘low quality’ sites. To define the external high-quality set, we first generated the intersection of variants that passed QC in both TOPMed Freeze 8 and GnomAD v.3.1 genomes. This set was additionally restricted to 1000 Genomes Phase 1 high-confidence SNPs from the 1000 Genomes Project38 and gold-standard insertions and deletions from the 1000 Genomes Project and a previous study39, both available through the GATK resource bundle (https://gatk.broadinstitute.org/hc/en-us/articles/360035890811-Resource-bundle). To define the external low-quality set, we intersected GnomAD v3.1 fail variants with TOPMed Freeze 8 Mendelian or duplicate discordant variants. Before model training, the control set of variants were binned by allele frequency and then randomly sampled such that an equal number of variants were retained in the positive and negative labels across each frequency bin. A support vector machine using a radial basis function kernel was then trained on up to n = 33 available site quality metrics, including, for example, the median value for allele balance in heterozygote calls and whether a variant was split from a multi-allelic site. We split the data into training (80%) and test (20%) sets. We performed a grid search with fivefold cross-validation on the training set to identify the hyperparameters that returned the highest accuracy during cross-validation, which were then applied to the test set to confirm accuracy. This approach identified a total of n = 931,823 WES variants as low-quality, resulting in a dataset of n = 6,723,607 variants.
Variant annotation
Variants were annotated as described in a publicly available pipeline38. In brief, variants were annotated using Ensembl variant effect predictor, with the most severe consequence for each variant chosen across all protein-coding transcripts. In addition, we derived canonical transcript annotations based on a combination of MANE, APPRIS and Ensembl canonical tags. MANE annotation was given the highest priority followed by APPRIS. When neither MANE nor APPRIS annotation tags were available for a gene, the canonical transcript definition of Ensembl was used. Gene regions were defined using Ensembl release 100. Variants annotated as stop gained, start lost, splice donor, splice acceptor, stop lost or frameshift, for which the allele of interest was not the ancestral allele, were considered predicted loss-of-function variants. Five annotation resources were utilized to assign deleteriousness to missense variants: SIFT, Polyphen2 HDIV, Polyphen2 HVAR, LRT, MutationTaster40,41,42,43, and LRT, obtained using dbNSFP44. Missense variants were considered ‘likely deleterious’ if predicted deleterious by all five algorithms, ‘possibly deleterious’ if predicted deleterious by at least one algorithm and ‘likely benign’ if not predicted deleterious by any algorithm.
Pakistan Genome Resource Statistical Analysis
ExWAS of SNPs with minor allele count > 5 in the 31 K PGR discovery cohort was conducted with n = 5 binary stroke phenotypes using REGENIE (v 3.1.1)26 with age, age2, sex, age*sex, exome batch, 10 genotyping array principal components (PCs), 10 common variant exome PCs and 10 rare variant exome PCs as covariates. The minimum of 1000 cases was selected based on a power calculation45 (n = 1000 cases; n = 25,000 controls; significance threshold = 0.005; prevalence = 0.0012; disease allele frequency = 0.005; genotype relative risk = 3.0; > 80% power). Follow-up analyses were conducted with the added covariates of hypertension and tobacco use, or as environmental factors in a gene-by-environment interaction test using REGENIE26. Gene burden analysis was conducted using REGENIE with separate masks for pLoF, pLoF + missense, pLoF + deleterious missense (as predicted by at least 1 of 5 algorithms), and pLoF + deleterious missense (as predicted by 5 of 5 algorithms). Analysis in the 61 K PGR meta-analysis cohort, including 31 K discovery and 30 K replication cohorts, was conducted using REGENIE.
PGR population genetic analysis
Using principal components (PCs)46 and Uniform Manifold Approximation and Projection (UMAP)47 based analyses PGR and UKB South Asian sub-populations were mapped to distinct groups or clusters. Specifically, we used the imputed genotypes to merge the PGR dataset with UKB and 1000 genome datasets. Imputed data was used to maximize the number of common variants between all three datasets. The Plink48 “–bmerge” option was used to merge datasets. A minimal QC was applied to the merged genotypes to exclude variants with MAF less than 5%, missing genotype rate greater than 10%, and Hardy Weinberg equilibrium P value less than 5 × 10−5. Variants mapping within the HLA region were excluded. Merged datasets were pruned for linkage disequilibrium (r2 > 0.25). A total of 20 PCAs were calculated in the merged data using the Plink “–pca” option. Calculated PCAs were imported to R and merged with reported ethnicities or country of birth information. The first 6 PCs calculated on the merged data were reduced to two dimensions using the UMAP package in R. The two eigenvectors of UMAP were calculated using an alpha value of 1.1 and beta value of 0.8. Two eigenvectors were plotted along with ethnicity and country of birth labels using the Plotly package in R. UKB self-reported ethnicities or country of birth was confirmed to be highly correlated with data obtained from UMAP.
Population attributable fraction
Estimation of the proportion of all strokes combined or hemorrhagic stroke in Pakistan population attributable to p.Arg1231Cys was calculated using the formula27,
$${{AF}}_{p}={P}_{c}{\times {AF}}_{e}={P}_{c}\left(1-\frac{1}{{OR}}\right)$$
where Pc is the prevalence of mutation among cases, AFe is the attributable fraction in the exposure, and OR was for risk of stroke (i.e., all stroke combined or hemorrhagic stroke) comparing mutant vs wild type of p.Arg1231Cys in the discovery cohort.
OR was obtained directly from Supplementary Table 2. The related 95% confidence intervals were constructed using bootstrap method with 10,000 resamples49, which was implemented with the command “Bootstrap” using Stata (College Station, Texas 77845 USA).
Recall by genotype
A subset of carriers of NOTCH3 p.Arg1231Cys were contacted by the Center of Non-Communicable Diseases in Karachi Pakistan under protocols approved by the IRB committee of the Center for Non-Communicable Diseases (NIH registered IRB 00007048). After obtaining consent from the proband and from the family members, questionnaires regarding past medical and family history were administered by trained research staff, in the local language. Physical measurements such as height, weight, hip and waist circumference were obtained in the standing position by using height and weight scales. Blood pressure and heart rate were recorded sitting by using OMRON healthcare M2 blood pressure monitors. Non-fasting blood samples were collected from each participant in EDTA and Gel Tubes. Serum and plasma were separated within 45 min of venipuncture. A random urine sample was also collected from each participant. The samples were stored temporarily in dry ice in the field and transported to the central laboratory based at CNCD and stored at −80 °C. Measurements for total-cholesterol, HDL cholesterol, LDL cholesterol, triglycerides, VLDL, AST, ALT and creatinine were obtained from serum samples using enzymatic assays. HbA1c was measured using a turbidemetric assay in whole-blood samples (Roche Diagnostics). All measurements were done at a central laboratory at CNCD. Statistical analysis comparing across genotypes was conducted using the numpy library in Python 3.11.4.
PGR stroke case control definitions
Controls
Inclusion
No medical history of stroke, myocardial infarction (MI), coronary artery disease, heart failure (HF), valvular disease, or pacemaker
Cases
General Criteria
Stroke: Diagnosis of ‘Stroke’
Ischemic: Diagnosis of ‘Ischemic stroke’
Hemorrhagic: Diagnosis of ‘Hemorrhagic stroke’
Oxfordshire Criteria
Partial anterior circulation infarct (PACI): Partial anterior circulation infarcts (PACI) stroke sub-type
Posterior circulation infarction (POCI): Posterior circulation infarcts POCI stroke sub-type
Total anterior circulation infarct (TACI): Total anterior circulation infarcts TACI’ stroke sub-type
Lacunar infarct (LACI): Lacunar infarcts stroke sub-type
TOAST Criteria
Cardioembolism (CE): Cardioembolism ischemic stroke subtype
CE probable: CE criteria, in addition diagnosis is made if the clinical findings, neuroimaging data, and results of diagnostic studies are consistent with one subtype and other etiologies have been excluded
Large artery atherosclerosis (LAA): Large artery atherosclerosis ischemic stroke subtype based on TOAST classification
LAA probable: LAA criteria, in addition in addition diagnosis is made if the clinical findings, neuroimaging data, and results of diagnostic studies are consistent with one subtype and other etiologies have been excluded
Small artery atherosclerosis (SAA): Small artery atherosclerosis ischemic stroke subtype
SAA probable: SAA criteria, in addition in addition diagnosis is made if the clinical findings, neuroimaging data, and results of diagnostic studies are consistent with one subtype and other etiologies have been excluded
UK Biobank stroke case control definition
UK Biobank stroke case control definitions were based on ICD10 codes as follows.
Cases
Inclusion
Phe10_I63, ICD10 3D: Cerebral infarction
Phe10_I630, ICD10 4D: Cerebral infarction due to thrombosis of precerebral arteries
Phe10_I631, ICD10 4D: Cerebral infarction due to embolism of precerebral arteries
Phe10_I632, ICD10 4D: Cerebral infarction due to unspecified occlusion or stenosis of precerebral arteries
Phe10_I633, ICD10 4D: Cerebral infarction due to thrombosis of cerebral arteries
Phe10_I634, ICD10 4D: Cerebral infarction due to embolism of cerebral arteries
Phe10_I635, ICD10 4D: Cerebral infarction due to unspecified occlusion or stenosis of cerebral arteries
Phe10_I638, ICD10 4D: Other cerebral infarction
Phe10_I639, ICD10 4D: Cerebral infarction, unspecified
Self-reported: SR_1583_ischaemic_stroke
Primary and secondary cause of death using above ICD codes.
Exclusion
Controls
Inclusion
Negative for the above codes
Negative for Phe10_Z823, ICD10 4D: Family history of stroke
Exclusion
Phe10_G45, ICD10 3D: Transient cerebral ischemic attacks and related syndromes
Phe10_G458, ICD10 4D: Other transient cerebral ischemic attacks and related syndromes
Phe10_G459, ICD10 4D: Transient cerebral ischemic attack, unspecified
Custom burden tests in UKB 450 K
Burden tests aim to boost statistical power by aggregating association signal across multiple rare variants. Prior studies in human and animal models have debated the role of various variant classes on NOTCH3 function, CADASIL pathology and patient prognosis, including experiments designed to determine if the pathogenicity of CADASIL variants follows a loss of function (LoF) mechanism8,25,50. Using data from hundreds of missense and LoF variants in NOTCH3 observed in 450 K UKB participant exomes, burden tests were conducted to assess the impact of LoF and missense variants.
Ten distinct gene burden tests of association with stroke were conducted using REGENIE26, divided into three distinct groups. The Group I tests assessed the impact of Cys-altering variants, Group II tests assessed the impact of Ser-altering variants, and Group III tests assessed LoF and all missense variants. Group I and II consisted of four distinct tests, including (1) a test of all group variants in EGFr domains 1 to 34, (2) a test of all group variants in EGFr domains 1 to 6, (3) a test of all group variants in EGFr domains 7 to 34, and (4) a test of all group variants outside of EGFr domains. The difference between Group I and Group II was Group I variants are missense variants that either add or remove a Cysteine (Cys-altering), while Group II variants are missense variants that either add or remove a Serine (Ser-altering). While the role of Cys-altering variants in CADASIL is well known (15, 19, 20), Ser-altering variants were chosen based on being the most common class of variants among NOTCH3 variants in UKB 450 K exomes. Group III consisted of two tests, including (1) a test limited to LoF variants and (2) a test limited to LoF and missense variants.
Reporting summary
Further information on research design is available in the Nature Portfolio Reporting Summary linked to this article.