The use of haplotype-specific transcripts improves sample annotation consistency

Background Exact sample annotation in expression microarray datasets is essential for any type of pharmacogenomics research. Results Candidate markers were explored through the application of Hartigans’ dip test statistics to a publically available human whole genome microarray dataset. The marker performance was tested on 188 serial samples from 53 donors and of variable tissue origin from five public microarray datasets. A qualified transcript marker panel consisting of three probe sets for human leukocyte antigens HLA-DQA1 (2 probe sets) and HLA-DRB4 identified sample donor identifier inconsistencies in six of the 188 test samples. About 3% of the test samples require root-cause analysis due to unresolvable inaccuracies. Conclusions The transcript marker panel consisting of HLA-DQA1 and HLA-DRB4 represents a robust, tissue-independent composite marker to assist control donor annotation concordance at the transcript level. Allele-selectivity of HLA genes renders them good candidates for “fingerprinting” with donor specific expression pattern.


Background
Clinical molecular research and biomarker development rely on a high level of data quality. Ensuring data quality extends beyond the establishment of reproducible technical processes involved in measurement of variables. Obtaining accurate clinical metadata is of utmost importance for meaningful clinical research, as they are necessary for finding clinical disease-treatment or diseasebiomarkers relationships [1]. Drawing conclusions based on incorrect metadata can have detrimental consequences in short-term or long-term patient care. Typical sample annotation errors may be due to sample mix-ups, database entry errors, or subjectivity, e.g. grading of a biopsy. In pharmacogenomics analyses, unrecognized annotation errors or sample mix-ups impact any supervised statistical analysis, such as certain steps during biomarker discovery and qualification, and patient stratification. Estimates of sample mix-ups or annotation errors in clinical datasets range up to 18% [2]. Several statistical approaches have been devised to address annotation uncertainty during classifier development [3][4][5]. These sophisticated and extremely valuable approaches are applied as part of a later phase the analytical process. To begin addressing annotation uncertainty already at the database level, we have recently reported on a transcript marker for gender annotation which can be applied to clinical datasets of whole genome microarrays immediately after the generation of data [6]. The so-called "REDKX" gender marker is based on heterosome genes with gender-dependent expression-characteristic. Application of this marker informs about the correctness of the gender annotation of a donor. In addition, control for inter-individual sample mix-up clinical datasets with multiple, e.g. longitudinal, sampling is an absolute must, and gender annotation QC is not sufficient to that goal, since sample mixups between individuals of the same gender would go undetected.
Of particular interest are genes with largely unchanged expression levels in samples of a donor, which would have a significantly different expression level in another donor. At best, those probe sets would display an interindividual bimodal "on/off"-expression characteristic. Bimodal distribution of (signal intensity) data is a deviation from the assumed normal distribution of measurements within a population, and can be recognized by the presence of two modes, each characterized by a peak. Bimodality can occur by differential expression, heterosome-specific expression, or, as observed e.g. in cancer biology, by genomic lesions in some donors but not in others [7]. Candidate genes with bimodal expression characteristics exhibit heterosome-specific and/or haplotypespecific expression. Some Affymetrix probe sets were designed in polymorphic mRNA sequences and are (unintentionally) haplotype specific. Messenger RNA (mRNA) derived from a donor with a certain haplotype or polymorphism does not hybridize with the same affinity to a locus of different haplotype or lacking the polymorphism, resulting in an extremely different expression level for the latter sample. The reason for the decreased or increased affinity is of technical nature: hybridization of a labelled mRNA fragment to a site with a single nucleotide polymorphism (SNP) causes nucleotide mispairing, leading to the creation of a virtual bubble at the site of the SNP. The result is a less stringent binding of the mRNA fragment to the probe target, and lower signal intensity. Hence, sites with SNPs could be used for indirect genotyping, or "fingerprinting" [8,9].
In order to systematically assess deviation from unimodality of individual probe sets we have utilized the principle of the dip statistic. The dip test had been proposed as a test statistic for unimodality [10]. It estimates the maximum difference between the empirical distribution function and the unimodal distribution function that minimizes that maximum difference. In order to assess the significance of the Dip Test statistic, we ran 1 × 10 6 simulations, where a sample of the same size as the dataset in question, was drawn from a normal distribution, and a Dip Test applied to each of the draws in order to generate 1 × 10 6 simulated Dip Test statistics. As a result, we were able to derive an empirical p-value for each probe set. Here we report on the application of the Hartigans' dip test to transcriptome wide clinical gene expression microarray datasets, in the search of probe sets which would help flag samples with potential donor-ID annotation mix-up.

Results
The transcript marker was trained on a publically available dataset of 47 Affymetrix HG-U133_Plus_2 microarrays from a study of systemic juvenile idiopathic arthritis (GSE7753, http://www.ncbi.nlm.nih.gov/geo) [11]. After normalization and intensity filtering (see Methods), 21,044 probe sets entered the Hartigans' dip test. To compute empirical p-values assessing the significance of an individual dip test statistic value, the dip test statistic was computed for a simulated dataset of 47 samples, the same sample number as the GSE7753 dataset, and 1 × 10 6 permutations per probe set. Empirically we filtered those probe sets with a p-value < 0.001. Another selection criterion was the minor allele frequency. The selected marker should have a minor allele frequency of less than 50%. Probe sets with the smallest p-values are listed in Table 1. The list is populated with many bimodally expressed genes located on one of the heterosomes, such as RPS4Y1, EIF1AY, DDX3Y, KDM5D, and XIST (the genes of the REDKX gender QC marker [6]), but also probe sets for genes located on autosomes, such as human leukocyte antigen (HLA)-genes HLA-DRB4 (209728_at), and HLA-DQA1 (203290_at). Further investigation revealed that the nucleotide sequence of probe set 203290_at aligns to a highly polymorphic region of the 3' untranslated region of HLA-DQA1, and is identical to the DQA1*0401 allele, but contains at least one probe with SNPs present in alleles *0101, *0102, *0103, *0201, *0301, and *0501 ( Table 2, left panel). The presence of polymorphic SNP sites in a probe set target region is a feature which may confer "fingerprinting"-quality. Inspection of the dip test results lead to the identification of another probe set for HLA-DQA1, 213831_at (p-value 0.006). The 213831_at probe set was further pursued, revealing sequence identity to DQA1-allele *0103 and single nucleotide polymorphisms in at least one probe for the other alleles mentioned above ( Table 2, right panel). Thus, both HLA-DQA1 probe sets 203290_at and 213831_at met an important criterion of candidate fingerprinting genes, allele-selectivity. The candidate marker panel was composed of three probe sets: 203290_at (HLA-DQA1*0401 allele), 209728_at (HLA-DRB4), 213831_at (HLA-DQA1*0103 allele). The dip test data for the candidate marker probe sets are shown in Figure 1. The simulated dip data were randomly distributed, as they fall onto the identity line in the quantile-quantile plot (Q-Q plot). The computed dip test data for GSE7753 largely followed the random distribution, but some probe sets including the candidate markers deviate from the unimodal distribution. Figure 2 illustrates the bimodal signal intensity distribution of the three probe sets in the dataset GSE7753, and indicates the cut-off value of 7 (log2 scale) which has been developed empirically after visual inspection of the data. The measured signal intensity per probe set is then flagged with a 1 if it surpasses the threshold or 0 if it does not. Samples from the same donor are expected to have the same score. Since the same score may apply to different donors, one has to assume a certain leakiness of this scoring system, as the same score may apply to different donors. To increase the power to detect donor annotation inconsistencies, we strongly recommend using the REDKX marker in addition to the candidate HLA-score proposed in the present study.
We have tested the HLA-score along with the REDKX gender QC marker in five publically available datasets of 188 samples from 53 donors. The normalized data for all samples can be accessed in Additional file 1: Table S1. Table 3 Table 4 summarizes the findings in 188 samples from 53 donors in five studies of four tissues types, whole blood, peripheral blood mononuclear cells (PBMC), and lung biopsy. Six samples were flagged for suspicious gender-and/or donor ID annotation, five of which in a single dataset. Three samples had incorrect gender annotation according the REDKX gender QC. Based on the HLA-score results, seven samples had potentially incorrect donor identifiers. The annotation of 5 of the six flagged samples could not be resolved given the information resources provided in the public domain. Three of those five samples had incorrect gender and donor ID The table shows probe sets with an empirical p-value < 0.001, sorted by descending dip test statistics. About 47% of the genes in the list are located on heterosomes, among those all genes of the gender marker "REDKX" ( [6], italicized for visualization purposes). Probe sets for HLA-genes, which were further considered during the marker validation process, are highlighted in bold. Table 2 Allele-specificity of HLA-DQA1 probe sets The nucleotide sequence of 203290_at perfectly matches the *0401 allele, and contains at least one mismatched probe for other alleles while 213831_at perfectly matches the *0103 allele, and contains at least one mismatched probe for other alleles. Perfect matches are indicated as an italicized bold 1. Allele sequences from [12].
annotation, two presented HLA-score data which could not be explained without further investigation, which was beyond the scope of the current project. In summary, based on the HLA-score and the REDKX gender QC, 3% of the samples are not usable for further analysis and would yield incorrect results.

Discussion
Notwithstanding technical precision, annotation precision is a major component contributing to high data quality. In an effort to standardize annotation, metadata databases have been developed which provide user guidance by implementation of controlled vocabulary [13]. However, Quantile-quantile plot of Hartigans' dip test statisticshuman error and subjectivity is still a common source of incorrectness in such databases [1]. Analysis errors and wrong conclusions with possibly detrimental consequences can be the result if annotation errors remain undetected [4]. Current strategies of dealing with annotation errors include statistical tests at a relatively late stage of the analysis. We hypothesized that transcript markers could aid in improving detection of annotation precision at an early stage, at best before the analysis.
Our approach was to investigate, design, develop and implement quality control tools to qualify microarray data in the context of the donor sample. As an initial step we have recently developed and qualified the socalled REDKX marker which is a transcript panel marker based on expression of genes located on heterosomes, indicating gender of sample donors in clinical microarray studies [6]. However, in studies with multiple samples from donors, correct gender annotation still leaves an uncertainty about the correct assignment of a sample to a donor, as samples with the same gender annotation may come from different subjects of that gender. Hence, we devised a second annotation transcript quality control marker, which would increase the detectability of samples with donor identification mislabels.
By applying the Hartigans' Dip Test statistic we identified probe sets with bimodal expression pattern. Of particular interest are probe sets which show allele-selectivity, such as probe sets for histocompatibility genes HLA-DQA1 and HLA-DRB4. Both genes are located on the p21 arm of chromosome 6, and could be in a linkage disequilibrium region. The markers do not represent expression quantitative trait locus (eQTL) genes, and are expressed independent of gender. HLA genes code for cell surface proteins which are expressed by antigen presenting cells and in the immune system serve the purpose of   Each sample of the dataset is labelled with a three-digit score (one "1" or "0" flag per probe set). The presence of intra-score differences elicits further follow-up investigation as to the nature and source of the difference. Intensity thresholds mark an "on" or "off"-status of the transcript expression, flagged as "1" or "0", respectively. The thresholds were empirically determined for each marker separately. In some instances, the score difference is due to mild threshold violation, in other instances it may be due to sample mix-ups. The score picks up those samples which are being flagged by the REDKX gender QC, and detects further samples with issues (labeled in bold). REDKX panel expression values are provided in Additional file 1: Table S1.
self-vs. nonself-discrimination [14]. Hence, HLA gene expression patterns represent highly specific "fingerprinting" of individual donors. In general, our results receive support in the independent findings of Joehanes et al. who have suggested that gene expression levels, including those of HLA genes, could serve as "fingerprinting" data in microarray datasets [15]. The authors came to this conclusion as part of their assessment of gene expression analysis from different blood-derived RNA sources. Here, we identify a candidate marker consisting of HLA genes in an unsupervised analysis and qualify it by training and testing its performance on a number of publically available datasets. HLA genes are essential contributors to susceptibility to risk or resistance to several autoimmune diseases such as multiple sclerosis, rheumatoid arthritis, and Type 1 Diabetes [16][17][18]. As sequencing of the HLA locus continues and more alleles are being identified, the number of allele-disease associations will grow [19][20][21][22]. For our exploratory marker analysis we applied expression data for three HLA probe sets. As a consequence of the extremely high multiplexing of parameters interrogated by a microarray it is not surprising that the behaviour of a single transcript in a large transcript population is not always 100% predictable and is the result of multiple factors such as assay and array performance. A three-digit sample identifier score yields 8 different score types, not sufficient to completely discriminate all donors in studies with 9 or more donors. However, as shown in the case of donor 32 in GSE67511, only the combination of the REDKX gender marker and the score has the potential to further reduce sample ID ambiguities by eliciting follow-up investigations should discrepancies between scores within a sample collection of a single donor arise. In the present analysis, we found that about 5% of the publically available samples used herein rewarded follow-up investigations, where half of those cases could not be resolved by interrogating the data alone. Despite our encouraging results from public datasets of 4 tissue types, we recommend the optimal intensity thresholds be developed related to each tissue of interest. The reason for this suggestion lies in the tissue dependent expression level variations as well as in technical dissimilarities, which may lead to deviations in global signal intensity, e.g. by different scaling settings.
Context-dependent biomarker qualification is driven by the application of the marker. The proposed genomic fitfor purpose biomarker approach for quality control of sample ID annotations in transcriptomics clinical datasets with multiple samples per donor could be applied immediately. As part of our approach to this end, we have implemented the haplotype-based quality control as part of our microarray quality control pipeline. Microarray data files are automatically reviewed by the software after they are produced, with the analysis results cached and made available through a web interface, where likely problems are highlighted for further investigation.
In case of detected mismatches to reported metadata, a root-cause analysis will be necessary to determine the reason for the error. If the cause cannot be determined in too many cases or is systematic, the decision could be not to use the entire dataset [1].
In conclusion, we have identified a set of transcripts, which, particularly in combination with the REDKX gender marker, is capable that can be used as a starting point to control for sample ID annotation errors in clinical datasets with multiple samples per donor. In publically available datasets we have identified about 3% of unresolvable annotation errors. Thus we recommend applying the marker broadly in transcriptomics studies and to follow-up with root cause analysis where necessary. Direct sequencing will provide further confirmation and possible expansion of the marker panel. The candidate markers were applied to datasets from the public domain (http://www.ncbi.nlm.nih.gov/geo). About 3% of samples have annotation issues which could not be resolved by visual inspection of the intensity data alone. The individual REDKX marker intensity values are not shown. Individual results are shown in Additional file 1: Table S1.