Intrinsic equivalence between developmental prosopagnosia questionnaires

The 20-item prosopagnosia index (PI20) is a self-report measure of face recognition ability, which is aimed to assess the risk for developmental prosopagnosia (DP), developed by Shah, Gaule, Sowden, Bird, and Cook (2015). Although they validated PI20 in several ways and it may serve as a quick and cost-effective measure for estimating DP risk (Livingston & Shah, in press; Shah et al., 2015), they did not formally evaluate its validity against a pre-existing alternative questionnaire (Kennerknecht et al., 2008) even though they criticized the weak relationship of the pre-existing questionnaire to actual behavioral face recognition performance. Thus, we administered the questionnaires to a large population (N = 855) and found a very strong correlation (r = 0.82 [95% confidence interval: 0.80, 0.84]), a principal component that accounted for more than 90% of the variance, and comparable reliability between the questionnaires. These results show unidimensionality and equivalence between the two questionnaires, or at least, a very strong common latent factor underlying them. The PI20 may not be greater than the pre-existing questionnaire; the two questionnaires measured essentially the same trait. The intrinsic equivalence between the questionnaires necessitates a revision of the view that the PI20 overcomes the weakness of the pre-existing questionnaire. Because both questionnaires contained unreliable items, we suggest, instead of using either questionnaire alone, that selection of a set of items with high reliability may offer a more robust approach to capture face recognition ability.


Introduction
Prosopagnosia is a neurodevelopmental condition that has deficits in recognizing human faces. Individuals with acquired prosopagnosia have lost their ability in recognizing faces after brain damage, whereas individuals with developmental prosopagnosia (DP) suffer from face recognition at any time during their development, in the absence of any brain injury. DP is characterized by selective deficits in recognizing faces, despite normal intelligence and intact visual object recognition ability (Brad Duchaine & Nakayama, 2005), and is estimated to affect 2% of the population (Kennerknecht, Grueter, et al., 2006;Kennerknecht, Plumpe, Edwards, & Raman, 2006).
It is evident that neurological injury is the primary cause of the inability to recognize faces in the case of acquired prosopagnosia, but what lies behind the DP is still unclear, because distinct anatomical anomalies are not apparent. Despite recent efforts to find neural signatures for DP (Garrido et al., 2009;Rosenthal et al., 2017;Rossion et al., 2003;Song, Zhu, Li, Wang, & Liu, 2015), quantitative, biological measure for estimating DP risk is not established. This absence of confirmed biological markers creates challenges in diagnosing DP by behavioral criteria alone (Barton & Corrow, 2016). It therefore forces us to rest on subjective report for being bad at faces when diagnosing (or more precisely, specifying or classifying) DP. In fact, almost all DP studies have recruited suspected DPs solely based on self-report complaining of face recognition difficulties or semi-structured interviews at best. Although some post-hoc behavioral validation may or may not be employed, there still is no established way to determine whether the suspected DPs represent simply the low-end, quantitatively extreme expression of face recognition ability or qualitatively distinct pathology (see Barton & Corrow, 2016).  Duchaine, Germine, & Nakayama, 2007), and famous face tests (Brad Duchaine & Nakayama, 2005;McNeil & Warrington, 1993). In addition to these behavioral tests, questionnaire-based tests are developed to serve as a quick and cost-effective measure for estimating DP risk. Kennerknecht, Ho, and Wong (2008) developed a 15-item DP questionnaire in Hong Kong population (hereafter, HK questionnaire) and estimated the prevalence of DP to be about 2%, which is in the same range as in Caucasians. However, the HK questionnaire has been criticized because it contains items irrelevant to face recognition and it has a weak relationship to actual behavioral performance (Palermo et al., 2017;Shah, Gaule, Sowden, Bird, & Cook, 2015). Recently, Shah et al. (2015) developed a new DP questionnaire, the 20-item prosopagnosia index (PI20), which was aimed to overcome the weakness of the pre-existing questionnaire, and showed that the PI20 correlated with behavioral face recognition performance. In addition, Shah and colleagues claimed that "adults have good insight into their face recognition difficulties" in general populations (Livingston & Shah, in press). However, the PI20 was not formally evaluated against the pre-existing Kennerknecht's questionnaire and thus the relationship between the questionnaires remains unclear. Given that items in the two questionnaires are so similar simply asking so how good (or bad) people are at recognizing individual faces, it is likely that the two questionnaire measures essentially the same traits. We examined this issue by administering the two questionnaires to a large population and performing a set of analyses including correlation analysis, hierarchical clustering, and a brute-force calculation/comparison of reliability coefficients. If the PI20 is a greater self-report instrument in identifying DP than the pre-existing questionnaire, the PI20 is expected to have a moderate relationship with the HK questionnaire such that items of the two questionnaires form different clusters and/or have higher scale reliability than the HK questionnaire.

Participants
Eight hundred and fifty-five young Japanese adults (427 female, 428 male; mean age: 20.9 ± 2.2 [±1 SD] years; range: 18-36 years) participated in the study. All had normal or corrected-to-normal vision and none reported a history of neurological or developmental disorders. The study procedure was conducted in accordance with the Declaration of Helsinki and approved by the Committee of Ethics, Waseda University, Japan (#2015-033). All participants provided written informed consent prior to participation.

Procedure
We asked participants to complete the questionnaires using an 8-inch touchscreen tablet PC. They were required to indicate the extent to which 36 items (15 from the pre-existing Hong Kong (HK) prosopagnosia questionnaire (Kennerknecht et al., 2008), and 20 from PI20 (Shah et al., 2015), and an additional item pertaining self-confidence in face recognition ability: "I am confident that I can recognize faces well compared to others") described their face recognition experiences. Responses were provided using a five-point Likert scale ranging from 1 (strongly disagree) to 5 (strongly agree). The participants were instructed to complete the questionnaires at their own pace.

Data analysis
Because the HK prosopagnosia questionnaire developed by Kennerknecht et al. (2008) contains four dummy questions (HK#10, #11, #12, and #13) that are irrelevant with respect to face recognition, we excluded these items and calculated the total scores ranging 11 to 55, using the remaining 11 items (hereafter, 'HK11') [score range: 11-55]. PI20 scores were calculated using all 20 items and ranged from 20 to 100. As females have exhibited superior performance in behavioral face recognition studies [3], we examined sex differences between the questionnaire scores. In addition, we used polychoric correlation coefficients to infer latent Pearson correlations between individual items from the ordinal data. The polychoric correlation matrix was estimated using two-step approximation (Olsson, 1979).
Cronbach's α and Revelle & Zinbarg's omega total coefficients were calculated to assess the scale reliability of both HK11 and PI20. Omega total coefficients were estimated using a maximum likelihood procedure (Revelle & Zinbarg, 2009). CIs for the coefficients were estimated using a bootstrap procedure (10,000 replications) with a bias-corrected and accelerated approach (DiCiccio & Efron, 1996;Kelley & Pornprasertmanit, 2016). As it was possible that higher reliability coefficients merely reflected the higher number of items in the PI20, relative to that in the HK11 (Cortina, 1993), we performed a brute-force calculation of reliability coefficients for all 167,960 ( 20 C 11 ) possible combinations of PI20 items, 11 items at a time (i.e., subsets of the PI20 generated by choosing 11 of the 20 items), which allowed us to compare reliability coefficients between the questionnaires with a virtual match of the numbers of items. Table 1 shows descriptive statistics for the total HK11 and PI20 scores. Independent and PI20 (p = 0.7554) scores. These results indicate that females and males showed almost identical mean HK11 and PI20 scores and score distributions, suggesting that sex was not a significant factor.

Correlations between total scores
The results showed a very strong significant correlation between the total scores for the two questionnaires ( Figure 1 scores showed that the first principal component (PC1) accounted for 91.1% (using standardized scores) and 94.2% (using raw scores) of the total variance in scores.

Correlations between individual item scores
The correlation matrix (Figure 2) generally showed correlations between individual items across the two scales; however, some items were not correlated with other items to the extent that they would reduce the reliability or internal consistency of a single measure pertaining to a single construct (i.e., DP risk). In fact, hierarchical clustering using the unweighted pair group method with arithmetic mean showed that eight out of 36 items were distant from a cluster to which most items belonged (shaded areas in Figure 2, dendrogram). These eight items consisted of ( Table 2): the four items already known to be irrelevant with respect to DP (HK#10, HK#11, HK#12, and HK#13), and two items from the HK prosopagnosia questionnaire (HK#2 and HK#7), and two items from the PI20 (PI#3 and PI#13). Previous studies reported that five of the eight item-score differences were marginal (score difference [DP − control]: <1) between individuals with suspected DP and typically developed control individuals (0.45 for HK#10, −0.39 for HK#11, 0.11 for HK#12, −0.45 for HK#13, and 0.62 for PI#3) (Kennerknecht et al., 2008;Shah et al., 2015). However, it should be noted that the score difference exceeded 1 for the remaining three items (1.46 for HK#2, 1.12 for HK#7, and 1.16 for PI#13), suggesting that these three items could actually measure prosopagnosia traits that differ from those measured via the other 28 items.
The brute-force calculation of reliability coefficients showed that the coefficients for the eleven-item PI20 subsets were almost comparable (within 1 SD) to those for the

Discussion
Our results showed that the two representative 'DP questionnaires' are closely related to each other such that they are almost indistinguishable in terms of correlation analyses, PCA, hierarchical clustering, and item reliability. It is worth noting that a recent meta-analysis showed that test-retest reliabilities for instantaneously administered tests are about r = 0.8 (Calamia, Markon, & Tranel, 2013), which is comparable to our findings (r = 0.82). This may indicate that the correlation coefficient between the two questionnaires is sufficiently high to consider that the two questionnaires might measure essentially the same trait to the extent of reliability that solid neuropsychological tests can achieve. PCA also confirmed that the primary component accounted for more than 90% of the total variance. Shah et al. (2015) claimed that "the PI20 can play a valuable role in identifying DP" and it will lead to "reliable classification [of DP]"; however, the huge correlation and comparable reliability between the two questionnaires indicate that the PI20 is not at all greater than the pre-existing questionnaire, even though they criticized it because of its "weak relationship" to actual behavioral face recognition performance. Although further research is needed to evaluate/compare the PI20 and HK11 with behavioral face recognition performance, it may be hard to expect that, given the huge correlation, it is only the PI20 that has a strong relationship to face recognition performance whereas the HK11 has a weak relationship to it. In fact, recent studies suggest that people have poor insight into their face recognition abilities (Bowles et al., 2009;Palermo et al., 2017).
In addition, we found strong correlations between individual items across the two questionnaires. Instead of forming different clusters of their own, the bulk of the items across two questionnaires formed a single cluster, which is distant from the remaining eight items (Table 2) ability to mentally image an object [HK#7]). Although we did not administer other questionnaires, the little or no correlation between face-recognition-relevant and irrelevant items supports the view that the high inter-individual correlations may be specific to self-reported face recognition ability. In terms of scale reliability, we suggest researchers and clinicians to discard these eight items when assessing an individual's insight into their face recognition ability.
In conclusion, our results indicated that the two representative DP questionnaires (Kennerknecht et al., 2008;Shah et al., 2015) essentially measured the same face recognition traits. In addition, the huge correlation and robust principal component demonstrated a common latent factor between the two measures. This putative unidimensionality could be intrinsic to insight into one's face recognition ability, rather than behavioral face recognition performance per se (Palermo et al., 2017). Although the PI20 may serve as a measure for estimating DP risk and face recognition ability in general populations (Livingston & Shah, in press), our findings suggest that, contrary to the Shah et al.'s claims, its reliability and validity may be almost equivalent to that of the pre-existing questionnaire (Kennerknecht et al., 2008) (precisely, the reduced subset, HK11) at the individual-item level. Given the current state of DP, where neither objective diagnostic criteria nor biological markers have been established (Barton & Corrow, 2016;Susilo & Duchaine, 2013), it might be a good idea to focus on creating a reliable face recognition questionnaire (rather than a 'DP questionnaire') that can predict behavioral face recognition performance by discarding unreliable, face-recognition-irrelevant items (Table 2, but see Palermo et al., 2017;Zell & Krizan, 2014). Alternatively, more exploratory research not only using HK11 and PI20 together or a combination thereof, but also a range of other face processing measures could aid the extraction of latent prosopagnosia traits/dimensions and the development of a valid DP taxonomy.

Funding
This study was supported by grants from the Japan Society for the Promotion of Science (#26540061 to DM and #17H06344 to KW) and Core Research for Evolutional Science and Technology (CREST) at the Japan Science and Technology Agency (#JPMJCR14E4 to KW).

Authors' contributions
DM and KW designed the study and wrote the manuscript. DM collected and analyzed the data.

Declaration of Conflicting Interests
The authors have no conflicting interests to declare Rouder, J. N., Speckman, P. L., Sun, D., Morey, R. D., & Iverson, G. (2009  to which the bulk of the items belonged (see Figure 2). Items marked with asterisks (*) are reverse scored. HK#6 * I always recognize family members  Scatter plot with color-coded transparent density curves of total scores for the 20-item prosopagnosia index (x-axis) and 11 items from Hong Kong prosopagnosia questionnaire (y-axis). Each circle indicates individuals' data, and color represents sex (red = female scores; blue = male scores). The gray transparent line represents a linear orthogonal regression line (first principal component, PC1 axis), which accounts for more than 90% of the total variance in scores in PCA with a singular value decomposition.  Fig. 1