A psychometric evaluation of the 12-item EPQ-R neuroticism scale in 502,591 UK Biobank participants using item response theory (IRT)

Background Neuroticism has been described as a broad and pervasive personality dimension or ‘heterogeneous’ trait measuring components of mood instability such as worry; anxiety; irritability; moodiness; self-consciousness; sadness and irritabililty. Consistent with depression and anxiety-related disorders, increased neuroticism places an individual vulnerable for other unipolar and bipolar mood disorders. However, the measurement of neuroticism remains a challenge. Our aim was to identify psychometrically ecient items and inform the inclusion of redundant items across the 12-item EPQ-R Neuroticism scale using Item Response Theory (IRT). Methods The 12-item binary EPQ-R Neuroticism scale was evaluated by estimating a two-parameter (2-PL) IRT model on data from 502,591 UK Biobank participants aged 37 to 73 years (M = 56.53 years; SD = 8.05), 54% female. Models were run listwise (n= 401,648) and post-estimation mathematical assumptions were computed. All analyses were conducted in STATA 16 SE on the Dementias Platform UK (DPUK) Data Portal. Results A plot of θ values (Item Information functions) showed that most items clustered around the mid-range where discrimination values ranged from 1.34 to 2.28. Diculty values for individual item θ scores ranged from -0.13 to 1.41. A Mokken analysis suggested a weak to medium level of monotonicity between the items, no items reach strong scalability (H=0.35-0.47). Systematic item deletions and rescaling found that an 7-item scale is more ecient and with information (discrimination) ranging from 1.56 to 2.57 and stronger range of scalability (H=0.47-0.52). A 3-item scale is highly discriminatory but offers a narrow range of person ability (diculty). A logistic regression differential item function (DIF) analysis exposed signicant gender item bias functioning uniformly across all versions of the scale. Conclusions Across 401,648 UK Biobank participants, the 12-item EPQ-R neuroticism scale exhibited psychometric ineciency with poor discrimination at the extremes of The 12-item neuroticism EPQ-R scale lacks item reliability and neurotic trait-specic information at the extreme ends of the neurotic continuum when a 2-PL IRT model is estimated. A secondary analysis suggests that systematic item-elimination and re-estimation of the 2-PL model produces a 7-item with higher levels of item information and reliability. This study suggests that the 12-item EPQ-R scale could benet from item revisions and updating including item deletions and validation of replacement items which consider gender item bias. Strengths of this study were the large population cohort available for a comprehensive IRT analysis and the psychometric methodologies which were applied to the data.

participants aged 37 to 73 years (M = 56.53 years; SD = 8.05), 54% female. Models were run listwise (n= 401,648) and post-estimation mathematical assumptions were computed. All analyses were conducted in STATA 16 SE on the Dementias Platform UK (DPUK) Data Portal. Results A plot of θ values (Item Information functions) showed that most items clustered around the mid-range where discrimination values ranged from 1.34 to 2.28. Di culty values for individual item θ scores ranged from -0.13 to 1.41. A Mokken analysis suggested a weak to medium level of monotonicity between the items, no items reach strong scalability (H=0.35-0.47). Systematic item deletions and rescaling found that an 7-item scale is more e cient and with information (discrimination) ranging from 1.56 to 2.57 and stronger range of scalability (H=0.47-0.52). A 3-item scale is highly discriminatory but offers a narrow range of person ability (di culty).
A logistic regression differential item function (DIF) analysis exposed signi cant gender item bias functioning uniformly across all versions of the scale. Conclusions Across 401,648 UK Biobank participants, the 12-item EPQ-R neuroticism scale exhibited psychometric ine ciency with poor discrimination at the extremes of the scale-range. High and low scores are relatively poorly represented and uninformative suggesting that high neuroticism scores derived from the EPQ-R are a function of cumulative mid-range values. The scale also shows evidence of gender item bias and future scale development should consider the former along with item deletions. Background Neuroticism has been described as a broad and pervasive personality dimension with in uences beyond its own limited de nition (1). Operationally, it has been de ned as a personality trait assessed by items referencing to instances of worry; anxiety; irritability; moodiness; self-consciousness; sadness and irritabililty (2)(3)(4). The NEO-PI (Neuroticism-Extraversion-Openess Personality Inventory) operationalises neuroticism as a combination of individual behavioural traits which may also be measured as isolated components of mood state e.g., anxiety; hostility; depression; self-consiousness; impulsiveness and vulnerability (1). Also de ned as a 'heterogeneous' trait possessing signi cant overlap with depression and anxiety, neuroticism places an individual vulnerable for other unipolar and bipolar mood disorders (3). Moreover, increased levels of neuroticism places an individual vulnerable to other neurotic disorders, psychological distress and 'emotional instability' (5). There is also consistent research suggesting a positive relationship between neuroticism and negative effect (6) notwithstanding neurotism essentially existing as a dimension of negative affect (7). Eysenck has further argued that neuroticism is a direct reaction to the autonomic nervous system (8, 9), ndings supported nding increased neuroticism correlated with tolerance to a highly stressed environment, suggesting a habituation relationship with everyday stressors (10,11 (12). Assessment outcomes of this scale were reported in the Manual for the Maudsley Personality Inventory (MPI) where gender differences were found across the psychiatric patients and soldiers, on whom the data were derived (13). Later versions of the MPI were revised to remove genderspeci c items although to our knowledge, details of the method of their removal are not available. The revised neuroticism scale became a component of the Eysenck Personality Questionnaire (EPQ-R: 14) and thereby exists as a culmination of attempts to select the relevant items through multiple revisions of the MPI. Although the EPQ-R neuroticism scale is reported to have been developed through clinical judgement and, multiple cluster and factor analyses, reasons for acceptance or rejection of items were complex, unclear and are not 'objecti ed' (13,15).
Using factor analysis and correlations for item deletion whilst widely used, have a bias towards identifying closely associated items as being informative and is opaque to the individual item contribution or person ability. The process is commonly known as classical test theory (CTT) whereby a summated score is computed from individual item scoring. The EPQ-R neuroticism scale has been found to lack items to identify respondents who would normally endorse items at the extreme ends of the trait continuum, e.g. high vs. low neuroticism (5). Furthermore, the scale maintains gender-speci c items, females consistently scoring higher (14,16), a difference which has been reported cross-culturally (17) and across the age range (e.g., 18).
We investigated the psychometric e ciency of the 12-item EPQ-R neuroticism scale -hereafter 'EPQ-R' (14) as a widely used measurement of neuroticism. We applied item response theory (IRT) to psychometrically evaluate the EPQ-R using data from UK Biobank (19), a large population study which assessed neuroticism at baseline. Our expectation was that the large sample size and balanced gender ratio (54% female) would provide valuable item-level information for assessing the informativeness of individual items and overall psychometric reliability of the scale. This assessment may have important implications in clinical settings and for epidemiological research where it is widely utilised.

Participants
The UK Biobank is a large population-based prospective cohort study of 502,665 participants. Invitations to participate in the UK Biobank study were sent to 9.2 million community-dwelling persons in the UK who were registered with the UK National Health Service (NHS) aged between 37 and 73 years. A response rate of 5.5% was recorded. Ethical approval was granted to Biobank from the Research Ethics Committee -REC reference 11/NW/0382 (19). The analysis here was applied to the whole cohort of 502,591 participants after withdrawals were considered.

Procedure
Assessments took place at 22 centres across the UK where participants completed an informed consent and undertook comprehensive mental health, cognitive, lifestyle, biomedical and physical assessments. The selection of mental health assessments were completed on a touchscreen computer, including the 12-item EPQ-R neuroticism scale (14) where participants were required to answer, 'yes', 'no', 'I don't know' or 'I do not wish to answer' in response to the 12 questions: 'Does your mood often go up and down?'; 'Do you ever feel The dependent variable is the dichotomous response (yes/no), the independent variables are the person's trait level, theta (θ) and item di culty (Bi) . The independent variables combine accumulatively and the item's di culty is subtracted from θ. That is, the ratio of the probability of success for a person on an item to the probability of failure, where a logistic function provides the probability that solving any item (i) is independent from the outcome of any other item, controlling for person parameters (θ), and item parameters. The 2-PL model includes two parameters to represent the item properties (di culty and discrimination) in the exponential form of the logistic model.
For each item, an item response function (IRF) may be calculated which calibrates the responses of an individual against each item. A calibrated standardised score for trait severity θ is returned and may be plotted as an item characteristic curve (ICC) along a standardised scale with a mean of 0 ( Figure 1a). From the ICC two parameters may be estimated. The rst is the value of θ at which the likelihood of item endorsement is 0.5, interpreted as 'expressed trait severity'. The second is the slope of the curve from the point at which the likelihood of item endorsement is 0.5, interpreted as 'expressed item discrimination' i.e., the ability to discriminate between greater and lesser severity scores. The IRF may also be expressed as an item information curve (IIF) which displays the relationship between severity and discrimination ( Figure 1b). The apex of the curve for any IIC indicates the value of θ at which there is maximum discrimination. By convention, scales expressing a range of θ values are more informative than those with items clustering around a single value and items with a discrimination of score of >1.7 are considered informative, although lower values are considered contributory within context (20). Statistical assumptions underlying the IRT principles of scalability, unidimensionality and item-independence are examined. UK Biobank data for this analysis (application 15008) were uploaded onto the Dementias Platform UK (DPUK) Data Portal (21) and analysed using STATA SE 16.1 (22) Results

IRT analysis
A 2-PL IRT model was estimated whereby di culty and discrimination parameters were computed ( Table  1). The discrimination (item-information) parameters across the scale range between 1.34 and 2.28. The item measuring 'Does your mood often go up and down?' exhibits the highest level of discrimination at 2.28, suggesting that this 'mood' question possesses the highest amount of information synonymous with the neurotic trait. In contrast, the item 'Are you an irritable person?', 1.34, is the lowest, and below the suggested recommended level of 1.7 for an ideal discrimination level for items measuring trait values (20).
The items, 'Are you a worrier?'; 'Do you suffer from nerves'; 'Do you ever feel just miserable for no reason?'; 'Do you often feel fed-up?' and 'Would you call yourself tense or highly strung' also have discrimination values of above 1.7.
The di culty parameter functions as a probability scale with the item position on Ө indicating the probability value of a respondent endorsing an item. Figure 2 shows the item characteristic curves (ICCs) for each of the items, presenting both the steepness of the discrimination curve and position of the di culty value on the Ө continuum. For example, for the item 'Does your mood often go up and down?', there is a 50% probability that someone with a Ө of 0.21 (someone who does experience neurotic trait characteristics) would endorse this item, therefore it is considered an item characteristic of neuroticism, albeit low. On contrary, for the item ''Are you a worrier?", there is a 50% chance of someone with a Ө of -0.13 endorsing this item, therefore, someone who does not experience neurotic trait charateristics.
Additional item discrimination is available by graphing the IIF curves (see Figure 3). The IIF curves thereby display the relationship between di culty (trait level) and discrimination (information), and an important feature of this graph is also the position on the continuum from which the point is drawn perpendicular from the apex of each item curve. The items which have their maximum curvature positioned along the Ө continuum in the positive half provide information about the neurotic trait when there is an endorsement (presence) of the trait characteristic. For example, the item 'Do you often feel lonely?' is an endorsement of neuroticism if a respondent endorses it, as its apex is positioned in positive Ө and is more likely to be endorsed by someone with a higher level of neuroticism (1.41) than a person endorsing the item 'Does your mood often go up and down?' which is also positioned in the positive Ө but has a lower di culty value (0.21). Therefore, although the 'mood' item has the highest discrimination value (see previous), it does not provide su cient information about respondents who possess a high level (presence) of the trait (+1 to +4) or a low level (absence) of the trait (-1 to -4), instead it provides the most information for respondents who possesses an average (Ө=0) to a minimal amount of the neuroticism trait (see Table 1). The item which possesses the least trait characteristic discrimination is the item, 'Are you an irritable person?', Although the IIF curve apex is positioned over a positive Ө (0.95), and may be endorsed by a respondent possessing an amount of the trait characteristic, the discrimination value is low (1.34).
In summary, the overall pattern of item distribution across the Ө continuum suggests that across the 12-

Reliability
In IRT, reliability may be calculated at multiple point values of Ө along the continuum rather than a single reliability score as in CTT. Reliability is de ned at different points of Ө with the mean of Ө xed at 0 and the variance at 1, facilitating identi cation of the model and reliability for all points along the Ө continuum, distinguishing respondents according to speci c values of Ө (23). For the 12-item scale there is reliable information to differentiate respondents who possess no or just above an average amount of trait information (Ө=0; 0.87 and Ө=1; 0.88), considered very good for reliability However, reliability then decreases (Ө=2; 0.76 and Ө=-1; 0.71) suggesting that the highest reliability of measuring the neurotic trait is at normal or a minimal amount of neuroticism, Ө=0 or 1. Thereafter, reliability reduces so that the extreme end of the continuum, Ө=3; 4; -2; -3; -4, is no longer reliably measured (See Table 2) .

Statistical assumptions 1. Item independence
A correlation analysis assessed initial item independency and all items were signi cantly correlated (p <.000) but the majority of values were lower than 0.50, suggesting basic local item independence. A residual coe cient matrix, requested after estimation of a single-factor model showed that no residuals were too highly correlated, R >0.20 (24), suggesting basic item independence.

Monotonicity
A Mokken analysis produced a Loevinger H coe cient (25) which measures the scalable quality of items, expressed as a probability measure, independent of a respondent's Ө. These coe cients ranged between 0.35 and 0.47 (Table 3)

Unidimensionality
A principal component analysis (PCA) shows that a single major factor is responsible for 36% of the variance and a second factor responsible for 11% of the variance, the difference of which is above the suggested 20% indicating a single major factor is being measured (26). A post-IRT estimation model measure of unidimensionality was also computed using a semi-partial correlation controlling for Ө. This analysis provides individual item variance contribution after adjusting for all the other variables including Ө. It demonstrates the relationship between local independence and unidimensionality, re ecting a conservative assessment whereby the desired R 2 should ideally be zero or as close to zero as possible (27). Items ranged between 0.01 and 0.02, suggesting unidimensionality. To our knowledge, there is still no standardised cut-off criterium for assessing this value (i.e., how close to zero all items should be across a scale).

IRT revised analysis
To assess a revised scale, items were systematically removed from the scale according to discrimination value with the lowest discrimating item removed rst ('Are you an irritable person?', 1.34) whereafter a 11item 2-PL IRT model was esitmated with the remaining items and the process repeated, removing the lowest discrimating item, below 1.7. In order of removal, the items systematically removed thereafter were: 'Do you often feel lonely?'; 'Are you often troubled by feelings of guilt?'; 'Do you worry too long after an embarrassing experience?' and 'Are your feelings easily hurt?' at which stage the 7 remaining items were maintained as most were > 1.70 on 434,693 individuals.
The item parameters for the 7-item scale are presented in Table 4. Statistical assumptions were computed on the revised scale of 7 items (Table 4) and importantly a Mokken analysis suggests improved scalability (monotonicity) compared to the full 12-item scale with two items reaching values >0.50 (Table 5).
Reliability across the scale is marginally improved compared to the full scale suggesting redundancy of the removed items (Table 6). Acceptable metrics for unidimensionality and item independence were achieved for this revised scale. The ICC and IIF graphs for the revised 7-item scale are presented in Figures 4 and 5 where improved item information over the 12-item scale is evident.  Table 7). A Mokken analysis suggests that scalability is strong (H≥ 0.50) across all items (Table 8), a semi-partial correlation analysis controlling for Ө showed all values were 0.00.
Reliability is only good at Ө = 0 suggesting this scale is only reliable to measure those with an average trait ( Table 9). The ICC and IIF graphs suggest the three-item scale may present an e cient, alternative and highly informative scale, however, the scale is narrow in range and does not possess items measuring neurotic traits above or below average Ө, at the extreme ends of the trait spectrum ( Figure 6).

Differential-Item Functioning (DIF) Analysis
To investigate gender differences in item functioning, a logistic DIF analysis was conducted across all three versions of the scale with gender as the observed group. A uniform and nonuniform DIF assessed whether speci c items favoured one group over the other (male vs. female) for all values of the latent trait (uniform) or just selected values of the latent trait (nonuniform). The output of these analyses are presented in Table   10 where evidence of signi cant uniform DIF for gender was found across all three versions.

Discussion
In a large population cohort of 502,591 adults aged 37-73 years, limitations in the range and reliability of item trait characteristics were found across the 12-item EPQ-R neuroticism scale when a 2PL IRT model was estimated listwise on 401,648 individuals. Our ndings suggest that the 12-item scale is ine cient with poor discrimination at the extreme ends of the scale-range, such that high and low scores are relatively poorly represented and uninformative. A reliability function analysis also suggests there is poor reliability at the extremes of the scale score and high neuroticism scores derived from the EPQ-R are a function of accumulative mid-range values. In a revised 7-item version of the scale, greater item-discrimination and reliability was found across the scale suggesting that selective items within the 12-item version are redundant. A further reduced 3-item version was investigated but although this scale possesses items of high discrimination and scalability, range is very narrow and lacks reliability beyond an average trait value.
A DIF analysis with gender as a group outcome suggests the scale exhibits signi cant gender differential item functionting across all versions of the scale.
To our knowledge, this is the rst study to conduct a comprehensive psychometric scale assessment applying IRT to the EPQ-R on such a large population. Furthermore, although the assumption values and parameter output of the 12-item IRT calibration were mostly acceptable according to established psychometric standards, an examination of individual items suggests that there were items of low discrimination and the scale could bene t from revisions based on psychometric methodologies such as those presented here, and as evidenced in the scale-revision analysis.
It is of fundamental importance that health measurement scales are reliable and valid measures of the construct of interest. Utilising psychometric methodologies to analyse psychosocial and health related outcomes has important implications for assessing longitudinal change both in clinical settings and epidemiological research. An IRT analysis provides item-level information and scale characteristics through the further computation of post-estimation assumptions including the estimation of an individual Ө latent metric predictive of individual Ө scores on the tted IRT model. This Ө metric may then be used as a latent construct in assessing longitudinal change (28) which may be a more reliable measure compared to a single summated score (29). Furthermore, it is also suggested that using an IRT derived Ө in longitudinal studies, over the summated score, may be preferable with reducing overestimation of the repeated measure variance and underestimation of the between-person variance (30).
A further advantage of utilising psychometric methodologies in an epidemiological context is that IRT extends the opportunity to utilise, computer adaptive testing (CAT) for both scale development and for e cient test delivery. During CAT administration, Ө is automatically computed in response to the trait (Ө) of the respondent and it is therefore not necessary to present the full range of items as the response scale is adaptive to individual performance (trait level), the items underlying the trait and a stopping rule (31). The potential to reduce a scale so that only the most reliable and informative questions are presented to participants is essential in clinical settings and for epidemiological research. This is important to consider when working with individuals who are older or who have co comorbid psychiatric disorders. Moreover, focused, reliable and user-friendly scales in a research setting increases user satisfaction, reduces participant burden and maintains long-term participant retention.
Participants who display or possess the extreme trait characteristics are rare, however, the potential should exist for this eventuality, but many scales are simply not adequately designed to do so (28). Moreover, previous research suggests that both the 12 and 3-item EPQ-R neuroticism scales may have reduced power to discriminate between low and high scoring individuals (5); we found evidence of this in the 12-item scale. It is important in both clinical and research settings that scales are designed to measure across the trait spectrum and this is possible if scales are developed using psychometric methodologies such as those described here and elsewhere (e.g., 32, 33).

Conclusions
The 12-item neuroticism EPQ-R scale lacks item reliability and neurotic trait-speci c information at the extreme ends of the neurotic continuum when a 2-PL IRT model is estimated. A secondary analysis suggests that systematic item-elimination and re-estimation of the 2-PL model produces a 7-item with higher levels of item information and reliability. This study suggests that the 12-item EPQ-R scale could bene t from item revisions and updating including item deletions and validation of replacement items which consider gender item bias. Strengths of this study were the large population cohort available for a comprehensive IRT analysis and the psychometric methodologies which were applied to the data.