The Effect of the Glossary on the Reliability and Performance of the Reading the Mind in the Eyes Test (RMET)

The Reading the Mind in the Eyes Test (RMET) is used to measure high-level Theory of Mind. RMET consists of images of the regions surrounding the eyes and a glossary of terms that defines words associated with the gazes depicted in the images. People must identify the meaning associated with each gaze and can consult the glossary as they respond. The results indicate that typically developed adults perform better than adults with neurodevelopmental disorders. However, the evidence regarding the validity and reliability of the test is contradictory. This study evaluated the effect of the glossary on the performance, internal consistency, and temporal stability of the test. A total of 89 undergraduate students were randomly assigned to three conditions. The first group used the original glossary (Ori-G). The second group developed a self-generated glossary of gazes (Self-G). Finally, the third group developed a glossary that did not define gazes, but unrelated words instead (No-G). The test was administered before and after participants drew a randomly assigned image as a secondary task. The findings show that the number of correct answers was similar among the three conditions before and after the secondary task. However, the Self-G and No-G groups took less time to finish the test. The type of glossary affected the consistency and stability of the test. In our case, the Self-G condition made the responses faster, more consistent, and more stable. The results are discussed in terms of levels of processing and the detection of mental states based on gazes.

Baron-Cohen and collaborators created the Reading the Mind in the Eyes Test (RMET) 12 due to the need for an appropriate test to detect variability in the ToM ability among 13 adults with typical development [2,6]. The instrument consists of 36 photographs 14 showing the gaze of men and women expressing a feeling or thought. Each photograph 15 has four possible answers that appear on the screen. People must choose the most 16 appropriate one. 17 RMET is considered a more advanced test as it values the complex emotional 18 aspects that arise in social interactions, in addition to the fact that the subject must 19 evaluate another person's point of view based on an aspect of their face such as the 20 regions that surround the eyes [2]. However, since intentions must be interpreted from 21 what facial expressions reflect, this test could be considered to measure emotion 22 recognition rather than mental state or intentionality [7]. 23 The RMET has been able to capture differences in ToM capacity between men and 24 women, where women score higher [2,6,[8][9][10]. This test has also been able to find 25 differences between clinical populations and control groups [9,[11][12][13]. However, the 26 RMET does not work well when discriminating between individuals with average 27 Theory of Mind skills and those with high skills, as most items have poor discrimination 28 capacity [14]. 29 In some studies, the RMET seems to be reliable in its temporal stability [10,[15][16][17]. 30 However, some studies have carried out reliability reviews and obtained varied results 31 regarding their internal consistency. The average internal consistency of the studies 32 reviewed is 0.64 ± 0.12. Taking into account all the calculation methods used 33 (Cronbach's Alpha, Split-Half, KR20, Ordinal Cronbach's Alpha, Ordinal Omega, 34 Maximal Weighted Internal Consistency Reliability for the Unidimensional Model), the 35 values fluctuate between a minimum of .37 and a maximum of .77 [8,10,14,[17][18][19][20][21][22][23][24][25][26][27][28][29]. 36 Müller and Gmünder [24] point out that tests with dichotomous scores usually have 37 lower alpha coefficients than those using Likert scales. However, this does not explain 38 why there is such variability in the internal consistency reported by the prior studies. 39 The RMET has also reported differences in the average scores obtained by the 40 aforementioned studies. The variability in scores could be explained by verbal IQ level, 41 which contributes significantly to the performance variation in the test [30,31]. 42 Considering the importance of the subjects' level of verbal processing on test variability, 43 we believe that the use of the glossary of terms deserves to be studied. Experimental 44 manipulations in which the participants carried out in-depth processing has enabled 45 better performance in terms of accuracy and speed, that is, better retrieval of learned 46 information with less material forgotten [32,33]. We wonder whether the experimental 47 manipulation of this glossary leads to differences in people's performance, internal 48 consistency and the temporal stability of the RMET. Therefore, we propose generating 49 two processing conditions that affect the participant's performance and test reliability 50 (stability and internal consistency). To that end, we introduced variations in the 51 glossary as follows: In one condition, the participant had to generate a list of synonyms 52 or definitions for the words of the original glossary with their own terms that they 53 collected on the Internet, as a Self-Generated Glossary (Self-G). In the second condition, 54 the participant did not have a glossary of words related to the gazes (No-G), although 55 they generated a glossary of neutral words that referred to different meanings. Finally 56 in the third condition participants had to read and learn the definitions of the original 57 glossary (Ori-G). We expected people's performance to be better and more consistent 58 when they had to craft their own glossary of terms than when they read the glossary of 59 the original test (Ori-G). Finally, we hypothesized that Self-G and Ori-G conditions Spanish language by Serrano and Allegri [34]. This test contains 37 images (one for 72 practice and thirty-six for evaluation), which only show the regions that surround the 73 eyes. For each of the images, participants see four possible answer options, from which 74 they must choose only one. Participants also have a glossary containing the words that 75 appear next to the images, so that participants can consult it as many times as they 76 want if they have doubts regarding the meaning of the words.

77
The experiment was divided into four phases. The first phase varied according to 78 each condition and was 25 minutes long. In the Ori-G condition, participants reviewed 79 the list of words and definitions included in the RMET. In the Self-G condition, 80 participants read the original list of words; but they had to create at least one synonym 81 for each word and one sentence for each synonym. Both groups were allowed to revisit 82 the glossary during the experiment. In the No-G condition, participants read a list of 83 words unrelated to the terms in the test and they had to create at least one synonym for 84 each word and one sentence for each synonym. They were not allowed to revisit the 85 glossary during the experiment.

86
In phase two, participants answered the Reading the Mind in the Eyes Test that had 87 been set up on a computer with E-prime 3.0 software. This program displayed the 88 instructions and the sample image. The program then randomly showed the 36 images 89 that constitute the RMET, recording the response and the total time the person used to 90 respond to each image. The instructions were as follows: 91 "In the next task, on the computer screen, a series of images will appear which 92 correspond to the regions that surround the eyes of different people. For each image, 93 select the word that best describes what the person in the picture thinks or feels by 94 pressing the corresponding number on the numeric keypad. It may seem to you that 95 more than one word is applicable to an image, but please choose only one word, the word 96 that you consider as the most appropriate. Before making your choice, make sure you 97 have read all 4 words. You should try to perform this task as quickly as possible." For 98 the Ori-G and Self-G conditions, the following statement appeared: "If you really don't 99 know the meaning of a word, you can look it up in the glossary." Thus, participants in 100 the Ori-G and Self-G conditions could use the glossary of words during the test. For the 101 No-G condition, the following statement appears: "If you really don't know the meaning 102 of a word, you should try to guess which one might be right."

103
In phase three, participants performed a secondary task, which consisted of drawing 104 a randomly assigned image (car, train, boat, plane, or bicycle). Participants had three 105 minutes to complete the drawing. Finally, in phase four, the Reading the Mind in the We calculated the internal consistency for the entire test and all three conditions using 109 Cronbach's Alpha and the Kuder-Richardson 20 (KR-20) formula, which is a special 110 case of Cronbach's Alpha used for calculating dichotomous items. To assess the 111 temporal stability of the test, we calculated the Spearman-Brown coefficient. 112 Additionally, we evaluated the consistency between the test and retest with the 113 interclass correlation coefficient using the average fixed raters (ICC3k). The ICC is a 114 measure that evaluates the reproducibility of repeated measurements in the same 115 population. Scores equal to or greater than .60 are considered acceptable for clinical use 116 [35]. In addition to the ICC, the distribution of score differences were analyzed with 117 Bland-Altman plots.

118
To assess whether there were differences between the three conditions and between 119 the test and the retest, we performed a mixed ANOVA with a 3x2 design with the 120 averages of correct answers and the averages of the total test time. The internal consistency of the test was low for the three conditions (Cronbach's α <

124
.70) and for the total test (Cronbach's α test = .29, Cronbach's α retest = .51), both in 125 the test and in the retest. The test under the Self-G condition obtained greater internal 126 consistency, while the test with the Ori-G obtained the lowest consistency in both the 127 test and the retest. KR-20 was also calculated, but no changes in trends were observed 128 when compared to Cronbach's Alpha.

129
The correlation between the test and the retest using the Spearman-Brown 130 coefficient was significant and moderate (.60). When evaluating by condition, we found 131 significant and moderate correlations in the Ori-G (.62) and Self-G (.63) conditions. In 132 the No-G condition, the test-retest correlation was not significant. When evaluating the 133 consistency between test and retest responses with the intraclass correlation coefficient 134 (95% Confidence Interval -CI-), a moderate level of consistency was observed throughout  Additionally, to explore the test-retest reliability, Bland-Altman plots were created 140 for the raw data (Fig 1, Panel A) and for the data transformed to logarithms, because 141 these scores did not meet a normal distribution (Fig 1, Panel B). The entire sample was 142 used in the analysis (n = 89). The average differences were .57 (SD = 3.74) for the raw 143 data and -1.79 (SD = .15) for the data transformed to logarithms. Only two of the 89 144 points were outside the upper limit of the confidence interval (95%), while the rest were 145 March 21, 2020 4/11 within its limits. High variability in score differences was observed, but the plot with 146 the logarithmic transformation showed that the differences in scores tended to decrease 147 when the average test scores increased.  (Original, Self-Generated, and No Glossary). In our study, internal consistency was low, 198 both in the test and in the retest, for all three conditions. These results were similar to 199 those obtained in other studies [17,20,24,28]. In our literature review, we also found 200 broad variability in internal consistency. Comparatively, our study presented the lowest 201 Cronbach's α value of the studies reviewed. Also, even though internal consistency 202 increased in the retest phase, it remained low, considering that a Cronbach's α equal to 203 or greater than .70 is considered adequate [36].

204
When exploring the internal consistency in each condition, we found that the Ori-G 205 condition had less internal consistency than the other two conditions in which 206 participants created their own glossary (Self-G) and even in the condition in which they 207 did not have a glossary at all (No-G). This trend was observed both the first and second 208 time the test was administered. As shown in Table 1, the Self-G condition was the one 209 that showed the greatest internal consistency.

210
Like other authors, we found that the test is reliable in terms of temporal stability 211 [10,[15][16][17]. In our study, the interval between the test and retest was three minutes.

212
Temporal stability did not change between the test administered under the Ori-G and 213 Self-G conditions. However, the temporal stability of the No-G condition was low.

214
In short, even though both the Self-G and Ori-G conditions had better levels of 215 temporal stability than the No-G condition, the Self-G condition had better levels of 216 internal consistency in both the test and the retest than the Ori-G condition. Therefore, 217 the reliability of the Self-G condition was better than that of the other two conditions in 218 this study. 219 Preti, Vellante, and Petretto [29] proposed that low reliability reports are due to 220 calculations that violate some assumptions, such as continuity. Hence, they would show 221 inadequate results for a scale with dichotomous responses, such as the RMET. This 222 statement has been refuted by Chalmers [37], who pointed out errors regarding the 223 calculation of reliability and the use of the Ordinal Alpha suggested by Preti, Vellante, 224 and Petretto [29]. Chalmers [37] stated that if the response stimuli are not ordinal, as 225 on a Likert scale, then the Ordinal Alpha is likely inappropriate and should not be used. 226 While tests with dichotomous response items usually have low reliability compared to 227 Likert-type response tests [24], this would not explain the high variability observed 228 with RMET in terms of internal consistency found in various studies, ours among them. 229 As for the performance of the participants in all three conditions, our average scores 230 were similar to those usually found in the literature. From a sample of twelve studies, 231 our total scores were located between the minimum (M=22.8) and maximum (M=28.4) 232 ranges, and close to the average (M=26.5, SD=1.8) [2,8,14,19,[38][39][40][41][42][43][44][45]. In addition, 233 our study found no significant differences in test and retest scores between the three 234 experimental conditions. Considering that the type of glossary, as well as its presence or 235 absence, did not affect the total number of correct answers, it is apparent that the use 236 of the original glossary or the self-generated glossary did not improve the performance 237 of the participants.

238
However, participants took less time in the retest than in the test, which can be 239 considered a measure of how well they learned to respond. When exploring the 240 differences by experimental condition, we found that those who built their own glossary 241 and those who did not have a glossary took significantly less time to respond than those 242 in the original glossary condition, both in the test and retest. Since the number of considering that participants who used the original glossary took longer to complete the 246 test, yet did no better at responding than the other two conditions.

247
Our interpretation of the findings is that the condition of creating a Self-Generated 248 Glossary leads to a higher level of processing in the participants [32,33]. This higher 249 level does not improve participants' performance, but it improves their velocity and the 250 test's consistency and stability. This processing gives coherence and stability to the 251 meanings and allows them to be recalled faster. The subjects who were exposed to the 252 original glossary were also able to carry out a level of processing since they reviewed 253 and read the words and meanings in the glossary. However, their level of processing was 254 not very deep because the meanings were learned passively and were not self-generated. 255 Taking into account the overall results, we suggest the use of the self-generated 256 glossary as an alternative to the original glossary or the absence of a glossary because it 257 achieved greater internal consistency, greater temporal stability, and resulted in a 258 similar number of correct responses compared to the other two conditions. In addition, 259 participants took less time to answer the RMET than those using the original glossary. 260 The absence of glossary did not affect the number of correct answers, although the 261 participants in this condition took the same amount of time to finish the test as those 262 using a self-generated glossary. The problem with this alternative is that while the 263 responses achieved a better level of consistency than the original glossary condition, this 264 procedure had less temporal stability.

265
One question we have is why our manipulation did not affect the number of correct 266 answers. It is assumed that the RMET measures the ability of participants to match a 267 word with a certain gaze; regulated by meaning-processing mechanisms. However, we 268 propose that the RMET measures a group of abilities that operate independently from 269 each other; related to matching certain gazes with emotions and mental states that the 270 person has learned in the course of his or her social development. Thus, our 271 experimental manipulation, which focused on semantic processing, would affect the 272 temporal stability and consistency of responses, but would not improve the ability to 273 associate a word with a specific gaze. 274 We recommend interpreting our results of comparisons among experimental 275 conditions with caution, as the reliability obtained was low. While in some cases the 276 reliability of the test was close to satisfactory, the high variability found in different 277 studies remains to be explained. In this regard, we propose using our method of asking 278 participants to build their own glossary. We also recommend exploring the temporal 279 stability of the test with our method using longer time intervals. 280 We consider that the results obtained in our study are paradoxical. In a strict sense, 281 the test was not consistent even though it was stable. Since the test itself is unreliable, 282 it doesn't make much sense to evaluate its validity. The scores of our participants, 283 which were quite close to those of adults with normal development, showed no difference 284 among the three experimental conditions, nor between the two instances in which the 285 RMET was applied. We still cannot specify what the test measures. However, we know 286 -based on the levels of stability and consistency-that RMET is sensitive to the 287 experimental manipulations that we carried out.

288
Finally, we believe it is important to consider that the lack of significant differences 289 in average scores between glossary methods may be because the instrument is not good 290 at detecting differences between individuals with average Theory of Mind ability and a 291 high level of cognitive functioning (Black, 2019).

293
The Reading the Mind in the Eyes Test has shown a great ability to differentiate 294 between people with typical development and people with neurocognitive problems. 295 However, some authors have questioned the validity of the test, pointing out that it 296 does not necessarily measure a high ability to detect mental states, mainly because its 297 internal consistency has had broad variability. Our findings indicate that the test has 298 low levels of internal consistency and that the glossary is a potential source of 299 variability. In our case, the higher level of processing we generated experimentally under 300 the Self-Generated Glossary condition did not improve the scores, but made the 301 responses faster, more consistent, and stable.