Non-Human Recognition of Orthography: How is it implemented and how does it differ from Human orthographic processing

,

S cript is a hallmark of Human cultural evolution and is unmatched in the animal kingdom (1).Reading of script is considered exclusive to Humans, requiring explicit instruction and specialized cognitive skills to retrieve word representations and meaning (2)(3)(4)(5).Biological foundations of reading ability have been described, for instance, in the finding of a visual word form area in the Human left-ventral occipito-temporal cortex (1) or of genetic risk for dyslexia (6).The description of the biological basis of reading and the description of reading-like behavior for Baboons (7) and Pigeons (8) raises the question of the evolutionary roots of reading ability.Both species learned to distinguish remembered orthographic stimuli (i.e., similar to written words) from novel letter combinations in orthographic decision tasks.While this suggests similar skills in non-human animals and Humans at the surface, how the task is solved across species, i.e., which cognitive processes were involved, remains an open question.Note that computer vision models, with drastically different architectures, can also solve similar tasks (9)(10)(11)(12), indicating that there are many ways to solve orthographic processing tasks.In the current study, we investigate the cognitive basis of reading(-like) behavior (i.e., orthographic decisions) in Humans, Baboons, and Pigeons and test if phylogenetic proximity leads to more human-like cognitive processes in primates than in birds utilizing a transparent computational model (13) that allows one to differentiate the individual neurocognitive representations underlying orthographic decisions.
Humans typically learn to read through graphemephoneme conversion only once robust phonological and semantic representations are already in place (14)(15)(16), thereby aligning their reading rate to that of speech in fluent readers

Significance Statement
Imagine being able to read without ever learning the alphabet.Research has shown that baboons and pigeons can exhibit readinglike behavior, suggesting shared processes across the species involved.To increase our understanding of the similarities and differences between humans and animals in reading-like behavior, we use a computational model to uncover the underlying processes that enable humans, baboons, and pigeons to perform these tasks.We found that humans and baboons rely on similar processes, focusing on information related to letters and letter sequences.In contrast, pigeons rely more heavily on visual cues.This discovery sheds light on the evolution of processes underlying reading and reading-like behavior, indicating that the lower the evolutionary distance between species, the more similar processes are involved.(17).On the neuronal level, efficient reading is characterized by the implementation of a lexical categorization process (in the left-ventral occipito-temporal cortex, Ref. (18,19); in proximity to the so-called visual word form area; Ref. (1)) that differentiates between word and non-word letter strings based on orthographic word characteristics (20).Behavioral tasks based on the distinction between words and non-words, like the lexical decision task, are easily solved by Humans as they can rely on a large body of orthographic knowledge and a lexicon of rich phonological and lexico-semantic representations (2)(3)(4)(5).Lexical decision tasks are standard repertoire to study reading and visual word recognition (e.g., Ref. (21,22)).Hence, similar orthographic decision tasks (i.e., without the involvement of explicit semantic processing) have been implemented in Baboons (7), Pigeons (8), Humans (23) and Computer vision models (9,10,12).
Here, we investigate the representations underlying orthographic decisions in Humans, Baboons, and Pigeons.This phenotyping approach (i) describes which neuro-cognitive operations are involved in making orthographic decisions, (ii) allows a comparison of the types of representation implemented across species, and (iii) across individuals (see Ref. (24)(25)(26) for examples of using computational models for cognitive or neuro-congitive phenotyping).We implement this based on a transparent computational model, the Speechless Reader Model (13), that allows us to infer the representations used in orthographic decisions on the level of vision and orthography as opposed to rather nontransparent computer vision models (9,10), in the which the high model complexity only allows initial, but highly valuable, post-hoc investigations of the internal workings of the model (e.g., see Ref. (10)).Also, the speechless reader outperforms computer vision models when modeling the learning curve of Baboons (13).Hence, comparisons between species that vary in phylogenetical proximity allow us to investigate the evolutionary roots of reading-like behavior based on neurocognitive phenotypes.

The Speechless Reader Model (SLR)
The SLR is based on neuro-cognitive assumptions of visual and orthographic processing in Human visual word recognition (13,18,27) and the principles of predictive coding (28)(29)(30).The implementation without phonological and semantic representations serves the purpose of modeling animal orthographic decisions, as animals cannot access this information.For Humans, the SLR allows us to estimate how well one can describe orthographic decisions based on visualorthographic information.Therefore, the SLR enables us to assess the influence of phonological and semantic knowledge even when not needed for the task.
The core of the model are three types of prediction error representations, which integrate the sensory input (i.e., the visually presented word or non-word) with the knowledge stored in the lexicon of each participant.For example, the orthographic prediction error (oPE; Ref. (12,27)) integrates word knowledge and sensory input on the pixel level.Interestingly, the oPE, although implemented based on purely visual pixel-level information, is similar to orthographic word characteristics (i.e., that describe the letter combinatorics of a word like the Orthographic Levenshtein distance; Ref. (20); r  13)).Prediction errors are calculated based on the learned letter strings stored in the lexicon, the resulting representations are summed into a word-likeness estimate (18).The output of the categorization process is binary, indicating if the input is considered learned or new.
across languages from .25 to .4,Ref. (27)).Thus, this simple pixel-level operation produces more than a pure vision-related word characteristic as more abstract information emerges.
In the current version of the SLR, the prediction error representations operate on three distinct levels (see Fig. 1): pixel (orthographic prediction error: oPE; Ref. (12,27)), letter (LPE), and letter-sequence (sPE; Ref. (13)).In other words, for an input letter string such as WIND, predictions from the lexicon convey the probability of a pixel, letter, or letter sequence at a specific position based on the previously learned letter strings in the lexicon (e.g., for LPE: how probable is the letter "W" at the initial position given the stored items in the lexicon, similar to a conditional probability of the letter at a specific position given a lexicon).In the full model, prediction error values are calculated for all three levels, which are summed up to generate a wordlikeness estimate for each encountered letter string.This estimate then forms the basis for the orthographic decision that determines the model output: is a letter string learned or unknown (see Eq. 1, 2, 3, 4, 5, and 6).
Here, we use the SLR to model three datasets: (i) Human data from a pseudoword learning task, in which participants learned letter strings without explicit semantic associations (23,31), (ii) Baboon data (7) and (iii) Pigeon data (8).We created an individualized lexicon comprising the letter strings learned successfully for each Human (N = 37), Baboon (N = 6), and Pigeon (N = 4).We then generated model simulations for each participant using each prediction error representation and possible combination of these representations (resulting in N = 7 model variants; e.g., combining the oPE and the LPE only in one variant or using only the sPE in another variant).In addition, we used individual categorization thresholds (cf.blue line in Fig. 1 and 6).Then, we compared the model and participant performance for each participant (see Appendix Fig. S1-3), and the simplest model with the lowest mean squared error was considered the best-fitting model.For the Fig. 2. Model fit and neuro-cognitive phenotypes for decision behavior for each participant by species.(a,d,g) Correlation of behavioral accuracy (x) and model accuracy (y) of each participant for learned (green) and novel (yellow) stimuli.Mean model accuracy was extracted from the models that most adequately described the performance on an individual level.Lines represent linear regression lines with confidence intervals.(b,e,h) Behavioral results (mean accuracy per individual; dots and violin, separated for learned and novel words), including group mean (black dot) and the mean model simulation results of the individual winning models (black X) with bootstrapped 95% confidence intervals.Note that only the data on learned letter strings was available for Pigeons.(c,f,i) Proportion of model variants that best described Human, Baboon, and Pigeon orthographic decision behavior.
interpretation, we focus on the inner workings of the models that show the highest similarity to the behavior of each participant, in other words, the neuro-cognitive phenotype for orthographic decisions.

Results
Mean orthographic decision accuracy (see Figure 2) showed that Humans performed best, with an average accuracy higher than 90%, whereas Baboon accuracy at 75% correct and Pigeon accuracy at 70% correct were significantly lower (i.e., no overlap in confidence intervals).Although the model simulations mirror this general pattern of results, we see that confidence intervals of the means of behavior (dot) and simulations (x) overlap for Baboons and Pigeons but not for Humans (see Figure 2).Thus, on average, the best-fitting speechless reader models performed at a similar accuracy as Baboons and Pigeons but are less accurate than Humans.This finding indicates, that the description of human orthographic decision performance is potentially influenced by knowledge that is currently not implemented in the model.Nevertheless, our model simulations were highly correlated with behavior across all three species (see Table 1 and Figure 2a,d,g), indicating that the selected models adequately represent the behavior of the individual participants.The pie charts in Figure 2 illustrate which model variants (including different prediction error representations) showed the highest fit to the individual Human, Baboon, and Pigeon data, respectively.We found that for Humans, the best-fitting model, in all except one participant, included letter-sequencelevel representations: For 86% of participants, the best- fitting model exclusively implemented letter-sequence-level representation (sPE).For 8%, it was combined with lettersequence-level and letter-level representations (L + sPE), and for 3%, the whole model with all three representations was most adequate (o + L + sPE).The behavior of one participant (3%) without the letter-sequence-level representation was most adequately represented by a model with only the letterlevel representation (LPE).Interestingly, the same model variants (i.e., including the same set of combinations of representations) exhibited the best fit for the Baboon data, albeit with different portions.Only one of the six Baboons' behavior was best described with the sPE model, which was most common in Humans.Most Baboons (50%; N = 3) required the implementation of a combination of lettersequence-level and letter-level representations (L + sPE), the second most common type of representation found in Humans.For Pigeons, the set of the best-fitting models was different without overlap in the combinations of representations to the other species.All adequate models included pixel-level representations (oPE) in combination with letterlevel representations (o + LPE) and letter-sequence-level representations (o + sPE) in one Pigeon each.
To measure the stability of the phenotyping, we estimated the split-half reliability (Table 2 upper section).We based the analysis on the agreement of the representations from the individual models fitted to one-half of the data with models fitted to the other half.We implemented random splitting and repeated the procedure one hundred times.Overall, reliability was high (agreement > 60% for 7 out of 9 representations), with the results for the letter and letter-sequence representations in Pigeons showing the lowest reliability.To assess if the estimation of the phenotypes relies heavily on the best-fitting model, we reevaluated the three representations based on the mean squared error from all implemented models (see Table 2 lower section).The lowest mean squared error emerged when models for Humans included the letter-sequence-level representations.For Baboons, we find letter-sequence-level and letter-level representations showing the lowest errors.For Pigeons, the error is lowest for the models including the pixel-level prediction error.These analyses substantiate the main findings from the best-fitting models.

Discussion
In an orthographic decision task (i.e., categorization of letter strings as learned or novel), we examined the neurocognitive representations in Humans, Baboons, and Pigeons.We could infer and compare potential representations implemented in orthographic decisions across species based on a transparent Speechless Reader computational model (13).Although previous investigations comparing orthographic decision behavior focused on the high similarity between the behavioral performance across species (7,8), here we used the opportunity to investigate the differences between species and individual participants using neuro-cognitive phenotypes.Since we integrate results from Humans, Baboons, and Pigeons, we can compare the orthographic representations of species separated by 324 million years of independent evolution (32).Overall, we found evidence that all species use, at times, pixel-, letter-and lettersequence-level representations.Yet, the contributions of those representational levels to the task performance varied between species.Humans predominantly relied on the implementation of letter-sequence-level representations suggesting that they extract relevant information from letter strings based on combinations of at least two letters.To some extent, this is also true for Baboons.However, they used letter-level representations to a larger extent than Humans.We found only one Baboon that implemented a letter-sequence-level representation only, which was predominant in Humans (i.e., found in 86%).In contrast, pixel-level representations were dominant for Pigeons (i.e., used by all Pigeons).These findings indicate that all three species solved the task but employed partly different cognitive strategies to reach orthographic decisions.Moving from letter-sequence via letter to pixel-level representations, Humans, Baboons, and Pigeons could be ordered along a continuum to integrate visual-orthographic information successfully.The perceptual demands of the analyzed orthographic decision studies, which mirror the tasks of local and global visual processing, are of significant importance.Comparative studies involving Humans, non-human Primates, and Pigeons have revealed that perceptual specializations on global or local stimulus features impact various cognitive tasks (33)(34)(35)(36).The striking similarity of our findings to those on global/local processing suggests a potential overlap in the visuo-cognitive processes required for visual orthographic integration.Humans have a strong propensity to group small elements into global configurations.Thus, they preferably process a large letter instead of its constituent small letters (37), easily see the global shape from partially deleted line drawings (38), integrate distant patterns into illusory shapes (39), and exploit perceptual principles like similarity and proximity to see a global form (40).When using the same tasks to compare Baboons or other Monkeys with Humans, Baboons respond more slowly to global than to local targets (41), while Macaques selectively attend to fine details of compound stimuli (42).Detailed analyses of these and similar results make it likely that these differences occur at the attentional level, where perceptual grouping operations seem more attention-demanding for Monkeys.At the same time, they come easy in Humans (41).In addition to attentional differences, early visual cortical areas in Humans may be especially adept at encoding and integrating global visual information (43).
Pigeons are well-known for showing a prominent local precedence in various tasks (44)(45)(46).These findings are very likely due to their ecology as granivorous birds that have to discriminate small grains by minute featural differences against a cluttered background with their frontal visual field.The frontal field of Pigeons is represented in the tectofugal pathway corresponding to the mammalian extrageniculocortical system (47,48).This system is far less prone to utilize global visual information than areas of the primate ventral cortical stream (36,49).Thus, the neural adaptation to a granivorous lifestyle drives local precedence in Pigeons and could be responsible for their preference for pixel-level orthographic processing.
The differential specialization within a global-to-local processing continuum is not fixed in any of the three species we discuss.All of them have the remarkable ability to alter their search strategy in task-or training-dependent manners (50)(51)(52).As we have observed in our analysis, this adaptability is a fascinating aspect of their cognitive strategies.While, on average, Humans, Baboons, and Pigeons are positioned on a continuum of letter-sequence-, letter-, and pixel-level representation, each species shows individual differences and is, in principle, capable of utilizing all three strategies.
The finding of a substantial reliance on letter sequence representations in Human orthographic decisions aligns with the general notion that efficient readers implement large orthographic elements for word recognition and reading (i.e., letter combinations or words; Ref. (2,3,53)).In literacy acquisition, increased reading competence strongly decreases the influence of the number of letters on reading time (i.e., longer reading times for longer words; e.g., Ref. (15,54,55)) and increases the perceptual span (5,56,57).Still, when recognizing unknown letter strings (e.g., pseudowords), they rely again on letter units (58,59).The findings of this study show that after training, Human readers implement orthographic decisions with non-words based on larger units combining multiple letters.However, the implemented model variants fall short of simulating Human behavior as accurately as the Baboon and Pigeon behavior.
A potential explanation is that Humans may rely on additional or different representations that we still need to implement in the model.Likely, candidate representations could be easily added to the model (when respecting the nature of the representations as prediction errors) and implemented on orthographic but also lexico-semantic and phonological processing levels (e.g., phoneme-level prediction error representations).Here, Human participants were efficient readers with a working reading system in place.However, we only modeled data from a pseudoword learning task (i.e., unknown letter strings; Ref. (23)), so it is likely that phonology played a role, given compelling evidence for its influence on visual word recognition (e.g., when learning to read; Ref. (15,60); in adult readers (18,(61)(62)(63); when reading problems occur; Ref. (14,64)).Another aspect is that in contrast to more classical measures of orthography, the sequence-based prediction error implemented here respects the exact sequence, always starting with the first letter of the string.Here, we chose this representation as the first hypothesis on the sequence level motivated by finding a word length effect for pseudowords in adult reading (58,59).However, the letter-sequence level implementation is only one of many potential representations that can characterize the orthographic structure of letter strings (e.g., based on bi-or tri-grams could have higher flexibility to model transposed letter effects; e.g., see Ref. (20,53)).Future work on integrating further representations in the model will allow a systematic comparison of representations used by Humans and, simultaneously, offer new possibilities to describe differences between individuals and species in orthographic processing and beyond.

Conclusion.
Our analyses reveal that the ability of the three species to differentiate between words and nonwords orthographically is based on partly different visuo-cognitive strategies.While Human readers mainly relied on lettersequence representations, Baboons utilized letter and lettersequence level strategies, while Pigeons heavily relied on pixel-level representations.These findings imply that pure "success testing" in comparative cognitive studies obstructs the more profound differences in species-specific cognitive strategies.Instead, a "signature testing" approach is needed that reconstructs how a task is solved (65).Such an approach can uncover species-specific visuo-cognitive strategies and reveal the potential neuroevolutionary adaptations that drive these cognitive strategies.8) trained four Pigeons (Columba livia) on a similar task.The limiting factor in this comparison is the Pigeon data, of which we could only obtain the responses to the learned letter strings of the last part of the experiment, while earlier responses and the responses to non-words were missing.To make the data comparable across species, we limited our primary analysis to the final 10,000 trials of each Baboon dataset and the two last sessions in the Human dataset, amounting to 440 trials per participant.Furthermore, the limitation to the learned items led to an adaptation of the lexical process part of the model (18), allowing the model to be adapted to individual learner thresholds for each participant (see below).For further information on participants and stimulus materials, please refer to the original publications.

Behavioral
Model implementation.Instead of automatically estimating the categorization boundary (blue line in Fig. 1) based on the overlap of word-likeness estimates from words and non-words, we implemented the categorization process based only on the wordlikeness distribution of the learned letter strings (also described P redictioni,j = n 1 StoredItemsi,j n [1] P Es = SensoryInputi,j − P redictioni,j [2] oP Es = 1 if (SensoryInputi,j − P redictioni,j) > .5 0 otherwise [3] in Ref. (18)).To find an adequate categorization threshold, we calculated the model responses for all possible model variants and multiple possible threshold values (range .5-1;see Eq. 6) for each participant (see Supporting Information: Table S1 and Fig. S1, S2, S3).To fit the optimal boundary, we compared the model simulations to the behavior of each participant.For all of our model simulations, we investigate the behavior after initial training (i.e., when the performance of animals is relatively stable).For all datasets, the items in the lexicon are determined based on the performance in trials used in the analysis (all trials after the first 33,041 trials, i.e., determined by the available data for both Baboons and Pigeons and data from sessions 3 and 4 from the Human dataset).We use the criterion for learned items, which was previously reported (Accuracy > .71), to determine the participant-specific implementation of the items in the individual lexicons of the speechless reader.When the lexicon items are known, we can estimate the top-down predictions (gray lines in Figure 1) on the level of pixels, letters, and letter sequences (see Eq. 1, described in Ref. (12,13,27)).When estimated on the pixel level, the StoredItems matrix consists of gray values defining the image of each item stored in the lexicon (i,j = 140, 40 each).For letters, the StoredItems matrix consists of all letters (for fourletter words: i,j = 4, 26 each); and for the letter sequence level, of all possible letter sequences (e.g., for four-letter words: Letters 1 and 2; Letters 1, 2, and 3; i,j = 2, x where x is the number of all possible letter combinations for the two and three letter sequences).For each pixel, letter, and letter sequence, a mean over the number of items in the lexicon (n) is calculated to represent the prediction.Note that letter-based representations were informative as for the animal studies, the letter frequencies ranged from 2.6 to 5.3% at any of the five positions.Also, explorations with inverted sequence prediction errors (i.e., starting the sequence with the last letter of a word) showed much less accurate results, indicating that animals used sequence representations in the reading direction.
Next, we integrate the predictions with the sensory input by subtraction (see Formula 2).The SensoryInput matrix is similar to the prediction on each level, so the subtraction can be implemented for each letter or letter sequence.Note that we represent the sensory input as 1 for the letter and lettersequence level, indicating the highest possible error when the input is unpredictable.This value is reduced by the prediction value for the letter at the given position (e.g., in English, there is a high probability for the s being the last letter, so we expect a low prediction error at this position) or for the letter sequence relevant (i.e., th is likely for English for the initial letter sequence).We further developed the formulation for the pixel level (oP E).We conjectured that a binary prediction error would be most appropriate when modeling word/non-word categorization behavior (see Eq. 3 and ( 12)).We found that a threshold at 0.5 is well suited to implement the binary prediction error representation.After integrating prediction and sensory input to generate the prediction error, all matrix values are summed up to get one value summarizing the overall prediction error on each prediction error level for each letter string.
We calculated the prediction errors on all three levels for each learned and novel letter string available in the datasets.In preparation for the integration of multiple prediction errors to estimate word-likeness based on prediction errors, we normalized each set of prediction errors to a value between 0 and 1 (P Enorm; see Eq. 4).After that, we can accumulate multiple prediction error values by summation and division (see Eq. 5).The resulting prediction error estimate is normalized in all model variants that integrate more than one prediction error representation.
In the final step, we implement the lexical categorization process.Initially, we implemented the boundary at the highest lexical categorization difficulty (i.e., when distributions of prediction error-based word likeness of learned and novel letter strings overlapped the most, see (18)).Since only the learned letter strings were available from the Pigeon data set, we conceived of an alternative approach that would allow us to implement lexical categorization without including novel letter strings.We achieved this by considering multiple thresholds applied to the prediction errors of the learned letter strings (see Eq. 6).Since we could not automatically calculate the optimal threshold, which would have been the case in the original formulation (18), multiple thresholds in the range from 0.5 − 0.95 were applied to all model variants to achieve a binary value, which indicates whether a letter string is learned or not (see the blue line in Fig. 1).Interestingly, this alternative approach resulted in better model fits than the original implementation used for Baboons in the past (i.e., cp.Ref. (13)).Thus, for the present analysis, we focused on the approach that can be implemented without assuming that non-words must be stored.
With this model, we simulated the behavior of each participant with each model variant (i.e., all prediction error combinations with all possible thresholds, N = 7).To determine the most appropriate model, we compared the model and participant behavior for all trials based on the mean squared error (see Fig. S1-3a; see also mean model accuracy in Fig. S1-3b).We selected the threshold value that resulted in the lowest absolute difference between the data and the simulations from all model variants.(Note that a different lexicon is assumed for each participant, which is why simulations differed between participants).The model variant with the lowest mean squared error was considered to include the prediction error representations most likely implemented to achieve the orthographic decision performance of the individual participant (see Figure 2c,f,i and Table S1).The model was implemented in Python, scripts, and data will be made available when accepted for publication.Data analysis.We tested simulation quality with linear regression models implemented in GNU R, investigating the similarity of the individual differences between model simulation and participant behavior.In addition, to investigate the group-level similarity, we provide descriptive statistics, including boot-strapped 95% Learned(P E) = 1 if P E accumulated > T hreshold 0 otherwise [6] T hreshold = .5− .95confidence intervals.To examine the stability of the implemented prediction error representations, we ran one hundred split-half reliability estimations (e.g., see (26) for the importance of reliability estimates in the context of computational phenotyping), estimating the overlap between the assumed representations between the models fitted to either of the two halves.This was done for each species and each model variant.

Fig. 1 .
Fig. 1.Speechless Reader Model, including model input and three types of prediction error representations based on: visual-orthographic representations (oPE; Ref. (12, 27)), letter-position representations (LPE), and letter-sequence representations (sPE; Ref. (13)).Prediction errors are calculated based on the learned letter strings stored in the lexicon, the resulting representations are summed into a word-likeness estimate(18).The output of the categorization process is binary, indicating if the input is considered learned or new.
data.The data used for model simulations stems from Eisenhauer et al. (23) (Humans), Grainger et al. (7) (Baboons), and Scarf et al. (8) (Pigeons).Eisenhauer et al. (23) tested 37 Human participants on a pseudoword lexical decision task of 880 trials, including eight presentations of 60 to-be-learned pseudowords (i.e., two presentations in each of the four sessions).Grainger et al. (7) used operant conditioning setups to train six Baboons (Papio papio) in a word/non-word categorization task.Scarf et al. ( P Es − min(P Es) max(P Es) − min(P Es)

Fig. S1 .
Fig. S1.Model comparisons and model accuracy for all models and all humans.(A) Shows the mean squared error of each model (i.e., all combinations of representations and all tested thresholds) and human behavior.(B) Model accuracy from all tested models.

Fig. S2 .
Fig. S2.Model comparisons and model accuracy for all models and all Humans.(A) Shows the mean squared error of each model (i.e., all combinations of representations and all tested thresholds) and Baboon behavior.(B) Model accuracy from all tested models.

Table 1 . Linear regression models investigating the association of model simulations and participant behavior.
Note.We tested the interaction of stimulus conditions and model simulation and found no significant interaction in Humans and Baboons.