A Computational Model of Normal and Impaired Lexical Decision: Graded Semantic Effects

Lexical decision is an important paradigm in studies of visual word recognition yet the underlying mechanisms supporting the activity are not well understood. While most models of visual word recognition focus on orthographic processing as the primary locus of the lexical decision, a number of behavioural studies have suggested a flexible role for semantic processing regulated by the similarity of the nonword foil to real words. Here we developed a computational model that interactively combines visual-orthographic, phonological and semantic processing to perform lexical decisions. Importantly, the model was able to differentiate words from nonwords by dynamically integrating measures of polarity across the key processing layers. The model was more reliant on semantic information when nonword foils were pseudowords as opposed to consonant strings. Moreover, the model was able to capture a range of standard reading effects in lexical decision. Damage to the model also resulted in reading patterns observed in patients with pure alexia, phonological dyslexia, and semantic dementia, demonstrating for the first time that both normal and neurologically-impaired lexical decision can be addressed in a connectionist computational model of reading.


Lexical Decision
Lexical decision (LD) has been widely used to investigate the cognitive processes involved in visual word recognition. The task generally requires participants to make YES responses for words while NO responses for nonwords. Differences in measures of accuracy and response time are thought to illuminate the underlying processing of words. However, these measures can vary substantially depending on the lexicality of the nonword foils. For example, accurate and rapid responses can be observed when a word is tested in the context of consonant strings (i.e., orthographically illegal and unpronounceable, e.g., FJK), but these responses will become slower and less accurate when the same word is tested in the context of pseudowords (i.e., orthographically legal and pronounceable, e.g., FET), or of pseudohomophones (i.e., orthographically legal and sounding like real words, e.g., FEA) (Evans, Lambon Ralph, & Woollams, 2012;James, 1975;Ratcliff, Gomez, & McKoon, 2004; the IA model could be considered as an orthographic lexicon where each input word form has a corresponding word unit. Grainger and Jacobs's (1996) multiple read-out model (MROM) extended the IA model to perform the lexical decision task by adopting three activation criteria to determine the type of stimulus (i.e., word/nonword) and speed of a response. A word response could be made either when the particular word unit activation reached a local criterion, M, or the overall activity in the word layer reached a global criterion, Σ , before the temporal deadline T. The RT was based on the earliest moment where either of criteria was met. Following the IA model, the resting-level activations of word units varied as a function of word frequency. In addition, Grainger and Jacobs (1996) assumed that the local criterion M should be fixed, while the global criterion Σ and the temporal deadline T would vary according to the lexical frequency status of the stimulus. MROM was able to simulate several standard effects seen in lexical decision including frequency effects, orthographic neighbourhood size effects, and their interactions (Grainger and Jacobs, 1996). Other models of visual word recognition such as the dual-route cascaded (DRC) model (Coltheart, et al., 2001) and the connectionist dual process (CDP+) model (Perry, Ziegler, & Zorzi, 2007) also share the similar decision mechanisms to that in the MROM model. However, one major problem with the MROM model and its variants is related to the use of the temporal deadline mechanism, which inevitably means that the models cannot generate variable nonword decision times when testing different types of nonword stimuli (Ratcliff, Gomez & Mckoon, 2004;Wagenmakers et al., 2008).
Note that, within this framework, some researchers (Blazely, Coltheart, & Casey, 2005;Borowsky & Besner, 1993) have proposed that the semantic system can be involved in lexical decision and is responsible for the effects of semantic priming and imageability in lexical decision (e.g., James, 1975;Meyer, Schvaneveldt, & Ruddy, 1975). The decision could be made either by monitoring the activation within the semantic system (Borowsky & Besner, 1993) or via the feedback to the orthographic lexicon (Blazely, Coltheart, & Casey, 2005).
However, this has yet been implemented to any localist model, so it remains unclear as to actual mechanisms that underpin the decisions based on the semantic system and its interaction with the orthographic lexicon.

Models Based on Distributed Views
An alternative theory of visual word recognition argues that there is no mental lexicon for the store of word knowledge in the recognition system (Dilkina, McClelland, & Plaut, 2010;Plaut, 1997;Seidenberg & McClelland, 1989. On this view, the fundamental differences between words and nonwords are the nature of their underlying representations. The lexical decision can be made on the basis of the differential activations elicited by familiar words and unfamiliar nonwords. When presenting a word, strong activations are expected, because the mappings between the visual or orthographic representation of the word and its phonological and semantic representations have been learned. On the other hand, relatively weaker activations would be found for novel nonword representations. Both words and nonwords will activate visual, orthographic, phonological and semantic representations and these different levels of representations interact with each other during lexical decision. A critical difference between this view and the lexical view is that semantic processing is considered important for lexical decision (Dilkina, et al., 2010; in addition to orthographic and phonological processing (Plaut, 1997;Seidenberg & McClelland, 1989).
Within this framework, several computational models have been developed to simulate the processes of lexical decision (Dilkina, et al., 2010;Harm & Seidenberg, 2004;Plaut, 1997;Seidenberg & McClelland, 1989). Plaut (1997) proposed that the measure of how strongly units were activated, called stress or polarity, could be used as a basis for making lexical decisions. Plaut developed a feedforward model, which consisted of orthographic, phonological and semantic components and demonstrated that words tended to produce higher stress than nonwords at the semantic layer allowing a discrimination rate of over 95%.
In addition, the network tended to produce higher semantic stress for pseudohomophones than for pseudowords. This might explain why pseudohomophones are more difficult to reject (Coltheart, et al., 1977;Meyer, et al., 1974;Milota, et al., 1997;Patterson & Marcel, 1977;Rubenstein, et al., 1971). Subsequently, Harm and Seidenberg (2004) developed a fully implemented reading model to explore how the skilled readers generate meanings of words Running head: MODELLING NORMAL AND IMPAIRED LEXICAL DECISION 13 from print by incorporating a pre-existing knowledge of the interactive mappings between phonology and semantics with newly learned mappings from orthography to phonology and orthography to semantics. Although lexical decision was not the main focus in their paper, they explored how the model could account for the pseudohomophone effect in lexical decision. They proposed that lexical decision could be made by determining the discrepancy between the input orthographic patterns and the orthographic patterns recreated from semantics. If the feedback information was consistent with the orthographic information, a word response was made; otherwise, a negative response was given. Dilkina, McClelland and Plaut (2010) also developed a connectionist single-system model of semantic and lexical processing to simulate lexical decision performance in SD patients. The model included four input/output layers including vision, orthography, phonology and action layers. There were also two hidden layers, one integrative semantic layer and an intermediate layer that facilitates the mapping between orthography and phonology. The lexical decisions were made in a similar way to Harm and Seidenberg (2004) by measuring the orthographic feedback, termed orthographic echo, for the presented words and nonwords. It is worth noting that although this measure of lexical decision is recorded at the orthographic layer, the orthographic echo is also dependent on activations at the semantic layer.
Several researchers have challenged the distributed view of lexical decision by arguing that these models would not be able to account for some patients with semantic impairments Running head: MODELLING NORMAL AND IMPAIRED LEXICAL DECISION 14 who show normal lexical decision accuracy (Coltheart 2004;Blazely, Coltheart, & Casey, 2005;Borowsky & Besner, 2006). That is if the semantic system is important for lexical decision as advocated by the distributed view, when the system is damaged the lexical decision performance should be greatly affected. To address this issue, Plaut and Booth (2006) developed a model that involved mappings from orthography to semantics. They applied a range of damage to the model's semantic layer. The results showed that the models' lexical decision performance was only slightly decreased with increasing semantic damage, whereas the model's semantic performance was sustainably deteriorated. As explained by Plaut and Booth, the model with semantic impairment was still able to distinguish words from nonwords because of the much stronger activations of words compared to that of nonwords, though the activations between words were not easily distinguishable.

Accumulated Information for Lexical Decision
Other models have emphasised the use of accumulated information for decision-making tasks (Busemeyer & Townsend, 1993;Norris, 2006Norris, , 2009Ratcliff, Gomez and Mckoon, 2004;Usher & McClelland, 2001. In some accumulator models, decisions are made by diffusion processes that continuously accumulate information over time from a starting point toward response boundaries. It can be implemented with a simple diffusion process such as the diffusion model (Ratcliff et al., 2004), where positive evidence for one of the alternative responses can be considered negative evidence for the other alternative. Other accumulator models with two or more diffusion processes allow the evidence in the accumulator to decrease due to random noise (Busemeyer & Townsend, 1993) or inhibition between the processes (Usher & McClelland, 2001. Among these accumulator models, the diffusion model (Ratcliff et al., 2004) has been widely applied to account for behavioural data in lexical decision. To choose between words and nonwords, the speed of information accumulation in the diffusion model, called the drift rate, is determined by the lexical status of the stimuli. The more word-like, the stimulus the higher the drift rate. The diffusion model was able to account for a range of phenomena seen in lexical decision including the word frequency effects and the correct patterns of RT distributions for both words and various types of nonwords.
Another type of accumulator model is implemented on the basis of Bayesian probability, named the Bayesian reader (Norris, 2006(Norris, , 2009. It assumes that subjects would consistently compute the probability of the stimulus being a word or a nonword based on its lexical status. The word likelihood is the sum of the probabilities of all possible letter strings whereas the nonword likelihood is simply one minus word probability (Norris, 2009). The Bayesian reader performed reasonably well in lexical decision and was able to simulate the effects of word frequency and the nature of the nonwords reported in Ratcliff et al.'s (2004) experiments. However, the model was unable to simulate the distribution of RTs for incorrect lexical decisions. It is also unclear how the model would perform when it is tested against nonwords with high orthographic neighbourhood sizes.

Summary
Evidence from behavioural, neuroimaging and patient studies, suggest that orthographic processing, phonological processing and lexical-semantic processing are all involved in lexical decision albeit to different extents dependent on the properties of word stimuli and the foil types. Human readers are able to flexibly use the available information from the recognition processes to support efficient lexical decision. Readers may rely on orthographic information if it is reliable and sufficient for decisions; otherwise, they may need to utilise phonological and semantic information (Plaut, 1997;Seidenberg & McClelland, 1989). While previous models of lexical decision have demonstrated how lexical decision can be made based on output generated from a single processing component, no model has yet shown how information from all the processing components of word recognition can be flexibly combined to support lexical decisions. This is the first time that these ideas have been implemented computationally. We aimed to develop a fully implemented computational model of visual word recognition that interactively combines visual, orthographic, phonological and semantic processing based on the parallel distributed processing (PDP) framework (Chang, Furber, & Welbourne, 2012a;Welbourne & Lambon Ralph, 2007;Plaut, McClelland, Seidenberg, & Patterson, 1996;Seidenberg & McClelland, 1989) to account for a range of standard effects in visual lexical decision tasks including frequency, consistency, orthographic neighbourhood size, and word length (number of letters) (Andrews, 1982;Balota et al. 2004;;Cortese & Khanna, 2007;Seidenberg, et al., 1984;Waters & Seidenberg, 1985). In particular, we sought to use the model to investigate the graded semantic effects in the contexts of different foil types as reported by Evans et al. (2012). We also tested if damage to visual-orthographic, phonological and semantic layers in the model could result in similar impaired lexical decision performance as that observed in patients with corresponding functional impairments, which are pure alexia, phonological dyslexia and semantic dementia.
Two simulations were run to investigate the models' performance in the following areas: (1) Normal lexical decision -including effects of frequency, consistency, foil type, imageability, word length and orthographic neighbourhood size; (2) Impaired lexical decision -including a comparison of how damage to the different processing areas could be mapped onto the lexical decision performance of different patient groups. An additional simulation reported in Appendix A was conducted to investigate the role of control units in regulating the temporal dynamics of the model. All of the simulations used the same network architecture (Figure 1), which is fully described in Simulation 1.

Simulation 1
Simulation 1 was designed to develop a recurrent model of word reading based on the general triangle framework (Harm & Seidenberg, 2004;Plaut et al., 1996;Seidenberg & McClelland, 1989;Welbourne & Lambon Ralph, 2007;Welbourne, Woollams, Crisp, & Lambon Ralph, 2011), but crucially including a visual level of representation similar to that used by Chang et al (2012a). In that study the addition of a visual layer of processing allowed the model to capture effects related to word length (measured by the number of letters), which have previously been problematic for this type of model to address, in particular, the length by lexicality effect (Weekes, 1997). However, that model did not include an interactive semantic processing layer; this is the first time that a full triangle model including visual, orthographic, phonological and semantic processing layers has been reported.

Network architecture
The model was a continuous recurrent network. The architecture of the model is shown in Figure 1. The model had two separate pathways for processing words from visual input: a phonological pathway and a semantic pathway. The word stimuli were presented to the visual layer of the model that connects to the OH layer via a set of 80 hidden units. The OH layer used here is equivalent to the orthographic layer in the triangle model, except that the orthographic representations were learned through the course of training (as in human development), rather than being supplied as inputs. It connects to both phonological and semantic units via two more sets of hidden units. Both the phonological and semantic units were connected to their own set of clean up units, which allowed for the development of phonological and semantic attractors in the model. In addition, there were three context units, which were used to provide contextual information for disambiguating homophones: While there is no way to distinguish meanings of homophones in single word reading, during natural reading the context will almost always give sufficient cues. Phonological and semantic units were connected to each other via two hidden unit layers. All output layers in the model were given a fixed negative bias of -2 to encourage sparse representations. Perhaps the most unusual element of the architecture is the control units associated with each layer except input and output layers. These units receive the same inputs as the layer they are connected to, and all their outgoing connections are inhibitory, allowing them to control the activation of all the units in a layer simultaneously. These units turn out to be critically important for training the present recurrent network as they help to regularise the time dynamics in the network by preventing activations in the deep layer from causing unwanted disturbance to the processing in the earlier layers. The details are illustrated in Appendix A.
In the current model, the activation values of all units were updated concurrently and ramped up or down gradually over time. The dynamics of the network was implemented by using an output integrator equation (Pineda, 1987). The output integrator function can be written as follows: (1) where is the current output of the unit , where is the summed output contribution from other units j, where is a logistic function, and where is a constant and represents an external input bias. For computational simulations, the continuous time is generally approximated by using discrete time steps, in that some intervals of the processing time are sampled and broken into a number of finite ticks. In the current reading simulation, the network was run for ten intervals of time, and three ticks per interval were used for the approximation of the continuous time in the network.

Visual representations
The training corpus consisted of 2,971 monosyllabic words. The visual representations used here were adapted from those used in Chang et al.'s study (Chang et al., 2012a). The network was directly fed with bitmap images of words in Arial 12-point lower case font, represented in white against a black background. Each word was positioned with its vowel aligned on the central slot of the image 1 . For words that have a second vowel (e.g. boat), the second vowel was placed right next to the first vowel. Ten 16x16 pixel slots were used so there were 2,560 visual units.

Phonological representations
1 Note this vowel centred alignment is not necessary for good performance in the model. Chang et al. (2012a) showed that any reasonable approximation of the optimal viewing position (OVP) with a central position slightly to the left of centre would produce similar results.
The scheme of phonological representations was the same as that used in the Plaut et al. (1996) model. Each word was parsed into onset, vowel and coda clusters of phonemes with specific units used to represent each possible phoneme in each cluster (Table 1). This gives a total of 61 phoneme units. corresponding to a given pronunciation in the training corpus is four.

Semantic representations
We utilised a set of representations previously developed by Chang, Furber, and Welbourne (2012b). These were developed by searching for a representational structure that met five key criteria that were considered desirable for use in large scale computational models: binary coding; sparse coding; a fixed number of critical semantic features in each vector; scalable vectors and most importantly preservation of the human-like semantic structure.
In the literature, several semantic representation schemes have been proposed either based on feature norms (Garrard, Lambon Ralph, Hodges, & Patterson, 2001;McRae, Cree, Seidenberg, & McNorgan, 2005), statistical co-occurrence semantic analysis (Landauer, Foltz, & Laham, 1998;Lund & Burgess, 1996;Rohde, Gonnerman, & Plaut, 2006), artificially generated patterns (Dilkina, et al., 2010;Plaut, 1997;Rogers, Lambon Ralph, & Garrard et al. 2004;Welbourne, et al. 2011) or Miller, Beckwith, Fellbaum, Gross, andMiller's (1990) online WordNet database (Harm and Seidenberg, 2004). Chang et al. (2012b) considered that only schemes that derived latent semantic relations based on co-occurrence matrices from text corpora could meet all the desirable criteria and that the most promising candidate among these was Correlated Occurrence Analogue to Lexical Semantics (COALS; Rohde et al., 2006). COALS uses singular value decomposition (SVD) on co-occurrence statistics to produce an arbitrary length semantic vector for each word. In the original formulation, only the positive valued vectors were used. This means that the information contained in negative parts of the vector could be lost. The fact that dogs have four legs would be captured, but the fact that they never fly would not. It is because dogs are more likely to co-occur with legs but not with fly. Chang et al. explored  Coding schemes that involved negative items consistently produced category structures that more closely matched the human results. The set of semantic vectors that most closely matched the human-derived structure was found to use five positive and fifteen negative features. Hierarchical clustering analysis showed that compared with Garrard et al. data, most items were clustered into the same group as in the human category. For example, living thing items were separated from nonliving thing items. Within the living thing items, the items (e.g. dog) in the animal group and the items (e.g. apple) in the fruit group were well separated.
Within the nonliving things, the broader category of tools (e.g. hammer) was well distinguished from the group of vehicles (e.g. train), although the boundary between tools and household items was less clear. More details can be found in Chang et al.'s (2012b) paper.
The semantic representations used in the present simulations were generated using the

Training procedure
The training was separated into two phases. In phase 1, the links between phonology and semantics were trained (shown in grey in Figure 1). This phase of training was intended to correspond to pre-literate language learning in children. In phase 2 the full reading model was trained.
In phase 1, the phonology-semantics model was subdivided into two parts: the production model learning the mappings from semantics to phonology, and the comprehension model learning the mappings from phonology to semantics ( Figure 2). Both parts were trained on the full corpus of 2,971 monosyllabic words. The probability of each word being presented to the model was determined by its logarithmic frequency. Slightly Running head: MODELLING NORMAL AND IMPAIRED LEXICAL DECISION 25 different learning rates and weight decays were used to train the two parts: the production model was trained with a learning rate of 0.2 and a weight decay of 1E-7, while the comprehension model trained with a learning rate of .05 and weight decay of zero. The initial weights were randomly set to values between -0.1 and 0.1.
Each example was presented for six intervals of time and each interval of time was divided into three ticks. In each presentation, the input pattern was clamped onto the appropriate units for six intervals of time. For the last two intervals, the activations of output units were compared with their targets. Error score was measured on the basis of divergence (cross-entropy; Plaut et al. 1996) between the target and the actual activation of the output units. It was used to calculate weight changes according to the back-propagation through time (BPTT) algorithm (Pearlmutter, 1989(Pearlmutter, , 1995. No error was recorded if the output unit's activation and target were within 0.1 of each other. After this training, additional 40,000 epochs of interleaved training on the mappings between phonology and semantics was applied in order to fine-tune the combined phonology-semantics model. -------- Figure 2 Insert Here --------In phase 2, the full reading model was trained using the BPTT algorithm with a learning rate of 0.1, a weight decay of 1E-8 and a momentum of 0.9. The visual representation of a word was presented at the input units for ten intervals (again each interval of time was broken into three ticks). The task was to produce correct phonological and semantic patterns. For the last two intervals, the output units were compared with their corresponding phonological, or semantic targets and errors were computed. No error was computed when the output unit's activation and target were within 0.001. Again logarithmic frequency was used to determine the probability with which word was presented to the model. To preclude any possibility that simulation result could be generated from one particular set of initial weights, twenty models with different initial weights were trained. When the improvement of the model's performance on phonology and semantics had slowed down and had reached an asymptote, following Bishop (2006), the accuracy rate on regular nonword pronunciations was used to determine the end point of training.

Testing Procedures
The testing procedures for both training phases were precisely the same. The decoding procedure for semantics was based on the Euclidean distances between the activations of the semantic units and each of the semantic representations in the training corpus (Monaghan, Shillcock, & McDonald, 2004;Monaghan et al., 2017). The semantic representation which was closest to the activation of the semantic units was taken as the semantic output. If the output was the same as the target representation, it was a correct response. The procedure for the generation of the phonological output was the same as that used in Plaut et al.'s (1996) study. For the vowel units, the most activated vowel unit was selected as the output. Onset and coda units were divided into groups of mutually exclusive units and the highest active unit above 0.5 was taken as the output for each group. If no unit was active above 0.5 then the group did not contribute to the output. Finally, if either of the ks ts or ps unit was active along with their components, then the order of the components was reversed.

Training performance
After two million presentations, phase 1 training was halted. The accuracy rates of the production and comprehension model were 99.97% and 99.43% respectively. Figure 3 shows the performance of the full word reading model throughout phase 2 training with and without control units. For the model trained with control units, both semantic and phonological units learned relatively quickly (phonology was faster than semantics), and reached near-perfect performance after 600,000 epochs. The average accuracy rates for the model to produce correct phonological and semantic patterns in the word reading task were 99.3% and 97.4% respectively. By contrast, when the model was trained without using control units, the phonological units never achieved an accuracy rate higher than 91%, and the semantic units were still at only 20% accuracy after one million epochs. As we shall see in Appendix A, this is because without control units the system cannot make the most efficient use of the embedded knowledge in the pre-trained connections between semantics and phonology.

Exploring Polarity Measure for Lexical Decision
Plaut (1997) demonstrated that the parallel distributed models could perform the lexical decision task based on the measure of polarity in the semantic layer, which is essentially a test of how binary the representations are. The idea behind this is that during training, the units are trained to represent the target patterns consisting of binary values. Thus when unfamiliar items such as nonwords are presented, the units tend to remain closer to their initial states. To capture this phenomenon, Plaut (1997) used a formula to compute the index of unit binarisation, termed unit polarity: where ‫ݔ‬ is the unit activation ranging from 0 to 1 and ‫ݕ‬ is the polarity measure.
However, Plaut's (1997) model only looked at polarity scores in the semantic layer, and it remains unclear if polarity scores based on visual, orthographic or phonological processing in the reading system would also be useful. Thus, in the present study, polarity scores across units in the H0 and OH (orthographic), phonological and semantic layers were integrated.

Testing stimuli
To investigate if the individual polarity scores from different key processing layers and their combined scores could provide a reliable source for making lexical decisions, we computed the polarity values generated from a set of words and nonwords taken from Evans et al.'s (2012) study. They used three different types of nonword foils where the foils were controlled in their orthographic and phonological relationships to the real words. That allowed us to see the polarity differences between words and different sets of nonword foils. pseudoword and pseudohomophone. However, ten of these words were not in our training corpus, so these were removed from the test along with their matched nonword foils leaving a total of 70 words along with three sets of matched nonword foils (consonant strings, pseudowords and pseudohomophones). slightly larger for words and the smallest for consonant strings with pseudowords and pseudohomophones somewhere in between. All the differences were small. In the semantic layer ( Figure 4 panel c), the main feature was that the polarity for words was higher than for all other stimuli, suggesting that this layer might make the most reliable contribution to lexical decision. However, it should be noted that these differences emerge relatively late compared with the difference between words and consonant strings in the orthographic layer.

Average polarities at different layers
Overall these results suggest that different areas of the model may contribute differently to the lexical decision task, depending on whether it could provide sufficient information to be used for deciding which foils can be rejected, and when the decision occurs. Consonant strings can be distinguished from words early on in the visual-orthographic processing areas, whereas pseudowords and pseudohomophones may require input from semantic processing areas. By averaging the polarities across these processing layers, it might be possible to maximise the differences.
-------- Figure  ܵ ‫ܦ‬ ݅ is the standard deviation of the nonword polarity for the items in that set. Figure 5 shows the average polarity differences of the key processing layers by nonword condition. One-sample t tests were conducted to examine whether the average polarity differences in each condition was greater than zero. The results showed that all the differences were significantly different from zero (all ps < .05). A one-way repeated measures ANOVA analysis with nonword condition as a within-subject factor revealed that the effect of nonword condition was also significant, F(2, 38) = 163.44, p < .001. A Tukey post-hoc test revealed that the polarity difference in the consonant string condition (1.33) was significantly larger than that in both the pseudoword condition (0.90), p < .001, and the pseudohomophone condition (0.82), p < .001. Moreover, the polarity difference in the pseudoword condition was also significantly larger than that in the pseudohomophone condition, p < .05.
-------- Figure 5 Insert Here -------- Figure 6 illustrates the individual contribution from orthographic, phonological and semantic layers to the polarity differences by showing the percentage that each layer contributes to the average differences in the three foil conditions. In the consonant string condition, orthography made the most contribution (48%), while in the pseudoword and pseudohomophone condition, semantics made the biggest contribution of (38% and 36% respectively). T tests revealed that the contribution made by the orthographic layer in the consonant string condition was significantly larger than that in both the pseudoword and pseudohomophone conditions (both ps < .05). By contrast, semantic processing was more important in both pseudoword and pseudohomophone conditions relative to the consonant string condition (both ps < .05).

Lexical Decision Criteria
The simulations so far have shown that differences in polarity scores at different layers in the network have the potential to form the basis for lexical decisions. But a crucial question is what criteria should be applied to this measure to allow lexical decisions to be made accurately and quickly? A straightforward approach might be to use a fixed threshold. If the item polarity becomes higher than the threshold, it is a word; otherwise, it is a nonword.
However, this is not ideal, because it does not allow for the fact that the polarity measures are time-dependent; early on in processing a relatively low threshold might be appropriate, but later a much higher threshold would be required. Also, this single threshold approach suffers from the problem that nonword decisions can only be made after all processing ticks (when it is known that the threshold will never be exceeded). In reality, nonword decisions can be made very quickly if the nonwords are not very word-like. To avoid these problems we adopted three separate dynamic criteria for word and nonword decisions: (1) word boundary: If at any tick, the polarity score exceeded the average polarity score for nonwords by more than two standard deviations, then a word decision would be recorded; (2) nonword boundary: If at any tick, the polarity score was more than two standard deviations under the average polarity score for words, then a nonword decision would be recorded; (3) minimum activation: the decision can only be made after the polarity reached an active level of 0.8. The last criterion was to ensure that the model could make a decision based on reliable information.
The polarity for an item was computed by combining the measures of average polarity for that item in the orthographic, phonological, and semantic layers so that the layers made equal contributions. If at any tick, either of the first two above criteria was met and the polarity score was greater than 0.8, then the lexical decision was assumed to have been made, and the response time was taken as the tick on which the decision was reached. When neither of the criteria was met by the time the last tick was reached, the responses were made based on whether their polarities at the last time tick were closest to the average word and nonword polarities. The response latencies for those items were assigned according to their distance to the average polarity. For the top 10% items, the response latencies were 31 ticks, and for the next 10% items, the latencies were 32 ticks, and so on. The slowest responses for the least 10% items were 40 ticks.

Inverse efficiency
It is worth noting that the cut-off lines for the word and nonwords are arbitrary and could be varied to produce a different speed-accuracy trade-off. To control for these potential differences in a speed-accuracy trade-off, we adopted inverse efficiency as our performance measure. Inverse efficiency is reaction time divided by accuracy, and it is relatively robust to different levels of speed-accuracy trade-off (Roberts, Lambon Ralph, & Woollams, 2010;Roder, Kusmierek, Spence, & Schicke, 2007). To illustrate the effectiveness of inverse efficiency, we tested the model on all the 2,971 words in the training set against a set of nonwords consisting of the same number of monosyllabic pseudowords taken from the ARC nonword database (Rastle, Harrington, & Coltheart, 2002). The length of nonwords ranged from three to seven letters. Different cut-off lines were tested including one standard deviation, two standard deviations and three standard deviations. When the cut-off line of one standard division was used, only 0.2% of word responses did not meet the decision criteria but this increased to 27.2% for the cut-off line of two standard deviations and 69.1% for the cut-off line of three standard deviations. The response latencies of those items were assigned according to their distance to average polarity. Figure 7 shows the distributions of accuracy, response time and inverse efficiency for three different cut-off lines produced by the model.
The use of different cut-off lines greatly influenced the distribution patterns of response time while the distribution patterns of inverse efficiency remained similar, indicating inverse efficiency was relatively robust to the selection of different cut-off lines. In the following tests, we opted to use the cut-off line of two standard deviations.

Semantic influences on lexical decision
The key test for the model was to see if the model could produce a graded imageability effect in lexical decision depending on condition difficulties as observed in Evans et al. (2012) where the imageability effect was larger when words were tested in the context of pseudohomophones than pseudowords, and it disappeared altogether in the context of consonant strings as in Figure  Nevertheless, all of the semantic vectors for high imageability and low imageability words in the model had the same number of active features. This is very different from previous connectionist simulations (Harm & Seidenberg, 2004;Plaut & Shallice, 1993) that explicitly implement semantic representations with more features for high than for low imageability words. Thus, a question raised is how the imageability effect emerged in the present model.
Several studies have reported that words with a high number of shared semantic features generated faster responses than those with fewer shared features when the total number of features are held constant during a lexical decision task and a semantic task (Grondin, Lupker & McRae, 2009), suggesting not all features matter equally. Semantic richness has been proposed to be multifaceted, and each construct could make distinctive influences on lexical processing (Yap, Tan, Pexman, & Hargreaves, 2011;Yap, Pexman, Welsby, Hargreaves, & Huff, 2012). Thus, imageability was likely operationalised in the model in terms of the shared semantic features. If this is the case, we would expect that high imageability words have more shared features than low imageability words. We computed the representational distance matrices for both high imageability words and low imageability words, as in Figure 9. A pair-wise correlation analysis showed that the average cosine distance score for high imageability words (M = 0.867) was significantly lower than that for low imageability words (M = 0.903), t(1188) = 5.45, p < .001, confirming that high imageability words contain more semantic features that are shared with others than low imageability words.

Frequency by consistency effects on lexical decision
The effects of frequency and consistency have been well reported in word naming (Baron & Strawson, 1976;Jared, McRae, & Seidenberg, 1990;Plaut, et al., 1996;Seidenberg & McClelland, 1989;Stanovich & Bauer, 1978;Taraban & McClelland, 1987). However, in lexical decision a small or null effect of consistency is generally reported (Andrews, 1982;Seidenberg, et al., 1984;Waters & Seidenberg, 1985). When studies include orthographically atypical words (i.e., words with low bigram frequencies) (Parkin & Underwood, 1983;Waters & Seidenberg, 1985) or mega studies include a wide range of words (Balota et al., 2004), the effect is likely to be observed. Additionally, the consistency effects in lexical decision also have been reported in patients with semantic dementia . These results suggest that the effects of spelling-sound inconsistency may not be easily observable when the lexical decision can be made on the basis of perceptual information available prior to the access to phonology, in contrast to when words need to be slowly recognised, or the task requires phonological codes (Waters & Seidenberg, 1985). Here, we tested the model on four sets of stimuli taken from Andrews (1982): high-frequency consistent words, low-frequency consistent words, high-frequency inconsistent words, and low-frequency inconsistent words. Each word set consisted of 20 words except that one high-frequency inconsistent word (live) was discarded because it was not in the training corpus. As there were no nonword foils included in supplementary material suggesting that the consistency effect is not generally observable in factorial designs with a small sample of words. As we shall see in the next section, the effect could be captured by using a regression technique on a wider range of words.

Linear mixed-effect model analyses on lexical decision
We have examined imageability effects and frequency, consistency and their interaction by using factorial analyses. In order to rule out that the possibility of the observed effects being a consequence of specific samples tested, it would be useful to assess if the model could produce similar effects on a larger set of words. Thus, we conducted linear mixed-effect model (LMM) analyses on all the words in the training corpus, akin to the large scale of regression analyses on lexical decision conducted by Balota et al. (2004). It was anticipated that the model should be able to reproduce a range of reading effects in lexical decision including frequency, consistency, neighbourhood size, word length, and imageability as well as the interactions of frequency by word length and of frequency by neighborhood size (Andrews, 1992;Balota et al., 2004;Cortese & Khanna, 2007).
All of the 2,971 words in the training set were tested against the set of nonwords consisting of the same number of monosyllabic pseudowords from the ARC nonword database (Rastle, Harrington, & Coltheart, 2002). The inverse efficiency scores were used as a dependent variable. Item number (one to 2,971) and simulation (one to 20) were included as random effects. A set of psycholinguistic variables was included as fixed effects: frequency (Freq), word length (WL), orthographic neighbourhood size (OrthN), consistency (Con) and imageability (Img). The frequency measure in the model was the frequency of the model's exposure to each word. Spelling-to-sound consistency score was determined by counting the proportion of words with the same rime that were pronounced in the same way as the target word, and the score was weighted by word frequency (Jared, 1997). Orthographic neighbourhood size was the number of words that could be created by changing one letter in a target word (Coltheart, 1977). Imageability scores for each word were taken from the norm by Cortese et al. (2004). All error responses and outliers (inverse efficiency greater than or smaller than three standard deviations) were excluded. Only words that have known values for all the predictors were used, leaving 48,804 observations for further analyses. The dependent variable was log-transformed because the distribution was skewed, and all the variables were scaled prior to LMM analyses. The correlations between the predictors and inverse efficiency can be found in Table 2. The LMM result showed that Freq, Con, WL, OrthN and Img all made significant contributions in predicting inverse efficiency (Table 3). The results demonstrated that words that were high in frequency, more imageable and with consistent spelling-to-sound mappings were processed more easily. That was also true for words that were short and had many neighbours. To explore the relationship between frequency and word length, all the predictors along with the interaction term were entered into the LMM model. The interaction between frequency and word length was significant, Estimated = -.056, SE = 0.010, t = -5.74. As in showed that the interaction term between frequency and orthographic neighbourhood size was not significant (t = 0.51), though this was found to be significant, Estimated = .025, SE = 0.012, t = 2.06, when the three-way interaction between frequency by neighbourhood size by word length was included, Estimated = .028, SE = 0.010, t = 2.74. In the model, the interaction between frequency and neighbourhood size was modulated by word length. The interaction pattern between frequency and orthographic neighbourhood size as shown in Figure 10(b) is also similar to that reported in Balota et al. (2004, Figure 13 for young adult) and compatible with the finding by Andrews (1989) that the effects of orthographic neighbours sizes were facilitatory for low frequency words while the null or the inhibitory effects were observed for high frequency words (Balota et al. 2004).
In summary, the present results of the main effects on the model's lexical decision performance are consistent with those reported by previous behavioural studies ( Balota et al. 2004;Cortese & Khanna, 2007). For the interaction patterns, we found that word length was dependent on word frequency (Balota et al. 2004) so was orthographic neighbourhood size (Andrews, 1989(Andrews, , 1992Balota et al. 2004).

Predicting item-level variance in lexical decision
The issue of predicting item variance of human latencies is one of the most challenging tests for any computational model of reading (Spieler & Balota, 1997). When several influential models of reading aloud including DRC (Coltheart et al., 2001), CDP (Zorzi, Houghton, & Brian, 1998), triangle (Plaut et al., 1996), and CDP+ were tested against on 2,870 words in Spieler and Balota (1997), only 3% to 7% of variance of word naming latencies could be accounted for (Coltheart et al., 2001) This is considerably lower than the variance of human latencies accounted for by the three most important lexical factors (frequency, orthographic neighbourhood size and word length), which is about 21.7% of the variance. The modelling data has been shown to close to the behavioural performance until a While previous studies have focused on the ability of computational models to predict item variance in word naming latencies, to our knowledge, very few studies (e.g., Kello, 2006) reporting item variance in computational models of lexical decision on a large set of words.
To address this lack, we tested the present model on its ability to account for the item variance of human lexical decision latencies based on data from the English Lexicon Project (ELP) (Balota et al., 2007). Again, we tested all the 2,971 words in the training set against the same number of pseudowords. We first computed the average lexical decision latencies across simulation runs for each word. We then looked for its lexical decision latencies (z-scored RTs) in the ELP (Balota et al. 2007) where 2,864 words had latencies recorded in the database.
Prior to analysis, outliers (greater than two standard deviations) of the behavioural data were removed, resulting in a removal of 4.2% of words. Inverse efficiencies produced by the model were first log-transformed and then entered into the regression model as a predictor with the behavioural z-scored RTs as a dependent variable. The result showed that the inverse efficiency scores accounted for a significant portion of the z-scored RTs, R 2 = 7.3%, p < .001.
For comparison, we conducted an additional regression analysis where the three lexical factors including word frequency of Hyperspace Analogue to Language (HAL) (Lund & Burgess, 1996), orthographic neighbourhood size and word length were used as predictors.
The result showed that these variables together could predict 42.64% of the variance of z-scored RTs (p < .001). The variance predicted by the present model was substantially lower than that predicted by the three lexical factors.
One possibility for this gap in variance is that the frequency range used for training the model is substantially compressed by using a logarithm transformation to reduce the training times. However, this may not be ideal for capturing participants' item-level frequency effects.
To explore this possibility, we conducted an additional regression analysis with log frequency used in the model along with orthographic neighbourhood size and word length as predictors, and with the z-scored RTs as the dependent variable. The result showed that 15.76% of variance of the z-scored RTs (p < .001) was predicted, which is higher than the model performance but is much lower than that predicted by the regression model with HAL frequency (42.64%). This suggests that the frequency range used to train the model is at least a potential cause of the discrepancy.
Another issue concerns whether semantic processing contributes much to the item variance accounted for by the model. While the semantic system is crucial for the model to simulate the graded semantic effect in lexical decision, it remains unknown how much added value is contributed by the semantic layer in terms of accounting for item-level variance in human lexical decision data. To test the contribution of semantic processing, we removed all the polarity scores generated from the semantic layer in the model, akin to more focal lesions to semantic processing. We then re-conducted the regression analysis. The result showed that the inverse efficiency scores accounted for a significant portion of the z-scored RTs, R 2 = 4.1%, p < .001. It was 3.2% lower than that produced by the full regression model. For comparison, we also conducted two additional regression analyses to test the contributions of phonological and orthographic processing. The results showed that the variance accounted for by the regression model without phonology or orthography was 0.3% or 1.4% lower than the full regression model, respectively. The findings demonstrate that semantic processing indeed greatly helps the model to account for item-level variance in human lexical decision data with orthographic processing the second and phonological processing the least.

Simulation 2
Having demonstrated that the model can simulate lexical decision by integrating information across visual-orthographic, phonological and semantic processing layers, the next important question is whether damage to those layers would result in general patterns of impaired lexical decision similar to those observed in patients with functionally corresponding reading deficits -namely pure alexia, phonological dyslexia and semantic dementia.

Pure Alexia
Pure alexia (PA) is a neuropsychological deficit generally caused by lesions in the left ventral occipitotemporal region (Damasio & Damasio, 1983). The hallmark feature of PA is abnormally strong length effects in word reading times, and it is thought by many to result from damage to visual processing (Arguin, Fiset, & Bub, 2002;Behrmann, Plaut, & Nelson, 1998;Fiset, Arguin, & McCabe, 2006;Roberts, et al., 2010). Despite the visual impairment in PA patients, some are still sensitive to lexical variables such as word frequency (Behrmann, et al., 1998;Johnson & Rayner, 2007;Montant & Behrmann, 2001), regularity (Friedman & Hadley, 1992), orthographic neighbourhood size (Arguin, et al., 2002;Fiset, et al., 2006;Montant & Behrmann, 2001), age of acquisition (Cushman & Johnson, 2011) and word imageability (Behrmann, et al., 1998) in word naming. In terms of lexical decision, some PA patients are able to perform the task above chance (Coslett & Saffran, 1989;Friedman & Hadley, 1992;Roberts, et al., 2010). Their lexical decision performance is modulated by the severity of the condition, word frequency, imageability and nonword type. According to the partial activation account (Behrmann, et al., 1998), some lexical-semantic processing could be activated by bottom-up visual stimuli, albeit to a lesser level relative to unimpaired reading (see the right-hemisphere account by Coslett and Saffran (1994) for a different interpretation of implicit recognition).

Phonological Dyslexia
Patients with phonological dyslexia (PD) are characterized by a relative impairment of nonword reading in the context of better word reading accuracy (Beauvois & Derouesne, dyslexia is disturbance to phonological processing because patients' reading performance is strongly correlated with their non-reading phonological deficits, and they exhibit the same qualitative performance characteristics on reading and nonreading tasks, including substantial lexicality and imageability effects ( Denes, Cipolotti, & Semenza, 1987). Only when patients are tested with low-frequency words does their performance decline to below typical performance (Dujardin et al., 2011).

Semantic Dementia
Semantic dementia (SD) is characterized by a progressive loss of conceptual knowledge.
Several studies have shown that SD patients' word naming performance is affected by some psycholinguistic properties of words such as frequency, regularity, consistency and imageability (Jefferies et al. 2004;Patterson et al. 2006;Woollams, 2015). Many SD patients also have surface dyslexic reading patterns (Woollams et al., 2007), showing a selective deficit in naming words with inconsistent spelling-to-sound mappings whereas the nonword naming ability is relatively preserved, although there are isolated studies reporting SD patients without significant impairment in reading aloud (e.g., Blazely et al. 2005). In terms of lexical decision, a number of studies have shown that the performance of SD patients is significantly poorer than controls (Benedet, Patterson, Gomez-Pastor, & de la Rocha, 2006;Diesfeldt, 1992;Patterson, et al., 2006; although see Coltheart, 2004;Blazely et al., 2005). Importantly, the performance of SD patients depends on the nature of stimuli. Diesfeldt (1992) reported that in a visual lexical decision task, a patient, BHJ, performed well when words were tested against consonant strings but had significant difficulty in distinguishing words from more wordlike nonwords such as pseudowords and pseudohomophones. In the two-alternative forced choice paradigm, the patients were able to judge orthographically typical words from the relatively atypical nonwords, but their performance was significantly impaired in the reverse condition .
Additionally, several studies have shown an enhanced imageability effect in SD patients (Jefferies et al. 2009;Hoffman & Lambon Ralph, 2011;Hoffman, Jones, & Lambon Ralph, 2013), while others have reported a reversal of the imageability effect (Bonner et al. 2009;Breedin et al., 1994;Yi et al., 2007). The discrepant findings could be due to the selection of stimuli and/or individual differences (Hoffman et al., 2013). When appropriate stimuli are used, most SD patients can process high imageability words more efficiently than low imageability words, because they have more robust semantic presentations. However, most of the enhanced imageability effects are observed when SD patients perform semantic tasks (e.g., synonym judgment and picture-word association). Such tasks require explicit access to semantic knowledge, whereas in lexical decision or word naming the semantic system may be involved, but the task does not require explicit semantic knowledge to be retrieved. A few studies have been conducted to investigate SD patients' word naming and lexical decision performance, but the findings are inconclusive. For example, Breedin et al. (1994) reported that an SD patient, DM, responded more slowly for concrete than abstract words in an auditory lexical decision task, demonstrating a trend toward reversed imageability effect (p < .06). In a case series of SD patients, Reilly, Grossman and McCawley (2006) reported a reversed imageability effect in mild SD patients' auditory lexical decision performance, but the null effect was observed for more severe SD patients. Moreover, Pulvermüller et al. (2010) found a facilitatory imageability effect in SD patients' visual lexical decision performance, but the observed effect was potentially confounded with frequency in their study because low imageability words had significantly lower frequency compared to high imageability words.
In a recent study, Woollams (2015) demonstrated a reversal effect of imageability in SD patients' word naming performance, particularly for inconsistent words, suggesting damage to the semantic system might render the related semantic knowledge inaccessible or unreliable resulting in an over-reliance on orthographic and phonological knowledge.
In summary, to simulate patients' data in lexical decision, we first damaged the model by lesioning the functional locus of the corresponding processing layer in the model. For PA the visual and orthographic layers were damaged, for PD the phonological layer was damaged, and for SD the semantic layer was lesioned. The damaged models were retrained for a period of time to mimic the recovery process following brain damage (Welbourne & Lambon Ralph, 2005Welbourne et al. 2011). After retraining, we tested the damaged models on the two sets of data used in Simulation 1 so that we can compare the performance of the damaged models with the intact model. The tests included: (1) frequency and consistency effects from Andrews (1982); (2) imageability and foil effects from Evans et al. (2012).
In line with patients' behavioural data, we predicted that the PA model would show generally impaired lexical decision performance compared to the intact model but with some sensitivity to lexical-semantic properties of words. Importantly, the performance of the PA model would be strongly modulated by the foil conditions because orthographic processing was more demanding for word-like nonwords than for consonant strings. Though the difference in the contexts of pseudowords and pseudohomophones could be small as both fitted the orthographic structures of words.
For the PD model, we predicted slightly impaired performance with frequency, consistency and imageability effects similar to those observed in the intact model, because phonological processing is relatively less crucial for lexical decision. For the SD model, we predicted strong consistency effects. Regarding imageability effects, the evidence in the literature is equivocal, but we expected that in the presence of word-like foils we might be able to detect a reversed imageability effect in line with Woollams (2015).

Method
The method to simulate PA and PD patient types was similar with the only difference being the location of the damage and the amount of retraining required for the model to recover to a stable performance level. For PA damage, we randomly removed 90% of the links connecting to or from the HO layer coupled with 90% of the links into or out of the connected control units. The network was retrained for 300,000 epochs. For PD damage, the network was damaged by randomly removing 90% of the links into and out of the phonology layer together with 90% of the links into or out of the connected control units. This time the network was retrained for 400,000 epochs. To capture some variations within patients, we analysed the average model performance across five consequent time points toward the end of training. Specifically, we analysed the average performance of the PA model from 260,000 to Running head: MODELLING NORMAL AND IMPAIRED LEXICAL DECISION 52 300,000 epochs and PD model from 360,000 to 400,000 epochs in steps of 10,000 epochs.
Semantic dementia is unlike the other two deficits as it is the result of a progressive disorder (Hodges, Patterson, Oxbury, & Funnell, 1992). Following Welbourne and colleagues (2005Welbourne and colleagues ( , 2007Welbourne and colleagues ( , 2011, we simulated it by repeatedly interleaving very mild damage and retraining. We randomly removed 0.8% of the links into or out of the semantic layer together with the links into or out of the connected control units and then trained the network for one epoch. This process was repeated 400 times.

Results
All of the damaged models were tested in the same way. For the frequency by consistency test, a 2 x2 ANOVA (frequency: high and low; consistency: high and low) was conducted on inverse efficiency scores. Note that we also reported accuracy data in Table 4 for additional information. However, as our demonstration of speed and accuracy trade-off patterns when using arbitrary cut-off lines (Figure 7), we did not conduct analyses on accuracy data. Instead, we decided to stick to the analyses on inverse efficiency to avoid the potential issue. For the imageability by foil test, a 2x3 ANOVA (imageability: high versus low; foil type: consonant string, pseudoword or pseudohomophone) was conducted. The performance of the PA, PD and SD models, as well as the intact model on the two lexical decision tasks, are summarised in Tables 4 and 5.
As indicated in Table 4, the performance of the PA model was generally impaired  Table 5 showed that the PA model produced significant frequency effects, whereas both the consistency effect and its interaction with frequency were not statistically reliable. For the imageability by foil test, the main effect of imageability was not significant but, more importantly, the main effect of foil type was significant. There was a significant interaction between imageability and foil type. An For the PD model, there were significant main effects of frequency and consistency but the interaction was not. The PD model produced higher average inverse efficiency for high consistency words than for low consistency words. Further analyses revealed that the reversal of the consistency effect was caused by semantic processing. Re-running the analyses without semantic contribution resulted in a facilitatory consistency effect, F(1, 19) = 8.4, p < .01, η p = 0.31, where high consistency words were processed more efficiently than low consistency words. For the imageability by foil test, the effect of foil was significant, whereas the effect of imageability approached significant (p = .081). Importantly, there was a significant interaction. An analysis of simple effects revealed that the imageability effect was not significant in the consonant condition (p = .36), while the effect approached to significant in both the pseudoword condition, F(1, 19) = 4.31, p = .052, η p = 0.19, and in the pseudohomophone condition, F(1, 19) = 3.71, p = .069, η p = 0.16, with high imageability words being processed more efficiently than low imageability words. These results suggest some involvement of semantics in the word-like foil conditions.
With regard to the SD model, the lexical decision performance was generally poor.
However, the SD model produced significant effects of frequency, consistency, and their interaction in lexical decision. In particular, the model's performance was strongly affected by spelling-to-sound mappings because the damaged semantic system could not effectively support recognition of inconsistent words. For the imageability and foil test, although the average inverse efficiency for high imageability words (50.5) was numerically larger than that for low imageability words (46.1), the effect did not research significance, p = .16. There was also a significant effect of foil condition whereas its interaction with imageability was not significant. Note: Means and standard errors of variables in brackets.

General Discussion
The primary aim of this paper was to investigate the processing underlying lexical decision tasks by developing a large-scale recurrent model containing visual, orthographic, phonological, and semantic components. Simulation 1 described how the model was developed by combining oral language and reading training. The model was then used to make lexical decisions by combining information across different processing components, and to explore what contribution each processing component made to the decisions in different contexts. In particular, we were interested in exploring the role of semantics in lexical decision. Based on the measure of polarities at the core processing layers (orthography, phonology and semantics), the model was able to produce the differential semantic effects corresponding to the different nonword foils. The model was also able to capture a range of reading effects including frequency, consistency, word length, orthographic neighbourhood size, and imageability in lexical decision. When information from selected layers was damaged, the model produced similar patterns as observed in patients who have difficulty in accessing that particular type of information. These results demonstrate the ability of the model to account for normal and impaired lexical decision.

Comparison with Other Recurrent Models
In the literature, several large-scale connectionist models of single word reading have addressed the challenge of modelling temporal dynamics in a recurrent framework (Chang, Plaut, et al., 1996;Monaghan et al., 2017). The reading model developed by Harm and Seidenberg (2004) probably provides the most complete simulation of human reading behaviours. Their model demonstrates the division of labour between the phonological and semantic processes, illustrating the cooperative and competitive nature of the reading system.
When compared with Harm and Seidenberg (2004)'s model, the present model can be considered as an extension of the same framework. The model here was developed according to the same connectionist principles and was also a large-scale continuous recurrent model with embedded pre-existing phonological and semantic knowledge. However, the present model is different from the previous work in four important aspects: (1) The model includes an additional visual processing component and starts the process from vision to orthography before splitting to both phonology and semantics. The orthographic component can be considered as a layer of hidden units (i.e., OH layer in the model) and the representations are allowed to develop through the acquisition of reading skills as in humans. This implementation is supported by evidence from our previous work (Chang, et al., 2012a), which demonstrated that parallel models can account for the interaction pattern between word length and lexicality observed in reading aloud (Weekes, 1997), provided that orthographic representations are not pre-defined and can be learned over the time course of training. This additional visual processing component also allows the model to simulate the behavioural patterns observed in patients with visual-orthographic deficits.
(2) The semantic representations are generated differently and include negative as well as positive components. In Harm and Seidenberg (2004)'s study, the semantic features are generated by using WordNet (Miller, et al., 1990)  be reasonable to assume that their training issues were largely caused by the pre-trained weights used for the simulation of the pre-existing phonology and semantic knowledge.
Adding the direct connections from orthography to phonology and semantics is likely to alleviate this problem as it effectively reduces the network depth. Compared to their model, the present model has deeper structures (i.e., two additional layers for visual processing) and this substantially hinders the learning of the model without control units. Using local control units for each layer with inputs that mirror the inputs of the layer they are controlling is a very effective way of dealing with this problem.
(4) The current investigation explored both the nature of 'normal' lexical decision and also across three types of acquired dyslexia. Both sources of empirical data have strongly influenced theories and models of reading behaviour. Accordingly, it is interesting and important to test whether implemented models of lexical processing can capture both sources of behavioural data.

Normal Lexical Decision
The most significant contribution from this paper is that the model extends the domain of connectionist models by explicitly tackling a range of lexical decision related phenomena including effects of foil type, frequency, imageability, consistency, word length and orthographic neighbourhood size. There is considerable evidence showing that distinguishing Running head: MODELLING NORMAL AND IMPAIRED LEXICAL DECISION 62 words from consonant strings is generally fast and accurate (Evans, et al., 2012;James, 1975;Ratcliff, et al., 2004;Shulman & Davison, 1977). It is likely that the decision can be made mostly on the basis of visual and/or orthographic information (e.g., Grainger & Jacobs, 1996), only when the decisions become difficult deeper processing would be required (Plaut, 1997;Seidenberg & McClelland, 1989). This is supported by the result shown in Figure 6, where in the context of consonant strings the orthographic processing layer contributed the most to the polarity differences while in the context of more word-like nonwords the semantic layer was critical.
The critical test for the present paper was to see whether the model could account for the differential semantic effects in lexical decision when words were tested with different types of nonwords. The extent of involvement of semantics in lexical decision is controversial (Balota & Chumbley, 1984;Chumbley & Balota, 1984;Coltheart, et al., 2001;Dilkina, et al., 2010;Plaut, 1997). This is mostly because semantic information was initially thought to become available only after lexical access has completed and that lexical access is the only process in the lexical decision task (Becker, 1980;Collins & Loftus, 1975;Forster, 1976;Morton, 1969).
This means that there would be no semantic influences on lexical decision. This view has greatly influenced the subsequent theoretical and computational models of lexical decision focusing on orthographic processing (Coltheart, et al., 1977;Coltheart, et al., 2001; However, evidence from behavioural (e.g., Balota & Chumbley, 1984;Chumbley & Balota, 1984;Balota et al., 2004;Corteness & Khanna, 2007), neuroimaging (e.g., Binder et al., 2003;Hauk et al., 2006;Woollams et al., 2011) and patient studies (e.g., Patterson et al., 2006; have suggested that semantic processing is involved in lexical decision albeit to different extents dependent on the properties of word stimuli and the foil types (e.g., Degroot, 1989;James, 1975;Joordens & Becker, 1997;Evans et al., 2012). In particular, Evans et al. (2012) showed that the semantic effect was stronger with pseudohomophones than with pseudowords and than with consonant strings. This pattern was replicated by the model. The simulation results demonstrated that there was a null effect for consonant strings and the effect size of the imageability effect was larger for pseudohomophones than for pseudowords, albeit the difference between pseudohomophones and pseudowords was not as strong as those observed in Evans et al.' data. Additionally, as shown in Figure 8, the magnitude of semantic effects increased when nonwords became more wordlike. In fact, the graded semantic effects are predicted directly by the time dynamics of polarity difference analyses. Of particular note is that the panels (a) and (c) of Figure 4 shows polarity differences between words and three different types of nonword foils in the orthographic and semantic layers. On average, the polarity differences in the semantic layer were larger than the differences in the orthographic layer; however, the reliable differences in the semantic layer emerged relatively late compared with that in the orthographic layer. These results indicated the importance of semantic access for difficult decision conditions at a later time while for the easier conditions the decisions could be made very quickly based on orthographic information or along with phonological information. Collectively, the findings are consistent with a distributed view of lexical decision, which proposes that both orthography and semantics are important for lexical decision (Dilkina, et al., 2010;Seidenberg & McClallend, 1989;Plaut, 1997). This is in contrast with the localist view arguing for no or little involvement of semantics in lexical decision (Coltheart, et al., 1977;Coltheart, et al., 2001).
In addition to the exploration of the graded semantic effects in lexical decision, linear mixed-effect model analyses on model performance demonstrated that the model was able to account for a range of standard reading effects in lexical decision including frequency, consistency, word length, orthographic neighbourhood size, and imageability. The results also demonstrated that both word length and orthographic neighbours influenced lexical decision performance particularly for low frequency words (Andrews 1989;Balota et al. 2004). These findings are in accordance with the key behavioural effects of psycholinguistic factors influencing lexical decision revealed in behavioural studies using large-scale regression analyses (Balota et al. 2004;Cortese & Khanna, 2007).

Impaired Lexical Decision
Different types of acquired dyslexia including pure alexia (PA), phonological dyslexia (PD), and semantic dementia (SD) were simulated by damaging the correspondent functional impairment in the model and then following by a period of retraining. The results demonstrated that the models were able to reproduce their impaired lexical decision performance.
Pure alexic patients are characterised by a general visual-orthographic impairment.
Consequently, their performance on the visual related task would be disrupted. Compared to the intact model, the PA simulations showed substantially impaired lexical decision performance. The performance was strongly affected by the foil conditions in which the PA model could better discriminate words from consonant strings than from word-like nonwords, presumably due to differential demands in orthographic processing. In addition, although explicit word recognition may be difficult, some PA patients have shown sensitivity to frequency, imageability, and nonword type in lexical decision (Coslett & Saffran, 1989;Friedman & Hadley, 1992;Roberts, et al., 2010). According to the partial activation account (Behrmann, et al., 1998), PA patients' performance may be supported by the partially activated lexical semantic system. This is consistent with the PA simulations, demonstrating a reliable frequency effect. The PA model also showed some degree of imageability effects, particularly when semantic information was critically demanded in the most word-like (i.e., pseudohomophones) condition. Note that the imageability effect in PA patients' reading is not universally observed (Howard, 1991;Patterson & Kay, 1982;Roberts et al. 2010). One factor affecting PA patients' recognition performance, which is not the focus here, is related to the severity of the reading deficit. For instance, Roberts et al. (2010) demonstrated that the sensitivity of patients with pure alexia to word imageability was regulated by reading severity.
In either very mild or too severe cases, PA patients would not show the imageability effects. This is because for mild PA patients orthographic information could still be reliable for effectively lexical decision while for severe PA patients orthographic activations may not be able to spread to the semantic layer. Collectively, evidence from the current simulation results and the case-series studies seem to suggest the imageability effects in PA patients would be moderated by both foil condition and severity. Future studies can be conducted to vary the levels of visual-orthographic damage to investigate PA patients' lexical decision in the contexts of different foil types.
As for the PD model, the lexical decision performance was relatively unaffected compared to the other damaged models. The simulation results are congruent with previous studies of phonological dyslexia (Cuetos, et al., 1996;Denes, et al., 1987;Dujardin, et al., 2011), reporting patients' little difficulty in making lexical decisions, and suggesting that phonology has a minimal role in lexical decision. One divergent result is that the PD model showed inconsistency words were processed more efficiently than consistency words. Within the connectionist framework of reading, inconsistency words are mainly processed via the semantic pathway from orthography to semantics, while consistency words are mainly processed via the phonological pathway from orthography to phonology (Plaut et al., 1996).
When the phonological system is extensively damaged, inconsistent words may be less affected and show processing advantage. This explanation is supported by our additional analysis, demonstrating that a facilitatory consistency effect can be obtained with the removal of semantic contribution. However, as PD patients are not generally examined for the consistency effect in lexical decision, whether PD patients with more severe and extensive lesions would show a similar pattern to the simulation result warrants further investigation.
The simulations of semantic dementia in lexical decision showed the strong frequency and consistency effects. In particular, the model had great difficulty in processing low-frequency inconsistent words. The result is compatible with the behavioural data observed in SD patients (Jefferies et al. 2004;Patterson et al. 2006), indicating the importance of semantic processing in inconsistent word reading. Moreover, SD patients perform relatively well on rejecting consonant strings compared to word-like nonwords. (Diesfeldt, 1992). A similar pattern is also observed in the SD model in which the decisions on words in the context of consonant strings was more efficient compared to that of words in the contexts of pseudowords and pseudohomophones as a consequence of preserved orthographic and phonological information but unreliable semantic information. Regarding the imageability effect in the SD model's lexical decision, the model produced a null effect with numerically lower processing efficiency for high imageability than low imageability words, which is consistent with Reilly et al.'s (2006) findings of moderate to severe SD patients in the auditory lexical decision task. Previous research has shown an enhanced imageability effect for SD patients in the semantic tasks (Jefferies et al., 2009;Hoffman & Lambon Ralph, 2011;Hoffman, Jones, & Lambon Ralph, 2013) whereas a null or reversed effect in the lexical decision and word naming tasks (Breedin et al., 1994;Reilly et al., 2006;Woollams, 2015; but see Pulvermüller et al., 2010). The divergence seems to reflect the task difference concerning whether or not explicit semantic knowledge is required. In lexical decision, decisions can be made on the integrated information of orthography, phonology and semantics. When the information from the semantic system is damaged and becomes unreliable, greater reliance on orthographic and phonological information might be expected.
A reversal effect of imageability might be interpreted as being due to greater use of the mappings between orthography to phonology for low imageability words (Woollams, 2015).
The severity of semantic dementia could also be a determinant factor (Reilly et al., 2006).
However, this would require more systematic investigations on a large-scale case series of SD patients before consensus can be reached.

Comparison with Other Models of Lexical Decision
There are several existing lexical processing models such as the multiple read-out model (MROM) (Grainger & Jacobs, 1996), the dual-route cascade (DRC) model (Coltheart, et al., 2001) and the connectionist dual process (CDP+) model (Perry, et al., 2007) that all share the same lexical processing and decision mechanisms. These models can simulate several effects in lexical decision including the frequency effect, the neighbourhood size effect and the strategic influences on lexical decision by flexibly adjusting decision criteria. Their results are almost all based on orthographic processing with little attention to other processing components, in particular, the semantic system. Thus questions as to how these models could implement a localist view of semantics in lexical decision -presumably either by implementing feedback connections from semantics to their orthographic lexicon (Coltheart, et al., 2001), or adjusting the decision criteria corresponding semantic processing to account for the semantic influences on lexical decision -remain unclear and have not been implemented. Moreover, the MROM model and its variants adopt the interactive activation (IA) model as a visual processing component. Hence, they are inevitably limited to deal with a particular small set of words as they lack a learning mechanism. In contrast, the visual input in the present model is much more flexible and could support learning large-scale word sets.
The present model is compatible with the idea of using accumulated information for lexical decision (Busemeyer & Townsend, 1993;Norris 2006;Ratcliff et al., 2004;Usher & McClelland, 2001. In fact, the polarity of the units is a measure of the quality of information produced from the model in response to the stimulus. According to Ratcliff et al. (2004), the decision is made when the accumulation of information over time reaches the appropriate boundaries. The rate of accumulation is assumed to vary as a function of the vocabulary, and using natural rather than log frequency to determine the probability of presenting each word. While these changes would allow the model to be exposed to a broader range of frequency, they would also considerably increase the training time. The present version of the model represents a sensible compromise between these two considerations.
Finally, the measure of polarity used in Plaut (1997) study and the present model, in a sense defined in equation (3), can be thought of as representing the task-related activity of units in the model. This is potentially interesting because it might also be used as a proxy of the Blood Oxygen Level Dependent (BOLD) signal in neuroimaging studies. Higher polarities incur a higher processing cost, which might map onto the data from fMRI imaging studies. Most existing reading models have been developed to simulate behavioural data (Coltheart, et al., 2001;Harm & Seidenberg, 2004;Plaut, et al., 1996;Seidenberg & McClelland, 1989) and some have been used to account for electrophysiological data (Cheyette & Plaut, 2017;Laszlo & Plaut, 2012;Rabovsky & McRae, 2014); however, relatively little research has been done to account for data from neuroimaging studies of printed word processing. The analyses of polarity differences at each layer in the model could potentially reveal the relative contribution of these layers to the lexical decision tasks. These differences might be expected to relate to the nature of differential processing in the brain regions that support reading. For example, the model was more reliant on semantic information when the nonword foils were pseudohomophones rather than consonant strings.
These results resemble the differential brain activation seen in left anterior temporal lobe, which has been associated with semantic processing  for lexical decision tasks  and for reading words with atypical pronunciations . Exploring this would require further investigations beyond the scope of the current study.

Limitations and Future Directions
The present model has demonstrated a range of important phenomena in both normal and impaired lexical decision. It should be acknowledged that there are some limitations to this study. First, the model's decision criteria based on the average word and nonword polarity scores were static rather than dynamic. One potential issue in relation to human readers' lexical decisions is that the average word polarity scores could be considered as an expectation for words that participants have already built through their experience with words, at least to some extent, prior to the experiment; however, the participants could not have built such as an expectation for nonwords (i.e., the average nonword polarity score) before they actually encounter them. Thus, it might be anticipated that the decision criteria used by the participants in the earlier trials of the experiment may be slightly different from those used in the later trials where the decision criteria may gradually become steady. However, in the human behavioural experiments, there are often practice trials that might help alleviate the issue and help build up stable criteria rapidly. In addition, a growing number of studies have demonstrated the effects of cross-trial sequence on the human lexical decision performance (e.g., Balota, Aschenbrenner, & Yap, 2016), where stimulus degradation and lexicality in the previous trial have impacted on the responses to the current stimuli, providing evidence for trial-by-trial adjustments to decision making. However, the underlying mechanism remains to be understood. Within the current modelling framework, it is possible to investigate the issue by dynamically adjusting the current stimulus polarity scores with reference to stimulus polarity scores in the previous trial or to implement a more flexible decision mechanism such as the leaky competing accumulator (Usher & McClelland, 2001. Another consideration is that the simulations of different types of acquired dyslexia only consider the most typical patients' lexical decision patterns. Clearly, there are wide variations in reading patterns within each type of dyslexia (Behrmann, et al., 1998;Crisp & Lambon Ralph, 2006;Roberts et al., 2010;Woollams, et al., 2007). The variations could, for instance, result from the severity of reading deficits and premorbid individual differences in reading (e.g., Dilkina, McClelland, & Plaut, 2008). Numerous studies have shown individuals differ with regard to reading experience and vocabulary knowledge as a consequence of great variations in reading effects of skilled readers (Adelman et al., 2012;Andrews & Hersch, 2010;Davies et al., 2017;. Individual differences in the degree of semantic reliance during exception word reading could account for different reading patterns observed in patients with semantic dementia (Woollams et al., 2016). Considering both the severity of reading disorders and premorbid individual differences in simulations of acquired dyslexia would be an interesting topic for further investigation.

Conclusion
A large-scale computational model of the human visual word recognition system was developed to explore the underlying processing mechanisms in lexical decision. We demonstrated that the model could perform the lexical decision tasks based on the measure of polarity combined across processing layers within the reading system. Importantly, the model was able to account for the graded semantic influences on lexical decision corresponding to various types of foils, providing evidence for semantic access in lexical decision -both in typical and neurologically-impaired reading. recurrent network. First, it can potentially shorten the training time because a large network can generally be dissembled into several small feedforward networks. These small networks can be trained separately and the weights obtained from each part can be embedded into the original large network for global training. Second, the pre-trained weights can be used in an ecologically valid manner to simulate pre-existing knowledge within the network before it starts to learn a new skill -as in Harm and Seidenberg's (2004)  control technique can be implemented by using a control unit along with a strong negative weight connected to every unit it needs to control. The control unit will be free to learn to regulate the activity of the units it controls in whatever fashion reduces errors, but we anticipate that for control units connected to other units with pre-trained weights it will learn to exert a strong inhibition early in training which will gradually decrease as the non pre-trained weights start to participate in the task. In the case of the network illustrated in Figure  The results in Simulation 1 (Figure 3) clearly demonstrate the performance benefits of using control units, in particular for semantics. To examine how they have achieved this, we computed the inhibitory effects of the control units on the H1, H2, H3 and H4 layers (see Figure 1). Figure A2 shows the average inhibition effects on these layers as a function of time ticks. For the units that control the deeper layers with pre-trained weights (H3 and H4), the pattern is as expected with a high initial level of inhibition that decreases towards zero with time. This is in contrast with H2, which starts with very low levels of inhibition and maintains this throughout processing. Presumably reflecting the importance of allowing information to progress from the visual layers to the phonological layers as quickly as possible. One might have expected the same pattern for H1, which controls the information flow along the semantic route, but here we see high initial levels of inhibition that decline only slightly throughout the whole course of processing. This suggests that activation of the semantic system is achieved mainly through the indirect route (via phonology). Overall, these analyses