Using large language models to study human memory for meaningful narratives

One of the most impressive achievements of the AI revolution is the development of large language models that can generate meaningful text and respond to instructions in plain English with no additional training necessary. Here we show that language models can be used as a scientific instrument for studying human memory for meaningful material. We developed a pipeline for designing large scale memory experiments and analyzing the obtained results. We performed online memory experiments with a large number of participants and collected recognition and recall data for narratives of different lengths. We found that both recall and recognition performance scale linearly with narrative length. Furthermore, in order to investigate the role of narrative comprehension in memory, we repeated these experiments using scrambled versions of the presented stories. We found that even though recall performance declined significantly, recognition remained largely unaffected. Interestingly, recalls in this condition seem to follow the original narrative order rather than the scrambled presentation, pointing to a contextual reconstruction of the story in memory.


Introduction
In the classical paradigm for studying human memory, participants are presented with randomly assembled lists of words and then perform memory tasks such as recognition and recall (see review in [Kahana, 2020]).A wealth of results has been obtained in these studies.For instance, it has been found that words at the end and the beginning of the list have a higher chance of being recalled (recency and primacy effects, respectively), and there is a tendency to recall words close to each other in the list (contiguity, [Kahana, 1996]).Moreover, it was found that when the presented lists grow in length, even though the average number of recalled words (R) is increasing, a progressively smaller fraction of the words is recalled [Murdock Jr, 1962].Several authors have addressed the issue of the mathematical form of the dependence of R on list length and found that the best description for this dependence is provided by power-law relations, R ∼ L α , with exponents α generally below one [Murray et al., 1976].It is well known that recall also depends on multiple experimental factors such as e.g. the presentation rate of words, the age of the participants, etc.However, in recent work, some of the authors discovered that if recall performance is analyzed as a function of a number of remembered (M ), rather than presented words, the relation becomes universal and is described by the analytical form: R = 3π 2 M [Naim et al., 2020].Moreover, this relation follows from a simple deterministic model where words are retrieved one by one according to a random symmetric matrix of 'similarities' reflecting their long-term encoding in memory, until the process enters a cycle and no more words can be recalled.The number of remembered words M , itself can be predicted by the retrograde interference model that assumes that each new word erases some of the previously presented words according to the 'valence' or 'importance' of each word [Georgiou et al., 2021[Georgiou et al., , 2023].
While it is remarkable that human memory for random material can be described with universal mathematical relations, it is of course much more important and exciting to try to understand how people remember more natural, meaningful information.After the pioneering work of Bartlett [1932], many studies considered recalls of narratives.As opposed to random lists, narratives convey meaning, and hence have structure on multiple levels which influences recall, as was confirmed in many previous publications (see Section 3).The first challenge in understanding narrative recall is the fact that people tend not to recall the narrative verbatim.Rather, they remember what the narrative is about and retell it in their own words [Gomulicki, 1956;Fillenbaum, 1966;Sachs, 1967].Counting correctly recalled words is therefore not a good score of recall, and the better score used in many studies, that we also adopt in our work, is a count of recalled 'ideas', or 'clauses' (see e.g.[Bransford and Johnson, 1972]).Using this method however requires a human-level of understanding of narratives and recalls, making collecting large amounts of data difficult and extremely time-consuming to analyze.In our study, we develop a way to overcome this and other difficulties by using large language models (LLMs) to assist in the analysis and design of experiments, as described later.In particular, we use LLMs to generate new narratives of a particular type and length, and to score human recalls obtained in multiple experiments performed over the internet.In addition to recall, we also performed recognition experiments (where people are requested to indicate whether a specific clause was in the presented story or not) in order to estimate how many clauses people remember after reading the narrative.To this end, we use LLMs to generate plausible lures, i.e. novel clauses that could have potentially appeared in the narrative.
Inspired by our previous results, we wanted to understand how recognition and recall performance scale up with narrative length as it increases and what the relation between them is.To this end, we performed a large number of experiments over the internet using the Prolific platform (www.prolific.com).We also compared the recall and recognition performance of original narratives with their scrambled versions in order to elucidate the effects of comprehension on different aspects of memory.Since there are different types of narratives that could potentially be more or less difficult for people to remember and recall, we decided to focus on one particular type of narrative first studied in the famous paper by Labov and Waletzky [1966] that established the field of narratology, namely the oral retelling of personal experience, told by real people, and the variants of those generated by LLMs (see later).While being collected in a research setting, these spoken recollections of dramatic personal episodes are close to the natural way people share their experiences in real life and therefore are of special interest for studying human memory.

LLM-assisted recall and recognition experiments.
For the purpose of this study, we have chosen several narratives of different lengths from Labov [2013] and Labov and Waletzky [1966].As part of the analysis in these publications, narratives were segmented into an ordered set of clauses, which are "the smallest unit of linguistic expression which defines the functions of narrative" [Labov and Waletzky, 1966].In other words, they are the smallest meaningful pieces which still serve some function in communicating a narrative.
Since these are spontaneous narratives spoken in local dialect, they are characterized by a number of features which are awkward to transcribe (pauses, repetition, gestures) as well as non-standard (and sometimes outdated) English vernacular.These factors complicate comprehension when participants have to read narratives on the computer screen.We therefore instructed LLMs to generate new narratives modeled on the original ones, i.e. exhibiting a similar type of a event sequence and the overall length in terms of the total number of clauses.In particular, the LLM-generated narratives inherited the segmentation from the original story, i.e. the number of clauses was the same and the information contained in the corresponding clauses had a similar role in their respective narrative (see Methods Section 5 for details of narrative generation, and some examples in Appendix A.1). Eight narratives were selected for subsequent memory experiments, ranging from 18 to 130 clauses in length.We presented each narrative to a large number of participants (∼ 200) who then performed either recall or recognition tasks.In the subsequent analysis, we treated clauses as the basic units that together communicate the meaningful information contained in the narrative.In particular, we quantified each individual recall by identifying which of the clauses in the narrative were recalled, determining this by whether information contained in this clause is present in the recall.We simplify the analysis by considering each clause as being either recalled or not.This scoring of recalls is traditionally performed by human evaluators and is very time-consuming.We, therefore, prompted an LLM to define which of the clauses of the original narrative were recalled and in which order.Here we utilized the remarkable ability of modern LLMs to respond to instructions, provided as prompts written in standard English (as opposed to a programming language), to perform novel tasks without any additional training (known as zero-shot prompting or 'in-context' learning [Brown et al., 2020], see Appendix A for more details).
To test the ability of the LLM to adequately score human recall (with appropriate prompting as described in the Methods Section 5 and Appendix A), we performed an additional set of recall experiments with a specially LLM-generated narrative and compared the LLM-performed recall scoring to the one conducted manually by the authors (see Methods Section 5 for details).To this end, we calculated the fraction of participants who recalled each particular clause, i.e. the clause's recall probability (P rec ), as judged by the LLM and by the authors.For nearly all of our analysis, the LLM we used was OpenAI's GPT-4 (see Appendix B.2 for comparison between different LLMs).As shown in Figure 1, GPT-4 scoring of recalls results in recall probabilities close to ones obtained by human evaluations for a great majority of the clauses.Moreover, variability of scoring is comparable between GPT-4 and human evaluators (compare Figure 1B and  C ).Interestingly, the LLM has a greater correlation with the mean human scoring (r = 0.94) than with any individual scoring (r = 0.92, 0.90, 0.90) (see Table 1 in Appendix B; c.f. Michelmann et al. [2023]).
One prominent feature seen in Figure 1A is a wide range of recall probabilities for different clauses.In other words, while some clauses are recalled by most of the participants, other clauses are not.Such a wide distribution of P rec 's across the clauses was observed in recalls of all the narratives and contrasts sharply with corresponding results for random lists of words where P rec 's are rather uniform except for the ones in the beginning and the end of the list (see Appendix E).This wide distribution of P rec s is apparently due to the fact that not all clauses have similar importance for communicating the narrative.Indeed if we select the clauses with high enough P rec , we usually get a good summary of the narrative (see Appendix F for examples).
We also performed recognition experiments in order to estimate the average number of clauses that participants remember after presentation.As we explain below, this analysis requires a large number of plausible lures, i.e. novel clauses that could have possibly been in the narrative and hence cannot be easily distinguished from the true clauses using context and style.Generating these lures is highly nontrivial as it requires an understanding of the narrative.This makes manually generating lures very challenging and time-consuming, which is why we utilized LLMs for this purpose (see Appendix A.2 for prompts and example output).Using the LLM, for each story we obtained the same number of lures as true clauses, and sampled 10 clauses from this entire pool of 2L clauses uniformly and randomly for the testing phase.Participants would then see one clause at a time, and were asked whether the clause was in the presented narrative or not.We checked that presenting several clauses for recognition does not result in systematic drift in performance, i.e. no output interference was detected in our experiments (Criss et al. [2011]; see Appendix D) .We then estimated the number of clauses retained in memory after presentation of the narrative (M ) from the fraction of 'hits', i.e. correct recognitions of the true clauses (P h ) and the fraction of 'false alarms', i.e. reporting lures as true clauses (P f ).In particular, we assume that if the participant remembers a given clause, they always recognize it as being part of the narrative; otherwise they still give a positive answer with probability P f .The total probability of a correct recognition is then given by This equation emphasizes the importance of using lures in recognition experiments, since without lures we would have an uncontrolled tendency of participants to indicate any presented clause as a true one irrespective of whether or not they remember it.

Scaling of Recall and Memory
Having each narrative seen by roughly 200 participants, half of them doing recall and another half recognition, we were able to determine the average number of remembered clauses (M ) and recalled clauses (R).As expected both M and R grow with the length of the narrative presented, as measured by the number of clauses in the narrative and denoted by L (see Fig. 2A,B).Moreover, both M and R appear to grow linearly with L for the range of narrative lengths we explored, and hence when we plot R vs M , we also get an approximately linear relationship (see Fig. 2C).This scaling behaviour is very different from what we observed with random lists of words with a characteristic square root scaling, i.e. unsurprisingly, recall of meaningful material is better than for random ones of the same size, even if we discount for better memorization.One of the factors that apparently leads to better recall of narratives is the temporal ordering of recall.When people recall narratives, recall mostly proceeds in the forward direction (see Fig. 3A), probably reflecting the natural order of events in the narrative that cannot be inverted without affecting its coherence.This contrasts with the case of random lists, when recall proceeds in both directions with similar probability (see Fig. 3B), which, according to a model proposed in [Naim et al., 2020] results in the process entering a cycle preventing many words from being recalled.(C): Average number of recalled clauses vs. number of remembered clauses from the same story.As expected from panels a) and b) the number of retrieved clauses in scrambled narrative is substantially smaller that in intact narrative for the same number of remembered clauses.For comparison we presented the theoretical performance for the random list of words, which describes data well (Naim et al. [2020]).It is clear that there are more clauses recalled in intact narratives than words in lists of random words.Surprisingly, retrieval of scrambled stories is significantly worse than random lists, suggesting an active suppression of items in service of generating a coherent recall (participants were implicitly instructed to recall story).

Meaning and memory
As we mentioned above, people's recall is strongly influenced by narrative comprehension, such that clauses that are most important in communicating the summary of the narrative are the ones that are recalled by most of the participants.We found however that recognition is not so strongly affected by meaning.This can be observed by evenly dividing all the clauses used in our experiments into subsequent bins according to their recall probabilities, and calculating the average recognition performance for all the clauses in each bin.Surprisingly, there is very little increase of recognition with recall probability across the clauses, such that clauses with highest and lowest average P rec only differ in their P h by less than 0.15 (see Fig. 4; c.f. [Thorndyke and Yekovich, 1980;Yekovich and Thorndyke, 1981]).
To further elucidate the role of meaning in memory, we repeated our experiments with another group of participants after randomly scrambling the order of clauses, thus making comprehension much more difficult if not impossible.We found that, unsurprisingly, recall of scrambled narratives is much poorer then the original ones (Fig. 2B,C).Recognition performance for scrambled narratives however is practically the same (Fig. 2A).This result indicates that memory encoding of clauses is not significantly affected by the structure and meaning of the narrative.Interestingly, the order in which people tend to recall clauses from a scrambled narrative corresponds much better to the order of these clauses in the original narrative than in the presented, scrambled one (see Fig. 3C,D), indicating that even in this situation people are trying to comprehend the meaning of the narrative rather than processing the input as a random list of unconnected clauses.This might explain why recall of scrambled narratives appears to be worse than recall of random word lists of the same size.

Previous Work
The experimental study of memory for narratives can be traced back to the highly influential descriptive work of Bartlett [1932].This and followup work introduced the idea that the encoding of memory for narratives is a process of abstraction [Gomulicki, 1956], and that subsequent recall is in large part a generative process driven by a participant's prior knowledge and biases.This line of thought was formalized much later in theories of narrative structure involving schemas, scripts, frames, and story grammars [Alba and Hasher, 1983;Rumelhart, 1975].By now, there is ample support for an abstracting process for memory encoding, and the existence of schematic structures which guide recall [Baldassano et al., 2018].shows that recall of coherent stories largely preserves presentation order.(B) recall of random word lists does not preserve presentation order.(C) As with random lists, the recall of a scrambled story does not preserve presentation order, but rather appear to reconstruct the original order of the story, as seen from the color gradients in panel (B).Apparently, random words and scrambled stories are recalled in random order considering their presentation order, but people perform some unscrambling of the scrambled stories as can be seen in (C) -there is tendency of recalled clauses being in the order of original unscrambled narrative.The participants construct a mental representation of the scrambled narrative which is evidently close to its original form.Recall consequently does not reflect input sequence, but rather the original sequence of the clauses.
A parallel line of research into narrative structure originated in the field of sociolinguistics by Labov and Waletzky [1966].In carrying out linguistic fieldwork to analyze spoken dialects of English, the authors found that personal narratives of emotionally charged events tented to elicit the best examples of local dialect.Furthermore, the resulting narratives, which were produced spontaneously and by non-expert storytellers, tended to be very regular in their structural properties.Of particular significance to us was the observation that stories were typically told in the order in which they experienced by the speaker, which Labov encapsulates in his "egocentric principle" and "no flashback" constraint Labov [2013].This lends some support to the strong iconicity assumption [Zwaan, 1996], which states that readers (or listeners) will interpret the order in which events are reported as reflecting the actual chronological order in which they occurred.
However, serial order alone likely cannot explain the rich structure we observe in narrative recall.Indeed, temporal order provides just a single axis along which events in a narrative are organized.In addition to simple serial order,  Clauses from all the narratives used in this study were divided evenly into 15 bins according to their P rec , and the average P h for the clauses in each bin was computed and plotted against the center of the corresponding bin.
events or clauses can have causal relations (e.g.A causes B) Trabasso and van den Broek [1985]; Trabasso and Sperry [1985]; Lee and Chen [2022], inferential relations (e.g.A implies B), and superordinate or constituency relations (e.g.A consists of B) Black and Bower [1979].These relations conspire to give narratives hierarchical structure [Kintsch, 1998].Story grammars provide one natural mechanism whereby the hierarchical structure arises, and served as early inspiration for studying structure dependence in narrative recall and recognition.
In this direction, Yekovich and Thorndyke [1981]; Thorndyke and Yekovich [1980] performed recall and recognition experiments to test how encoding and processing depend on hierarchical structure.Their recognition experiments included only old, paraphrase, and false statements in the test phase.The false statements tested were appropriate (invoking the appropriate agents or actions), but inconsistent with or contradictory to the story.Similar to our results, they observe that recognition is uncorrelated with recall.Furthermore, whereas recall was argued to be sensitive to the structure of a narrative, recognition appeared to have no such sensitivity.
In another experiment seeking to isolate the structure sensitivity of narrative processing, Kintsch et al. [1977] tested processing of scrambled narratives, in which paragraphs of a text were presented to participants in random order.Participants were able to produce coherent summaries of such scrambled text, which were indistinguishable from summaries produced by participants reading the coherent text.It is possible that scrambling larger units (paragraphs in this case, versus clauses in our experiments) produced an overall more comprehensible text, making descrambling easier.Nevertheless, we still observe descrambling, but of lower quality (see Figure 3C).
Scaling laws for memory have been observed for random unstructured lists (pictures, words, etc) [Standing, 1973;Murray et al., 1976;Naim et al., 2020].Two important takeaways from these works are the following: memory typically exhibits power-law scaling with list length, and the retrieval process appears to be universal.Surprisingly, there is very little work which considers the scaling of narrative memory with story length, let alone attempts to quantify it.The only work we are aware of is [Glenn, 1978], which measured average recall as a function of story lengths up to 83 "informational units", which include clauses as well as noun and adjective phrases.The experimental design was motivated largely by questions about story grammars, and therefore the episodic structure of the narratives (in the story grammar sense developed by the authors) was kept constant while descriptive phrases and clauses (so called 'informational units') were added to increase the length of the story.Therefore, the added statements are arguably a kind of filler, not contributing any significant additional meaning or structure to the underlying narrative.This might account for the seemingly sublinear scaling of their mean recall with story length, compared to our linear results (Figure 2).There are other significant differences, including the test population (24 second-grade schoolchildren), the stimulus input format (spoken narratives), and the recall format (spoken recall).
The choice of "informational units" in [Glenn, 1978] also differs from our choice of clauses, and is similar to segmentations that use propositional analysis [Kintsch and van Dijk, 1978], or pausal breaks [Johnson, 1970].An important feature of these different units is that they provide a more fine-grained segmentation of a narrative -a clause can consist of many propositions or pausal breaks, but not the other way around.However, these smaller units would not have a narrative function in the Labovian sense.

Discussion
In this contribution, we describe a new way to study human memory for meaningful narratives with the help of large language models.Together with using internet platforms for performing experiments, this technology enables a qualitative leap in the amount of data one can collect and analyze.In particular, we describe the prompts that we used to make GPT-4 generate new narratives of a particular type and size, score multiple human recalls of these narratives by identifying which clauses were recalled, and generating plausible lures for recognition experiments.Having a large amount of data is important for memory research because, as opposed to e.g.sensory processing that is believed to be largely uniform across people [Read, 2015], the way people remember and recall meaningful material is highly variable.Hence, only through statistical analysis can some general patterns be uncovered.In particular, we considered how recognition and recall performance scales up for narratives of increasing length.We found that approximately the same fraction of narrative clauses are recognized and recalled when narratives become longer, in the range of 20 to 130 clauses that we considered for this analysis.We expect that as narratives become longer, this trend will not persist because people will start summarizing larger and larger chunks of the narrative into single sentences in their recalls.In this case, deciding whether individual narrative clauses in the chunk were recalled or not is ambiguous.In fact, we observed that summarizing started to appear in recalls of some of the participants of the longest narrative in our pool, which resulted in the average number of clauses in a recall being significantly smaller than the average number of narrative clauses that were deemed to be recalled (see appendix C).In the extreme case, a short summary of the entire story can be scored as having recalled nothing, since no particular clause was recalled, when in fact there was a non-trivial recall but at a higher level of abstraction or organization.This in itself does not mean that studying recall for clauses is incorrect or uninformative, but rather that it must have some regimes of validity.It also illustrates the need to develop techniques which are sensitive to this hierarchical structure of recall, by automatically detecting higher-level units of meaning such as events and episodes.
We investigated the role of meaning in narrative memory by presenting the participants with the same clauses as in the original narrative but in a scrambled order.These 'narratives' are much more difficult, if at all, to make sense of, and indeed their recall was very poor.However, recognition of individual clauses was practically as good as in the original narrative.This surprising observation indicates that the encoding of clauses in memory is not very sensitive to the overall structure of the narrative, and only in the process of recall does the meaning of the narrative play a major role.This finding is consistent with the observations of [Thorndyke and Yekovich, 1980;Yekovich and Thorndyke, 1981].It also provides a striking confirmation of the generally held wisdom that while recall is structure sensitive, recognition need not be [Kintsch, 1998].It is still possible, however, that these trends will change when longer narratives are considered.This will have to be investigated further.
Another interesting observation concerns the fact that as the narrative becomes longer, the range of recall probabilities for different clauses remains very wide, e.g.there are always some clauses that are recalled by most of the participants and others that are almost never recalled.In contrast, the probability to recall words from a random list of words decreases with the length of the list, with the exception of the last few words (due to the recency effect) [Murdock Jr, 1960].It would be an interesting theoretical direction for future studies to try to identify the factors that could predict how well a given clause will be recalled in a given narrative.
We focused on first-person spoken stories.These were personal accounts of important events, shared naturally and informally.This way, they lacked the refinement of crafted stories, which may utilize tricks to improve memorability.It would be interesting to see how scaling of memory is affected by such expertly told or literary stories.Evidently there is an impact, considering that many stories in the oral tradition endure over very long timescales [Rubin, 1995;Nunn, 2018].A striking example of this in a more controlled lab setting had participants trained to construct a narrative in which they embedded a random list of words [Bower and Clark, 1969].This work found that employing such a mnemonic improved recall to nearly perfect for up to 12 consecutively learned lists.
While practice and rehearsal are necessary for preserving narratives in oral tradition, our results suggest that narratives are intrinsically more memorable.We find that memory performance for narratives encountered only once, as measured by the scaling relations in Figure 2, are robustly superior to performance on unstructured lists.
The research conducted for this report relied crucially on a set of LLM input "prompts", i,e, instructions, written in standard English, given to the LLM for carrying out various tasks.Roughly speaking, these appear as if they were instructions given to a human research assistant.Quite remarkably, the LLM completes the input string to provide a correct output without any additional training, a phenomenon known as 'in-context' learning [Brown et al., 2020].Since this phenomenon is still not fully understood, we had to resort to a good amount of trial-and-error and fine-tuning in designing the prompts used in our analysis.We provide all of the prompts used in our experiments in appendix A. The specific model we utilize in most of the paper is OpenAI's GPT-4.However, while we believe the capabilities necessary to carry out our experiments are not limited to this model, it is an open question whether the prompts we use can be transferred to different models.
In summary, using LLMs in conjunction with internet platforms for performing experiments is a powerful new tool that could bring significant new advances in understanding human memory.

Methods
With the aim of conducting a large scale study on memory for natural continuous material (personal narratives in this case) we required an automated procedure that would facilitate measuring human recall memory performance, since manual scoring of recalls is very labor-intensive and thus limits the ability to analyse large datasets.We were able to achieve this through the use of Large Language Models (LLMs) and we assessed the reliability of our pipeline by comparing it to human scoring performed by the authors.Our dataset was generated by conducting recall and recognition experiments online, recruiting participants through a crowd-sourcing platform.All segments of this study are detailed below.

Stimulus Set -Narrative pool
Nearly all of the stimuli we use are generated by LLMs and are based on first-person oral narratives taken from socio-linguistic interviews [Labov and Waletzky, 1966;Labov, 2013] 2 .The oral narratives are segmented into clauses in these references, and these are used as templates for the LLM narrative generation.The LLM output is a narrative of equal length (in number of clauses), with very similar narrative-syntactic structure, but involving different subject matters.Two stories were generated from each template for lengths L = 18, 32, and 54.Two additional narratives were directly taken from Refs.[Labov and Waletzky, 1966;Labov, 2013], one with L = 19 ('boyscout' Story 10), and the other with L = 130 which was minimally edited to remove punctuation due to speech breaks, in order to increase readability ('stein' Story 9).More details of the narrative generation by LLMs can be found in appendix A.1, along with a sample narrative template in Argument 1 and examples of generated outputs in Completion 1.All narratives used as stimuli can be found in Appendix G.
For the purpose of evaluating the reliability of recall scoring of LLMs and their similarity to human scoring, we generated and segmented a different narrative based on the 'boyscout' story (Story 10, L=19).This stage began before the rollout of gpt-4 and for this reason we document the evaluation process separately in Appendix I.The narrative generation step in this part produced variable length narratives in prose, which we had to subsequently segment using GPT-3.As a result, this narrative generation procedure did not keep the same number of clauses as the story it was based on ('boyscout').

Experimental Design
Participants were recruited online through the 'Prolific' platform (www.prolific.com)and experiments were conducted on participant's browser.Only candidates that indicated English as their first language were allowed to participate.Participants were initially greeted with a short description of the experiment and an informed consent form.The trial was initiated by a button press.After a three second counter, a narrative was presented in the form of rolling text (marquee) in black font in the middle of a white screen.All narratives were animated in constant speed with a total duration calculated as the character span of the story divided by 250 (resulting in a character moving with a speed of approximately 25 characters/s) while the font size has been set to 40px.Once the marquee for the narrative has traversed outside the screen (all characters shown and disappeared), the testing phase was triggered automatically.This marquee style presentation was chosen because it allowed for comfortable reading while fixing the presentation duration for all participants and simultaneously preventing revisiting of already read material.
In the free recall experiments, the testing phase consisted of a textbox and a prompt to recall the story as close as possible to the original one.Once participants finished typing their recall of the narrative, they submitted their response with a button press and the experiment was concluded.
For the testing phase of the recognition experiment, 10 queries were sequentially presented.In each query, the participant was shown a single clause at random, either from the just presented narrative (old) or a lure (new).They were tasked to select whether they remembered seeing the presented clause by pressing "Yes" if they thought the clause appeared in the narrative and "No" otherwise.We did not observe any obvious signatures of output interference [Criss et al., 2011] (see Appendix D) and therefore used all queries in the subsequent analysis.Lures were generated by asking the LLM to take a given narrative segmentation, and insert novel clauses between each existing clause.This ensures that the lures are distinct from the true clauses, but still fit within the overall context of the story.For instance, this avoids lures which might mention "dolphins" if the story is about boy scouts.The prompt used to generate lures and an example completion by gpt-4 are given in Appendix A. 2.

Analysis
Analysis was conducted through custom Python scripts.For recognition memory, in order to estimate total encoded memory M from Equation (1), we used population and test trial averaged hit rate (true positive probability P h ) and false alarm rate (false positive probability P f ).Standard error was computed using statistical bootstrap [Efron and Tibshirani, 1994].
Recall scoring was done using the OpenAI model gpt-4-0613 (a GPT-4 model which receives no updates) based on the clause segmentation of the narratives.For each participant's recall, gpt-4-0613 was instructed to loop through each clause of the given narrative (as presented) and examine whether the information that this clause provided was present in some form in the participant's recall and the corresponding passage.The numbers of all clauses evaluated as being recalled were given at the end of the output in the form of a list.The full prompt we use for scoring recalls is given in Prompt 3, which takes three arguments: the narrative stimulus in prose (e.g.Argument 2), the numbered clause segmentation of the narrative stimulus (e.g.Argument 3), and the participant's recall (e.g.Argument 4).A sample completion is provided in Completion 3.
Separately, to evaluate the similarity of recall scoring between humans and LLMs, 3 authors performed manual scoring of 30 recalls of Argument 5 using the same procedure, evaluating whether each clause was present in the recall or not.

Random List of Nouns Experiment
We performed an additional experiment with a list of 32 nouns (see Appendix H) that were randomly selected from the pool of nouns used in [Naim et al., 2020].The experimental protocol was exactly the same as in that work with presentation speed 1.5 sec per word.105 participants were recruited using the Prolific online platform, with each participant accepting informed consent prior to the beginning of a trial.

Appendix A. Prompts and Completions
A significant part of the successful use of LLMs is the effective design of inputs or "prompts", which amount to instructions, written in standard English, for carrying out a particular task.
More precisely, the input to a LLM is a string, which we interchangeably refer to as the prompt or the context.The output of the LLM is also a string.This is also referred to as a "completion", since LLMs are trained to complete text fragments given to them as inputs.
Despite the accumulation of wisdom on prompt engineering, we still resorted to a good amount of trial-and-error and fine-tuning in designing the prompts used in our analysis.Since these prompts are tantamount to algorithms written for LLMs, we present them below in pseudocode boxes.In the following sections, we exhaustively detail the prompts used, along with examples of LLM outputs (i.e.completions).The specific model we utilize in most of the paper is OpenAI's GPT-4.For the data analysis, in the interest of reproducibility, we opted to use the deprecated model gpt-4-0613 which does not receive updates.In appendix B.1, we show how scoring using the latest model gpt-4 can change with time as a result of OpenAI's regular model updating.

A.1 Narrative Generation
We started with a template narrative, selected from the collection of oral narratives in [Labov, 2013], and instructed the LLM to produce variations of the story which changed the surface form, but kept the overall structure (e.g.number of clauses).For these stories, we used the segmentation given by Labov [2013], wherein the narratives are segmented word-for-word into linguistic clauses.Here is a sample of the prompt, with arguments shown enclosed in brackets, like {this}.

Prompt 1 Narrative Generation
This is a true personal narrative about a single event in someone's life.It has exactly {N} clauses: {template narrative} Generate a new personal narrative that is unique and about something completely different.Try to keep the overall narrative structure of the personal narrative above, but change as much of the subject matter and action as possible.Do not just use the narrative and replace key persons, places and things.Make it completely new.This new narrative must also contain exactly {N} clauses.
As an example, we used the story of Jacob Schissel from [Labov, 2013] as a template narrative which has N = 18: Prompt 2 Lure Generation {segmentation} The items above all fit together to tell a story.Add more items of roughly the same length, numbered 1.5, 2.5, and so on, interleaving the existing items, elaborating on the story, and without repetition.These new items should introduce completely new plot elements, but still make sense in the context of the rest of the story.Add as many items as possible.
The following is an example completion using the template narrative segmentation (Argument 1).For purposes of presentation, we present the lures along with the original clauses from the narrative.The completion does not include the original clauses, which appear in gray, and only consists of the text shown in black.
Completion 2 from Prompt 2

A.3 Recall Scoring
In order to score recall, we utilize zero-shot prompting with OpenAI's deprecated model gpt-4-0613.We chose to use this particular model for the benefit of reproducibility, since we have found that results from the following analysis can drift in time as OpenAI updates its model (see Appendix B.1).
The recall scoring Prompt 3, shown below, was a block of text, represented as a python string, which took as arguments the original narrative, the numbered segmentation of the narrative, and a single participant's recall of the narrative.The prompt was constructed so that it would identify which clauses from the narrative are present in the recall.A consequence of this scoring procedure is that a single clause in the recall can encode multiple clauses from the original narrative.gpt-4 appeared to have no trouble identifying such compressed clauses in the recall, and frequently would identify multiple original clauses within a single clause in the participant's recall.An example of this can be seen in Completion 3, where the same passage in the recall ("even the scoutmaster looked on") is counted as recalling both clauses 15 and 16 from the original narrative (Argument 2).
Below, we provide the scoring prompt, as well as some example arguments, and the resulting LLM completion.

It can be broken down into the following independent pieces of information: {segmentation}
Here is an alternative version of the original text where some of the above pieces of information may be missing: {recall} For each of the numbered information pieces of the list above, evaluate whether the information of each piece is given in the alternative version of the story, stating the number and showing the corresponding passage from the alternative story it is given in.After, write all the numbers of the pieces that are given in the alternative version of the story in a set of brackets at the end of the response.

Argument 2 Example narrative
Yeah, I was in the boy scouts at the time.And we was doing the 50-yard dash, racing, but we was at the pier, marked off and so we was doing the 50-yard dash.There was about 8 or 9 of us, you know, going down, coming back.And going down the third time, I caught cramps and I started yelling "Help!", but the fellows didn't believe me, you know.They thought I was just trying to catch up, because I was going on or slowing down.So all of them kept going.They leave me.And so I started going down.Scoutmaster was up there.He was watching me.But he didn't pay me no attention either.And for no reason at all there was another guy, who had just walked up that minute...He just jumped over and grabbed me.

A.4 Ordered Recall Scoring
In order to extract the order in which the original clauses appeared in a participant's recall, we added an extra step to the recall scoring discussed above.For a given narrative, segmentation, and recall, we concatenated Prompt 3 with its Completion 3, and appended to this the following additional prompt: Prompt 4 Ordered Recall Scoring Now repeat the alternative version of the narrative with the number of the independent piece of information inserted next to the location in which it appears in the alternative version.Then, list the numbers separately in the order in which they appear in the alternative story immediately above.The final list of numbers should be enclosed in parentheses.
Therefore, the total input or prompt for ordered recall scoring would be Prompt 3 + Completion 3 + Prompt 4. The output for this example was the following: Completion 4 from Prompt 3 + Completion 3 + Prompt 4 A boy (14) was with the scoutmaster and he was going up and down and on the third time of going up the pier he had cramps (7) and shouted 'help' ( 8) but no-one believed him (9), even the scoutmaster looked on (15,16) by until this guy came out of nowhere (17) and grabbed the boy and saved him (19).(14,7,8,9,15,16,17,19)

Appendix B. Reliability of LLM scoring
Here we report additional details of the reliability for LLM narrative scoring.
In scoring the recalls, the human scorers (who are the authors on this paper) did not develop overly sophisticate scoring protocols.The directions were intuitive, essentially boiling down to "did this clause from the original story appear in the recalled story".Despite the ambiguity of this procedure, there was very high inter-scorer correlation.

B.2 Comparison between different LLMs
We compared different LLMs available through the OpenAI API.In order of increasing number of learned parameters (model size), these were GPT-3 (text-davinci-003), GPT-3.5 (or ChatGPT, gpt-3.5-turbo)and GPT-4 (gpt-4-0613).The same scoring Prompt 3 was used for all of these models.The results are visualized in Figure 5.We find that GPT-4 is qualitatively better than the smaller models (compare correlation between human and LLM scoring in Figure 5B).Furthermore, increasing size does not seem to be sufficient to increase performance, as illustrated by the severe drop in performance by GPT-3. 5.As seen in Figure 5(A,B), GPT-3.5 appears to score more generously than the other models, resulting in a systematic upward bias of nearly all recall probabilities.It is possible that this reflects the sensitivity of these models to prompts, and that with appropriate tuning of the scoring prompt, the performance of GPT-3.5 could be improved.

Appendix C. Compressed Recall in Longer Stories
Each recall can be split up into a number of clauses.We denote the mean number of clauses used in recalls by C. Figure 6 compares C to the mean number of recalled clauses, defined previously and denoted by R. If every clause used by the participant were recalling a single clause from the original story, then the points would lie on a straight line, corresponding to C = R.For shorter stories, we find in fact that C > R, and people appear to use more clauses than necessary.However, for the longest story (L = 130), there is a marked compression, in which participants on average use 75% as many clauses as they are scored to have recalled.
We used Prompt 5 with GPT-4 at T = 0 to segment recalls.An example of this applied to Argument 4 is given in Completion 5.

Prompt 5 Segmentation Prompt
Provide a word-for-word segmentation of the following narrative into linguistic clauses, numbered in order of appearance in the narrative: {narrative} and the discrimination measure is We plot d ′ over the course of the recognition trials in Figure 7 separately for coherent stories and scrambled stories.
Output interference is characterized by a d ′ which decreases with trial number, indicating a diminishing ability to discriminate new vs old items.

Appendix E. Recall Probability Curves and Distributions
Here we plot recall probability per clause as a function of the serial position of presentation.We refer to these below as recall probability curves.We compare recall probability curves for narratives to data obtained for random word lists.Figure 8 a shows the recall probability curve for coherent (blue), scrambled (red) stories, alongside the recall probability curve for free recall of a random word list (black dashed).Importantly, for the scrambled recall, the serial position corresponds to the position of presentation.Two main points can be gleaned from these figures.First, whereas for random lists there is a pronounced primacy effect, we see no such structure in both coherent and scrambled stories.
In fact, highly recalled clauses appear at all positions in the story.Another observation is that the distribution of P rec for clauses in a narrative covers a much broader range than P rec for words in a list.This can be seen by looking at the cumulative distribution function of F (p) = P (P rec > p), which is the probability to find P rec > p.As shown in Figure 9A, F (p) drops to zero around p ≈ 0.6 for random lists, while showing a more gradual decay for coherent narratives.In contrast, Figure 9B shows that recall is significantly impaired for scrambled stories, with F (p) showing a more precipitous fall to zero similar to random word lists.
However, the recall probability curves for scrambled stories reveals a more subtle effect.While the overall distribution of recall probabilities may be similar to (if not worse than) random lists, there are still spikes in P rec throughout the bulk of the narrative, suggesting that these texts are not processed as if they were a random list of clauses.In Figure 10, we replot the recall of only the scrambled stories from Figure 8 according the serial position of the clause in the coherent story.Below this, we show the correlation between recall probability of a clause appearing in the coherent story versus the scrambled version.While low, the correlations are positive and statistically significant, suggesting that participants are able to identify memorable clauses even in the scrambled scenario, and use this in attempting to construct a coherent recall.This active process of selection might explain why average recall of clauses for scrambled narratives appears worse than recall of words in random lists (i.e.why the scrambled narrative data (orange triangles) lie below the square-root scaling (dashed gray) in Figure 2).Note that for the scrambled P rec (red curves), this is not the presentation order, but rather the original order of the clauses.Plotting the curves in this way, we observe the spikes in recall probability appear to coincide between coherent and scrambled recall.Below, we show correlations between P rec for a given clause in the coherent and scrambled presentation.125.So in the meantime -not in the meantime but during, during the period of time that we were there, my father became sick 126.And the neighbors were as nice as anybody could ever have been 127.When there was trouble they responded just like anyone else 128.But that was our welcoming to the neighborhood 129.And it was terrible 130. and it certainly was frightening for me

Figure 2 :
Figure 2: Human performance in recall and recognition experiments for narratives of different length.(A): Estimated number of remembered clauses (M) is plotted as a function of the number of clauses in the narrative (L) measured in recognition experiment.Surprisingly M has similar values in intact and scrambled narrative.(B): Average number of recalled clauses (R) for narratives of different length.In contrast to the M, R drops substantially for scrambled narratives.(C):Average number of recalled clauses vs. number of remembered clauses from the same story.As expected from panels a) and b) the number of retrieved clauses in scrambled narrative is substantially smaller that in intact narrative for the same number of remembered clauses.For comparison we presented the theoretical performance for the random list of words, which describes data well(Naim et al. [2020]).It is clear that there are more clauses recalled in intact narratives than words in lists of random words.Surprisingly, retrieval of scrambled stories is significantly worse than random lists, suggesting an active suppression of items in service of generating a coherent recall (participants were implicitly instructed to recall story).

Normalized
Serial Position of Item (i/L)

Figure 3 :
Figure 3: Recall order.Color-coded order of clauses or words for different conditions are shown in all panels.Recalled clauses or words are stacked together vertically (with the first recalled clause at the bottom of a column, and the last recalled clause at the top).The height of the column represents the total number of clauses or words recalled in a given trial.In panels A, B, and D, color code represents serial position of presentation of clauses or words, from early (red) to later (blue) in presentation position.Panel (C) is the only exception, in which the color code reflects the serial position of clauses in the original (intact) story.(A)shows that recall of coherent stories largely preserves presentation order.(B) recall of random word lists does not preserve presentation order.(C) As with random lists, the recall of a scrambled story does not preserve presentation order, but rather appear to reconstruct the original order of the story, as seen from the color gradients in panel (B).Apparently, random words and scrambled stories are recalled in random order considering their presentation order, but people perform some unscrambling of the scrambled stories as can be seen in (C) -there is tendency of recalled clauses being in the order of original unscrambled narrative.The participants construct a mental representation of the scrambled narrative which is evidently close to its original form.Recall consequently does not reflect input sequence, but rather the original sequence of the clauses.

Figure 4 :
Figure 4: Recognition vs recall performance across different clauses.Clauses from all the narratives used in this study were divided evenly into 15 bins according to their P rec , and the average P h for the clauses in each bin was computed and plotted against the center of the corresponding bin.

Figure 5 :
Figure5: Reliability of different LLMs for scoring narratives We observed that performance on human recall scoring was not monotonic in the size of the LLM.(A) shows P rec for each clause as computed by LLMs of three different sizes, in order of increasing size: GPT-3 (OpenAI's API model text-davinci-003), GPT-3.5 (or ChatGPT, model gpt-3.5-turbo),and GPT-4 (gpt-4-0613).GPT-3.5 scores appear systematically biased to be higher than the mean human scores, whereas GPT-3 roughly follows the human trend.(B) shows correlations between these models and mean human performance.(C) shows inter-LLM correlations.

Figure 6 :
Figure 6: Compression of Recalls.The vertical axis shows the mean number of clauses in a population of recalls, determined using the segmentation prompt given in this section.The horizontal axis gives R, the mean number of recalled clauses.

Figure 7 :
Figure 7: Discrimination measure shows good memory for true clauses which does not change significantly as the recognition trial progresses.

Figure 8 :
Figure8: Recall probability per item for narratives and random lists of words.(A) shows the recall probability per clause as a function of the serial position of presentation for a coherent narrative (blue) and the scrambled version of this same narrative (red).As a comparison, we plot the recall probability per word as a function of serial position of presentation (black dashed).While the recall probability for the list shows a marked primacy effect, with a general suppression of recall in the bulk of the list, the recall of coherent narratives shows many spikes in P rec throughout the narrative.And while the scrambled stories have overall much lower recall probabilities, they still show large peaks within the bulk of the presentation.(B) and (C) compare coherent and scrambled narratives of increasing length.

Figure 9 :
Figure9: Cumulative distribution of recall probabilities (A) As compared to random lists (purple dashed), coherent narratives have a slower decline in the cumulative distribution function, indicating a larger number of clauses with a higher recall probability.(B) shows this behavior is reversed for scrambled stories, where the recall probability of clauses appears to uniformly drop, in some cases well below the corresponding distribution for recall of a random list.

Figure 10 :
Figure 10: Descrambling of narratives in recall.(A-C) show recall probability as a function of the serial position in the coherent story.Note that for the scrambled P rec (red curves), this is not the presentation order, but rather the original order of the clauses.Plotting the curves in this way, we observe the spikes in recall probability appear to coincide between coherent and scrambled recall.Below, we show correlations between P rec for a given clause in the coherent and scrambled presentation.

Table 1 :
A comparison between gpt-4, mean human ( h), and individual human scorers (h 1 , h 2 , h 3 ).Each entry in the table gives the correlation coefficient (r-value) between the corresponding row and column label.The models offered through the OpenAI API are known to undergo regular updates.We found that within the span of two months, the correlation coefficient with mean human performance rose by 0.05.We compare gpt-4 called via OpenAI's API on 05/23/23, the same model gpt-4 called less than two months later on 07/03/23, with the deprecated model gpt-4-0613 (called after 06/13/23).Fortunately, the frozen model gpt-4-0613, which we use throughout the paper, is essentially as good as the most recent version.

Table 2 :
A comparison on gpt-4 scoring compiled at different times of the year.Evidently, the model improves over time in approximating the mean human scoring, with the most recent model achieving a correlation closest to one.The entries of the table show correlation coefficients (r-values) between the corresponding row and column variables.
118. "Now" she said, "Whoever threw the stones I don't know 119.But you can all see what has happened 120.And now I'm telling you that we're going to be here for a month 121.So you might just as well make up your mind to that 122.If you don't want me here then you people get together and give me enough money to move somewhere else, because I do not have it 123.Well of course you know they didn't do that 124.They probably didn't have any money either