Abstract
Artificial intelligence that accumulates knowledge and presents it to users has the potential to become a vital ally for humanity. It is thought to be advantageous for such artificial intelligence to have the capability to continually accumulate knowledge while operating, as opposed to needing to relearn each time new knowledge is acquired. In this study, we developed an appendable memory system for artificial intelligence to acquire new knowledge after deployment. Artificial intelligence that can share past acquired knowledge with users is perceived to be a more human-like entity. Some dialogue agents developed to date can partially utilize memories of past conversations. Such memories are occasionally referred to as long-term memory. However, the capacity of existing dialogue agents to continuously accumulate and share knowledge with users is currently insufficient. Given their naive implementation, their memory capacity is ultimately limited. It is considered necessary to develop methods to realize a system capable of storing any amount of information within a finite medium. Furthermore, when developing dialogue agents through the repetitive use of human conversation data as training data, doubts arise regarding whether artificial intelligence can effectively utilize the information they have acquired in the past. In this study, we demonstrate this impossibility. In contrast, this research proposes a method to store various information repeatedly within a single finite vector and retrieve it, utilizing a neural network with a structure similar to a recurrent neural network and encoder–decoder network, which we named memorizer–recaller. The system we aim to build is one capable of generating external memory. Generally, parameters of trained neural networks are static; in other words, trained normal neural networks predict values by referring to the static memory. In contrast, we aspire to develop a system that can generate and utilize dynamic memory information. This study has only produced very preliminary results, but it is fundamentally different from the traditional learning methods of neural networks and represents a foundational consideration for building artificial intelligence that can store any information within finite-sized data and freely utilize them in the future.
1 Introduction
The grand challenge of artificial intelligence research is to construct the strong artificial intelligence or artificial general intelligence. The journey towards the realization of such artificial intelligence is long; in the process of the realization, we will likely need to solve many problems. The study of endowing artificial intelligence with consciousness, that is, the study of machine consciousness, is one such problem [1], and the construction of entities that possess physicality in the real world is another such problem. Furthermore, an ability to learn perpetually, without the need for a defined learning process like that found in current machine learning field, may also be necessary. Such artificial intelligence would be capable of continuing to learn even after being deployed, and constructing this kind of artificial intelligence is one of the grand challenges in the field of dialogue agent research [2].
Artificial intelligence that can continue to learn even after being deployed is thought to need at least the following abilities, that is, the ability to acquire new knowledge and use it according to the situation:
The ability to retain the information that has been supplied so far as some kind of memory data.
The ability to add the newly supplied information to that memory data without compromising the information that has already been memorized.
The ability to freely extract necessary data from the memory data.
Generative artificial intelligence, particularly large language models (LLMs), sparked a boom in 2022. The origin of this trend, ChatGPT, a dialogue agent developed by OpenAI, is capable of having a conversation based on a certain amount (8192 tokens) of past conversation memory. Since the information inputted into ChatGPT is used in subsequent conversations, it would not be an exaggeration to say that ChatGPT is an entity that continues to learn while operating as artificial intelligence. Also, BlenderBot 2 developed by Meta, and its successor model BlenderBot 3, are dialogue agents that can summarize knowledge from past conversations, hold that knowledge, and have conversations based on it. BlenderBot 2 has a component called “long-term memory” in its structure. This component holds information that summarizes past conversations. In response to user input, it uses the information stored in this long-term memory. However, as with Chat-GPT, the information remembered is also finite due to its naive implementation. Such LLMs are not designed to compress and retain past information as some form of memory, nor are they intended to decompress such compressed information and restore something from it; they are only neural networks comprised of static parameters, trained for the purpose of generating and processing natural language.
In human conversation, people sometimes forget the past conversation they have had. If we are looking for literally human-likeness in artificial intelligence, artificial intelligence with finite memory may partially meet that demand. However, people do not completely forget the content of a conversation from a certain time ago and remember the content of a conversation after a certain time, so the dialogue agents with finite memory may not be human-like in that sense. Furthermore, we should consider whether we should demand human-likeness or imperfect features in certain abilities of artificial intelligence, such as memory power. As we mentioned earlier, the grand challenge of artificial intelligence research is to build the strong artificial intelligence or artificial general intelligence, but should all the abilities of strong artificial intelligence or artificial general intelligence be human-like? Should we wish for artificial intelligence to possess imperfect memory? In the paper where the term “strong artificial intelligence” was first used [3], strong artificial intelligence is described as an entity with “spirit.” This spirit is human nature itself. We believe that in order for a strong artificial intelligence to be human-like, it should simply possess the spirit, which is the essence of the human-likeness; it is not necessary for other abilities to be human-like. Conversely, we believe that artificial intelligence can even be “machine-like” in some other abilities. Artificial intelligence that does not lose memory like human and can provide knowledge freely without forgetting any memorized information like machine would be good friend to humanity. In fact, we do not demand that the dog-shaped robot, aibo, developed by SONY, should die, do we? The aforementioned BlenderBot 2 has long-term memory, but it is not machine-like so far. The machine-like artificial intelligence should be able to memorize any information. Here, we call such memory that can memorize any information “permanent memory.” In this study, towards realizing such permanent memory, we develop “appendable memory” system.
In this study, we contemplate the possibility of constructing such a permanent memory or appendable memory through existing machine learning methods. Furthermore, prior to that, we consider whether we can train artificial intelligence to learn how to “refer to” past information using the machine learning methods. There is a suspicion that methods attempting to extract information from medium like long-term memory, permanent memory or appendable memory may not be suited to the learning methods of neural networks. When trying to construct a dialogue agent, we repeatedly train a model with pairs of question sentences and their corresponding response sentences. The same is true for LLMs like the Generative Pretrained Transformer. For example, it is thought that if a model, which accepts a question sentence and long-term memory as input data and return some response sentence, is trained repeatedly using conversation pair data, the model will stop paying attention to the memories as learning progresses. This is because a model that has undergone training will be optimized to the training data and thus be able to return an appropriate response to a question sentence without referring to the information in memories. Using information that can only exist in external memory like longterm memory to generate a response from the artificial intelligence might be nothing more than a fantasy within the current framework of machine learning.
To clarify this, we conduct an experiment. In the experiment, we will perform encoding and decoding operations. Therefore, we will use two models, which we will call the “memorizer” and “recaller.” We will refer to the encoding operation as the “memorization” operation, and the decoding operation as the “recall” operation from here. Initially, in the memorization operation, we generate several vectors combining a key vector consisting of random numbers of a certain length and a value vector consisting of an integer from 0 to 9, and repeatedly train the memorizer with the information to generate a vector, which is expected to contain all the input information. We will call this vector generated by the memorization operation a “memory vector.” In the following recall process, we use one of the key vectors used in the memorization process and the memory vector as inputs to the recaller, and train it to output a value, which corresponds to a key of the key–value pair. During the test phase, we feed multiple key–value vectors different from those used during training to the trained memorizer. Next, we observe whether the recaller can retrieve and output a correct value, which corresponds to a key, by using only the key vector and the memory vector as input data. The vector we call a memory vector in this experiment corresponds to the previously mentioned long-term memory, permanent memory or appendable memory. If the recaller can effectively utilize the memory vector, the recaller should be able to output a value corresponding to the key. However, the actual result is likely not to work out. This is because a recaller that has undergone sufficient training with a finite training dataset is a model that is optimized to training dataset and merely memorizes input and teacher vectors, namely the combination of a key–value vector composed of random numbers and is only an artificial intelligence for just outputting a value corresponding to the key without referring to the memory vector.
In this study, we will conduct an experiment that demonstrates the limitations of existing learning methods. We also propose a learning method designed to avoid overfitting the model to patterns inherent in the input dataset. Ultimately, we are advocating for the creation of a system that can encode multiple pieces of information independently into a memory vector and restore information freely from it, using the proposed learning method. We believe that a neural network trained using conventional learning methods stores the memory information necessary to return appropriate output values for input values within its parameters. We describe such a neural network as possessing “static memory.” In contrast, the system we aim to construct can generate “dynamic memory,” what we call appendable memory. It is a type of memory that can change dynamically without the learning of a neural network or changing parameters of the neural network.
2 Experiments and Discussion
2.1 Network Architecture
What we aim to achieve in this study is to train the memorizer–recaller network so that the memorizer can store N pairs of key and value vectors into a memory vector, and then the recaller output the corresponding value from the memory vector and one key vector, as shown in Figure 1. In other words, we perform the task of restoring complete information from the key using the memory vector. Using a dialogue agent as an example, this operation can be explained as shown in Figure 2. The combination of key and value corresponds to a statement by an user, and the memory vector represents accumulation of these statements stored in the artificial intelligence. The key represents some piece of information, and the value is an explanation for that information. The role of the recaller is to recall the value corresponding to a key based on its memory. If an artificial intelligence can perform this behavior, it would be able to accumulate knowledge from users or even web searches, and grow without changing or updating the parameters of the neural network after deployment.
The operations performed by the memorizer–recaller network can be achieved with a key–value store in cases where the key and value are a simple combination, and there is no limit to memory capacity. However, here the size of the memory vector is fixed, and the size of the data to be memorized is also an arbitrary N. Moreover, if we assume natural language as the input value, the key and value are not always in this order, and instead of the tandem structure of key–value, the value may also be located in the middle of the key vector. We cannot handle these kind of data with a simple key-value store. Furthermore, the operation we want to perform with this memorizer can be realized with a neural network equipped with an attention mechanism (ANN) [4] or recurrent neural network (RNN). ANNs and RNNs can receive sequential values as input and output values, considering past context information. The operation we want to perform with the memorizer is similar, but one difference of the memorizer from the usual ANNs or RNNs is that there is no teacher data for the memory vector, during the learning of the model and that ANNs and RNNs cannot append information to the memory vector though they can overwrite it.
In this study, we used two neural networks; here, we show formula of the models. In this paper, the left arrow (←) is a symbol that assigns the value on its right to the variable on its left. The first model, the memorizer, is represented by the following formula: where, w is the weight parameter, b is the bias parameter, and σ is the activation function. We used LeakyReLU [5] as the activation function. In addition, f is a function that concatenates two vectors, t is an integer from 1 to N, kt is the t-th key, vt is the t-th value, and mt is the t-th memory vector. In the memorizer, the (t − 1)-th memory vector is used to generate the t-th memory vector; the memorizer has recurrent loop in its structure, and thus the memorizer is a kind of RNN. The variable p in the final formula is what is derived from the input data, while q is what is derived from the (t − 1)-th memory vector; namely the memory vector includes both previous memory information and newly inputted information. Because we need to calculate the element-wise summation of p and q, these vectors need to have the same dimensions. During the memorization phase, we perform the computation for all N data to finally obtain mN. The second model, the recaller, is represented by the following formula: where, ϕ is the activation function, and because the recaller is used to classify and predict values from 0 to 9, we used the softmax function. In addition, ki is any one of the N keys used as input to the memorizer. In the recall phase, it attempts to output , which is expected to be the value corresponding to ki.
Figure 3 illustrates the neural networks. In the figure, the structure on the left is the memorizer, and the one on the right is the recaller. The memorizer accepts pairs of key and value vectors as input and outputs the memory vector. By repeating this calculation for all N input values, it ultimately generates memory vector that is expected to contain all inputted information. The recaller accepts a single key and the memory vector generated by the memorizer as input and outputs the value corresponding to the key. What we want to ascertain using the memorizer–recaller network is whether a trained neural network can realize the storage and retrieval of multiple pieces of information, as shown in Figure 1, without requiring relearning. Since we do not pursue enhancing performance of the system, we did not include any luxury mechanisms, which are generally used to improve the performance of neural networks, and designed it with a very simple combination of layers, as shown in the figure.
2.2 Learning Process of the Models
2.2.1 Learning in a Standard Manner
Next, we trained the model. First, we explain the hyperparameters used for training. The size of the intermediate layer was set to 256 for both the memorizer and recaller. Similarly, the dimension of the memory vector, which is the output of the memorizer, was set to 256. Since the recaller is a predictor that outputs values that are integers from 0 to 9, the dimension of its output vector was set to 10. In addition, the dimension of the key vector was set to 16. Each element of the key vector is a floating point number randomly generated from a continuous uniform distribution following parameters with a minimum value of 0 and a maximum value of 9. The value vector is a one-dimensional vector, and its only element is an integer that is randomly generated from a discrete uniform distribution with a minimum value of 0 and a maximum value of 9. The number of input data to be inputted into the memorizer is N and this N is variable for the experiment. For training, 1024 combinations of these N input data were generated and used in the batch learning method. During the learning of the memorizer, the value of t is changed from 1 to N, the memory vector mt is calculated, and finally mN is obtained. For m0, random numbers generated from a uniform distribution following parameters with a minimum of -1 and a maximum of 1 were used. Since the softmax function was used in the output layer, the cross-entropy error function was used as the cost function. Parameter updates were performed when the error between the value and the teacher data was calculated after inputting mN and each of the N keys to the recaller to output the predicted value. Therefore, the cost function L is computed as follows: where M represents the memorizer, R represents the recaller, l denotes the cross-entropy error function, w signifies all weight parameters of the memorizer–recaller network, and b stands for all bias parameters of the memorizer–recaller network. As a parameter optimization method, Adam [6] was used.
The results of the training are shown in Table 1. In the training, the training dataset and the validation dataset were randomly generated. In other words, there is no discernable rule or pattern that the machine learning model might discover in each of the data. Therefore, in order to derive the correct answer in the validation phase, it is necessary for the recaller to refer to the memory vector generated in the memorization phase. Training was terminated when the accuracy in the training phase reached 0.8 or more. Table 1 displays the number of epochs required until then and the accuracy in the validation dataset. As a result of the training, the accuracy in the training phase reached 0.8, but on the other hand, the accuracy in the validation phase was low, and we were not able to build a good model.
The reason why a good model could not be built through the training phase is that the model overfit the training dataset and adapted excessively to it. In other words, this model learned the characteristics of the training dataset consisting of random elements, so it could not adapt to the validation dataset, which has no correlation and does not share any characteristics with the training dataset. Although the recaller should ideally refer to the memory vector, which is a condensed medium of the information inputted into the memorizer, to derive answers, it is merely attempting to output something based on the pattern of the randomly generated vector. In other words, in a learning task like this one, where the model is asked to predict output values corresponding to input values, it is thought to be impossible to incorporate the function of referencing to the memory vector, as it would be sufficient for the model to output values corresponding to the input values as it goes through more learning iterations. We think the parameters of neural networks trained by standard learning method is a kind of memory itself, which memorizes the relationship between the key and the value. As mentioned above, we call this memory as a static memory. Once a system holds perfect static memory overfitted to the training dataset, the system would not need to refer to the external memory like the memory vector.
2.2.2 Learning with Random Value
What we want to implement in the memorizer–recaller network proposed in this study is a “way” of organizing input values, remembering them, and a “way” of searching for necessary information from memory and recalling it. In the general machine learning training methods conducted in the previous section, a model learns the nature of the data and tries to understand the patterns inherent in the data. An LLM that has intensively learned text can be considered to have discovered some rules for language generation. In contrast, what we want to construct in this study is a system that can memorize and recall information regardless of the scale of learning data or the domain from which the learning data originates. The model we want to realize in this study is not a system to find the patterns inherent in the data, but a kind of mnemonic system to make a model to learn how to memorize information.
In this study, to prevent the model from recognizing patterns inherent in the data and to enable the model to learn how to behave, we removed patterns from the learning data. Specifically, we probabilized ki, the input value to the memorizer and recaller, and vi, the input value to the memorizer and the teacher data for the recaller. During the learning process, each of the 16 elements of ki was randomly generated from a continuous uniform distribution with parameters from 0 to 9 for each epoch. The only element of the value vector was also generated from a discrete uniform distribution with a minimum value of 0 and a maximum value of 9. Although we are using randomly generated data in the learning process here, the aim is that random data does not have a tendency. We are trying to make the memorizer–recaller network learn not to read data and learn the tendency of the data, but to learn how to remember data and how to retrieve information from the memory.
The results of learning using this method are shown in Table 2. In this learning process, we ended the learning when the validation accuracy reached 0.8. Generally, the accuracy of the validation dataset is lower than the accuracy of the training dataset, and if this is not the case, it is considered that there may be a suspicion of abnormal learning settings. However, since both the training dataset and the validation dataset in this learning method are random values, if the learning progresses normally in the training dataset, the accuracy in the validation dataset also improves at the same time, so this behavior is normal.
We conducted training while varying the number of data N to be memorized from 2 to 9. Up until N became 8, accuracy reached 0.8 both in the training dataset and the validation dataset. In other words, the trained memorizer– recaller was able to derive the correct answer in a validation dataset by learning from a training dataset, where both the training and validation dataset was randomly generated and has no patterns. On the other hand, when N was 9, despite training the model by the epochs of 500000, accuracy did not reach 0.8, and which meant the learning process did not proceed successfully. In other words, the memorizer–recaller constructed under these learning conditions can only memorize at most 8 pieces of information.
Next, as a demonstration of the memorizer–recaller’s performance, we sequentially inputted the following 8 sets of key and value data in this order into the memorizer to generate a memory vector. Afterwards, we conducted a test using all the keys to see whether the values could be retrieved using the memory vector and the recaller. The integers shown in bold represent the values.
8.75, 7.90, 4.59, 0.500, 4.06, 0.180, 3.98, 8.82, 3.24, 4.33, 6.20, 7.92, 8.26, 1.95, 5.09, 7.79, 7
3.46, 2.68, 0.510, 2.45, 4.30, 7.31, 4.32, 3.54, 7.52, 3.04, 5.83, 3.31, 8.61, 1.26, 7.83, 4.26, 5
3.48, 8.12, 4.05, 5.52, 8.12, 0.890, 8.73, 5.88, 1.54, 3.22, 6.76, 5.47, 2.93, 0.350, 5.71, 8.63, 3
5.88, 5.72, 8.96, 5.24, 3.73, 4.27, 5.61, 3.04, 6.07, 2.85, 7.01, 8.55, 5.96, 0.120, 5.61, 6.06, 3
2.82, 8.69, 5.30, 5.94, 4.80, 2.07, 3.55, 5.57, 4.27, 4.23, 6.44, 2.59, 3.45, 6.74, 7.91, 0.930, 5
7.21, 4.68, 6.11, 6.49, 5.24, 4.84, 6.83, 0.950, 4.26, 1.68, 6.63, 1.95, 1.22, 2.92, 1.35, 2.00, 0
4.58, 8.25, 8.29, 0.750, 2.50, 0.0800, 7.58, 5.82, 7.57, 2.38, 3.58, 4.98, 1.48, 3.33, 1.32, 5.13, 9
6.33, 2.60, 3.90, 6.80, 3.56, 8.06, 5.75, 8.02, 6.12, 4.04, 8.81, 1.05, 6.90, 3.71, 6.08, 2.25, 3
The results of this test, in which we wrote the output values of the recaller next to the colon, are as follows. In this case, the recaller was able to retrieve the correct value for all keys.
3.46, 2.68, 0.510, 2.45, 4.30, 7.31, 4.32, 3.54, 7.52, 3.04, 5.83, 3.31, 8.61, 1.26, 7.83, 4.26, 5 : 5
7.21, 4.68, 6.11, 6.49, 5.24, 4.84, 6.83, 0.950, 4.26, 1.68, 6.63, 1.95, 1.22, 2.92, 1.35, 2.00, 0 : 0
3.48, 8.12, 4.05, 5.52, 8.12, 0.890, 8.73, 5.88, 1.54, 3.22, 6.76, 5.47, 2.93, 0.350, 5.71, 8.63, 3 : 3
5.88, 5.72, 8.96, 5.24, 3.73, 4.27, 5.61, 3.04, 6.07, 2.85, 7.01, 8.55, 5.96, 0.120, 5.61, 6.06, 3 : 3
8.75, 7.90, 4.59, 0.500, 4.06, 0.180, 3.98, 8.82, 3.24, 4.33, 6.20, 7.92, 8.26, 1.95, 5.09, 7.79, 7 : 7
4.58, 8.25, 8.29, 0.750, 2.50, 0.0800, 7.58, 5.82, 7.57, 2.38, 3.58, 4.98, 1.48, 3.33, 1.32, 5.13, 9 : 9
6.33, 2.60, 3.90, 6.80, 3.56, 8.06, 5.75, 8.02, 6.12, 4.04, 8.81, 1.05, 6.90, 3.71, 6.08, 2.25, 3 : 3
2.82, 8.69, 5.30, 5.94, 4.80, 2.07, 3.55, 5.57, 4.27, 4.23, 6.44, 2.59, 3.45, 6.74, 7.91, 0.930, 5 : 5
Furthermore, the mean accuracy when repeating such a test 1024 times was 0.916. This indicates that even in the tests, the memorizer–recaller network has shown good performance, as intended, able to memorize multiple pieces of information in a single memory vector, and to retrieve information from it.
Next, we checked how the mean accuracy changes when the memorizer–recaller, built for N = 8, is made to process various amounts of data. In this experiment, we computed mean accuracy by repeating such a test 1024 times. The results, as shown in Table 3, demonstrate an accuracy exceeding 0.9 up to the number of data used for training, which is 8, but the performance dramatically drops when the number of data processed is increased beyond that. This problem is a 10-class classification problem, so the accuracy when answering randomly would be 0.1. When the number of data to be memorized is set to 256, the accuracy became nearly the same as when answering randomly.
There was a tendency of way how the memorizer–recaller network retrieve correct answer. The recaller was able to produce the correct output for the last 8 pieces of information memorized by the memorizer, and was unable to recall the data provided to the memorizer that was to be memorized earlier. For instance, when the input data consisted of 16 items, the memorizer–recaller showed an accuracy of 0.510, which appears to greatly exceed the accuracy of 0.1 when predicting randomly. However, this was not because the memorizer–recaller could derive the correct answer with an accuracy of 0.510 uniformly for the data to be memorized. The value of 0.510 was calculated as the average of about 0.9 accuracy in deriving the correct answer for the last 8 pieces of data input to the memorizer, and about 0.1 accuracy in deriving the correct answer for the first 8 pieces of data.
By the new method of learning developed in this study, it was possible to build a system that can remember multiple pieces of information in a single vector, despite being a trained model, and can retrieve them, although it is a small amount of just 8 items. The memorizer–recaller network is similar to an encoder–decoder network like a transformer in terms of neural network structure, and the computational processing in the memorizer is an extension of ANNs or RNNs. However, by effectively utilizing random data in the learning process and not allowing the same input data to be learned twice, we were able to achieve the goal of making the artificial intelligence remember the procedure to solve problems, rather than having it learn patterns embedded in data. As a result, the memorizer–recaller network has become able to continue learning after the learning process, which distinguishes this network from existing artificial intelligence.
On the other hand, the memory capacity is extremely small, and thus the memory vectors generated by the memorizer are far from being called permanent memory. To improve this, a breakthrough is needed, not just ad hoc measures like increasing the size of the network parameters.
One feature of the recaller that differentiates it from previous neural network models is that it restores information from memory vectors, which possess properties distinct from the input values. As discussed in the previous section, we demonstrated the possibility that the neural network models might not be able to effectively utilize a external memory, due to adaptation to learning datasets. However, artificial intelligences like ChatGPT and BlenderBot 2 can appear to refer to previously learned memories and information. However, this is fundamentally different from what the recaller does. ChatGPT and BlenderBot 2 combine prior information and queries and use them as inputs for a static model to generate responses; in the end, LLMs are nothing more than artificial intelligence capable of processing natural language. No restoration of any kind of information is occurring from the dynamic memory vectors, which are in a format different from natural language. Further, a distinction of the memorizer from previous neural network models lies in its ability to be able to append new information to memory vectors: this is the key feature of the memorizer–recaller network. The memorizer–recaller network possesses dynamic external memory in the form of memory vectors. When considering that ChatGPT and BlenderBot 2 incorporate past input and output values as well as their summaries, along with current input values, it might be appropriate to regard these past pieces of information as external memory. In the case of ChatGPT, this external memory takes the form of a long vector with a certain number of tokens, while BlenderBot 2 achieves this through several summary sentences. Conversely, the external memory of the memorizer is a finite-sized vector that can be appendable with new information. We perceive future potential in this appendable memory vector. While the amount of information that can be remembered is thought to depend on the size of the memory vector and the precision of the floating point numbers, we aim to improve this by being able to retain even more information through some form of breakthrough.
2.3 Developing Sorting Algorithm
Next, we tried to see if a sorting algorithm could be generated by applying the memorizer–recaller network. In order to perform sorting, it is necessary for a model to remember the inputted information. Since the memorizer–recaller is a system that can memorize input information in a memory vector, we conducted this experiment, expecting that it could generate a sorting algorithm.
The structure of the network used for this experiment is as shown in Figure 4. Almost all of the structure is the same as the structure in Figure 3, but the input data of the recaller have changed from key to query. This query is an integer from 0 to N−1. Zero is a prompt to make the recaller output the smallest number among the N numbers input, one is a prompt to make the recaller output the second smallest number among the N numbers input, and N−1 is a prompt to make the recaller output the largest number among the N numbers input. By arranging these output results, the result of sorting the input values is obtained. For example, we used the following data as the input value of the memorizer. This is the case when N is 5. The floating point numbers are the keys and these are what we want to sort, and the bold characters are the values. The keys were randomly generated from a uniform distribution with a minimum value of 0 and a maximum value of N.
3.35, A
1.05, B
0.640, C
1.58, D
1.82, E
For these input data, the recaller outputs as follows. The queries to the recaller in this case are 0, 1, 2, 3, 4 from the top to the bottom.
C
B
D
E
A
The learning was conducted in the same way as described earlier. Learning was stopped when validation accuracy reached 0.95. By incrementally increasing the value of N from 2, we checked whether the learning could be completed normally. As results, it was possible for learning to complete normally within 500000 epochs up to N being 8. This number 8 was consistent with the results in the previous section. Although this value of N can vary slightly depending on the stopping condition of the learning, discussing whether this N value is 8, 7, or 9, etc. is meaningless, but it can be thought that about this extent of memory can only be retained by the memorizer–recaller network under current condition of parameter size and architecture design. As a part of the memorizer, we uses Formula 2.4. On the other hand, if we modify this formula as follows: the learning process for generating the sorting algorithm proceeded better on our preliminary experiments. With this kind of ingenuity, we may be able to slightly increase the size of N. However, the system we want to develop does not generate memory vectors suitable for a pre-known size N. Hence, N should not be included in the arguments of the memorizer. To improve performance, it appears necessary to consider some other mechanism.
Next, under this condition of N being 8, the constructed memorizer–recaller network was asked to solve sorting problems. An output was considered correct only when the order of the elements in the output data from the recaller perfectly matched the order of the elements in the input data. After performing the calculation of the test by 1024 times, the network was able to sort accurately in 893 trials. Therefore, the accuracy of this sorting algorithm was 0.872. The sorting algorithm generated in this study cannot provide a perfect answer and can only sort up to 8 numbers, but there is a possibility of achieving a better result by adjusting the network architecture of the memorizer–recaller, optimizing the parameter size, or making the learning termination condition more stringent.
As a result, as expected, it was possible to generate a sorting algorithm using the memorizer–recaller network. There are various types of sorting algorithms, but the worst time complexity of the comparison sorting method, which is the most standard sorting algorithm, is O(N log N) when the number of input numbers is N. In contrast, the computation complexity of the sorting algorithm generated using the memorizer–recaller is always O(N). This experimental result reinforced the evidence that the memorizer–recaller network can hold input information in the memory vector, and also demonstrated its usefulness to apply actual problem.
2.4 Towards Advanced Memory System in Artificial Intelligence
Firstly, we summarize the terms we have used so far and our interpretation of them. We believe that the long-term memory possessed by several LLMs is a form of external memory. Additionally, we consider both the appendable memory developed in this research and the eventual goal, permanent memory, to be external memory. The long-term memory stores information in the form of natural language, which can be processed by LLMs capable of handling such language; however, it is accompanied by constraints regarding the length and quantity of information stored due to its format. In contrast, the external memory generated by the memorizer is what we call appendable memory, which is a memory vector capable of having information appended to it in finite vectors. Moreover, the recaller can refer to this compressed memory information and restore information from it. While the recaller uses these memory vectors to produce information, LLMs receive lengthy inputs containing past memory and output natural language. We consider that these mechanisms do not include referencing in their operation.
In this study, we aimed to construct a system in which artificial intelligence can acquire knowledge even after the model has been trained, and we created a memorizer–recaller network. Furthermore, we attempted to construct a permanent memory system in which artificial intelligence can retain all sorts of information in the world. The memorizer– recaller system that we constructed in this study succeeded in holding separately inputted information in the memory vector of the memorizer and retrieving it by the recaller. The memory vectors generated by the memorizer are external memories that can be saved and utilized later. Additionally, this external memory is a dynamic memory information, to which new information cab be appended. This is where our approach differs from the artificial intelligence that has been developed thus far. Furthermore, the recaller is capable of restoring information from these memory vectors. We attempted to construct a sorting algorithm using this network, which also succeeded, demonstrating that the memorizer can correctly encode input information into the memory vector and that the recaller can appropriately use that information.
The learning method proposed in the study probabilized training dataset for each epoch: we introduced randomness to learning process. This prohibits a model to learn patterns inherent in the training dataset. While reinforcement learning can be used to make models acquire procedures rather than learning patterns in data, this study did not use such a mechanism, and it is considered a new study that achieved this by using random numbers in the learning method. Random numbers are also used in diffusion models [7] and generative adversarial networks (GANs). This study was inspired by those studies, but considering that diffusion models and GANs are indeed learning patterns in the learning data, it is considered different from those.
On the other hand, although the memorizer–recaller network was originally developed based on the motivation to allow artificial intelligence to hold an arbitrary amount of information, the actual amount of information that can be memorized by the network was only up to 8. Based on the results of preliminary experiments, it is expected that this number can be somewhat improved by increasing the parameter size of the network. However, this is not a fundamental solution. To improve this, some kind of breakthrough is considered necessary. The fact that the memorizer–recaller network does not have sufficient memory may be related to catastrophic forgetting [8]. Originally, catastrophic forgetting occurs when models learn one task and another task. Since the information to be inputted to the memorizer in this study is not related to another task, it does not seem to be related to this phenomenon at first glance. However, as shown in the experimental results, the memorizer–recaller network has a nature of completely forgetting information before a certain point, so there is no doubt that some kind of forgetting is occurring. Several methods have been proposed to prevent catastrophic forgetting, including Elastic Weight Consolidation [9] and Deep Generative Reply [10]. We were unable to find a method that could be directly used for the memorizer–recaller network, but with some ingenuity, it may be possible to apply these methods developed so far. We plan to examine this in the future.
3 Conclusion
In this study, we set out to construct a system where artificial intelligence can acquire knowledge even after the model has been trained, resulting in the creation of the memorizer–recaller network, which can generate appendable memory. The memorizer–recaller system developed in this study successfully held separately inputted information in the memory vector of the memorizer and retrieved it using the recaller. Additionally, the learning method we proposed in this research probabilized the training dataset for each epoch, thus prohibiting a model from learning patterns inherent in the training dataset. This approach sets our work apart from previously developed artificial intelligence systems.
However, despite the original goal of the memorizer–recaller network to hold an arbitrary amount of information, the capacity the memory vector was limited to storing only up to 8 pieces of information. To address this issue, a breakthrough would be necessary. Despite the results of this study being preliminary, the nature of appendable memory is different from that of the existing one, and the method to train memorizer–recaller are fundamentally different from previous neural network learning methods and are considered groundbreaking. In the future, we will continue our development with the expectation that it will serve as one of the foundational methods for building artificial intelligence capable of storing any information in finite-sized data and utilizing it freely.
Funding
This work was supported in part by the Top Global University Project from the Ministry of Education, Culture, Sports, Science, and Technology of Japan (MEXT).
Footnotes
We have added a subsection to the Experiments and Discussion section.