ProSelfLC: Progressive Self Label Correction Towards A Low-Temperature Entropy State

There is a family of label modification approaches including self and non-self label correction (LC), and output regularisation. They are widely used for training robust deep neural networks (DNNs), but have not been mathematically and thoroughly analysed together. We study them and discover three key issues: (1) We are more interested in adopting Self LC as it leverages its own knowledge and requires no auxiliary models. However, it is unclear how to adaptively trust a learner as the training proceeds. (2) Some methods penalise while the others reward low-entropy (i.e., high-confidence) predictions, prompting us to ask which one is better. (3) Using the standard training setting, a learned model becomes less confident when severe noise exists. Self LC using high-entropy knowledge would generate high-entropy targets. To resolve the issue (1), inspired by a well-accepted finding, i.e., deep neural networks learn meaningful patterns before fitting noise, we propose a novel end-to-end method named ProSelfLC, which is designed according to the learning time and prediction entropy. Concretely, for any data point, we progressively and adaptively trust its predicted probability distribution versus its annotated one if a network has been trained for a relatively long time and the prediction is of low entropy. For the issue (2), the effectiveness of ProSelfLC defends entropy minimisation. By ProSelfLC, we empirically prove that it is more effective to redefine a semantic low-entropy state and optimise the learner toward it. To address the issue (3), we decrease the entropy of self knowledge using a low temperature before exploiting it to correct labels, so that the revised labels redefine low-entropy target probability distributions. We demonstrate the effectiveness of ProSelfLC through extensive experiments in both clean and noisy settings, and on both image and protein datasets.


INTRODUCTION
T HE label modification is a supervision improvement strategy for model optimisation. It redefines the target probability distribution of a data point by combining a one-hot distribution, which is the target if no label modification, and another one, which could be either predicted or predefined. The existing target (label) modification algorithms can be roughly categorized into two types: (1) Output regularisation (OR), including label smoothing (LS) [53,66] and confidence penalty (CP) [57]. OR penalises overconfident predictions to regularise deep neural networks; (2) Label correction (LC). LC can not only correct the semantic classes of noisy probability distributions, but also regularise the trained models by adding the similarity structure information over training classes to one-hot probability distributions so that the learning targets are aware of the similarity hierarchy over training data. LC can be finely categorized into two subclasses: Non-self LC and Self LC. The former requires extra learners, hence the name-"Nonself". Accordingly, Self LC represents that a model bootstraps itself during training. A widely-adopted representative of Non- self LC is knowledge distillation (KD). KD supervises a model using the predictions of other model(s), usually named teacher(s) [27]. Self LC methods contain Pseudo-Label [37], bootstrapping (Boot-hard and Boot-soft) [59], Joint Optimisation (Joint-hard and Joint-soft) [67], and Tf-KD self [90], etc. We display an overview in Fig. 1 with detailed mathematical analysis in the Section 2 and Table 1.
Firstly, we are interested in adopting Self LC in practice for three reasons: (1) OR methods naively penalise confident outputs without leveraging easily accessible knowledge from other learner(s) or itself (Fig. 1a); (2) Non-self LC requires auxiliary models to generate accurate predictions (Fig. 1b). (3) Self LC leverages a model's self knowledge and does not need extra learner(s). But we note a core question which is not well answered: How much should we trust a learner to leverage its knowledge to revise labels as training proceeds?
As illustrated in Fig. 1b, in Self LC, we have two labels for any data point-a predefined one-hot q and a predicted p (a.k.a., self knowledge). The learning target is redefined to be (1 − )q + p, where defines the trust score of p. In existing methods, is fixed without considering the fact that a model's knowledge could improve as the training proceeds. Taking the bootstrapping [59] as an example, is fixed throughout the training. While Joint Optimisation stage-wisely trains a model. Concretely, it fully trusts predicted probability distributions when a stage ends and uses (a) OR includes LS [66] and CP [57]. LS softens a target by adding a uniform label distribution. CP changes the probability 1 to a smaller value 1 − in the one-hot target. The double-ended arrow means factual equivalence, because an output is definitely non-negative after a softmax layer. (b) LC contains Self LC [37,59,67,90] and Non-self LC [27]. The parameter defines how much a predicted label distribution is trusted. Fig. 1: Target modification includes OR (LS and CP), and LC (Self LC and Non-self LC). Assume there are three training classes. q is the one-hot target. u is a uniform probability distribution and p denotes a predicted one. ∈ [0, 1] is the coefficient.
them as the targets of next stage, mathematically, = 1. Tf-KD self trains a model by two stages: = 0 in the first one while is tuned for the second one. We remark that the stage-wise training requires significant human intervention about the duration of each stage and tuning for the next stage, etc, thus being timeconsuming in practice. To improve Self LC, we propose a novel method named Progressive Self Label Correction (ProSelfLC), which is end-toend trainable and needs negligible extra cost. Most importantly, ProSelfLC modifies the target progressively and adaptively as training goes. Two design inspirations of ProSelfLC are: (1) If a model learns from scratch, its predictions are unreliable in the early phase, so that human annotations have to be relied on for supervision even though they could be noisy; (2) As time progresses, the model learns semantically meaningful patterns before fitting noise, even when severe label noise exists [4]. Therefore, we can leverage a model's accurate and confident knowledge to revise pre-annotated labels, then the model will not fit noise.
Secondly, note that OR methods penalise low entropy while LC rewards it, intuitively leading to the second vital question: Should we penalise a low-entropy status or reward it?
Entropy minimisation is the most widely used principle in unsupervised and semi-supervised machine learning scenarios [18,19,24,36,63]. In standard supervised classification, minimising categorical cross entropy (CCE) also optimises a model towards a minimum-entropy state defined by one-hot labels. However, when it comes to large-scale machine learning where noisy data generally exists, confidence penalty becomes popular recently for reducing noisy fitting. In contrast, we prove that it is more effective to reward a semantically meaningful low-entropy state redefined by ProSelfLC. By showing the effectiveness of ProSelfLC, we defend entropy minimisation against the recent confidence penalty practices [15,53,57,66].
Thirdly, we disclose a common phenomenon which hinders a model from confident learning towards a correct low-entropy target state. By reporting the confidence metrics in Fig. 2 and Fig. 3, we reveal this phenomenon: using the standard CCE loss, when the training data contains severe noise, before fitting noise, a deep model has much lower confidence than its accuracy. For LC, if the predictions are high-entropy, the modified targets will be high-entropy too. Therefore, we develop an Annealed Temperature (AT) as a plug-in module to reduce the entropy of self knowledge. Empirically (see Table 9), for the Self LC methods including Bootsoft and ProSelfLC, with an AT plugged in, we are able to exploit the low-temperature (i.e., low-entropy) self knowledge to redefine a corrected low-entropy target state. Consequently, a model can learn confidently and generalise well.
We summarise our main contributions as follows: • We provide a theoretical study on common target modification methods through entropy and KL divergence [34]. Accordingly, we reveal their drawbacks and propose ProS-elfLC as a solution. ProSelfLC can: (1) enhance the similarity structure information over training classes; (2) correct the semantic classes of noisy label distributions. ProSelfLC is the first method to progressively and adaptively trust a lowtemperature self knowledge. • We uncover a finding which complements the recent findings [4,20,52,92]: when a higher label noise exists, deep models are significantly less confident of learning semantically meaningful patterns before fitting noise. Correspondingly, we propose to decrease the entropy of self knowledge using an AT and learn towards a revised low-temperature entropy state. • Our extensive experiments: (1) defend the entropy minimisation principle; (2) demonstrate ProSelfLC's effectiveness in clean and noisy settings of two very diverse data domains, i.e., image and protein datasets. This demonstrates the general applicability of our method.
network z consists of an embedding network f(·) : R d → R k and a linear classifier g(·) : R k → R c , i.e., z i = z( For the brevity of analysis, whenever there is no confusion, we take a data point (x i , y i ) and omit its superscript so that it is denoted by (x, y). The linear classifier is usually the last fully-connected layer. Its output is named logit vector z ∈ R c . We produce its classification probabilities p by normalising the logits using a softmax function: where p(j|x) is the probability of x belonging to class j. Its corresponding ground-truth is usually denoted by a one-hot representation q: q(j|x) = 1, if j = y; q(j|x) = 0, otherwise. Our definition of knowledge confidence is: Definition 1 (Knowledge Confidence). A model's knowledge with respect to x is defined by p. The knowledge confidence measures how certain p is, and mathematically defined by how distant p is from a uniform distribution u ∈ R c , and ∀j, u j = 1 c . We can calculate the knowledge confidence using two formulations: where conf top (p) is widely adopted in [20,35,52] to measure the miscalibration degree between confidence and accuracy, while conf all (p) is our proposed confidence metric. "top" indicates only the top probability is used while "all" denotes all probabilities are considered. Both metrics are agnostic to the semantic class and accuracy. Generally, they have a strong positive correlation, thus being interchangeable in practice.

2.1
Semantic class and similarity structure in p q ∈ R c provides semantic information about the probabilities of x being different training classes. We could also interpret q(j|x) as the similarity between x and j-th class. Consequently, q should not be exactly one-hot, and is proposed to be corrected at training, so that it can define a more structured target probability distribution. For better clarity, we present two definitions: Definition 2 (Semantic Class). Given a target label distributioñ q(x) ∈ R c , the semantic class is defined by arg max jq (j|x), i.e., the class whose probability is the largest. Definition 3 (Similarity Structure). Inq(x), x has c probabilities of being predicted to c classes. The similarity structure of x versus c classes is defined by these probabilities and their differences.

Revisit CCE, LS, CP and LC
Standard CCE. For any input (x, y), the minimisation objective of standard CCE is: where H(·, ·) represents the cross entropy. E q (− log p) denotes the expectation of negative log-likelihood, and q serves as the probability mass function. Label smoothing. In LS [27,66], we soften one-hot targets by adding u:q LS = (1 − )q + u. As a result, L CCE+LS (q, p; ) = H(q LS , p) = Eq LS (− log p) Confidence penalty. CP [57] penalises highly confident predictions: Label correction. As illustrated in Fig. 1, LC is a family of algorithms, where the one-hot q is modified to a convex combination of itself and a predicted distribution: We remark that if is large, and p is confident in predicting a different class, i.e., arg max j p(j|x) = arg max j q(j|x),q LC defines a different semantic class from q.

Theory on CCE, LS, CP and LC
Proposition 1. Compared with the standard CCE, the learning targets are modified in LS, CP and LC.
When a target probability distributionq is fixed, minimising the cross entropy H(q, p) is equivalent to minimising the KL divergence [34] ofq from p, i.e., the relative entropy ofq with respect to p. Proof. Let D KL (·||·) denote the KL divergence, we have H(q, p) = D KL (q||p) + H(q,q). Asq is fixed, H(q,q) is a constant so that we can leave it out of loss minimisation. Proposition 3. Some KD methods, which aim to minimise the KL divergence between predictions of a teacher and a student, belong to the family of label correction. Proof. In general, a loss function of such methods can be defined to be L KD (q, p t , p) = (1 − )H(q, p) + D KL (p t ||p) [90]. As D KL (p t ||p) = H(p t , p) − H(p t , p t ), p t is from a teacher and fixed when training a student. We can omit H(p t , p t ): Consistent with LC in Eq (6), L KD (q, p t , p) revises a label using the knowledge p t . Proposition 4. Compared with CCE, LS and CP penalise entropy minimisation while LC reward it. Proposition 5. In CCE, LS and CP, a data point x has the same semantic class. In addition, x has an identical probability of belonging to other classes except for its semantic class. The proofs of propositions 4 and 5 are presented in the Appendix A. Only LC exploits informative information and has the ability to correct labels, while LS and CP only relax the hard targets. We summarise CCE, LS, CP and LC in Table 1. Constant terms are ignored for concision.

LABEL CORRECTION
In the standard CCE, the semantic class is considered while the similarity hierarchy over all classes is ignored. This is mainly due to the difficulty of annotating the similarity structure for every x, especially when c is large [84]. Recent progress demonstrates that there are some effective approaches to define the similarity structure over samples without annotation: (1) In KD, an auxiliary teacher model provides a student model the similarity hierarchy information [27,53]; (2) In Self LC, e.g., Boot-soft, a model can bootstrap itself by exploiting the knowledge it has learned so far. We focus on the end-to-end Self LC and improve it in this work.
In Self LC, indicates how much a predicted label distribution is trusted. In ProSelfLC, we propose to set it adaptively according In the upper block, we display their learning targets, loss calculations using equal and interchangeable cross entropy and KL divergence. In the bottom block, we present their properties from the viewpoints of entropy minimisation, semantic class and structure.   For any x, we summarise its equations of loss L, label and trust below.
Label:q ProSelfLC = (1 − trust(t, p))q + trust(t, p)p; (9) Self trust ProSelfLC : trust(t, p) = g(t) × l(p); t and Γ are the iteration counter and the number of total iterations, respectively. h(η, B) = 1/(1 + exp(−η × B)) defines a sigmoid curve. Here, η = t/Γ − Θ, where Θ ∈ [0, 1]. Θ decides the inflection point. B, Γ are task-dependent and can be chosen according to a validation set in practice. We show three options to compute the local trust l(p), being either a constant or knowledge confidence-dependent. Γ and Θ are highly correlated, therefore, we fix Θ = 0.5 and only tune Γ following the standard practice. For brevity, we refer to trust(t, p) as ProSelfLC and make them interchangeable when there is no confusion.

The design inspirations of self trust: ProSelfLC
Global trust. g(t) denotes overall how much we trust a learner. g(t) grows as t rises and is independent of data points, thus being global. B adjusts the exponentiation's base and growth speed of g(t). Theoretically and practically, g(t) could be many other formats. For example, for the sigmoid function h, with no loss of generality, we use a logistic function. In practice, there are many other alternatives, e.g., generalised logistic functions, hyperbolic tangent functions, and smoothstep ones. We leave the exploration of these alternatives to future work.
The design of g(t) is inspired by the human learning process. In the earlier learning phase, i.e., t < Γ * Θ, g(t) < 0.5 ⇒ ProSelfLC < 0.5, ∀p, so that the predefined supervision dominates and ProSelfLC only modifies the similarity structure a bit. Because when a learner has not seen the training data for enough times, its knowledge p with respect to x is less reliable. When it comes to the later training phase, i.e., t > Γ * Θ, we have g(t) > 0.5. Γ * Θ represents the global trust inflection time. Local trust. l(p) represents how much we trust p. If l(p) = 1, all predictions are treated equally. When l(p) = conf top (p) or conf all (p), a more confident prediction has a higher local trust. l(p) is designed to regularise the later learning phase. If p is of higher entropy, l(p) is lower, hence ProSelfLC is smaller. If p is highly confident, we trust it and ProSelfLC is large.
We will empirically discuss g(t) using different B and three l(p) options in the Section 5.5, where conf top (p) and conf all (p) are found to work better than their constant counterpart.

Cases analysis
Due to the potential memorisation in the earlier phase (though less likely to be severe), we may get undesired confidently wrong predictions for noisy labels, but their trust scores are small as g(t) is small. We conduct the cases analysis of ProSelfLC in Table 2 and summarise its core tactics as follows: (1) Correct the similarity structure for every data point in all cases, thanks to exploiting the growing self knowledge of a learner as its training proceeds.
(2) Revise the semantic class when t is large enough and p is confidently inconsistent. When both conditions are met, as highlighted in Table 2, we have ProSelfLC > 0.5 and arg max j p(j|x) = arg max j q(j|x). Therefore, p redefines the semantic class.
We emphasise that ProS-elfLC also becomes robust against lengthy exposure to the noisy data, which is empirically demonstrated in Fig. 3, Fig. 4, Fig. 5, and Table 4.

Generic coarse signed calibration error
We first revisit the Expected Calibration Error (ECE) using multiple bins and our definition of Generic coarse Signed Calibration Error (GSCE). Formally, according to [8], with respect to any data point (x, y), the network z is perfectly calibrated if Recently, a weaker but more practical condition [20,35,52], where only the most likely class is considered, is named argmax or top-label calibration and adopted: To approximately compute Eq. (14), an ECE estimator with multiple bins is proposed [20,35,52]. It has three steps: are bucketed into m bins (i.e., groups) G 1 , G 2 , ..., G m based on conf top (p); (2) for each group, the absolute error between confidence mean conf(G i ) and accuracy accu(G i ) is computed; (3) we calculate the ECE by the expected error over bins. Let |G i | denote the number of samples in G i , · is the Iverson bracket, we summarize: It has two main inconveniences making ECE with multiple bins a tool to measure miscalibration: (1) it depends on the number of bins and the distribution of confidences; (2) if an ECE is large, we are unclear whether a model is over-confident or under-confident. Therefore, we propose a signed, simpler and faster-to-compute alternative to capture the degree of miscalibration, named GSCE: GSCE uses single bin, thus being a coarser metric than Eq. (17). However, it is more generic because conf(·) could be conf all (·) and other variants in addition to conf top (·) in Eq. (17). A positive GSCE represents an over-confident miscalibration while a negative GSCE denotes an under-confident one.

Miscalibration analysis: a model has much lower confidence than accuracy before fitting noise
There exist three vital findings about the learning behaviours of deep networks: (1) deep models easily fit random noise [92]; (2) deep networks learn simple semantic patterns before fitting noise [4]; (3) modern deep neural works tend to be over-confident [20,52]. In this section, we disclose a new notable one: When a higher label noise exists, deep models are significantly less confident of learning semantically meaningful patterns before fitting noise.
This discovery is illustrated in Fig. 2 and Fig. 3 with details below. In Fig. 2, we have some counter-intuitive observations: for both small (ShuffleNetV2) and large (ResNet18) networks, regardless of using GSCE top or GSCE all as the miscalibration metric, the model is over-confident in noisy training data while underconfident in clean training data, and has a small miscalibration on the test data. However, it is well known that a model trained using the CCE generalises poorly when noise exists [4,92].
In Fig. 3, the learning track along with the iteration helps us comprehend the observations in Fig. 2. We observe: (1) the model is highly miscalibrated and has much higher accuracy than its confidence across three sets before fitting noise (i.e., before the accuracy of the noisy subset starts to drop); (2) the miscalibration    ResNet18 on CIFAR-100 using the standard CCE. The symmetric noise rate is r. For a more transparent and stratified analysis, we store models of different iterations and report the results of three sets, i.e., test set, clean and noisy subsets of the training data. All results are multiplied by 100 before plotting.   becomes more dramatic when r changes from 20% to 40%; (3) when the model starts to fit noise at around 21k iterations, the miscalibration hits a negative peak.

Integrate an AT into ProSelfLC
Inspired by the analysis in subsection 4.2, we apply a low temperature T to decrease the entropy of self knowledge without affecting its accuracy. By exploiting it, we are able to define a revised low-entropy target state. Mathematically, other than using p to correct labels, we use An annealed temperature (denoted by AT, 0 < T < 1) works better consistently, which is demonstrated by our extensive experiments, e.g., Fig 4. For a comprehensive analysis, we also study AT integrated with other target modification approaches in the Section 5.5. Interestingly, AT boosts the low-entropy rewarding algorithms (i.e., Boot-soft and ProSelfLC) but does not help the low-entropy penalising method (i.e., CP).

EXPERIMENTS
In deep learning, due to the stochastic batch-wise training scheme, small implementation differences (e.g., random accelerators like cudnn and different frameworks like Caffe [28], Tensorflow [1] and PyTorch [55]) may lead to a large gap of final performance. Therefore, to compare more properly, we reimplement CCE, LS and CP using PyTorch. To allow for a deterministic behaviour in all experiments, we follow the Py-Torch reproducibility guidelines 1 to fix all randomness sources, e.g., using identical seeds in all experiments, avoiding nondeterministic algorithms, and setting the CUDA environment variable CUBLAS_WORKSPACE_CONFIG=:4096:8 2 , etc. Regarding Self LC methods, we re-implement Boot-soft [59], where is fixed throughout training. We do not re-implement stagewise Self LC and KD methods, e.g., Joint Optimisation and Tf-KD self respectively, because time-consuming tuning is required. In addition, our ProSelfLC can also be treated as an iterationwise Self LC method. By default, in clean and synthetic noisy cases, we train on 80% training data (corrupted in synthetic noisy cases) and use 20% trusted training data as a validation set to search all hyperparameters, e.g., Γ, , B, T and settings of an optimiser. Note that Γ and an optimiser's settings are searched first and then fixed for all methods. Finally, we retrain a model on the entire training data (corrupted in synthetic noisy cases) and report its accuracy on the test data to fairly compare with prior results. In real-world label noise, the used datasets have a separate clean validation set. Here, we use the clean dataset only for validating hyperparameters, instead of training a network's learnable parameters in [26,29,43,60,68,69,87,100].

Compare with baselines on clean CIFAR-100
Dataset and training details. CIFAR-100 [33] has 20 coarse categories and 5 fine classes in a coarse class. There are 500 and 100 images per class in the training and testing sets, respectively. The image size is 32×32. We apply simple data augmentation [25], i.e., we pad 4 pixels on every side of the image, and then randomly crop it with a size of 32×32. Finally, this crop is horizontally flipped with a probability of 0.5. We train the widely used ShuffleNetV2 [48] and ResNet-18 [25]. SGD is used with its 3: Test accuracy (%) on CIFAR-100 clean test set in the clean setting. We put the intermediately obtained best accuracy in the bracket. A lower drop from an intermediate best accuracy to the final one can be interpreted as a learner's higher robustness against a long time being exposed to the training data. We bold the best results. The criteria of "best" is a highest intermediate accuracy followed with a highest final accuracy.

Method
ShuffleNetV2 ResNet-18 settings as: (a) a learning rate of 0.2; (b) a momentum of 0.9; (c) the batch size is 128 and the number of training iterations is 39k, i.e., 100 epochs. We divide the learning rate by 10 at 20k and 30k iterations. The weight decay is 2e-3 for ResNet-18 while 1e-3 for ShuffleNetV2 because ResNet-18 has a larger fitting capacity.
Result analysis. We check the methods' sensitivity to hyperparameters, which generally makes more sense than that to random seeds. We report the mean and standard deviation of multiple hyper-parameters other than random seeds. Therefore, CCE has one run. While for LS, CP and Boot-soft, we run several different and T . Analogously, for ProSelfLC, we run several B and T .  [20,16,12,8]. The hyper-parameter space size is the same for each method except for CCE and LS. We report the mean and standard deviation of the top three results in Table 3. Compared with the baselines, ProSelfLC performs the best for both networks. First, we do not observe a big enough difference of those methods in the clean setting. Therefore, we focus on noisy scenarios hereafter. Second, the sensitivity to hyper-parameters is also small. Therefore, we do not report the standard deviation in the Table 4. Instead, we further discuss hyper-parameters in the Section 5.5.

Compare with the state-of-the-art methods on noisy CIFAR-100
Generating noisy train labels. (1) Symmetric label noise: the original label of an image is uniformly changed to one of the other classes with a probability of r. (2) Asymmetric label noise: we follow [73] to generate asymmetric label noise. Within each coarse class, we randomly select two fine classes A and B. Then we flip r × 100% labels of A to B, and r × 100% labels of B to A. We remark that the overall label noise rate is smaller than r. Competitors. 3 We compare with the results reported recently in SL [73] and Topo [75]. Forward is a loss correction approach that uses a noise-transition matrix [56]. GCE denotes generalised cross entropy [99] and SL is symmetric cross entropy [73]. They are robust losses designed for solving label noise. Regarding the other robust loss functions including focal loss (FL) [45], NLNL [32], and normalised losses [49,70,71], according to the experimental report in active passive loss (APL) [49] where a deeper ResNet-34 is used though, their results are much worse than ours. Therefore, we do not compare with them in the table. Similarly, although TVD [96] uses ResNet-18, its reported results are much lower      and not compared in the table. The other recent approaches are Forgetting [4], Decoupling [51], MentorNet [29], Co-teaching [22], Co-teaching+ [89], IterNLD [72], RoG [38], PENCIL [88] and TopoFilter [75]. Tf-KD reg [90], SSKD [83] and Li's LC [44] are three Self LC methods. Results analysis. Training details are the same as Section 5.1. For all methods, we report their final results when training terminates. Therefore, we test the robustness of a model against not only label noise, but also a long time being exposed to the noise.
In Table 4, we observe that: (1) ProSelfLC outperforms all the baselines, which is significant in most cases; (2) By default, we use AT and better baseline results are obtained. Despite that, our ProSelfLC further improves the standard Self LC (i.e., Boot-soft) and is the best of all. In addition, we visualize and comprehend the dynamic learning statistics of ProSelfLC versus baselines in Fig. 5, which clarifies why ProSelfLC works better. According to Test accuracy (%) on the real-world noisy dataset Clothing1M, which contains asymmetric noise [88] and instancedependent noise [6,58]. We note that some approaches use the full training set while others sample a label-balanced subset. As two practices affect the performance a lot, we group the results into two columns for a clear and fair comparison. For the sampled noisy training data, each label has 18976 images, leading to about 260k images in total. * indicates online label-balanced sampling for each mini-batch. The first two blocks present the results from ICLR 2022 papers [78] and [30], respectively. The fourth block contains the results from [10]. Results of the third block are from multiple recent papers, as noted in the second column. When one method is reported in different papers and has different results, we keep the highest one only.  Table 5, ProSelfLC is superior to other recent Self LC methods.
Revising the semantic class and similarity structure. In Fig. 5b and Fig. 5c, we show dynamic statistics of different approaches on fitting wrong labels and correcting them, respectively. ProSelfLC is much better than its counterparts. High semantic class correction means that the learned similarity structure revises the semantic class and similarity hierarchy corrupted by the noise.
To redefine and reward a low-temperature entropy state. On the one hand, LS and CP work well, being consistent with prior claims. In Fig. 5d and Fig. 5e, the entropies of both clean and noisy subsets are much higher in LS and CP, correspondingly their generalisation is the best except for ProSelfLC in Fig. 5f. On the other hand, ProSelfLC has the lowest entropy while performs the best, which proves that a learner's confidence does not necessarily weaken its generalisation performance. Instead, a model needs to be cautious about what to be confident in. According to Fig. 5b and Fig. 5c, ProSelfLC has the lowest wrong fitting and highest semantic class correction, which indicates that the learned model reaches a low-entropy target state redefined by corrected labels.  but uses a batch size of 256. UniCon [31] integrates Autoaugment [14], contrastive learning and semi-supervised training, thus having slightly better results than us. The other methods have been introduced heretofore. Experimental details. On both datasets, the ResNet-50 is pretrained on ImageNet and publicly available in PyTorch [55]. For Clothing1M, we follow the recent settings in [10,98,101] and use a small batch size of 32. The other training details are similar to Section 5.1 with small changes: we start with a learning rate of 0.01 and use a weight decay of 0.02. They are chosen according to the separate clean validation set. For Food-101N, we follow the same settings as the recent work [98]. The batch size is 128 and we train 72k iterations. We report the mean and standard deviation results of three random trials as in [10,98]. Results analysis. In Table 6 of Clothing1M results, for both labelimbalanced data and label-balanced data, ProSelfLC has the highest accuracy, which demonstrates its effectiveness against realworld asymmetric and instance-dependent label noise. In Table 7, the results of Food-101N confirm again that ProSelfLC is superior to existing algorithms. We remark that Clothing1M and Food-101N contain fine-grained categories, thus being challenging. ProSelfLC obtains the state-of-the-art performance on both.

Train robust transformers on noisy protein classification datasets
We follow the recent ProtTrans to do experiments on protein classification [16]. Training deep models to predict a protein's properties is challenging as the length of amino acid sequences varies from several tens to multiple thousands [2,16,61]. Some recent approaches crop amino acid sequences to decrease training time and GPU memory consumption [2,16,61]. In this work, we find that cropping input sequences adds noise to model training as some proteins have important functional regions interspersed across the protein length [2]. We empirically demonstrate this by training models on the cropped proteins. In addition to cropping noise, we further design high-noise experiments by including unlabelled proteins. Conceptually, it is semi-supervised learning. We bridge semi-supervised learning and label-noise learning by assigning random labels to those unlabelled proteins, so that we establish the synthetic dataset to validate ProSelfLC for training robust protein transformers against label noise.
Datasets. The DeepLoc train set used in [16] contains 6,622 proteins that are annotated to be membrane (i.e., they are found on the membrane), water-soluble (i.e., they are from the lumen of the organelle), or unknown (missing information about where they are found) [2]. Specifically, there are 1,518 transmembrane proteins, 2,227 water-soluble proteins and 2,877 proteins with unknown labels. There are 1,842 proteins in total and 1,087 proteins with known labels in the test set. We present the cropping noise and synthetic symmetric label noise as follows: • Cropping noise. We train on proteins with known labels. Their length ranges from 40 to 13,100, with a median of 434. Therefore, we truncate proteins longer than 434 at the end so that all amino acid sequences have a length no longer than 434. The goal of this setting is to validate whether ProSelfLC could be robust to cropping noise if there is a practical need to crop sequences for speeding up training and reducing GPU memory requirement. • Cropping noise+Label noise: First, we keep the cropping noise as we crop proteins to decrease training time and reduce GPU memory requirement. Second, we add symmetric label noise by assigning uniform random labels to 2,877 unlabelled proteins. Cropping noise together with label noise makes the noise level high. The objective is to evaluate whether ProS-elfLC could be robust to severe noise when it is expensive to remove it in practice.
Network and training details. We train sequence transformers to classify a single amino acid sequence (without using homology information at all) to be either transmembrane or water-soluble. The transformer network is a subnet of ProtBert-BFD [16], a protein language model pretrained on BFD-100 dataset [64,65]. According to [16], to use a larger batch size, ProtBert-BFD is first trained on sequences with a maximum length of 512, then tuned on sequences with a maximum length of 2k. We name this subnet ProtBert-H16-D6, where the D6 denotes its depth is 6 (D6), i.e., a stack of 6 hidden transformer layers. H16 means that in each transformer layer, the number of transformer blocks (a.k.a., heads) is 16. The released model has a depth of 30, so that our subnet is  9: Results of target modification methods without/with annealed temperature (AT) for curating the target-state entropy. We train models on CIFAR-100 whose training labels contain symmetric noise. We do not select the intermediate best models and report the generalisation accuracy and confidence metrics on the clean test set when training terminates. There are two confidence measurements: conf all and conf top . The highest accuracy or confidence of each row is bolded.       with symmetric label noise. A subfigure's caption describes the used network and noise rate r. We report four metrics (%) and one coloured bar per metric along the vertical axis. For fitting the noisy train subset, a lower value is better. For the other three metrics, a higher value is better. According to the finding of AT's effectiveness for boosting Boot-soft and ProSelfLC in Table 9, we add AT on top of every self trust scheme. Therefore, we observe all schemes have competitive results and their performance gaps become smaller. During training, a learner is not given whether a label is noisy or not. We use the final model when training ends. 5 times shallower. We choose to train this subnet mainly because it requires a small-memory GPU and is faster to train. In addition, it benefits little from pretraining, so it indicates our algorithm can be applied to train sequence transformers from scratch. Finally, we will release this subnet and make it convenient to reproduce our results with a 16GB GPU machine. We use a batch size of 32 and the SGD optimiser. The weight decay is 0.0001. For cropping noise, a starting learning rate of 0.02 is used. When noise rate is high, i.e., cropping noise+label noise, we use a smaller starting learning rate of 0.01. We train 40 epochs in total. We stress that the generic hyperparameters are coarsely searched by visualizing the statistical training curves, i.e., without brute-force and confidently fitting all training data as noise exists. Besides, we report the metrics of the final model when training terminates other than select the best intermediate model, leaving our reported metrics less biased. Results analysis and discussion. We present the accuracy and confidence metrics of ProSelfLC and baseline algorithms in Table 8. ProSelfLC's fitting of the noisy train set is much lower than that of CCE, which indicates that ProSelf does not overfit the training set. This observation is obvious for both noise types. Furthermore, ProSelfLC learns and generalises better and more confidently compared with other widely used baselines. Our experiments demonstrate that the misleading effect by cropping noise can be alleviated by ProSelfLC, as ProSelfLC's performance (92.4%) is even slightly better than the DeepLoc ensemble model (92.3%) [2] and the large transformer model (91.0%) [16]. With the high label noise added, though 2,877 out of 6,622 proteins have random labels, the generalisation performance of ProSelfLC decreases little from 92.4% to 92.2%, which confirms that Pro-SelfLC can be a robust solution when it is expensive to remove severe noise in practice.

Ablation studies
Normal-temperature entropy state versus low-temperature entropy state. For CP, Boot-soft and our ProSelfLC, the target state's entropy can be adjusted by the temperature. We denote annealed temperature by AT. We study the normal-temperature state (i.e., without AT) versus the low-temperature state (i.e., with AT) and display their results in Table 9. Generic training parameters are the same for all methods. We observe: (1) Compared with the baseline CCE, the confidence-penalty approaches (LS and CP) indeed learn better and lower-confidence models, which is consistent with the motivations of proposing them; (2) However, confidence-reward algorithms (Boot-soft and ProSelfLC) can perform better. ProS-elfLC with AT generalises the best with the highest confidence, i.e., the lowest entropy; (3) Across networks and noise rates, CP is less sensitive to target-state entropy while Boot-soft is the most sensitive. This demonstrates that target-state entropy is also very important for standard label correction (i.e., Boot-soft). For ProSelfLC, AT improves the performance consistently and reaches a curated low-temperature entropy state.  Fig. 6, on both networks and both noise rates, generally, the performance is more sensitive to T when B is large. This confirms a human's intuitive concept that if we trust a learner itself at a faster speed, the confidence adjustment of this learner's predictions becomes more crucial. When the noise rate increases, better results can be obtained by using a relatively smaller T to optimise the model towards a low-temperature entropy state.
Self trust schemes. We study the differences of four self trust schemes described in Section 3. When ProSelfLC is constant at training, it degrades to Boot-soft. According to Fig. 7, we observe that (1) Compared with "constant", g(t) outperforms Boot-soft in three metrics except for sacrificing fitting clean train subset a lot. (2) Compared with "constant" and g(t), g(t) * conf all and g(t) * conf top are better in balancing fitting and generalisation. By default, in all other experiments, we use g(t) * conf all due to its slightly better results. We further discuss post-training model calibration [20] in Appendix B, and the changes of entropy and ProSelfLC during training in Appendix C.

Label noise and semi-supervised learning
The target modification algorithms are great strategies for model optimisation in the scenarios of label noise and semi-supervised learning, which are closely related. In the setting of semisupervised learning, we are given partially annotated training data. Therefore, its key is to reliably "fill missing labels" and continue to learn based on them. Interestingly, when the missing labels are not perfectly filled, which is usually the case, the challenge of semi-supervised training changes to label noise. For a further comparison, in the semi-supervised learning, the given annotations are generally clean and reliable, so that the label noise only exists in the unannotated set. While in the setting of label noise, we are not given any lead about whether an example is trusted or not, thus being even more challenging.

LC and knowledge distillation (KD)
In the section 2.3, we have mathematically derived that some KD methods [9,27,44,83,90] also modify labels. Therefore, LC and KD are interchangeable in those cases. We use the term LC other than KD mainly for two reasons: (1) LC is more descriptive; (2) the scope of KD becomes much larger than label modification. For example, when two models are trained, the consistency between their predictions of a data point is rewarded in [5,97], and a large distance between their feature maps is penalised in [62]. Recently, multiple networks are trained for KD [17]. Regarding self KD, the intraclass samples are constrained to have consistent probability distributions [85,91]. In another self KD [94], the deepest classifier provides knowledge to supervise the shallower classifiers. In a recent self KD method [90], Tf-KD self applies two-stage training. In this work, we focus on improving the endto-end self LC. Therefore, some self KD methods [85,91,94], maximising the consistency of different classifiers or intraclass samples' predictions, do not modify labels and are less relevant for comparison. When it comes to the two-stage self LC method [90], in our view, it can be an add-on, i.e., an enhancement plugin. Therefore, exploiting ProSelfLC to improve non-self and stagewise LC approaches is an interesting area for future work.

Sample selection using the small-loss criterion
Recently, there is a popular family of algorithms which propose to learn from small-loss samples when severe label noise exists [21,22,29,86,89]. Their underlying assumptions are that small-loss examples are clean and learning from clean data only mitigates fitting noise. Generally, there are two key issues which significantly affect their performance in practice: (1) the Clean subset: CCE models Clean subset: ProSelfLC models Noisy subset: CCE models Noisy subset: ProSelfLC models (b) 40% of training data is noisy. Fig. 8: The change of cross entropy losses during training ResNet18 on CIFAR-100 with symmetric label noise. For a stratified analysis, we snapshot the model every 500 iterations and report the average losses of clean and noisy parts of the training data. For plotting, we first divide a loss by the maximum loss at training then multiply it by 100. For two noise rates, though the loss of the clean data decreases steadily for both CCE and ProSelfLC, the loss of the noisy data decreases only at a later phase when using CCE, while it increases throughout training when using ProSelfLC.
selection schedule; (2) the proportion of selected small-loss data.
For example, to address the first issue, MentorNet [29] learns a data selection curriculum while Co-teaching [22] gradually selects fewer clean samples as training proceeds. The given design reason of Co-teaching [22] is that a deep network starts to memorize the noisy data in the later training phase. To address the second issue, S2E [86] proposes an automated machine learning method to control the selection process so that a higher proportion of clean instances is selected and better performance is obtained.
To clearly understand why ProSelfLC outperforms the recent small-loss sample selection methods, as compared in Tables 4 and  6, we display the change of cross entropy loss as training proceeds in Fig. 8. We have the following insightful observations: • When using CCE, the loss of the noisy data decreases significantly in the later phase. This confirms the importance of the selection schedule of dropping very few samples at the early stage while leaving out more at the later stage. If using CCE with small-loss data selection, the corrupted-label examples will not be selected to train the model. • When using ProSelfLC, the loss of the noisy data even increases steadily. The model does not fit noisy labels and keeps improving knowledge at both clean and noisy training subsets, as also demonstrated in Fig. 4. In summary, first, ProSelfLC learns from all data while sample selection methods [21,22,29,86,89] only learn from small-loss data. Second, the proportion of selected data matters [86] and small-loss instances are more likely to be correct but not certain [21]. They are the reasons why ProSelfLC is superior.

CONCLUSION
Theoretically, we comprehensively study multiple label modification techniques from the viewpoints of entropy and KL divergence. Methodologically, we propose ProSelfLC as an advanced self LC approach. ProSelfLC is the first approach to trust low-temperature self knowledge progressively and adaptively. Extensive experiments prove its superiority over existing methods in clean and noisy scenarios of two diverse domains, i.e., image and protein datasets.
In terms of new insightful findings, we disclose and illustrate that deep neural networks become less confident of learning semantic patterns before fitting noise when the label noise rises, which complements the findings in [4,20,52,92]. In addition, ProSelfLC promotes entropy minimisation, which is in marked contrast to the recent practices of confidence penalty [15,57,66]. The effectiveness of ProSelfLC defends the entropy minimisation principle.
Xinshao Wang is a senior researcher of Zenith Ai and a visit scholar of University of Oxford. He was a postdoctoral researcher at the Department of Engineering Science, University of Oxford after finishing his PhD at the Queens University of Belfast, UK. Xinshao Wang is working on core deep learning techniques with applications to visual recognition, disease prediction based on electronic health records, and protein engineering. Concretely, he has been working on the following topics: (1) Deep metric learning: to learn discriminative and robust representations for downstream tasks, e.g., object retrieval and clustering; (2) Robust deep learning: robust learning and inference under adverse conditions, e.g., noisy labels, missing labels (semi-supervised learning), out-of-distribution training examples, sample imbalance, etc; (3) Computer vision: video/set-based person re-identification; image/video classification/retrieval/clustering; (4) AI healthcare: electrocardiogram classification; (5) ML-assisted gene and protein engineering. Sankha Subhra Mukherjee is a hands-on research leader who thinks deeply about the hardest problems in machine learning and delivers results on which innovative businesses have been created. Dr Mukherjee is a deep learning expert. His doctoral research at Heriot-Watt developed new deep neural network techniques which led to the cofounding of a high growth tech start-up, landmark publications, and patents. He had been the founding EVP of Research and later CSO for a high growth start-up leading breakthroughs in machine learning, recruiting and supervising a team of 20 world-class researchers. In his current role he is a co-founder and Chief Scientific Officer, leading a team of world class researchers to deliver breakthroughs in ML and AI driven cell engineering and synthetic biology.  '. In 2018, the CHI Lab opened its second site, in Suzhou (China), with support from the Chinese government. In 2019, the Wellcome Trust's first "Flagship Centre" was announced, which joins CHI Lab to the Oxford University Clinical Research Unit in Vietnam, focused on AI for healthcare in resource-constrained settings. He is a Grand Challenge awardee from the UK Engineering and Physical Sciences Research Council, which is an EPSRC Fellowship that provides long-term strategic support for nine "future leaders in healthcare." He was joint winner of the inaugural "Vice-Chancellor's Innovation Prize", which identifies the best interdisciplinary research across the entirety of the University of Oxford.

David
where we have H(q, q) = 0 because q is a one-hot distribution.
where H(p, u) = H(u, u) = constant. Analogously, LC in Eq (6) can also be rewritten: In LS and CP, +D KL (u||p) and +D KL (p||u) pulls p towards u. While in LC, the term −D KL (p||u) pushes p away from u.
Proposition 5. In CCE, LS and CP, a data point x has the same semantic class. In addition, x has an identical probability of belonging to other classes except for its semantic class. Proof. In LS, the target isq LS = (1 − )q + u. For any 0 ≤ < 1, the semantic class is not changed, because 1 − + * 1 c > * 1 c . In addition, j 1 = y, j 2 = y ⇒q LS (j 1 |x) =q LS (j 2 |x) = c .
In CP,q CP = (1 − )q − p. In terms of label definition, CP is against intuition because these zero-value positions in q are filled with negative values inq CP . A probability has to be not smaller than zero. So we rephraseq CP (y|x) = (1 − ) − * p(y|x), and ∀j = y,q CP (j|x) = 0 by replacing negative values with zeros, as illustrated in Fig. 1a.

APPENDIX B DISCUSSIONS ON WRONGLY CONFIDENT PREDICTIONS AND MODEL CALIBRATION
1. It is likely that some highly confident predictions are wrong. Will ProSelfLC suffer from an amplification of those errors?
First of all, ProSelfLC alleviates this issue a lot and makes a model confident in correct predictions, according to Fig. 5e together with 5b and 5c. Fig. 5e shows the confidence of predictions, whose majority are correct according to Fig. 5b and 5c. In Fig. 5b, ProSelfLC fits noisy labels least, i.e., around 12% so that the correction rate of noisy labels is about 88% in Fig. 5c. Nonetheless, ProSelfLC is non-perfect. A few noisy labels are memorised with high confidence.
2. How about the results of model calibration using a computational evaluation metric: Expected Calibration Error (ECE) [20,54]?
Following the practice of [20], on the CIFAR-100 test set, we report the ECE (%, #bins=10) of ProSelfLC versus CCE, as a complement of Fig. 5. For a comparison, CCE's results are shown in corresponding brackets. We try several confidence metrics (CMs), including probability, entropy, and their temperature-scaled variants using a parameter T . Though the ECE metric is sensitive to CM and T , ProSelfLC's ECEs are smaller than CCE's.

APPENDIX C THE CHANGES OF ENTROPY STATISTICS AND ProSelfLC AT TRAINING
In Fig. 9, we visualise how the entropies of noisy and clean subsets change at training.  The changes of entropy statistics and ProSelfLC at training. We store a model every 1000 iterations to monitor the learning process. For data-dependent metrics, after training, we split the corrupted training data into clean and noisy subsets according to the information about how the training data is corrupted before training. Finally, we report the mean results of each subset.