Zero-determinant strategies under observation errors in repeated games

Zero-determinant (ZD) strategies are a novel class of strategies in the repeated prisoner’s dilemma (RPD) game discovered by Press and Dyson. This strategy set enforces a linear payoff relationship between a focal player and the opponent regardless of the opponent’s strategy. In the RPD game, games with discounting and observation errors represent an important generalization, because they are better able to capture real life interactions which are often noisy. However, they have not been considered in the original discovery of ZD strategies. In some preceding studies, each of them has been considered independently. Here, we analytically study the strategies that enforce linear payoff relationships in the RPD game considering both a discount factor and observation errors. As a result, we first reveal that the payoffs of two players can be represented by the form of determinants as shown by Press and Dyson even with the two factors. Then, we search for all possible strategies that enforce linear payoff relationships and find that both ZD strategies and unconditional strategies are the only strategy sets to satisfy the condition. We also show that neither Extortion nor Generous strategies, which are subsets of ZD strategies, exist when there are errors. Finally, we numerically derive the threshold values above which the subsets of ZD strategies exist. These results contribute to a deep understanding of ZD strategies in society.


I. INTRODUCTION
Cooperation is a basis for building sustainable societies. In a one-shot interaction, cooperation among individuals is suppressed because cooperation takes costs to the actor while defection does not. This cooperationdefection relationship is well captured by the prisoner's dilemma (PD) game utilized in game theory. In the oneshot PD game, defection is the only Nash equilibrium. When the game is repeated, the situation drastically changes, which is modeled by the repeated prisoner's dilemma (RPD) game [1]. In the RPD game, cooperation will be rewarded by the opponent in the future. In such a situation, cooperation becomes a possible equilibrium. This mechanism is called direct reciprocity [2][3][4] and makes it possible for players to mutually cooperate in the RPD game.
Evolutionary game theory (EGT) [5] studies how cooperation evolves in the RPD game. Among various cooperative strategies tested in evolutionary games, generous tit-for-tat [6] and win-stay lose-shift [7,8] were robust to various kinds of evolutionary opponents under noisy conditions. EGT can find strong strategies against various opponents in evolving populations. One missing point was, what is a strong strategy against a direct opponent which utilizes any kind of strategy? In 2012, Press and Dyson suddenly answered this question from a different point of view. Using linear algebraic manipulations, they found a novel class of strategies which contain such ultimate strategies, called zero-determinant (ZD) strategies [9]. ZD strategies impose a linear relationship between * ichinose.genki@shizuoka.ac.jp; the payoffs for a focal player and his opponent regardless of the strategy that the opponent implements. One of the subclasses of ZD strategies is Extortioner which never loses in a one-to-one competition in the RPD game against any opponents.
In those ZD studies, no errors were assumed. However, errors (or noise) are unavoidable in human interactions and they may lead to the collapse of cooperation due to negative effects. Thus, the effect of errors has been focused on in the RPD game [43][44][45][46][47][48][49][50][51]. However, only a few studies have concerned the effect of errors for ZD strategies [52,53]. There are typically two types of errors: perception errors [45] and implementation errors [46]. Hao et al. [52] and Mamiya and Ichinose [53] considered the former case of the errors where players may misunderstand their opponent's action because the players can only rely on their private monitoring [43,47] instead of their opponent's direct action. Those studies showed that ZD strategies can exist even in the case that such observation errors are incorporated. In those studies, no discount factor is considered. It is natural to assume that future payoffs will be discounted. Thus, some studies have focused on a discount factor for ZD strategies [30,31,[54][55][56] and mathematically found the minimum discount factor above which the ZD strategies can exist [55].
In this study, we search for ZD strategies under the situations that observation errors and a discount factor are both incorporated. We search for the other possible strategies, not just ZD strategies, that enforce a linear payoff relationship between the two players. By formalizing the determinants for the expected payoffs in the RPD game, we mathematically found that only ZD strategies [9] and unconditional strategies [14,55] are the two types which enforce a linear payoff relationship. We numerically show the threshold values above which the subsets of ZD strategies exist in the game.

II.1. RPD with private monitoring
We consider the symmetric two-person repeated prisoner's dilemma (RPD) game with private monitoring based on the literature [47,52]. Each player i ∈ {X, Y } chooses an action a i ∈ {C, D} in each round, where C and D imply cooperation and defection, respectively. After the two players conduct the action, player i observes his own action a i and private signal ω i ∈ {g, b} about the opponent's action, where g and b imply good and bad, respectively. In perfect monitoring, when the opponent takes the action C (D), the focal player always observes the signal g (b). In private monitoring, this is not always true. σ(ω|a) is the probability that a signal profile ω = (ω X , ω Y ) is realized when the action profile is a = (a X , a Y ) [47]. Let ϵ be the probability that an error occurs to one particular player but not to the other player while ξ be the probability that an error occurs to both players. Then, the probability that an error occurs to neither player is 1 − 2ϵ − ξ. For example, when both players take cooperation, σ((g, g)|(C, In each round, player i's realized payoff u i (a i , ω i ) is determined by his own action a i and signal ω i , such that u i (C, g) = R, u i (C, b) = S, u i (D, g) = T , and u i (D, b) = P . Note that the payoffs depend on the signals in private monitoring. Hence, his expected payoff is given by The expected payoff is determined by only action profile a regardless of signal profile ω. Thus, the expected payoff matrix is given by According to Eq. (1), R E , S E , T E , and P E are derived as respectively. We assume that and which dictate the RPD condition with observation errors.
In this paper, we introduce a discount factor to the RPD game with private monitoring. The game is to be played repeatedly over an infinite time horizon but the payoff will be discounted over rounds. Player i's discounted payoff of action profiles a(t), t ∈ {0, 1, ..., ∞} is δ t f i (a(t)) where δ is a discount factor and t is a round. This game can be interpreted as repeated games with a finite but undetermined time horizon. Finally, the average discounted payoff of player i is

II.2. Determinant form of expected payoff in the RPG game
Here, we proceed to show that Eq. (5) can be represented by a determinant form even for the repeated games with observation errors and a discount factor, as Press and Dyson did for the repeated game without error and no discount factor [9]. The action profiles a(t) in Eq. (5) need to be specified to calculate s i . Those profiles are determined after the strategies of two players are given. Consider player i that adopts memory-one strategies, with which they can use only the outcomes of the last round to decide the action to be submitted in the current round. A memory-one strategy is specified by a 5-tuple; X's strategy is given by a combination of where 0 ≤ p j ≤ 1, j ∈ {0, 1, 2, 3, 4}. The subscripts 1, 2, 3, and 4 of p mean previous outcomes Cg, Cb, Dg, and Db, respectively. In Eq. (6), p 1 is the conditional probability that X cooperates when X cooperated and observed signal g in the last round, p 2 is the conditional probability that X cooperates when X cooperated and observed signal b in the last round, p 3 is the conditional probability that X cooperates when X defected and observed signal g in the last round, and p 4 is the conditional probability that X cooperates when X defected and observed signal b in the last round. Finally, p 0 is the probability that X cooperates in the first round. Similarly, Y 's strategy is specified by a combination of where 0 ≤ q j ≤ 1, j ∈ {0, 1, 2, 3, 4}. Define , v 4 (t)) as the stochastic state of two players in round t where the subscripts 1, 2, 3, and 4 of v imply the stochastic states (C,C), (C,D), (D,C), and (D,D), respectively. v 1 (t) is the probability that both players cooperate in round t, v 2 (t) is the proba-bility that X cooperates and Y defects in round t, and so forth. Then, the expected payoff to player X in round t is given by v(t)S X , where S T X = (R E , S E , T E , P E ). The expected per-round payoff to player X in the repeated game is given by where 0 < δ < 1. The initial stochastic state is given by The state transition matrix M of these repeated games with observation errors is given by By substituting Eq. (11) in Eq (8), we obtain where I is the 4 × 4 identity matrix. Then, let be the mean distribution of v(t). Additionally, we define Because v 1 +v 2 +v 3 +v 4 = 1 (Appendix A), the following holds (Appendix B): Equation (16) and With Eq. (17), we immediately obtain a formula for the dot product of an arbitrary vector f T = (f 1 , f 2 , f 3 , f 4 ) with the fourth column vector u of matrix M ′ as a consequence of Press and Dyson's formalism, which can be represented by the form of the determinant where µ = 1 − ϵ − ξ and η = ϵ + ξ. Furthermore, Eq. (18) should be normalized to have its components sum to 1 by u · 1, where 1 = (1, 1, 1, 1). Then, we obtain the dot product of an arbitrary vector f with mean distribution v. Replacing the last column of D(p, q, f ) with player X's and Y 's expected payoff vector, respectively, we obtain their per-round expected payoffs: When we set δ = 1, Eq. (18) corresponds to Eq. (2) of [52]. By using Eq. (18), we can calculate players' perround expected payoffs when 0 < δ ≤ 1 by the form of the determinants. δ = 1 is the case where future payoffs are not discounted.

III.1. All strategies that enforce linear payoff relationships
Since we are interested in the payoff relationship between the two players, we linearly combine those payoffs represented by Eqs. (19) and (20). The linear combination of s X and s Y can also be represented by the form of the determinant: where α, β, and γ, are arbitrary constants. The numerator of the right side of Eq. (21) is expressed in the following way: If Eq. (22) is zero, the relationship between the two players' payoffs becomes linear: Thus, we search for all of the solutions such that D(p, q, αS X + βS Y + γ1) = 0. Press and Dyson [9] (without error) and Hao et al. [52] (with observation errors) searched for the case that second and fourth columns of the determinant take the same value. This makes the determinant zero. Also, Mamiya and Ichinose [53] searched for all the cases, from all possibilities, that make the determinant zero with observation errors. Here, we extend Mamiya and Ichinose [53] to the case with both observation errors and a discount factor.
As a result, we found that, in the RPD game even with observation errors (imperfect monitoring) and a discount factor, the only strategies that impose a linear payoff relationship between the two players' payoffs are either The former corresponds to ZD strategies and the latter corresponds to unconditional strategies, respectively.

III.2. Extortion and Generous no longer exist when there are errors
Extortion [9] and Generous [23] strategies are wellknown subsets of ZD strategies, which have important characteristics. Extortion never loses to any opponent in a one-to-one competition in terms of the expected payoffs. Moreover, they finally impose ALLC (always cooperate) to the opponent who tries to improve his payoff [9]. On the other hand, Generous strategies always obtain lower payoffs than the opponent except for mutual cooperation. Hence, Generous strategies are known as one of the cooperative ZD strategies. Generous strategies are weak in a one-to-one competition. However, in a large evolving population, cooperative groups are more successful than the group of Extortioners. Thus, evolution leads from Extortion to Generous strategies.
In Eq. (25), we substitute α = ϕ, β = −ϕχ, and γ = ϕ(χ − 1)κ [55] to obtain where κ = P E with 1 ≤ χ < ∞ represents Extortion while κ = R E with 1 ≤ χ < ∞ represents Generous strategies. The parameter χ gives the correlation between two players' payoffs. Thus, we call χ a correlation factor. The parameter κ corresponds to the payoff that the ZD strategy would obtain against itself. We thus call κ baseline payoff as used in Ref. [14]. Those two prominent strategy sets exist when there are no errors. Here, by contrast, we prove that Extortion and Generous strategies no longer exist when there are errors. The detail calculation is provided in Appendix D.
However, Hao et al. [52] and Mamiya and Ichinose [53] found that there exist ZD strategies which partially have the characteristic of Extortion even when there are errors. In the original meaning, Extortion has two properties: (1) his payoff increase is always higher than any opponent's due to χ > 1 (payoff control ability) and (2) he is not outperformed by anyone (payoff dominance). When there are errors, the second property is lost although the first property remains. Intuitively, this is because errors introduce uncertainty into the payoffs and, consequently, enforce a negative impact on the accuracy of player X's payoff-based strategy setting. To keep the payoff control ability, the payoff dominance needs to be sacrificed when there are errors. Hao et al. called the ZD strategies which only have the first property contingent extortion [52].

III.3. Existence of subsets of ZD strategies
Since observation errors and a discount factor are considered, in general, the ranges in which ZD strategies can exist are narrowed. Ichinose and Masuda mathematically showed the minimum threshold values above which Equalizer (another subclass of ZD strategies), Extortion, and Generous strategies can exist [55]. Here, we numerically address threshold values where subsets of ZD strategies can exist.

Minimum discount factor for Equalizer
Equalizer strategies are a subclass of ZD strategies. They can fix the expected payoffs of the opponent regardless of the opponent's strategies [9]. We first show minimum discount factor δ c for Equalizer when observation errors ϵ and ξ are given. Equalizer can fix the opponent payoff no matter what the opponent takes, which means that This is obtained by substituting α = 0 into Eq. (23). Note that χ → ∞ in Eq. (27) corresponds to Equalizer [52]. We substitute α = 0 into Eq. (25) to obtain Equalizer: If we solve Eq. (29) for β, γ, p 2 and p 3 , are obtained. By substituting β and γ into Eq. (28), player Y 's payoff is fixed at which is independent of the opponent's strategies q.
Equalizer must satisfy the condition 0 ≤ p i ≤ 1 in Eq. (29). The existence of Equalizer strategies also depends on δ, ϵ, and ξ. We numerically find the minimum discount factor δ c and the condition of (ϵ, ξ) for which Equalizer exists. δ ≥ δ c is the condition for δ under which Equalizer strategies exist. Figure 1A shows δ c when ϵ + ξ is given. We set (T, R, P, S) = (1.5, 1, 0, −0.5) and excluded the case ϵ+ξ > 1/3 because T E > R E > P E > S E is not satisfied under the situation. Note that the effects of ϵ and ξ are the same because η = ϵ+ξ and µ = 1−ϵ−ξ in Eq. (29) includes both ϵ and ξ. When there was no error (ϵ + ξ = 0), δ c was about 0.33. When the errors were ϵ + ξ = 0.1 and 0.2, δ c were about 0.52 and 0.93. As a result, we found that δ ≥ δ c for Equalizer becomes larger as the error is increased. Figure 1B shows the possible payoff range that Equalizer can enforce to the opponent (s Y ) when corresponding ϵ + ξ is given. If there is no error (ϵ + ξ = 0), Equalizer can enforce all possible payoffs between P E = 0 and R E = 1 as shown in Ichinose and Masuda [55]. However, as error rates become larger, the possible range of payoffs that Equalizer can enforce becomes smaller. When the errors were ϵ + ξ = 0.1 and 0.2, the possible payoff ranges were 0.21 ≤ s Y ≤ 0.79 and 0.47 ≤ s Y ≤ 0.53, respectively.

Minimum correlation factor for ZD strategies with 1 ≤ χ < ∞
We numerically calculated the minimum correlation factor χ c for subsets of ZD strategies with 1 ≤ χ < ∞ to exist (Fig. 2). Each curve corresponds to each δ as shown in the legend. The area surrounded by each curve and the vertical axis is the region of χ which can be utilized by the ZD strategies when ϵ+ξ is fixed. As the error ϵ+ξ becomes larger and the discount factor δ becomes smaller, the minimum correlation factor χ c becomes larger.

III.4. Numerical examples of representative ZD and unconditional strategies under errors in repeated games
We numerically demonstrate that ZD and unconditional strategies can impose a linear relationship between the two players' payoffs while others cannot in the RPD game under errors. We take up contingent Extortion and Equalizer as the representative of ZD strategies, ALLD (always defect) as the representative of unconditional strategies, and Win-Stay-Lose-Shift (WSLS) as neither ZD nor unconditional strategies in general. Figure 3 shows the relationship between the two players' expected payoffs per game with payoff vector (T, R, P, S) = (1.5, 1, 0, −0.5). The gray quadrangle in each panel represents the feasible set of payoffs. We fixed one particular strategy for player X (vertical line) and randomly generated 1,000 strategies that satisfy 0 ≤ q 0 , q 1 , q 2 , q 3 , q 4 ≤ 1 for player Y (horizontal axis). Thus, each black dot represents the payoff relationship between two players. In addition, the blue and red are the particular cases for player Y . Red is the case that player Y is ALLD and blue is the case that player Y is ALLC. We set δ = 1 for We adopted p0 and κ so that χc was minimized.
vs. 1000 + 2 strategies. In that case, ξ = 0 is fixed and ϵ is caried to 0, 0.1, 0.2. As WSLS strategies are neither ZD nor unconditional strategies in general, the payoff relationships are not linear irrespective of errors and a discount factor. Figures 3B and F show the case with an contingent Extortion vs. 1000 + 2 strategies. If there are no errors, Extortion is unbeatable against any opponent as shown by black dots. For instance, when δ = 1 and ϵ + ξ = 0, Extortioner p = (0.86, 0.77, 0.09, 0) which passes over (P E , P E ) can impose a linear payoff relationship to the opponent, with the slope χ = 15 [black dots in Fig. 3B]. Note that δ = 1 corresponds to a no discounting game. In that case, the expected payoff of a game does not depend on p 0 because all terms which contain p 0 and q 0 in Eq. (18) vanish. In other words, the value of p 0 is arbitrary. Even if δ = 0.9 and ϵ + ξ = 0, Extortioner p = (0.955556, 0.855556, 0.1, 0; 0) which passes over (P E , P E ) can impose a linear payoff relationship to the opponent, with the slope χ = 15 [black dots in Fig. 3F].
However, as proved in Sec. III.2, when there are errors, only contingent Extortion can exist. Thus, there exists the region that the expected payoff of the contingent Extortion is lower than the opponent's payoff near (P E , P E ) [see yellow-green and cyan dots in Figs. 3B and F] even though the increase of the strategies is still larger than the opponent's due to χ > 1 when the opponent tries to increase his payoff.

IV. DISCUSSION
We considered both a discount factor and observation errors in the RPD game and analytically studied the strategies that enforce linear payoff relationships in the game. First, we successfully derived the determinant form of the two players' expected payoffs even though a discount factor and observation errors are incorporated. Then, we searched for all possible strategies that enforce linear payoff relationships in the RPD game. As a result, we found that both ZD strategies and unconditional strategies are the only strategy sets to enforce the relationship to the opponent. Then, we proved that Extortion and Generous strategies no longer exist when there are errors. Finally, we numerically showed minimum discount factors for Equalizer (χ → ∞) and minimum correlation factors for other subsets of ZD strategies (1 ≤ χ < ∞) above which those ZD strategies exist.
We showed that ZD strategies can still exist even when discount factor δ deviates from 1. In real life, we remember interacting with other people but the interaction will conclude after a certain time. People sometimes change who they interact with because they move to other places. Young animals spend most of their time with their parents. After animals grow up and become adults, they interact with new companions. There are interactions in a short time such as mating. Our results demonstrate that strategies which unilaterally control the opponents' payoffs exist even in those limited repeated interactions.
We also showed that ZD strategies can still exist even when there are observation errors to some extent. Such noise often happens in real interactions between individuals. Thus, if ZD strategies were to exist only without noise, the applicability of ZD strategies in real problems would be quite limited. Our results are important because we expanded the applicability of ZD strategies to those problems.
Our results are limited to the two-player RPD games. Other studies have focused on n-player games [19,26,27,56]. It is worth investigating games including observation errors and a discount factor for n-player games. On the other hand, regarding memory, our study only used memory-1 strategies. A recent study revealed the role of longer memories for the evolution of cooperation, which is another direction to investigate [51].
When spatial structures are included, the different role of Extortioner has been known [16][17][18]21]. Extortioners are neutral with respect to ALLDs if there are no errors. Thus, Extortioners can neutrally invade the sea of ALLDs in a spatial structure. On the other hand, the best response to Extortioners is ALLC. Once ALLC happens, the clusters of ALLC are better than those of the Extortioner. Then, cooperation is promoted. In this way, it has been demonstrated that Extortion acts as a catalyst for cooperation. Another interest is how observation errors and a discount factor affect the evolution of cooperation in a spatial setting.
We considered a model of private monitoring in direct reciprocity where signals are personally interpreted. Signals would play an important role in indirect reciprocity rather than direct reciprocity because signals are shared by many players as a social norm which affects their behaviors. As far as we know, no one has found ZD strategies in indirect reciprocity. We do not even know that we can use similar techniques for the problem. Thus, this is a potentially challenging and exciting direction of future research.
Game theory has been applied to practical problems. For instance, it has been used in the problem of selfish routing or traffic congestion. In those problems, imposing a tax on selfish terminals or drivers is one possible strategy to improve the efficiency of the total flow. As we showed, ZD strategies can induce cooperative behavior to the other players. Thus, incorporating ZD strategies instead of imposing taxes in those problems may be an effective way to improve efficiency. Our study contributes to new research directions of ZD strategies to various fields.
We show that the sum of elements in the mean distribution This is another form of Eq (13). Because the sum of every row in the transition matrix M is equal to one, the sum of every row of We show that v(0) and v T M 0 are equal. v and M 0 are defined by Eqs. (13) and (14), respectively. We calculate the matrix multiplication v T M 0 .: Therefore, the following holds: Appendix C: Strategies that enforce D(p, q, αSX + βSY + γ1) = 0.
To search for all possible strategies that make D(p, q, αS X + βS Y + γ1) = 0, we express Eq. (24) in component form: By taking out q from Eq. (C1), we obtain Here, we search for strategies which satisfy D(p, q, αS X + βS Y + γ1) = 0 irrespective of Y 's strategy q, meaning that Eq. (C2) must hold true irrespective of q. Therefore, the coefficients of each element q in Eq. (C2) must equal zero, that is, the following conditions are necessary: If there exist real numbers, s, t, u, v, α, β, and γ such that Eqs.