Comparison of insect and human cytochrome b561 proteins: Insights into candidate ferric reductases in insects

Cytochrome b561 (cytb561) proteins comprise a family of transmembrane oxidoreductases that transfer single electrons across a membrane. Most eukaryotic species, including insects, possess multiple cytb561 homologs. To learn more about this protein family in insects, we carried out a bioinformatics-based investigation of cytb561 family members from nine species representing eight insect orders. We performed a phylogenetic analysis to classify insect cytb561 ortholog groups. We then conducted sequence analyses and analyzed protein models to predict structural elements that may impact the biological functions and localization of these proteins, with a focus on possible ferric reductase activity. Our study revealed three ortholog groups, designated CG1275, Nemy, and CG8399, and a fourth group of less-conserved genes. We found that CG1275 and Nemy proteins are similar to a human ferric reductase, duodenal cytochrome b561 (Dcytb), and have many conserved amino acid residues that function in substrate binding in Dcytb. Notably, CG1275 and Nemy proteins contain a conserved histidine and other residues that play a role in ferric ion reduction by Dcytb. Nemy proteins were distinguished by a novel cysteine-rich cytoplasmic loop sequence. CG8399 orthologs are similar to a putative ferric reductase in humans, stromal cell-derived receptor 2. Like other members of the CYBDOM class of cytb561 proteins, these proteins contain reeler, DOMON, and cytb561 domains. Drosophila melanogaster CG8399 is the only insect cytb561 with known ferric reductase activity. Our investigation of the DOMON domain in CG8399 proteins revealed a probable heme-binding site and a possible site for ferric reduction. The fourth group includes a subgroup of proteins with a conserved “KXXXXKXH” non-cytoplasmic loop motif that may be a substrate binding site and is present in a potential ferric reductase, human tumor suppressor cytochrome b561. This study provides a foundation for future investigations of the biological functions of cytb561 genes in insects.

of these proteins' ferric reductase activity is not clear; for example, although CGcytb can reduce ferric-57 chelates in vitro, it is thought to function in chromaffin granules by reducing MDHA to produce 58 ascorbate, which is then used by dopamine β-hydroxylase (DBH) in the synthesis of norepinephrine 59 [16,17]. It seems likely that at least some cytb561s are multifunctional proteins, and that substrate 60 specificity may be influenced by the substrate availability where a particular cytb561 is located. 61

109
Identification of insect cytb561 protein sequences 110 For this study, we selected D. melanogaster as our model insect species, along with eight 111 additional insect species of interest: Anopheles gambiae (mosquito), Acyrthosiphon pisum (aphid), Apis 112 mellifera (bee), Ctenocephalides felis (flea), Papilio xuthus (butterfly), Pediculus humanus corporis 113 (louse), Tribolium castaneum (beetle), and Zootermopsis nevadensis (termite). These species were 114 chosen for their phylogenetic diversity and their association with high-quality genomic data. 115 We identified D. melanogaster cytb561 protein sequences using the FlyBase protein domain 116 search tool with the query "Cytochrome b561" [33]. Of the nine resulting hits, eight possessed the four 117 conserved heme-coordinating histidine residues characteristic of cytb561 family members: CG1275, 118 Nemy, CG8399, CG10165, CG13077, CG13078, CG10337, and CG3592. The remaining sequence, CG6479, 119 lacked these conserved histidines and was eliminated from our dataset. 120 To identify cytb561 sequences in the insect species of interest, we performed species-specific 121 BLAST-P searches using the National Center for Biotechnology Information (NCBI) non-redundant 122 database [34]. Eight D. melanogaster cytb561 amino acid sequences were used as queries: NP_728727.1 123 (CG1275), NP_725208.1 (Nemy), NP_611079.2 (CG8399), NP_609982.1 (CG10165), NP_609990.1 124 (CG13077), NP_609989.1 (CG13078), NP_609986.1 (CG10337), and NP_570039.1 (CG3592). Additional 125 insect cytb561 sequences were identified using the NCBI Conserved Domain Database (CDD) [35]. The 126 NCBI CDD search engine was populated with a combination of "insect species" and either "cyt_b561" or 127 "cytochrome_b_N," and a search was executed. 128 Isoforms and identical hits were removed from the dataset, resulting in one unique sequence 129 for each gene. Of the 46 non-D. melanogaster sequences discovered, 36 resulted from BLAST-P searches 130 and ten resulted from NCBI CDD queries. A sequence from T. castaneum (XP_008195477.1) was edited 131 using RNA-seq data to remove an apparently misplaced intron, resulting in the addition of 53 C-terminal 132 amino acids and a complete cytb651 core domain. The NCBI accession number and source of each 133 sequence is listed in S1 Table.  134 Three sequences with unique properties were observed: XP_314065.2 (A. gambiae), 135 XP_001950579.2 (A. pisum), and XP_001949276.1 (A. pisum). These three sequences were used as 136 separate queries to conduct BLAST-P searches against the NCBI non-redundant database for the class 137 Insecta to determine whether other insect species have proteins with similar sequences. 138 Defining homologous regions of protein sequences 139 Prior to performing phylogenetic and sequence analyses, we identified the homologous region 140 of each insect cytb561 protein. We used the HHpred server (Max Planck Institute Bioinformatics Toolkit 141 webserver), which is a tool for protein remote homology detection that generates a multiple sequence 142 alignment, predicts secondary structure, then creates a profile hidden Markov model that is compared 143 with a target database [36]. We evaluated each insect and outgroup sequence using the HHpred 144 PDB_mmCIF70_3_Mar server set to default parameters, including local alignment. This server 145 references the Protein Data Bank (PDB) of known protein structures for possible templates, then 146 generates results based on the probability percentage that the query shares a homologous region with 147 the template. An estimated probability percentage of > 95% indicates that homology is nearly certain 148 [36]. All insect and outgroup sequences were determined to share the highest probability percentage of 149 homology (> 99%) with regions of Dcytb (CYBRD1, NP_079119.3, PDB 5ZLG). The predicted homologous 150 region was not identical for all insect sequences; therefore, we identified the region that all insect 151 sequences have in common and defined this as the common homologous region, which corresponds to 152 Dcytb A44-T179. We used common homologous sequences for all analyses unless otherwise described. 153 The common homologous region of each sequence is listed in S1 Table.  154 To compare insect cytb561s with the six known human cytb561 proteins, we used HHpred and 155 the same methodology described above to define the common homologous regions of the following 156 where we applied a custom color scheme and obtained consensus sequence logos and conservation 166 scores [37,38]. Accession numbers for insect sequences used in alignments are in S1 Full-length insect cytb561 protein sequences were submitted to the SignalP-5.0 webserver for 173 detection of predicted signal peptides, and they were analyzed by the DEEPLOC-1.0 webserver for 174 prediction of subcellular protein localization [40,41]. Sequences were also investigated by eye to identify 175 putative late-endosomal/lysosomal targeting signals in the C-terminal portion of the protein [42]. 176

Phylogenetic analysis
177 For our phylogenetic analysis, we selected an outgroup species, Trichoplax adhaerens, based on 178 two qualifying criteria: T. adhaerens belongs to an ancient animal lineage near the base of the metazoan 179 tree, and its cytb561 genes are homologous to insect cytb561 genes [43]. To identify an outgroup 180 sequence, we evaluated the cytb561 genes from T. adhaerens and selected the sequence with the 181 lowest identity to the D. melanogaster cytb561 sequences. The phylogenetic analysis was conducted 182 using the MEGA version X software [44]. First, all common homologous insect cytb561 sequences and 183 the outgroup cytb561 sequence were aligned in CLUSTAL Omega using the "Pearson/FASTA" output 184 parameter; the resulting alignment was uploaded to MEGA X, and a rooted phylogenetic tree was 185 created using the maximum likelihood (ML) method. To determine the optimal model for ML tree 186 construction, we ran the "Find Best DNA/Protein Models (ML)" tool in MEGA X, which resulted in 187 recommended parameters: the LG with freqs. ( Hemiptera (A. pisum), Hymenoptera (A. mellifera), Lepidoptera (P. xuthus), Psocodea (P. humanus 230 corporis), and Siphonaptera (C. felis) (S1 Table). All investigated species have between three (P. 231 humanus corporis) and nine (P. xuthus) cytb561 genes (S1 Table). 232 Using the HHpred PDB server for protein remote homology detection, we defined a homologous 233 region (core domain) shared by all of the insect cytb561s [36] (S1 Table). The ~130 amino acid long core 234 domain sequences were then used to generate a multiple sequence alignment (S1 Fig). From this 235 multiple sequence alignment, we generated a phylogenetic tree and found strong support for four 236 cytb561 groups: three orthologous groups and a fourth group that lacks one-to-one orthologs (Fig 1). 237 For simplicity, we will refer to all insect sequences that cluster in the three orthologous groups 238 according to the representative D. melanogaster protein name: CG1275, Nemy, or CG8399. The 239 remaining 25 sequences will be referred to as Group 4.  orthologs in most or all species) were identified: CG1275, Nemy, and CG8399. The remaining genes 247 formed a fourth group that has no one-to-one orthologs but includes many species-specific gene

254
The CG1275 and Nemy groups are closely related, as evidenced by bootstrap support of a 255 CG1275-Nemy cluster and the short branch lengths for the CG1275 and Nemy groups (Fig 1). CG1275 256 proteins were identified for eight of the nine insect species (Fig 1; S1 Table). We did not identify a 257 CG1275 ortholog in the lepidopteran insect P. xuthus, and we were unable to find a CG1275 sequence in 258 any other species from the order Lepidoptera (butterflies and moths). Of the species analyzed, A. pisum 259 was the only insect with a CG1275 gene expansion (Fig 1; S1 Table). We identified Nemy proteins in 260 eight of the species (Fig 1; S1 Table), but we did not find a Nemy sequence from A. mellifera or any other 261 species in the order Hymenoptera (bees, wasps, ants, and similar insects). A previously published 262 phylogenetic analysis of eukaryotic cytb561s suggested that CG1275 and Nemy from D. melanogaster 263 and A. gambiae may be orthologous to human Dcytb, CGcytb, and Lcytb, but a lack of statistical support 264 for the clade makes the phylogenetic relationship of CG1275 and Nemy with Dcytb, CGcytb, and Lcytb 265 uncertain [19]. 266 CG8399 267 CG8399 is the only orthologous group with sequences from all nine insect species. This cluster 268 includes ten proteins, nine of which contain multiple domains -reeler, DOMON(s), and cytb561 -269 consistent with some members of the CYBDOM protein family (Fig 1; S1   Group 4 genes were more likely than other insect cytb561 genes to undergo species-specific expansions 278 (Fig 1). We did not identify any Group 4 genes in P. humanus corporis or A. pisum (Fig 1). 279

Sequence identity between human and insect cytb561s
280 Human cytb561s have been more widely studied than insect cytb561s, and human Dcytb is one 281 of only two cytb561 structures that have been solved [18]. We were interested in determining the 282 sequence similarities between the 54 insect cyb561s and the six human cytb561s (Dcytb, CGcytb, Lcytb, 283 SDR2, TScytb, and CYB561D1) to learn which of the better-characterized human cytb561s is most similar 284 toeach of the insect cytb561s. For this analysis, we identified the common homologous region of each 285 insect and human cytb561 sequence, then determined their pairwise identities. CG1275

Analysis of insect cytb561 domains 305
Locations of conserved amino acid residues 306 We used amino acid sequence alignments to identify conserved amino acid residues in insect 307 cytb561 domains. To facilitate comparisons of residues from diverse sequences, we denote residues as 308 the number of positions they are upstream (-) or downstream (+) of one of the four conserved heme-309 coordinating histidine residues; for example, the residue immediately upstream of H1 is referred to as 310 H1-1. To identify amino acid residues that are conserved in all four insect cytb561 groups, we aligned 311 the ~130 amino acid core domain sequences of all 54 insect species and found that only the heme-312 coordinating histidines, H1-H4, were completely conserved (S1 Fig). Note that for the CG1275, Nemy, 313 and Group 4 proteins, H1-H4 are in TM2-TM5, whereas in the CG8399 proteins, H1-H4 are in TM1-TM4. 314 In addition to the four positions containing conserved histidines, 11 positions have highly-conserved 315 residues (with Jalview conservation scores of 8 or 9), including seven with small residues (Gly, Ala, and 316 Ser), and four with larger hydrophobic residues (S1 Fig). Five of the seven small residue positions were 317 previously noted [3], including two that are present in 106 eukaryotic cytb561s (H1+7, H4+4) [19]. To 318 identify amino acid conservation within specific insect cytb561 groups, we separated the core domain 319 alignment into five group-specific alignments and then used Jalview to generate group-specific 320 consensus logos (Fig 2) [38]. These consensus logos reveal patterns of conservation within groups and 321 allow residues at specific locations to be compared across groups. histidine. If the corresponding residue in Dcytb was predicted to contact a heme or ascorbate, that 328 position is indicated by a symbol: non-cytoplasmic heme (square), non-cytoplasmic ascorbate (upside-329 down triangle), cytoplasmic heme (red diamond), or cytoplasmic ascorbate (red triangle). The Jalview 330 consensus rows show the most common residue at that position listed below the logo; a + is used where 331 there are two or more residues with the highest percentage and a -is used to show gaps. 332 To analyze the conserved amino acids in the context of the protein structure, we used ChimeraX 333 to overlay a protein alignment of the CG1275, Nemy or CG8399 groups with its representative D. 334 melanogaster AlphaFold protein model (Fig 3). For Group 4B proteins, we selected the D. melanogaster 335 CG13077 AlphaFold model (Fig 3). Since D. melanogaster does not have a Group 4A protein, we used the 336 A. gambiae Group 4A (XP_320673.4) AlphaFold model (Fig 3). The degree of conservation is shown by 337 gradient coloring. By analyzing conservation information from the structural models as well as the 338 group-specific consensus logos, we determined that the insect cytb561 proteins, with the exception of 339 the Group 4B proteins, have relatively high group sequence conservation in the regions near non-340 cytoplasmic heme-coordinating H1 and H3 (Figs 2 and 3). Additionally, CG8399 proteins have relatively 341 high conservation in the regions preceding the cytoplasmic heme-coordinating H2 and H4 (Figs 2 and 3). 342 This is the first large scale analysis of cytb561 proteins since X-ray crystallography studies 358 identified heme-and ascorbate-binding residues in human Dcytb and A. thaliana AtCytb561-B, and a 359 metal ion-coordinating residue in Dcytb [13,18]. We wanted to leverage this structural information to 360 predict whether conserved residues in insect cytb561s might participate in heme, ascorbate, and/or 361 iron-binding. Amino acid sequence alignments and structural models were used to compare insect 362 cytb561 proteins with Dcytb and Atcyb-B. Initial comparisons were restricted to CG1275 and Nemy 363 proteins because of their high similarity to Dcytb and AtCytb561-B; once key residues were identified in 364 CG1275 and Nemy, we used sequence alignments to identify the corresponding residues in the CG8399 365 and Group 4 proteins. References to specific residue numbers are based on Dcytb unless otherwise 366 noted. 367 We first created an alignment of Dcytb, AtCytb561-B, CG1275, and Nemy sequences (Fig 4). 368 From this alignment, which extends beyond the core domain, we identified 23 completely conserved 369 residues and an additional 23 residues that are highly conserved across Dcytb, AtCytb561-B, CG1275, 370 and Nemy sequences (Fig 4). To identify any previously unspecified Dcytb residues that are likely to be 371 important for heme or ascorbate binding, we used ChimeraX to determine predicted protein contacts 372 with each heme and ascorbate group in the Dcytb ascorbate-bound structure (PDB 5ZLG) (Fig 4; S3 373 Table). (Note that Dcytb was selected for this analysis rather than AtCytb561-B for multiple reasons: 374 insects are more closely related to humans than plants, Dcytb is a ferric reductase, and the Dcytb crystal the four strictly conserved histidines are labeled with their number rather than *, and ^ indicates highly 386 conserved with a Jalview conservation score of 8 or 9). Predicted Dcytb heme-contacting residues are 387 labeled with a square (non-cytoplasmic heme) or diamond (cytoplasmic heme), and ascorbate-388 contacting residues are labeled with an upside-down triangle (non-cytoplasmic ascorbate) or triangle 389 (cytoplasmic ascorbate). A blue vertical line indicates the position of 51 amino acids found only in A. 390 pisum XP_001949276.1, which were omitted from the figure. Coloring by amino acid residue: basic 391 residues in blue (Arg) or cyan (Lys); acidic residues (Asp and Glu) in dark pink; amide residues (Asn and 392 Gln) in light pink; hydroxylic residues (Ser and Thr) in pale purple; aromatic residues in light peach (Phe 393 and Tyr) or dark peach (Trp); sulfur-containing residues in dark green (Cys) or light green (Met); His in 394 orange; Pro in yellow; Ile, Leu, and Val in pale yellow; and Ala and Gly in white. 395 Next, we compared structural models of D. melanogaster CG1275 and Nemy with the solved 396 structure of Dcytb. To investigate, we overlayed AlphaFold models for D. melanogaster CG1275 and 397 Nemy with the ascorbate-bound structure of human Dcytb. The AlphaFold models have very high 398 confidence for the majority of residues in the six transmembrane helices and their connecting loops (S4 399 Table), and the alpha carbons of the two insect structures aligned well over most of the Dcytb sequence 400 47 residues that make at least one predicted contact with a heme group (S3 Table), including 32 423 residues that are completely or well conserved in all CG1275 and Nemy sequences, as well as in 424 AtCytb561-B (Fig 4). Many of these 32 residues are predicted to make contacts with a heme group in D. 425 melanogaster CG1275 and Nemy protein models overlayed with Dcytb (S3 Table). In particular, the H1-426 H4 side chains are positioned very similarly to the heme-coordinating H1-H4 side chains in the Dcytb 427 structure (Figs 5B and C). Amino acid residues that were predicted to interact with a heme group or 428 participate in heme positioning, as well as their degree of conservation in the four insect cytb561 429 groups, are described below. arginine at H1+20 is completely conserved in all CG1275, Nemy, and CG8399 sequences (Fig 2) and likely 435 interacts with the cytoplasmic heme A-propionate (Fig 6A). The residue at H1+20 tends to be a serine or 436 threonine in Group 4A proteins (Fig 2), and 22 out of 25 Group 4B sequences also have a polar residue at 437 H1+20 (S1 Fig). Serine is a much smaller residue than arginine, but in Dcytb the gamma carbon of Arg70 438 is only ~3.2 Å from the heme A propionate; therefore, a serine or threonine residue could potentially 439 form a hydrogen bond with a heme carboxylate group if properly positioned. Outside of the Dcytb core 440 domain, at the end of helix 6, Lys225 forms two potential hydrogen bonds with the cytoplasmic heme D 441 propionate (Fig 6A; S3 Table). This helix 6 lysine is conserved in CG1275 sequences and is a basic residue 442 in 5 out of 7 Nemy sequences, but it is replaced with a proline in D. melanogaster and A. gambiae Nemy. We found that two of the highly-conserved small residue positions in insect cytb561s are 470 predicted to form cytoplasmic heme contacts in Dcytb: H3+18 (Gly138) and H4+4 (Gly163) (S3 Table). A 471 glycine at H4+4 is conserved across all insect groups, in addition to more diverse cytb561s [3,19], while a 472 glycine at H3+18 is conserved in all insect groups except CG8399, where an alanine substitution occurs 473 (Fig 2). It is likely that small residues at H3+18 and H4+4 are important for proper cytoplasmic heme 474 positioning within the protein's central cavity in diverse cytb561s (Fig 6A). 475 Non-cytoplasmic heme 476 We performed a similar analysis of non-cytoplasmic heme positioning. In Dcytb and AtCytb561-477 B, two residues form hydrogen bonds with the non-cytoplasmic heme A-propionate group: a serine at 478 H3-2 (Ser118) and a glutamate at H4+21 (Glu180) (Fig 6B)[13,18]. These residues are well conserved in 479 CG1275 and Nemy sequences and form predicted contacts with the heme group in the D. melanogaster 480 CG1275 and Nemy models (Fig 4; S3 Table). As mentioned above, it is notable that the most highly-481 conserved regions among CG1275 and Nemy proteins are the sequences surrounding H1 and H3, which 482 coordinate the non-cytoplasmic heme group (Fig 2). While CG8399 and Group 4 proteins share less 483 conservation in these regions, a serine or threonine at H3-2 is well conserved across Group 4 proteins, 484 and CG8399 sequences most often contain an asparagine (Fig 2; S1 Fig); these polar residues could 485 potentially hydrogen bond with the non-cytoplasmic heme A-propionate (H4+21 is slightly beyond the 486 conserved core cytb561 domain and was therefore omitted from our analysis of CG8399 and Group 4 487 proteins). 488 In Dcytb and AtCytb561-B, an asparagine at H3-5 forms a hydrogen bond with the non-489 cytoplasmic heme D-propionate ( Fig 6B) [13,18]. This position is highly conserved in CG1275 and Nemy 490 proteins (Fig 4); two Nemy sequences have a histidine, which also could form a potential interaction 491 with a heme propionate. All Group 4A and most Group 4B proteins also have a histidine at H3-5 (Fig 2;  492   S1 Fig). In contrast to the other groups, there is no clear conservation at H3-5 in CG8399 proteins, which 493 is part of the core domain non-cytoplasmic loop (Fig 2; S1 Fig). Interestingly, in Dcytb the histidine at H2+22 is conserved but was reported to coordinate with a non-497 cytoplasmic zinc ion (Fig 6B, also discussed in more detail below) [18]. All but two CG1275 and Nemy 498 sequences have a histidine at H2+22, whereas this position is most commonly a lysine in Group 4 499 proteins or a hydrophobic residue in CG8399 proteins ( Fig. 2; S1 Fig). Dcytb has a methionine (Met116) 500 at H3-4 whose amide nitrogen is also positioned to hydrogen bond with the non-cytoplasmic heme D-501 propionate, similarly to Phe114 in AtCytb561-B (Fig 6B). In CG1275 and Nemy sequences, a methionine, 502 phenylalanine, or leucine is found at H3-4 and a phenylalanine is also common in Group 4 proteins (18 503 out of 25) (Figs 2 and 4). In contrast, CG8399s have no notable conservation at this position (S1 Fig). The 504 hydrophobic side chain at H3-4 in Dcytb (Met116) appears to anchor the non-cytoplasmic loop between 505 helices 3 and 4 in the membrane, and its backbone carbonyl is predicted to form a hydrogen bond with 506 the sidechain of the tryptophan (Trp122) at H3+2, providing added stability to the loop region between 507 helices 3 and 4 (Fig 6B). A tryptophan at H3+2 is entirely conserved in CG1275 and Nemy proteins (Fig 4), 508 suggesting a similar positioning of the non-cytoplasmic heme and loop between helices 3 and 4. The 509 tryptophan at H3+2 is not conserved in CG8399 or in most Group 4 proteins (Fig 2). 510 Glycine residues at positions H3+4 (Gly124) and H4+18 (Gly177) are predicted to make non-511 cytoplasmic heme contacts in the Dcytb, CG1275, and Nemy structures (Figs 4 and 6B; S3 Table) These 512 residues were previously identified as highly conserved across diverse cytb561s [3]. The H3+4 glycine is 513 conserved in all CG1275, Nemy, and CG8399 sequences, and is highly conserved in Group 4 proteins (21 514 out of 25) (Fig 2). A glycine at H4+18 is conserved in all CG1275 and Nemy sequences and most Group 4 515 proteins (19 of 25) also have a glycine at H4+18, whereas CG8399s have an alanine substitution (Fig 2;  the ascorbate ring [13,18]. We located these Dcytb and AtCytb561-B residues in relation to the 529 conserved heme-coordinating histidines and examined our insect sequences for conserved residues, 530 especially basic and aromatic residues, at these locations. We also performed ChimeraX contact analysis 531 with the Dcytb structure and overlayed CG1275 and Nemy structures to identify other amino acids 532 proximate to ascorbate to identify similarities and differences in how ascorbate might bind. 533 Potential cytoplasmic ascorbate binding residues 534 The Dcytb and AtCytb561-B crystal structures clearly identified three basic residues that bind 535 cytoplasmic ascorbate: H2-7 (Lys79), H2-3 (Lys83), and H4-7 (Arg152) [13,18]. Additionally, ChimeraX 536 analysis predicted that Dcytb has ascorbate contacts at H3+22 (Phe142) and H4-3 (Met156) (Fig 7A; S3 537 Table). These five residues are also predicted to have cytoplasmic ascorbate contacts in D. melanogaster 538 CG1275 (S3 Table), and most CG1275 proteins have these conserved residues except for some variation 539 at H4-3 (Fig 4). In Nemy sequences, a lysine at H2-3, an arginine at H4-7, and a phenylalanine at H3+22 540 are entirely conserved, whereas either a valine or threonine is present at H4-3 (Fig 4); D. melanogaster 541 Nemy is predicted to have ascorbate contacts at these four positions (S3 Table). More variation is seen 542 at Nemy H2-7, which is most often an arginine (4 out of 8) and in one sequence is a lysine (Fig 4). 543 However, H2-7 is not a basic residue in the D. melanogaster, A. gambiae, and A. pisum Nemy sequences 544 (Fig 4). The D. melanogaster Nemy model suggests a shift in the placement of the lysine at H2-3 (Fig 7A), 545 which may compensate for the lack of a basic residue at H2-7. Additionally, the Nemy model predicts 546 that a serine (Ser205) at H4-6 hydrogen bonds with ascorbate, suggesting a unique functional role for 547 this serine in D. melanogaster Nemy and in A. gambiae Nemy, which also has a serine at this position 548 ( Fig 7A). Of the five residues proximate to cytoplasmic ascorbate in Dcytb and AtCytb561-B (basic 563 residues at H2-7, H2-3, and H4-7, aromatic at H3+22, bulky hydrophobic at H4-3), only an arginine at H4-564 7 (Arg509) is completely conserved across CG8399 sequences (amino acid numbering in this section 565 corresponds with D. melanogaster CG8399) (Fig 2). Most CG8399 sequences also have a conserved 566 lysine at H2-7, although D. melanogaster CG8399 has a threonine (Thr443) (S1 Fig). To visualize 567 conservation on the cytoplasmic side of CG8399 proteins, we overlayed the CG8399 consensus 568 sequence alignment onto the D. melanogaster CG8399 AlphaFold model (Fig 8). Of the remaining 569 positions discussed in the interaction of cytoplasmic ascorbate with Dcytb, AtCytb561-B, CG1275, and 570 Nemy proteins (H2-3, H3+22, and H4-3), CG8399 sequences have conserved residues with differing 571 properties. At H2-3, CG8399s have a phenylalanine (Phe447) rather than the lysine found in Dcytb, 572 AtCytb561-B, CG1275, and Nemy proteins, and a conserved tryptophan (Trp446) precedes H2-3 at H2-4 573 (Fig 8). Instead of a bulky hydrophobic residue at H3+22, a proline (Pro502) is present in CG8399s; that 574 proline is preceded by a completely conserved CG8399 arginine (Arg501) at H3+21 (Fig 8). Instead of a 575 bulky hydrophobic residue at H4-3, CG8399 proteins have an asparagine (Asn513). We noted additional 576 highly-conserved positions in this region among CG8399 proteins, including a second proline at H3+24 577 (Pro504) and a basic residue at H1+23 (Lys432), which is either a lysine or arginine in all the CG8399 578 sequences. H1+23 is an arginine in most CG1275 and Nemy sequences although not in Dcytb (Trp73) or 579 AtCytb561-B (Gln74) (Fig 4). While side-chain placement cannot be determined with high confidence, it 580 is notable how highly conserved this region is across multi-domain CG8399 proteins, including conserved 581 residues suitable for binding ascorbate: basic residues at H1+23, H3+21, and H4-7, and aromatics at H2-582 4 and H2-3 (Fig 8; S5 Fig). we identified as having a potential hydrophobic contact with cytoplasmic ascorbate in Dcytb, CG1275, 600 and Nemy (Fig 7A; S3 Table). It is possible that the basic residues at these highly-conserved In Dcytb the only residue found to directly interact with non-cytoplasmic ascorbate was Phe184 616 (H4+25) via a van der Waals interaction, although several residues were discussed as contributing to the 617 environment of the ascorbate-binding pocket [18]; in AtCytb561-B the same H4+25 (Phe182) interaction 618 was reported, and several residues were also suggested to form possible hydrogen bonds, including 619 H2+22 (His106) and H3-3 (Tyr115) [13]. Our analysis revealed that three Dcytb residues have predicted 620 contacts with non-cytoplasmic ascorbate: H1-3 (Phe47), H2+22 (His108), and H4+25 (Phe184) (Fig 7B; S3 621 Table). A phenylalanine at H1-3 is also predicted to make ascorbate contacts in D. melanogaster CG1275 622 and Nemy (S3 Table); a phenylalanine at this position is conserved across CG1275 and Nemy sequences, 623 and is centrally located -near the zinc ion, the non-cytoplasmic ascorbate, and the non-cytoplasmic 624 heme -in Dcytb, CG1275, and Nemy structures (Fig 7B). In Dcytb, Phe47 was the only residue that had 625 an observed change in electron density upon zinc ion and ascorbate binding, indicating a change in 626 position [18]. Most Group 4A proteins have a phenylalanine at H1-3; however, there is less conservation 627 at this position in CG8399 and Group 4B proteins (S1 Fig).  628 As mentioned above, the phenylalanine (Phe184) at H4+25 was determined to form a van der 629 Waals interaction with the non-cytoplasmic ascorbate in Dcytb and AtCytb561-B [13,18]. An aromatic 630 amino acid, most often phenylalanine, is typical at H4+25 across CG1275 and Nemy sequences with a 631 small number of exceptions; the most dramatically different is a glutamate (Glu236) in D. melanogaster 632 Nemy (Fig 4). In the absence of ascorbate, our analysis revealed that Glu236 and a lysine at H3-7 633 (Lys161) are predicted to form a hydrogen bond in the D. melanogaster Nemy model that would run 634 straight through the region occupied by the non-cytoplasmic ascorbate ring (Fig 7B). The D. 635 melanogaster Nemy model also has extensive contacts predicted between the non-cytoplasmic 636 ascorbate and Lys161 (23, including one hydrogen bond) and Glu236 (29) (S3 Table). These results 637 suggest that D. melanogaster Nemy could bind non-cytoplasmic ascorbate though a combination of 638 compensatory substitutions in lieu of an aromatic residue at H4+25. H4+25 is slightly outside the 639 conserved core domain and therefore was not included in our analyses of CG8399 and Group 4 640 sequences. 641 In Dcytb, the histidine (His108) at H2+22 was determined to be part of an ascorbate-metal-642 binding complex; His108 was found to directly coordinate with a redox inactive zinc ion, which directly 643 interacted with the non-cytoplasmic ascorbate [18]. The histidine at H2+22 is conserved across CG1275 644 and Nemy sequences and is predicted to be similarly positioned in D. melanogaster CG1275 and Nemy 645 models (Fig 4; Fig 7B). In contrast, a histidine at H2+22 is not conserved in other insect cytb561s (Fig 2). 646 The residue at H2+22 is nonpolar in CG8399 sequences, whereas many Group 4 proteins have a lysine at 647 this position, which could contribute to ascorbate binding but would not be expected to coordinate a 648 metal cation (Fig 2; S1 Fig). 649 His108 was also found to be critical for Dcytb ferric reductase activity, as were residues at H2+21 650 (Asn107), H3-3 (Tyr117), and H4+25 (Phe184, discussed above) [18]. Like Dcytb, CG1275 and Nemy 651 proteins have a tyrosine at H3-3 in addition to the histidine at H2+22 and phenylalanine at H4+25, and 652 they tend to have a conservatively-substituted serine at the H2+21 position (Fig 2). These conserved 653 residues support the hypothesis that CG1275 and Nemy proteins have ferric reductase activity. In 654 contrast, conservation of these residues in the other insect cytb561 proteins is limited to a glutamate 655 that is entirely conserved at H2+21 in multi-domain CG8399 proteins (discussed further below) (Fig 2). 656 Some Group 4A proteins may have a different type of non-cytoplasmic ascorbate-binding site. A lysine-657 containing "KXXXXKXH" motif was observed in the non-cytoplasmic loop of the core domain in six of the 658 seven Group 4A proteins (excluding one of the two C. felis sequences) (Fig 2; S1 Fig). Group 4A includes 659 the proteins identified as more like human TScytb than any D. melanogaster cytb561 (Table 1; S2 Table). 660 We aligned these six Group 4A proteins with human TScytb (which can reduce ferric ions in nanodiscs 661 [14,54]) and TScytb-like proteins of diverse animal species and found that all sequences have this 662 "KXXXXKXH" motif (Fig 9A), which contributes to the high degree of conservation across TScytb proteins 663 on the non-cytoplasmic side (Fig 9B). The location of the "KXXXXKXH" motif in the non-cytoplasmic loop 664 between helices 3 and 4 (where substrate binding occurs in Dyctb), as well as its strict conservation in 665 TScytb-like proteins of diverse species, suggests an important functional role, possibly related to 666 ascorbate binding (Fig 9C). This motif was not identified in other human or insect cytb561s. A completely 667 conserved tryptophan at H4+23 could also potentially contribute to a non-cytoplasmic binding site, and 668 the histidine at H3-5 (within the "KXXXXKXH" motif) could potentially coordinate a ferric ion complexed 669 with ascorbate, similar to the substrate binding site of Dytcb, providing a site for ferric reduction in 670 TScytb-like proteins (Fig 9C). Glu541, as well as Glu471 and Lys539 (Fig 10B). 759 Compared to other Nemys, the A. pisum sequence has approximately 150 additional N-terminal 807 residues preceding helix 1 and approximately 50 additional residues in the non-cytoplasmic loop 808 connecting helix 1 and helix 2, which immediately precedes the core cytb561 domain (S4 Fig). We found 809 proteins with similar extended regions in eight additional species of the Aphididae family, but not in 810 other insects, suggesting that these regions may serve a unique function within aphids (S4 Fig; S5 Table). 811 The extended N-terminal region preceding helix 1 is acidic and includes relatively high percentages of 812 aspartate, glycine, serine, and threonine (S4 Fig). 813 Of the investigated CG8399s, only the A. pisum sequence possesses two DOMON domains. We 814 identified 888 CYBDOM hits in the class Insecta, of which only 19 were found to have two DOMON 815 domains: nine from other species of aphids, eight from other insects from the order Hemiptera, and two 816 from insects in the order Coleoptera (S6 Table). It appears that multiple DOMONs are uncommon in 817 insect cytb561s; however, they have a relatively high occurrence in Hemipterans. Both A. pisum DOMON 818 domains contain the conserved methionine and histidine that are assumed to coordinate a heme group 819 (S3 Fig); however, it is unknown how these two DOMON domains are spatially arranged. 820 While all nine investigated insect species possess a multidomain CG8399, A. gambiae (mosquito) 821 is the only species with a second, single domain CG8399-like protein (XP_314065.2); this sequence lacks 822 both reeler and DOMON domains. Further investigation revealed similar single-domain (cytb561 only) 823 CG8399-like proteins in ten additional mosquito species, as well as two species of fungus gnats, which 824 are also in the order Diptera (S7 Table). The single-domain CG8399-like protein is more similar to the 825 cytb561 domain in multidomain CG8399 proteins than to the cytb561 domain in other groups of 826 cytb561s (Table 1; S2 Table), including its core domain spanning TM1-TM4, rather than TM2-TM5. 827

Cyb561 subcellular localization
828 Subcellular localization is important for protein function in an organism. We submitted each of 829 the 54 full-length insect cytb561s to the SignalP-5.0 webserver for detection of predicted signal 830 peptides, and to the DEEPLOC-1.0 webserver for prediction of subcellular protein localization [40]. Only 831 the multidomain CG8399 proteins had a predicted signal peptide, and they are the only sequences with 832 predicted plasma membrane localization (S8 Table). This result is consistent with the presence of D.  Table). The subcellular localization predictions for Group 4 proteins were not particularly robust, and 836 the predicted location varied depending on the specific Group 4 protein (S8 Table). 837 An alignment of the Dcytb, Nemy, and CG1275 carboyxl-termini revealed that only the CG1275 838 proteins have a "EDXXLL" motif near TM6 that is consistent with a dileucine-based late-839 endosomal/lysosomal targeting signal "(D/E)XXXL(L/I)" (Fig 11) [42]. A. mellifera and T. castaneum 840 CG1275 proteins lack one conserved leucine, and one of the four A. pisum sequences (XP_003246890.1) 841 lacks both leucines, yet these three sequences are still predicted to localize to the lysosome (Fig 11; S8 842 Table). and an additional helical region is in black. The EDXXLL motif is boxed in blue in the consensus logo. 847

848
A major goal of this study was to predict which of the insect cytb561 orthologous groups may 849 function as ferric reductases and, thus, have a possible role in iron uptake by insect cells. Our analyses 850 suggest that none of the four groups of insect cytb561 proteins can be excluded as ferric reductase 851 candidates. 852 Based on sequence similarity, predicted structural similarity, and conservation of key amino acid 853 residues, we conclude that CG1275 and Nemy proteins are strikingly similar to the well-established 854 ferric reductase Dcytb, as well as the closely-related human proteins Lcytb and CGcytb. These similarities 855 extend to the conservation of a metal-binding histidine and other amino acid residues that are required 856 for ferric reductase activity by Dcytb. To date, no functions of CG1275 proteins have been discovered; 857 however, if CG1275 proteins have ferric reductase activity, the constitutive, nearly ubiquitous 858 expression of D. melanogaster CG1275 [64,65] would be consistent with a role in iron uptake by many 859 cell types. We found that CG1275 proteins have a C-terminal dileucine motif that directs proteins to a 860 late-endosomal/lysosomal location, suggesting that CG1275 proteins could function similarly to the non-861 homologous STEAP3 protein, which reduces Fe 3+ to Fe 2+ prior to uptake across endosomal and lysosomal 862 membranes [27,28,66]. Among mammalian cytb561s, a C-terminal dileucine motif is present only in 863 Lcytb proteins [7]. Interestingly, Lcytb appears to function as a lysosomal ferric reductase in Burkitt 864 lymphoma cells, filling the role that STEAP3 plays in some other cell types [67]. Although Nemy proteins 865 have many similarities with Dcytb and Lyctb, their one known function is unrelated to iron homeostasis 866 [4,30]; however, Nemy proteins could be multifunctional enzymes that also play a role in iron uptake. 867 D. melanogaster CG8399 and the mammalian ortholog SDR2 have in vitro ferric reductase activity 868 [6,15,31], but the in vivo functions of CG8399 and SDR2 proteins are not known. Based on a previous 869 biochemical study [63] and our subcellular localization predictions, CG8399 proteins localize to the 870 plasma membrane, with their reeler and DOMON domains in the extracellular compartment. Their 871 presence in the plasma membrane may enable reduction of extracellular ferric ions prior to iron uptake. 872 Our analyses of the predicted structures of the CG8399 DOMON and cytb561 domains suggest that the 873 two domains may function together in ferric reduction, with the interdomain transfer of an electron 874 from cytoplasmic ascorbate to a heme in the extracellular DOMON domain and then to a ferric ion that 875 is perhaps coordinated nearby by multiple aspartate residues. Because D. melanogaster CG8399 876 expression is limited to a subset of tissues [64,65], its hypothetical role in iron uptake would be limited 877 to specific cell types. 878 We have the least amount of evidence to support the hypothesis that the diverse Group 4 879 proteins function as ferric reductases, and a lack of one-to-one orthologs indicates that these proteins 880 may have non-conserved physiological functions. The expression profiles of all but one of the D. 881 melanogaster Group 4 genes is moderately to highly restricted [64,65], excluding a role in iron uptake by 882 most cell types. The Group 4 proteins have low similarity to the well-studied mammalian cytb561s, 883 lacking even conserved residues that bind ascorbate in the cytoplasm. However, multiple basic residues 884 and a large hydrophobic residue at the cytoplasmic surface of most Group 4 proteins may enable the 885 Group 4 proteins to use ascorbate as an electron donor. On the non-cytoplasmic side, a subset of Group 886 4 proteins contain a "KXXXXKXH" motif that is also present in TScytb, a protein with an unknown 887 biochemical function in vivo, but ferric reductase activity in vitro [14]. We propose that the "KXXXXKXH" 888 motif may bind ascorbate and a ferric ion in a hypothetical ferric reduction site. 889 890 891