RESPONSE LETTER TO THE COMMENTS OF THE RECOMMENDER AND REVIEWERS Resubmission MS Title: An evaluation of pool-sequencing transcriptome-based exon capture for population genomics in non-model species

Exon capture, coupled with high-throughput sequencing technologies, represents a cost-effective technical solution to answer specific evolutionary biology questions by focusing on areas of the genome under selection. Transcriptome-based capture, which allows exon capture for non-model species, is particularly used in phylogenomics. In the case of population genomics studies, it remains however poorly developed because the cost of sequencing a large number of indexed individuals across multiple populations of one species is prohibitively high. In this study, we evaluate the possibility of combining transcriptome-based capture and pool-seq (before extraction) as a cost-effective, generic and robust approach to estimate the population variant allelic frequencies of any species. We designed capture probes for ~5 Mb of randomly chosen de novo transcripts of the Asian ladybird (5,717 transcripts). From a pool of non-indexed 36 individuals, ~300,000 bi-allelic SNPs were called. We found that capture efficacy was high, and that pool-seq was as effective and accurate as individual-seq in detecting variants and estimating allele frequencies. We also propose and evaluate an approach to simplify the processing of read data, which consists of mapping reads directly to targeted transcript sequences in order to obtain coding variants. This approach is effective and does not affect the estimation of SNPs’ allele frequencies, except for a small bias near some exon ends. We demonstrate that this approach can also be used to efficiently predict a posteriori intron-exon boundaries of targeted de novo transcripts, thus allowing to cancel genotyping biases around exons ends.

"random" when dealing with our final set of targets, which was not entirely randomly chosen. We have corrected the text accordingly (e.g. Lines 27 and 162). Additionally, we explain in the method section why we performed this step : "We have chosen to work on an exome subset, particularly with the future goal of working simultaneously on a large number of population samples".

Minor points:
-Although this is not the main point of the study, would it possible to give more details about the de novo transcript annotation (initial numbers, method for reconstruction, sequenced tissues/stages…)?
We have not been clear enough in the manuscript on this point: we actually did not produce and annotate these data ourselves. We only searched putative peptide-coding sequences (CDS) and calculated the values for the filters used (GC%, N, size) for each of them. Information about the de novo transcripts of H. axyridis are available in other studies, such as Vilcinskas et al. (2013;https://doi.org/10.1098/rspb.2012.2113 and Vogel et al. (2017;https://doi.org/10.1016/j.dci.2016. In consequence, we have not added any additional information in the manuscript, but we have modified the text to make the origin of these data more obvious to the reader . -line 443 : "the allele frequency estimates obtained with the two mapping methods were highly correlated both for the pool (r=0.998; Fig. 2C) and for the individuals (r=0.998)." It seems that the correlations of AF between the 2 mapping strategies (CDS vs genome) is slightly different for lower AF values (<0.2), with the mapping onto CDS slightly overestimating AF as compared to mapping onto genome (Fig 2C). Would it be interesting to do the correlations by bins/intervals of AFs? Figure 3C (formerly 2C) involves a very large number of points (174,307), and the difference mentioned here is mainly a visual artefact due to a few points. Indeed, when focusing on frequency below 0.2 (154,322 SNPs) -One section of the discussion seems to have been duplicated.
This was indeed a mistake, and we now removed one of the paragraphs.
-The references are presented twice.
This has been corrected.
Additional requirements of the managing board: As indicated in the 'How does it work?' section and in the code of conduct, please make sure that: -Data are available to readers, either in the text or through an open data repository such as Zenodo (free), Dryad or some other institutional repository. Data must be reusable, thus metadata or accompanying text must carefully describe the data.
-Details on quantitative analyses (e.g., data treatment and statistical scripts in R, bioinformatic pipeline scripts, etc.) and details concerning simulations (scripts, codes) are available to readers in the text, as appendices, or through an open data repository, such as Zenodo, Dryad or some other institutional repository. The scripts or codes must be carefully described so that they can be reused.
-Details on experimental procedures are available to readers in the text or as appendices.
-Authors have no financial conflict of interest relating to the article. The article must contain a "Conflict of interest disclosure" paragraph before the reference section containing this sentence: "The authors of this preprint declare that they have no financial conflict of interest with the content of this article." If appropriate, this disclosure may be completed by a sentence indicating that some of the authors are PCI recommenders: "XXX is one of the PCI XXX recommenders." We have carefully followed the "How does it work?" section and code of conduct of PCI Genomics. Overall, this paper contains a lot of work and an interesting method to find intron-exon boundaries.
In general, it was for me rather difficult to follow the reasoning behind some steps, especially those sections related to the SNPs and the filters used at various steps in the materials and methods.
Furthermore, I do not entirely understand what the exact aim is of the paper: exome sequencing or "random subset of the exome" sequencing or sequencing of orthologous targets (because a lot of references are made to phylogenetic studies, even the first sentence immediately refers to that), … To me, the main idea seemed to be the random subset targeted sequencing of coding regions. If that is indeed the general idea and if it thus is not to be used for phylogenetics, neither for sequencing the entire exome, I would more focus the writing of the paper on that: why do you want to sequence a subset, what is the rationale for that, for what can it be used. If the general idea is that sequencing a pool has a limited effect on allelic frequencies, this should also be quantified in more detail.
Overall, I do like the general concept of the paper, however, I think it can be refined more.
See below, in the "in more details" section, for detailed answers about the points raised here. if results would be reported as XX% (number/total number). Maybe a nice figure detailing the subsets would also be nice and make it more easy to follow.
We have (i) reported throughout the text as much results as meaningful XX%, in order to make the manuscript easier to read (e.g. Lines 244,270,341), and (ii) added an illustration (Figure 1) to summarize the analyses corresponding to the main objectives of the study, and we believe that this helps a lot the general understanding of our work. We have.

Variant called was not performed only on 20 individuals. The variant called was performed on at least 20 individuals (knowing that the maximum number of individuals is 23). We did not retain the positions which were genotyped in less than 20 individuals in order to provide a good and sharp estimation of allele frequencies. We added precisions to the text (lines 239-240).
In general, I am confused by the usage of the word "exome". Is the goal to ultimately use this technique to sequence the entire exome for a large number of individuals in pool OR to sequence a random subset of the genome to obtain frequency estimates? If the goal is to sequence a random subset, it is also more OK to use the 5717 CDS instead of the 5736 CDS (see earlier remark). The third remark (in the section of biases) still remains at that moment however. In general however, I do have the feeling that you put the subset and the exome at the same level, as also stated in the discussion (line 500) and that biases the results.
We agree that we have not been careful enough in the text with regard to the use of the term "exome".
The bias towards the end: you state in the abstract and line 606 that there is a bigger bias towards the end when it comes to estimating allele frequencies. This is to be clarified more. From the paper, I get the feeling you mean that this refers to more variants that are not in HWE. Bias in allelic frequency to me refers in this case to more widely diverging estimates of the true allele frequency, not to deviations of HWE. At line 610, there is also a statement of 296,736 SNPs. This is the only time

I found this number in the manuscript. Where does it come from?
We do not refer to a bias in the estimation of allele frequencies, but to the prediction of false SNP. We have shown that the direct mapping onto CDS can produce false SNPs near the IEBs (due to the similarity in sequence between the beginning of the intron and the beginning of the next exon). Figure   S6 illustrates this with an example and Figure S2B  Line 147: why are probes that match other species omitted? This also implies that you have to have access to reference genomes of closely related species? If this is the case, best to mention this as a limitation.
We did not omit "probes that match other species". As explained in the manuscript (lines 151-154), we only omitted probes with more than one close match in the H. axyridis de novo transcriptome or in the draft genome of A. glabripennis.
Our method does not require the use of reference genomes of closely related species. However, we believe that the use of the genomes of outgroup species improves the quality of the probe design. This is not a limitation because using easily available genomic data of model species is actually achievable in the probe design process of any non-model species. In our case, it is important to emphasize that Tribolium castaneum and Anoplophora glabripennis are not "related species" of H.

Line 238 -241: why was the Hardy-Weinberg equilibrium test performed? If it is used as a proxy for
genotyping error, I do not entirely agree with the concept… It has been shown that HWE testing will not achieve this. Some references:  I would recommend the paper after some revisions.

Leal, S. M. Detection of genotyping errors and pseudo-SNPs via deviations from Hardy
My comments are listed below.

General comments:
The authors performed a lot of different steps to generate the data use for the benchmark. We added an illustration to graphically summarize the analysis workflow (Figure 1).

Specific comments:
line 52: 'even in after, the in should be remove.
This has been corrected (line 52). There are only 86 positions upon 300,036 (0.03%) that are polymorphic only in individuals. This is why the there were not visible in Figure 2 (formerly 1). We decided to remove those SNPs from Figure 2, and mention them in the caption.  The word "private" refers here to SNPs identified in a single mapping approach. We added this precision to the text (Lines 459-461).
line 459: IEB will be better written in full letters because it seems that it is the first time the acronym is used in the Result part.
This has been modified (line 473).
line 462: IEB will be better written in full letters in the title.

This has been modified (line 477).
line 548-574: The two paragraphs say the same thing. It is, I think a mistake. One should be choose.
This was indeed a mistake, and we consequently removed one of the paragraphs.

Comment on IEB_finder:
I was able to run the first step, i.e.
Step 1 : collect_CDS_infos.pl, but for the second step (Mapping genomic reads on CDS sequences), there is no 'genomicReads.fq' file to test the tool.