SArKS: de novo discovery of gene expression regulatory motifs and domains by suffix array kernel smoothing

Motivation We set out to develop an algorithm that can mine differential gene expression data to identify candidate cell type-specific DNA regulatory sequences. Differential expression is usually quantified as a continuous score—fold-change, test-statistic, p-value—comparing biological classes. Unlike existing approaches, our de novo strategy, termed SArKS, applies nonparametric kernel smoothing to uncover promoter motifs that correlate with elevated differential expression scores. SArKS detects motifs by smoothing sequence scores over sequence similarity. A second round of smoothing over spatial proximity reveals multi-motif domains (MMDs). Discovered motifs can then be merged or extended based on adjacency within MMDs. False positive rates are estimated and controlled by permutation testing. Results We applied SArKS to published gene expression data representing distinct neocortical neuron classes in M. musculus and interneuron developmental states in H. sapiens. When benchmarked against several existing algorithms for correlative motif discovery using a cross-validation procedure, SArKS identified larger motif sets that formed the basis for regression models with higher correlative power. Availability https://github.com/denniscwylie/sarks. Contact denniswylie@austin.utexas.edu. Supplementary information appended to document.


Introduction
Discrete sequences-of tones, of symbols, or of molecular building blocks-can provide clues to other characteristics of the entities from which they are derived: a phrase in a bird's song can reveal which species it belongs to, the use of an idiomatic expression can pinpoint a speaker's geographic origin, and a specific short string of nucleotide residues can illuminate the function of a DNA domain. In these examples, insights are gleaned from informative motifs-short subsequences that match some frequently recurring discernible pattern.

SArKS
Of particular interest are DNA regions modulating differential gene expression. Identification of motifs within such regions can help to uncover DNA sequences with autonomous and predictable function (stereotyped expression regulation). While prior efforts have focused on short motifs that participate in regulating eukaryotic gene expression [Walhout, 2006], the details of how and which motifs are needed for expression specificity remain poorly understood.
We present a broadly-applicable algorithm for identifying DNA regulatory regions that support differential gene expression. Our strategy is predicated on the following suppositions: (a) gene expression regulatory regimes involve the binding of transcription factors (TFs) to sites on non-coding DNA in the vicinity of a transcription start site (TSS) [Maston et al., 2006, Nguyen andD'haeseleer, 2006]; (b) TFs act combinatorially to attract and repel transcription machinery [Walhout, 2006]; (c) the same TF binding site may appear multiple times within a stretch of DNA, interspersed with other binding sites [Gotea et al., 2010]; and (d) there is more than one solution: different genes, even those co-expressed within a single cell, may rely on different regulatory mechanisms [Badis et al., 2009]. In accord with these suppositions, we aim to identify TF binding sites associated with enriched transcripts and scrutinize their arrangement for significant patterns that can then be evaluated experimentally.
Many different motif identification methods have been described. Consensus-based methods such as Weeder [Pavesi et al., 2001, Pavesi et al., 2004 focus on fixed-length motifs that occur repeatedly (allowing for small numbers of mismatches) in sequences of interest and can be efficiently implemented using suffix trees [Sagot, 1998, Marsan and Sagot, 2000, Pavesi et al., 2001. Alternately, profile-based methods such as MEME [Bailey and Elkan, 1995, Bailey et al., 2006, Bailey et al., 2009] build a probabilistic motif profile to be compared to a background model in order to classify subsequences as either matching the motif or not.
In contrast, discriminative motif analysis [Sinha, 2003] identifies motifs specifically differentiating one set of sequences (e.g., promoter regions for genes with a given expression pattern) from another (e.g., a set of reference promoter regions). A number of approaches have been applied to this problem [Segal et al., 2002, Segal and Sharan, 2005, Redhead and Bailey, 2007,Fauteux et al., 2008,Valen et al., 2009,Huggins et al., 2011,Yao et al., 2014; one popular recent example, DREME , employs Fisher's exact test to compare counts of motif matches in the target/positive sequences with counts in the background/negative sequences (followed by further refinement of motif profiles for satisfactory candidate motifs). An alternative discriminative strategy employing similar hypergeometric enrichment calculations but coupling them with a zero-or-one-occurrence-per-sequence (ZOOPS) scoring approach is implemented in HOMER [Heinz et al., 2010].
Discriminative approaches incorporate gene-specific information beyond promoter sequence into the motif discovery process-e.g., whether or not a gene shows elevated expression following an experimental manipulation-but these methods implicitly assume a binary representation (e.g., elevated vs. not elevated) for such covariates. Given that the information used to establish the contrasting gene sets is often obtained in the form of continuous measures of differential expression (t-statistics, f -statistics, etc.), with some genes exhibiting extremely divergent expression patterns across conditions while others show more modest differences, it is also useful to consider "correlative motif discovery," which seeks motifs whose presence signals a trend towards higher or lower values of such a continuous measure. While less SArKS heavily investigated than discriminative motif discovery, a few correlative algorithms have been proposed, including MOTIF REGRESSOR [Conlon et al., 2003], which first applies the (non-correlative) MDScan algorithm [Liu et al., 2002] to identify motifs in a subset of high-scoring sequences, then filters the motif set based on the predictive value of regression models based on the selected motifs. Another correlative algorithm, FIRE [Elemento et al., 2007], iteratively optimizes the mutual information between sequence scores and occurrences of candidate motifs, starting from a set of the most informative 7-mer "seed" motifs. Both of these algorithms may be seen as applying correlational information (regression or mutual information, respectively) as a filter to select and refine a set of candidate motifs-MDScan output for MOTIF REGRESSOR or the set of all 7-mers for FIRE-initialized in a non-correlative manner.
In contrast, we aimed to develop an algorithm for correlative motif discovery not requiring a prespecified set of seed motifs to minimize the possibility of missing informative motifs due to suboptimal seeding. Correlative motif discovery requires a rule for assigning numbers to motif patterns, which may then be correlated with sequence scores to rank and select motifs. The generation of a seed set of motif candidates paves the way for quantification by simply counting occurrences of the seed motifs within each of the sequenches w b . However, as the number of potential motifs of length k grows exponentially with k, seeding in this manner generally requires either keeping the initial candidate motifs short (as FIRE does) or by using other (non-correlative) filters to prescreen candidate motifs (as done by MOTIF REGRESSOR).
The combinatorial explosion of possible motifs of length k also implies that, given a fixed set of sequences w b to be analyzed, only a fraction of possible length-k motifs will actually be observed in any w b for k suitably large. For example, given 1000 sequences each of length 1000, at most one million k-mers of any length k can be found. This suggests focusing only on substrings of the w b instead of on the universe of possible patterns which could be present in the w b . More specifically, we consider suffixes-substrings formed by deleting the beginning of a string-of the w b , as these have two critical advantages. First, the number of suffixes of a set of sequences w b grows linearly with (and is actually equal to) the combined length (i.e., b |w b |), and second, lexicographical ordering allows us to assign each suffix a numerical value: the location of the suffix in the sorted list of all suffixes. Just as words sharing a common prefix are found close together in a dictionary, suffixes starting with a shared k-mer are assigned similar numeric values. SArKS develops this idea into a correlative motif discovery algorithm, with a natural extension for identification of longer multi-motif domains (MMDs) spanning tens to hundreds of bases, as described in Section 2.
We present the application of SArKS to real RNA-seq data, using nonparametric permutation testing to compute significance thresholds and estimating false positive rates. We demonstrate that SArKS outperforms several existing algorithms at identifying correlative motifs in a pair of cross-validation testing scenarios. Further analysis of the top motif patterns thus identified reveals among them many particular variations on sequence elements known to have regulatory activity [Mathelier et al., 2015, Elbarbary et al., 2016. By implementing a correlational approach to motif discovery, SArKS thus provides a step forward in taking full advantage of the differential expression information offered by RNA-sequencing experiments in the context of motif discovery.
Symbolic notation is described as it is introduced in the text, but is also systematically described in the glossary provided in Section S2.1 for convenience.
Given n sequences w b (also referred to as words) with associated scores y b , the essential steps of the motif discovery algorithm we propose, illustrated in Fig 1 and more rigorously described in Section 2.1, consist of: 1. concatenating all the sequences w b into one supersequence x (see Eq (1) below); 2. constructing the suffix array [s i ] of this supersequence (Eq (4)), where i indexes all suffixes of x sorted into lexicographic order; 3. mapping the suffix positions i back to the sequences w b i from which the beginnings of the associated suffixes are derived (Eq (5)); and finally 4. for each i, applying kernel smoothing to locally regress the sequence scores y b j on suffix positions j lexically near i (Eq (6)).
We thus encode the motif pattern corresponding to the first few characters of the suffix of x beginning at character s i with the numerical suffix array index value i. Because i gives the position of a suffix in the lexicographically sorted list of suffixes of the concatenated supersequence x, multiple occurrences of a highly conserved motif-even if they derive from different sequences w b -will be consolidated into a run i, i + 1, . . . , j of consecutive index values. Averaging together runs of j − i consecutive scores by kernel smoothing using a kernel of width j − i thus offers a way to compare the scores y b i , y b i+1 , . . . , y b j to the overall score distribution.

Concatenation
Concatenate all words w b (each assumed to end in the line-terminator character $ lexically prior to all other characters) to form word Thus x[l b , l b+1 ) = w b ; that is, the substring of the concatenated string starting at position l b (inclusive) and ending immediately before position l b+1 (exclusive) is the sequence w b (in this paper the first character of a string w is denoted w[0], the second w[1], etc.).

Suffix sorting
Lexically sort suffixes 2  10  AY  4  0  1  3  8  AY  3  3  1  4  0  D  0  0  2  5  5  DRNAY  3  3  1  6 4 N 2 0 7 /3 7 1 NA 1 4 7 /3 8 7 NAY 3 3 10 / 3  9  6  RNAY  3  3  2  10  11  Y  4  0  2  11  9  Y  3  3  Table of all suffixes of x (part of each suffix following first end-of-sequence character shown in light gray), along with index b of input sequence w b each suffix derived from and score y b associated with w b . (C) Sorted suffix table indicating suffix array index i, suffix array value si, suffix (sequence following first end-of-sequence character has been removed), sequence of origin bi, associated score y b i , and smoothed scoreŷi generated using smoothing window of size 3 (kernel half-width κ = 1). (D) Smoothed scoresŷi plotted against suffix array index i, indicating peak at i = 8 corresponding to suffix NAY of input sequence DRNAY. Note that prefix NA of this suffix is longest substring common to the two input sequences w1 and w3 with scores y b > 0.

Motif selection
SArKS into ordered set S = {x s 0 , x s 1 , · · · , x s ln−1 } (4) thereby defining suffix array [s i ] mapping index i of suffix in S to suffix position s in x (in our software we rely on the Skew algorithm [Kärkkäinen and Sanders, 2003] modified to use a difference cover of 7 and implemented in SeqAn [Döring et al., 2008] to efficiently compute the suffix array).

Block marking
mapping index i of suffix in S to block b containing suffix position s i . The block array then tells us that the character x[s i ] at position s i in the concatenated string x is derived from

Kernel smoothing
Calculate locally weighted averagesŷ where the kernel K ij acts as a weighting factor for the contribution of the score y b j to the smoothing window centered at sorted suffix index i. Loosely speaking, K ij is used to measure how similar (the beginning of) the suffix x[s j , |x|) is to be considered to (the beginning of) the suffix x[s i , |x|) in the calculation of a representative scoreŷ i averaged over suffixes similar to x[s i , |x|). As the suffixes have been sorted into lexicographic order, the magnitude of the difference i − j provides some information regarding this similarity: the key idea of the kernel smoothing approach described here is that Eq (6) with K ij defined to be a function of |i − j| may therefore offer a computationally tractable approach for identifying similar substrings (prefixes of suffixes) which tend to occur preferentially in high scoring words w b . In this work we use a uniform kernel which allows Eq (6) to be computed easily in terms of cumulative sums: The kernel half-width κ appearing in Eq (7) is an important adjustable parameter in the SArKS methodology controlling the degree of smoothing applied. Increasing κ smooths over more, and hence generally more diverse, suffixes, potentially increasing statistical power at the expense of the resolution of the detected motifs. We recommend investigating a range of values of this parameter as is illustrated in Section 3.

k selection
Set lengthk i for k-mer associated with suffix array index i by locally averaging the length of suffix sequence identity: where k max functions both to increase computational efficiency and to makek i more robust in the presence of a small number of long identical substrings. All results presented here based on k max = 12: This value was selected as k max ≈ log 4 |x| where x is the longest concatenated sequence string considered in Section 3.2, so that for k > k max there are more distinct k-mers than there are positions for such k-mers to occur in all of the sequences w b composing x. Eq (9) is similar to Eq (6) except that: (a) Eq (9) smooths the length (capped at k max ) of the longest prefix on which the suffixes x[s i , |x|) and x[s j , |x|) agree instead of smoothing the score y b j as in Eq (6); and (b) Eq (9) omits the central term i = j as it trivially compares the suffix beginning at s i to itself and is thus uninformative.

Motif selection
A straightforward approach to identifying correlative motifs using SArKS would then be to set a score threshold θ and take motifs to be k-mers prefixing the suffixes starting at the positions s i in the concatenated string x. This is the essence of the method we propose, though with a couple of additional filters designed to pinpoint the optimal locations s i at which initiate motifs and, in Section S2.2, to remove likely false positive positions.
Defining the negative spatial shift operator η(i) which yields the unique suffix array index corresponding to the spatial position immediately prior to s i , so that s η(i) = s i − 1, as well as the positive shift operator ρ(i) similarly defined by the condition s ρ(i) = s i + 1, we start with a preliminary filtered suffix array index set I consisting of those i for which (1) the smoothed scoreŷ i ≥ θ and (2)ŷ i is not less than the smoothed scores of the spatial positions in x immediately adjacent to s i (i.e., s i must be the loci of a peak in plot ofŷ i versus spatial position s i ): from which we obtain the associated set M of k-mers beginning at the positions s i in x by where k i is the nearest integer tok i . Strategies for setting the filtering threshold θ based on the permutation testing method are described in Section S2.5. In the next section we recommend one additional filter be incorporated into the definition of the index set I, replacing Eq (10) by Eq (12).

Limit intra-sequence repeats
Short tandem repeats in the genome [Ellegren, 2004] can generate smoothing windows for the underlying repeated unit dominated by a relatively small number of distinct sequences, as is discussed in more detail in Section S2.2. As the scores for these motifs smooth over

Spatial smoothing
SArKS fewer distinct sequence scores, the associated smoothed scores are less precise than those for most potential motifs, leading to increased false positive rates. Section S2.2 introduces the Gini impurity score g i for suffix array index i measuring the "effective sequence count" contributing to the smoothing window centered at i, while Section S2.5 demonstrates that g i predicts the variance of the smoothed score for suffix array index i under the null hypothesis of independence between sequence and score. Thus we modify equation (10) to remove likely false positive i values characterized by low Gini impurities g i : screening out positions i for which the repeated occurrence of a few high-scoring words in the window centered at i leads toŷ i ≥ θ.

Spatial smoothing
Existing motif discovery approaches often take into account the tendency of regulatory motifs to cluster together [Wasserman and Sandelin, 2004]. Our algorithm builds on this observation, indentifying candidate regulatory regions through the application of a second round of kernel-smoothing over suffix positions s i within words: where we here use uniform kernels of the form L (λ) (generally with width λ = κ) to search for regions of length λ with elevated densities of high-scoring motifs. Note thatŷ s i defined by Eq (13) is indexed not by suffix array index i but by suffix array value s i giving the spatial position s i in the concatenated word x.
To use such spatial smoothing for motif selection/filtering, it is necessary to introduce a second threshold θ spatial , as the doubly-smoothed scoresŷ s i will generally be somewhat less dispersed than will be the singly-smoothedŷ i . The threshold θ spatial can be used to define an index set I spatial similar to the manner in which I is defined by Eq (12), as is detailed in Section S2.4. Section S2.4 additionally defines the set J spatial of suffix array indices i corresponding to the starting positions of MMDs. It then details the procedure adopted by SArKS to merge spatially contiguous motifs colocalized within the same MMD, yielding the set I spatial of suffix array indices i and merged motif lengthsk s i required to obtain the merged motif set M spatial analogous to equation (11).

Permutation testing
The significance of the observed correlation between the occurrence of the motifs uncovered by SArKS and the sequence scores y b can be evaluated by examining results obtained when the sequences w b and the scores y b are independent of each other. To this end, the word scores y b are subjected to permutation π to define If the permutation π is randomly selected independently of both the sequences w b and the scores y b , any true relationships between sequences and scores will be disrupted. Section S2.5 and Section S2.6 develop the strategy used by SArKS to set thresholds θ (and/or θ spatial ) for each particular combination of parameters κ, λ, g min so as to control the overall false positive rate for the desired range of parameter sets.

Illustration using simulated data
To illustrate SArKS, we first applied it to a simple simulated toy data set in which 30 random sequences w b were generated with each letter w b [s] drawn independently from a Unif {A,C,G,T} distribution. We then embedded the k-mer motif CATACTGAGA (k = 10) in the last 10 sequences (i.e., those Scores were assigned to the sequences according to whether the motif had been embedded: The kernel half-width κ = 4 was used to obtain smoothing windows of approximately the same size as the number of motif-positive sequences, 2κ + 1 ≈ |{b | y b = 1}|. As this number will not be known in advance when applying SArKS to real data, we recommend testing a range of κ values as done in Section 3.2 below, but here it serves well for a simple demonstration. Fig 2 plotsŷ i as obtained from Eq (6) applying the method of Section 2.1. The highest peaks in the plot correspond to the positions of various substrings of the embedded motif, and correspond to the set M of k-mers defined by the x[s i , s i + k i ) column of Table 1.
Section S2.5 illustrates the utility of setting a minimum Gini impurity g min during motif selection: 190 out of 1000 random permutations generated at least one position i (π) for whicĥ y (π) i (π) = 1 ≥ θ (for this toy model θ was simply taken to have the maximum possible value of 1), but only 20 of these permutations yield any results if g min = 1−(1+γ)(1−median i g i ) = 0.8506 (following Eq (S5) with γ = 0.1) is applied. Based on these results, we can derive a 95% CI of (1.2%, 3.1%) for family-wise error rate (FWER).

DNA motifs associated with neuron subtype-specific expression
We also used SArKS to analyze RNA-seq gene expression data from distinct mouse neocortical neuron subtypes [Mo et al., 2015]. Here, pooled genetically defined cell classes were subjected to a variety of sequencing methods to investigate transcriptomic and epigenetic variation, including cell class-specific ATAC-seq to determine what genomic regions are accessible to transcriptional machinery. Combined with the RNA-seq differential expression results, this   Illustration of motif selection process from Section 2.1 applied to simulated data set using window half-width κ = 4. All positions for which the sequence smoothed scoreŷi ≥ θ = 1 are shown; the table is sorted in descending order on the estimated motif lengthki. Number of distinct transcripts remaining after sequential application of filters described in Section S2.7.

SArKS
information is useful for filtering the set of genes included in SArKS analysis, as discussed in Section S2.7.
Our goal was to identify potential regulatory motifs associated with transcripts enriched in parvalbumin (PV) GABAergic neurons. Because we focused on putative regulatory regions in the vicinity of actively used gene transcription start sites, we quantified expression at the transcript level, filtered transcripts, and determined differential expression as described in Section S2.7. Table 2 summarizes the results of the various transcript filters: 6,326 distinct transcripts, each representing a unique gene, were retained for analysis. For this data set, we conducted three separate SArKS analyses, two focusing upstream (5') of the TSSs for the transcripts of interest and the other downstream (3').
We examined upstream sequences w b for each of the 6,326 remaining transcript species from -3,000 base pairs to the TSS (the transcription start sites of 85.5% of mouse genes annotated in Ensembl GRCm38 [Aken et al., 2016] are separated by more than 3 kb; median separation is 23 kb; mean separation is 54 kb). We also performed separate analyses of sequences w b extending from the TSS to +1000 base pairs after the TSS.
We tested a range of half-window sizes κ ∈ {250, 500, 1000, 2500} with the maximum value of 2500 selected to produce a smoothed window of size 2κ + 1 = 5001 similar to the number of sequences analyzed (6326). For each half-window size κ, we applied two minimum Gini impurity values g min set according to Eq (S5) with first γ = 0.1 and then γ = 0.2. Also for each value of κ, we examined three separate values of the spatial window size λ ∈ {0, 10, 100}. These three values were selected to investigate the performance of SArKS using no spatial smoothing (λ = 0), using a window λ = 10 of the typical length scale of eukaryotic transcription factor binding sites [Stewart et al., 2012], and finally using a window λ = 100 to target the low end of the enhancer length distribution [Loots, 2008].

Benchmark comparisons to existing algorithms
In order to contextualize the performance of SArKS, we conducted a cross-validation benchmarking study comparing the results of SArKS to those produced by four existing methods. Two of these methods, FIRE [Elemento et al., 2007] and MOTIF REGRESSOR [Conlon et al., 2003], were selected for comparison as they provide alternative approaches to correlative motif discovery as we have defined it here. The remaining two algorithms, DREME  and HOMER [Heinz et al., 2010], are popular discriminative methods which we have run by discretizing our score data with promoter sequences b considered 'positive' sequences if the score y b ≥ 2, 'negative' otherwise.
The set of 6,326 transcripts analyzed from the Mo 2015 data set was split into 5 disjoint subsets V 1 , V 2 , . . . , V 5 (the name V f intended to suggest the f th validation set) of approximately equal size (|V 1 | = 1, 266, while |V f | = 1, 265 for f > 1). For each of the algorithms evaluated and for both of the promoter sequence ranges investigated, 5 separate motif identification analyses were conducted by successively including only the pairs ∈ V f }, thus holding out those genes b ∈ V f for validation. Existing algorithms were run at their default parameter settings (defined either within the source code or in associated documentation; exact specifications are given in Section S2.8), with the exception that MOTIF REGRESSOR was run searching only for motifs positively correlating with score to enable more direct comparison of its output with that of the other algorithms (none of which look for anticorrrelated motifs, though all could do so simply by reversing the sign of the score).
In order to assess the degree of similarity in the motif sets identified by these methods, we used tomtom ( [Gupta et al., 2007]) to compare the pooled motif sets identified by each algorithm. Fig 3 shows the overlap between motifs sets by algorithm using this method of motif comparison: SArKS identifies similar motifs for the majority of motifs identified by each of the other algorithms, as well as many additional ones.
Given our interest in correlative motif discovery, the Pearson correlation between the count of occurrences of a given motif in sequence w b with the score y b across the sequencescore pairs (w b , y b ) provides a natural metric for assessing motif discovery. estimated Pearson correlation coefficient values for each motif identified (by each algorithm) evaluated using the held-out validation set {(w b , y b ) | b ∈ V f } appropriate for the fold f in which the motif was discovered. The count for a given motif in sequence w b was assessed using fimo ( [Grant et al., 2011]) for DREME, HOMER, and MOTIF REGRESSOR-all of which represent motifs as position-weight matrices-and using a simple regular expression search for FIRE (which returns regular expression representations of motifs) and for SArKS k-mers. In all cases, motif counts were based on motif occurrences on either the forward or reverse strand. This is the default behavior for motif scanning with fimo; for regular expression searches, this was implemented by searching for occurrences of either a motif or its reverse complement in the forward strand. As Fig 3 and Fig 4 demonstrate, the number of motifs identified by different algorithms is highly variable: DREME, FIRE HOMER, MOTIF REGRESSOR < SArKS. The interpretation of the number of motifs is, however, complicated by two factors: (1) the occurrence rate of individual motifs in the relevant biological sequences (promoters, etc.) may differ substantially and (2) some motifs may be very similar in sequence and/or genomic location (e.g., which promoters contain them).
The first of these complications is illustrated in Fig 4 by facetting horizontally on motif occurrence rate (count per analyzed sequence): one visible trend here is that the HOMERidentified motifs tend to occur less frequently than do the MOTIF REGRESSOR motifs, indicating that HOMER tends to define motifs with more granularity than does MOTIF REGRESSOR. SArKS-identified motifs are spread across a wide range of per-sequence occurrence rates in this plot.  The second complication-the similarities among individual motifs identified by an algorithm-may be addressed by noting that correlative motif discovery can also be viewed as a form of feature extraction. In this vein, we can assess the performance of motif discovery algorithms by using the selected motifs as predictors to build a regression model for associated sequence score y b based on the motif counts in the sequence w b . Fig 5 plots the validation set-estimated Pearson correlation coefficients of the predictions made by building a linear ridge regression model (with the L2 regularization parameter selected by generalized cross-validation [Golub et al., 1979]) with the sequence scores for each cross-validation fold by algorithm and sequence range. This approach collapses the variation in quantity and quality of individual motifs down to variation of a single quantity-the regression model predictions-thereby facilitating a head-to-head comparison of motif discovery algorithms bypassing both of the complications discussed above. (Note that because of the similarity of some identified motifs, which manifests as collinearity of predictors in a regression context, regularization is a key component of such a regression modeling approach.) SArKS yields better results than the other algorithms for both the upstream and especially downstream sequence regions (Fig 5). Nonetheless, all five algorithms generally perform better when searching the downstream region. Using the downstream sequences as {w b }, the four existing algorithms yield similar results with perhaps a slight edge to HOMER. With the upstream sequences instead provided as {w b }, there is more separation between the existing algorithms, with the performance of the algorithms roughly in order of how many motifs they report (DREME, FIRE < HOMER < MOTIF REGRESSOR < SArKS). The superior performance of motif discovery algorithms on downstream sequences over upstream ones coincides with the existence downstream of a particular family of very similar motifs-Esrrb, Essra, Esrrg from the JASPAR database-which is identified by each of the algorithms in the some form. In particular, variations on this motif dominate the downstream SArKS motif results, as is discussed further in Section 3.2.2.
An alternative approach to assessing motif discovery efforts is to compare the motifs identified algorithmically to databases of known TF-binding motifs, such as JASPAR [Mathelier et al., 2015]. This is explored further in Section S3.2.1.

Analysis of SArKS results
We also investigated the results of SArKS applied directly to the full set of 6,326 genes remaining after application of the filters detailed in Table 2 (i.e., not restricted to a particular cross-validation training set). Just as we did for the cross-validation approach described in Section 3.2.1, we investigated all combinations (κ, λ) ∈ {250, 500, 1000, 2500} × {0, 10, 100} for the smoothing half-width κ and spatial length λ. We again computed two values of g min for each value of κ following Eq (S5) both with γ = 0.1 and γ = 0.2. The values of g min , along with the fraction of suffix array index values i for which g i ≥ g min , are listed in Table 3. The analyses performed using the stricter g min values obtained using γ = 0.1 yielded larger k-mer motif sets: 3,393 total distinct k-mers using γ = 0.1 versus only 1,232 γ = 0.2 for the upstream sequence set; 380 distinct k-mers using γ = 0.1 versus just 180 for the downstream sequence set. More than 98% of the k-mers discovered using γ = 0.2 were also identified using γ = 0.1 (for both sequence ranges: 1,208 of the 1,232 upstream; 179 of the 180 downstream). By demonstrating that more restrictive values of γ can yield larger motif sets that include almost all of the motifs obtained using more permissive γ values, these results highlight the importance of the Gini impurity filter in focusing SArKS on potential motifs that appear within sufficiently many distinct sequences w b to achieve reasonable statistical confidence.
We assessed the statistical significance of these SArKS results following the method of Section S2.6 with thresholds θ and θ spatial set by Eq (S24) and Eq (S25) using z = 4. Upstream sequence analysis considering 250 independent random permutations resulted in 12 (4.8%) for which any of the parameter sets (κ, g min , λ) yielded a nonempty set of identified motifs. These upstream sequence results correspond to a 95% family-wise error rate confidence interval (FWER CI) of (2.5%, 8.2%). For the downstream sequence analysis, 250 independent permutations yielded eight (3.2%) instances of nonempty motif sets, from which we estimate a 95% FWER CI of (1.4%, 6.2%).
The role of the parameter z in Eqs (S24)-(S25) in balancing FWER against sensitivity can be seen in the analyses presented here by considering the consequence of increasing z: at z = 5, only two of the 250 upstream permutations (0.8%) resulted in nonempty motif k-mer sets (95% FWER CI (0.097%, 2.9%)) while four of the 250 downstream permutations (1.6%) yielded any motif results (95% FWER CI (0.44%, 4.0%)). The cost of this decreased false positive rate to sensitivity is apparent in that only 49% of the upstream and 50% of the downstream motif k-mers identified using z = 4 are still discovered using z = 5. Here we are willing to accept the FWER values associated with z = 4 (point estimates 4.8% and 3.2% for upstream and downstream analyses, respectively) in order to maximize sensitivity by casting a reasonably wide net.
The highestŷ i obtained in any of the SArKS analyses conducted-detected in the downstream sequence analysis using κ = 250, λ = 0, and γ = 1.1 in the downstream region analysis-corresponded to the k-mer TGACCTTG. This k-mer is very similar to a number of JASPAR TF-binding motifs. The strongest matches are to the binding motifs of Esrrb (q = 0.00078), Esrra (q = 0.00078), and Esrrg (q = 0.00301). In benchmarking SArKS against four existing motif-discovery algorithms, we noted that all of these algorithms identified motifs similar to each of these JASPAR motifs (tomtom q ≤ 0.1). In the case of SArKS specifically, we find that in fact a large fraction of the motifs associated with identified peaks inŷ i exhibit significant similarity to one of the JASPAR motifs Esrrb, Esrra, or Esrrg, as is illustrated in Fig 6B. This particular set of motifs may also help to explain the overall stronger performance of all of the motif discovery algorithms using the downstream sequences relative to the upstream sequences in the regression comparison (Fig 5), as was briefly alluded to in Section 3.2.1.
Esrrb/Essra/Esrrg binding motifs were identified by SArKS analysis of the upstream sequences, but they did not account for either the highest scoresŷ i nor did they correspond to a large fraction of the overall k-mer motif sets discovered-see Fig 6A. For five of the 12 distinct combinations of smoothing half-window κ and spatial window λ investigated using SArKS, the k-mer CCACCTGC was identified at the positions s i with maximal values ofŷ i (the k-mers GCACACCTT, TGGAACTCACT, CCTGGAAC, and CAGCCTGG (identified using two distinct parameter combinations at the same suffix index i) were associated the highest y i values using the remaining seven parameter combinations). The octamer CCACCTGC contains the canonical core recognition E-box sequence CANNTG (more specifically, the E12-box variant CACCTG [Bouard et al., 2016]). Comparison of CCACCTGC with known motifs from the JASPAR database [Mathelier et al., 2015] using tomtom finds some evidence of similarity to TF-binding motifs for SNAI2, MAX, SCRT2, SCRT1, TCF3, MNT, Id2, MAX::MYC, TCF4, and FIGLA (q-values of 0.14 for each), though no similarities were significant at q ≤ 0.1.
As CCACCTGC was identified in analyses with spatial window length λ ranging up to 100, we performed a multiple sequence alignment using muscle [Edgar, 2004] of the 100-mers x[s i − 50, s i + 50) for these positions s i (Fig 7A); three of the five 100-mers thus aligned were very similar (Levenshtein distance ≤ 7) to the 99-mer consensus sequence constructed.  Figure 6: Contributions of top motifs to peak composition. (A) Log-scaled histograms of peaks i ∈ I (or I spatial when spatial smoothing is employed) identified in upstream analysis for which corresponding k-mer motifs: (1) are prefixed with CACCTGC or CCACCTGC (indicated in red) or are suffixed by the reverse complement sequences GCAGGTG or GCAGGTGG (purple); (2) are otherwise spatially located within a blast hit to the B1 SINE sequence (gold); (3) exhibited significant tomtom similarity (q ≤ 0.1) to one of the JASPAR motifs Esrra, Esrrb, or Esrrg (blue); or (4) did not satisfy any of the above criteria (gray). Horizontal panels: half-window κ values used in analysis; vertical panels: spatial smoothing length λ. (B) Log-scaled histograms of peaks identified in downstream analysis; color coding is as in (A) except that black replaces gray. C?CACCTGC and its reverse complement do not occur in downstream peak set. Furthermore, the consensus sequence also contains CCTGGAAC and CCAGGCTG (reverse complement of CAGCCTGG).

DNA motifs associated with neuron subtype-specific expression
Interestingly, a screen of known repeated elements in the mouse genome for a consensus sequence uncovered a 93.9% identical base pair stretch of the B1 short interspersed element (SINE) sequence (SINEBase [Vassetzky and Kramerov, 2013]). The B1 SINE family consists of retrotransposon-derived sequences that appear throughout the mouse genome, especially upstream and within introns of genes implicated in DNA remodeling and expression regulation [Tsirigos and Rigoutsos, 2009]. Additional observations have been made further suggesting that SINE sequences function as transcriptional enhancers [Ichiyanagi, 2013, Elbarbary et al., 2016, Ge, 2017. Fig 6A indicates the number of SArKS-identified peaks that fall within blast hits between the upstream sequences w b and the B1 SINE consensus sequence (minimum 90% sequence identity and 70 base alignment length) in gold, red, or purple. Those within-B1 peaks which corresponded to the flagged k-mers CCACCTGC or CACCTGC are coded as red, while those corresponding the reverse complements GCAGGTGG or GCAGGTG are coded purple; all remaining within-B1 peaks are indicated by gold. Fig 7B provides a more detailed view of these peak counts by splitting them out by position to which the corresponding k-mers align to the B1 consensus. Whether or not each base of each k-mer is a match or mismatch to the B1 consensus is also indicated in Fig 7B; it is apparent that, while essentially the entirety of the B1 consensus is represented by identified k-mer motifs, there is more variation away from the consensus towards the left end and at a couple of isolated positions further in than along most of the length of B1.
The k-mer CCACCTGC itself is not quite a perfect match to the annotated B1, containing a single base substitution away from the octamer CCGCCTGC whose reverse complement GCAGGCGG is found at positions 49-56 of the SINEBase B1 sequence. This substitution is responsible for the peak at position 54 in the mismatch counts in Fig 7B-one of the few positions at which there are more mismatches than matches. It is also worth noting that this G to A substitution creates the above noted E-box sequence CANNTG, and that the unmodified octamer CCGCCTGC does not match any JASPAR motifs at q ≤ 0.5, unlike the modified sequence.
One of the remaining top upstream k-mer motifs mentioned above, GCACACCTT, similarly matches the nonamer GCACGCCTT spanning positions 15-23 of the SINEBase B1 sequence, but with a single G to A substitution. The modified nonamer GCACACCTT identified by SArKS shows significant similarity (tied q values of 0.038) to the JASPAR motifs TBX21, EOMES, TBX15, TBX1, and TBX2, while the unmodified B1 nonamer GCACGCCTT again shows no similarity to any JASPAR motifs at q ≤ 0.5. The last of the top upstream k-mer motifs, TGGAACTCACT, is a perfect match to the reverse complement of positions 89-99 of the SINEBase B1 sequence.
Finally, in order to illustrate how SArKS can be used to help select candidate regulatory regions for promoting specific expression patterns, we again constructed a ridge regression model based on the counts of SArKS-identified k-mer motifs. We applied the same modeling strategy as described in Section 3.2.1, again using generalized cross-validation [Golub et al., 1979] to determine the L2 regularization parameter, to the promoter regions defined by the 3,000 base pairs immediately preceding the TSSs of each of the 6,326 distinct analyzed transcripts. Each distinct transcript was then assigned a score by resubstitution into the resulting regression fit. Table 4 shows a sequence of filters in which these regression scores **!*****!!*!*!!* * !**!!*!!****!!***! ************!!*!!!!! ***!!*!!!*************!**!!!*!*!***** *!!*  Figure 7: SArKS-discovered motifs within B1 SINE elements. (A) Multiple sequence alignment (muscle) of 100-mers surrounding top CCACCTGC motif peaks with reverse-complement of B1 consensus sequence. Associated genes are indicated to the left. Gray highlighting: ≥ 50% agreement in the multiple sequence alignment. (B) Number of upstream motif k-mer peaks in B1 regions that align to each position within the B1 sequence. Gray bars: number of peak k-mers derived from upstream sequence regions for which a blast hit (percent identity ≥ 90%, alignment length ≥ 70) to B1 was found and for which an alignment of the k-mer to B1 aligned a matching base at the position in question. Red bars: number of k-mers within B1 blast hits which align aginst B1 with a mismatched base at the position in question. Above each bar is a label indicating the B1 consensus base at that position. Note that the lack of a gray bar at position 89 results from the lack of consensus base for B1 at this position (marked by N above the red bar), so that all k-mers that align against this position must produce a mismatch. The consensus base labels are drawn darker and the bars are marked with an asterisk at positions (19 and 54) where two of the top SArKS peaks exhibit changes compared to the B1 consensus sequence. Table 4: SArKS-based regression modeling assists in selecting candidate upstream regions for promoting PV-specific expression.

Distinct Transcripts Count
All analyzed 6,326 + Protein coding 5,017 + High expression in PV 1,595 + Low expression in non-PV 196 + PV : non-PV log-ratio ≥ 1 92 + Top 5% SArKS regression score 13 + Top 5% t-statistic score y b 11 Number of distinct transcripts remaining after sequential application of described filters. Annotation of transcripts as protein coding or otherwise taken from Ensembl GRCm38 [Aken et al., 2016]. Expression levels were considered high in PV samples if the average within-PV value of log 2 (TPM + 1) ≥ log 2 (10 + 1), while expression levels were considered low in non-PV samples if the average non-PV log 2 (TPM + 1) < log 2 (10 + 1).
Log-ratios were calculated as the difference of the PV-averaged-and non-PV-averaged-log 2 (TPM + 1) values, so that a log-ratio of one represents at least a two-fold increase in expression levels. SArKS regression scores were calculated using a ridge regression model built using counts of all k-mer motifs identified by SArKS applied to 3kb upstream promoter regions.

Conclusions
We introduce SArKS as a method for de novo correlative motif discovery and present examples of its application. This method fully exploits data sets produced by modern quantitative techniques, such as RNA-seq, by avoiding the dichotomization-and consequent loss of information [Fedorov et al., 2009]-of sequence scores into discrete groups as required by discriminative motif discovery algorithms. SArKS has also been designed to minimize the reliance on specification of specific background sequence models, instead using nonparametric permutation methods [Ernst, 2004] to set significance thresholds for motif identification and estimate false positive rates. SArKS can also smooth over spatial location of motifs to identify multimotif domains (MMDs), which can then provide top-down feedback to enhance understanding of the motifs they contain. We have benchmarked SArKS against several existing discriminative and correlative motif discovery algorithms using a previously published RNA-seq data set: SArKS identifies particularly large bodies of motifs in these comparisons and SArKS-identified motif sets provide for more predictive feature sets in a regression modeling approach evaluated by cross-validation than do the motif sets generated by existing algorithms. Finally, by more closely examining the SArKS analysis results applied to this data set, we show that among the MMDs identified are many variations of the B1 short interspersed element (SINE) sequence in mouse, which has been previously implicated in gene regulation [Ichiyanagi, 2013, Elbarbary et al., 2016, Ge, 2017.

S2.2 Limit intra-sequence repeats supplementary
One complicating factor in the strategy described in Section 2.1 is the presence of tandem repeats (common in eukaryotic DNA [Ellegren, 2004]): if the substring x[s i , s i + rm) (assumed to derive wholly from the single word w bi ) consists of r 1 repeats of the same m-mer, then it is likely that the sorted suffix array index positions j and k implicitly defined by s j = s i + am and s k = s i + bm for small a, b ≥ 0 will be close by, since, assuming without loss of generality that showing that the suffixes of x beginning at positions (s i + am) and (s i + bm) agree on their first (r − b)m characters. Since all of the positions s i + am for small a must come from the same word block b i they must have the same associated score y bi . If this score y bi is particularly high, this phenomenon may lead to windows of highŷ j values centered on j satisfying s j = s i + am which result from a very small number of different repeat-containing words (perhaps as few as one if the number of repeats is high enough within a single high-scoring word). We thus here develop a natural method for filtering the peak index set I to selectively remove suffix array index values i for which the smoothing window is dominated by a few heavily repeated words w b . The distribution of weighted word frequencies contributing to the window centered at position i of the suffix array table across the full word set W may for these purposes be summarized by the associated Gini impurity (often used in fitting classification and regression trees [Breiman et al., 1984]): which provides a measure ranging from 0 to 2κ 2κ+1 of the degree of uniqueness of the words contributing to the calculation ofŷ i .
As a concrete example, if all of the weighted frequencies word frequencies f (i) b = 1 q are the same for a set of exactly q words w b appearing in the smoothing window centered on i, g i = 1 − 1 q . This suggests an intuitive interpretation of (1 − g i ) as the multiplicative inverse of the "effective word count" contributing to the smoothing window around i.
Section S2.5 further demonstrates that (1 − g i ) is also approximately proportional to the variation of the smoothed scoresŷ i that would be expected if there were no association between the sequences w b and the scores y b (see Eq (S21) below). This proportionality suggests a simple method for selection of a g min value at which most suffix array indices i will be retained while filtering out only those most likely to yield false positive results under permutation testing: As shown in Eq (S21), setting g min to satisfy Eq (S5) removes suffix indices i for which the variance of the permuted smoothed scores is greater than (1+γ) times the median value. Thus any value γ > 0 will retain the majority of positions i for further consideration. We have used γ = 0.1 or γ = 0.2 for all of the examples in the present work, retaining positions for which the permuted score variance is less than 110% or 120%, respectively, of the median.

SArKS
x = u 0 * C A TACTGAGA * u 1 * C ATACTG AGA * u 2 Figure S1: Example k-mers to be removed or extended. Two identified k-mers (u=CATACTGAGA on the left and v=ATACTG on the right) are indicated by the dark gray highlighting, with two additional separately identified k-mers that are part of u indicated within the nested black boxes. The two nested k-mers contained within the boxes inside of u will be removed from the discovered k-mer set by the method of Section S2.3.1, while k-mer v = ATACTG will be extended by the method of Section S2.3.2 to include the characters highlighted in light gray, replacing v with CATACTGAGA. The four k-mers indicated in this figure correspond to positions si ∈ {3959, 3960, 3961, 4232} from Section 3.1.
Requiring g i ≥ g min results in redefining the peak index set I to screening out positions i for which the repeated occurrence of a few high-scoring words in the window centered at i leads toŷ i ≥ θ.

S2.3 Removing and extending k-mers supplementary
The presence of a k-mer x[s i , s i + k) associated with a high smoothed scoreŷ i can also result in high smoothed scoresŷ j when s j = s i + m if the substring (k − m)-mers x[s i + m, s i + k) also also preferentially found in higher-scoring sequences, as pictured in Fig S1. The following two steps may be added to the algorithm described in Section 2.1 in order to reduce the reporting of such substrings when they are present only as part of the full k-mer:

S2.3.1 Removing nested k-mers
Cases in which both k-mer x[s i , s i + k) and its sub-(k − m 1 − m 2 )-mer x[s i + m 1 , s i + k − m 2 ) (with m 1 > 0, m 2 ≥ 0) are individually identified can be resolved to report only the longer k-mer by removing any index i ∈ I (defined by Eq (S6)) if there exists j ∈ I such that the k j -mer interval starting at s j includes all of the k i -mer interval starting at s i , thus retaining only: This can be done efficiently using an interval tree.

S2.3.2 Extending substring k-mers
Fig S1 depicts not only two cases of nested k-mers which may be removed from the reported motif set by the method of Section S2.3.1 (contained within black boxes) but also an additional shorter k-mer ATACTG, highlighted in dark gray, which is derived from a distinct occurrence of the same longer k-mer (CATACTGAGA in this case). Because this distinct occurrence of the longer k-mer was not itself initially identified, the method of Section S2.3.1 does not remove the shorter substring k-mer from the motif set. However, such substring k-mers may be extended to the longer k-mer occurrence by the following method: for each i ∈ I , define the duplet resolving any ties in the arg max in favor of maximal z 0 . Eq (S8) picks out the largest superinterval s i − z 0 , s i + k i + z 1 containing the interval s i , s i + k i such that the extended

S2.4 Spatial smoothing supplementary
SArKS k i + z 0 i + z 1 i -mer x s i − z 0 , s i + k i + z 1 is equal to one of the already identified k-mers x s j , s j + k j j ∈ I . (In the example of Fig S1, z 0 i = 1 and z 1 i = 3, corresponding to the light gray highlighted characters surrounding the substring k-mer). Then defines our motif set after removal of nested motifs.

S2.4 Spatial smoothing supplementary
SArKS indentifies candidate multi-motif domains (MMDs) through the application of a second round of kernel-smoothing over suffix positions s i within words: where we here use uniform kernels of the form (generally with width λ = κ) to search for regions of length λ with elevated densities of high-scoring motifs. Note thatŷ si defined by Eq (S10) is indexed not by suffix array index i but by suffix array value s i giving the spatial position s i in the concatenated word x.
To use such spatial smoothing for motif selection/filtering, it is necessary to introduce a second threshold θ spatial , as the doubly-smoothed scoresŷ si will generally be somewhat less dispersed than will be the singly-smoothedŷ i . The threshold θ spatial can be used to define an index set I spatial similar to the manner in which I is defined by Eq (S6), but the task is more complex when we replace the single spatial position s i by a spatial window [s i , s i + λ).
Recalling the definitions of the negative/positive spatial shift operators η(i)/ρ(i) which yield the unique suffix array indexes corresponding to the spatial position immediately before/after s i , so that s η(i) = s i − 1 and s ρ(i) = s i + 1, first define: J spatial represents the set of suffix array indices i corresponding to the left endpoints s i of spatial windows [s i , s i + λ) passing the filters for score threshold θ spatial and minimum Gini impurity g min , and for which the sequence-smoothed scoreŷ i is at least as high as the spatially adjacent scoresŷ η(i) to the left andŷ ρ(i) to the right. Defining the left-directed distance δ j from the suffix with sorted suffix array index j to the set J spatial by δ j = min i∈J spatial we define in turn the set of sorted suffix array indices I spatial marking the starting positions of selected k-mers by: Eq (S14) identifies suffix array indices i: (1) whose spatial positions s i fall within a spatial window [s j , s j + λ) for some j ∈ J spatial , (2) whose sequence-smoothed scoreŷ i ≥ θ spatial , and (3) for which the position s i − 1 spatially to the left is either (3A) not in one of the spatial windows specified by J spatial or (3B) has associated sequence-smoothed scoreŷ η(i) < θ spatial . This final criterion is

S2.5 Permutation testing supplementary
SArKS included because we want to merge adjacent k-mers whose leftmost positions fall within the same selected spatial window. This merging process is implemented by calculating for each index i ∈ I spatial the lengtĥ of the merged k-mer starting at i. Eq (S15) setsk si by selecting the right endpointk j + s j of the k-mer beginning at s j to maximize the merged lengthk j + s j − s i over all choices of s j for which every position s m between s i and s j has an acceptable sequence-smoothed scoreŷ m ≥ θ spatial . It is then straightforward to obtain the motif set

S2.5 Permutation testing supplementary
The significance of the observed correlation between the occurrence of the motifs uncovered by SArKS and the sequence scores y b can be evaluated by examining results obtained when the sequences w b and the scores y b are independent of each other. To this end, the word scores y b are subjected to permutation π to define y If the permutation π is randomly selected independently of both the sequences w b and the scores y b , any true relationships between sequences and scores will be disrupted. This suggests a simple method for assessing the significance of motifs discovered using a given set of parameters (kernel half-width κ, θ, g min , etc.): generate R random permutations π r and for each permutation calculate scoresŷ (πr) i using Eq (6) (and alsoŷ (πr) si using Eq (S10) if spatial smoothing is employed) with y b replaced by y π b . In this manner one can estimate the distribution of scores under a null model in which there is no association between the sequences of the various words w b and the scores y π b .
This method of significance testing also provides the motivation for the form of Eq (S4) in Section S2.2. To demonstrate this, let Π be a random variable representing a random permutation and note that the random variables y Π(b) satisfy while, assuming that the number of words n = |W | is large enough that we may approximate Eq (S19) then tells us that where the Gini impurity g i is defined by Eq (S4). Thus smaller values of g i imply higher variance of the window-smoothed scores obtained under random permutation Π (with mean unchanged). This increased variance will lead to the requirement of larger cutoff values θ for reporting motifs discovered in the unpermuted data with a given degree of confidence unless positions i with g i < g min are filtered out as described in Section S2.2.

S2.6 Permutation testing for multiple parameter combinations supplementary
Multiple combinations κ (α) , λ (α) , g (α) min of the values of SArKS parameters may be explored (with α here indexing the set of desired combinations).
For any permutation π, letŷ (α,π) max is the highest filtered sequence-smoothed score obtained after permuting by π, whilê y (α,π) max is similarly the highest filtered spatially-smoothed score). Then we suggest a simple method for setting thresholds θ (α) and θ (α) spatial based on a set of randomly generated permutations {π r }: with higher values of z trading reduced sensitivity for lower false positive rates (in the examples analyzed in Section 3.2 we take z = 4). For the sake of simplicity we have generally used only one of these two thresholds for any particular combination of parameters α, setting θ , if spatial smoothing is not employed). In order to characterize the false positive rate associated with the entire set of analyses across all of the parameter settings employed while controlling for multiple hypothesis testing, a family-wise error rate (FWER) resulting from these thresholds can then be estimated by generating an independent set of R permutations {π r } and counting the number of permutations π r for which a nonempty set of k-mer motifs is identified using any of the parameter sets κ (α) , λ (α) , g (α) min . That is, writing (where I (α,π r ) and I (α,π r ) spatial are defined respectively by Eq (S6) and Eq (S14) using the thresholds θ (α) and θ (α) spatial determined using the original permutation set {π r }) we can infer confidence intervals by noting that the random variable E instantiated in e satisfies E ∼ Binom(R , ) under the permutation test null hypothesis. We can thus derive confidence intervals (CIs) for the FWER (in the weak sense, as the permutation test represents a complete null hypothesis with no true positives [Farcomeni, 2008]) by applying the Clopper-Pearson method for estimation of binomial CIs.
Because the position of the first exon can help pinpoint the TSS-and hence the DNA region containing the putative promoter-we reanalyzed the GSE63137 RNA-seq data using kallisto [Bray et al., 2015] to quantify and normalize transcript level expression against Ensembl mouse cDNA reference GRCm38. Transcript species were filtered by mean expression to focus on those for S2.7 RNA-seq expression analysis supplementary SArKS which reliable expression estimates could be made, retaining only transcripts for which at least 100 pseudocounts were obtained when summed across all samples and whose mean normalized expression met or exceeded the median of the transcript mean normalized expression levels. We also filtered out those transcripts which showed low variance across the full sample set, retaining only those for which the variance of normalized expression values met or exceeded the median such value across all transcript species [Bourgon et al., 2010]. In order to simplify downstream analysis, only the isoform with highest mean expression level across all samples was retained for each detected gene. Finally, as previously analyzed epigenetic information on chromatin accessibility was available from the same study [Mo et al., 2015], only transcripts for which the transcription start sites were located within ATAC-seq peaks (i.e., were accessible) for all examined neuron classes were retained for analysis. Imposing this condition minimizes the likelihood that epigenetic factors, rather than regulatory sequence characteristics, underlie the variations in gene expression between cell classes.
Differential gene expression was then assessed on normalized expression values via standard Student's t-test comparing data for PV neurons to data for excitatory and VIP neurons, with the resulting t-statistic providing a rough estimate of the gene's enrichment in PV neurons (score y b for transcript b). One potential issue with the use of such t-statistics with small sample numbers-here, two samples associated with each neuron type-is that especially low within-group standard deviation estimates can result in very large magnitude t-statistics for a few genes. For example, the average estimated within-group standard deviation of the 76 genes (out of 6,326 surviving the filters described above; see Table 2) for which |t b | > 10 (with |t b | ranging up to a maximum value of 49.6) was less than 30% of the average within-group standard deviation of the full set of 6,326 analyzed genes. Every one of the 76 genes with such high magnitude t-statistics had a within-group standard deviation estimate below the median value for the full gene set.
This phenomenon has previously led to the application of empirical Bayes methods [Smyth, 2004] using moderated t-statistics in place of standard t-statistics for calculating differential expression p-values. As we are here instead interested in using the t-statistics to derive word scores y b , for which no particular distributional assumptions are required, we have adopted a simpler approach to prevent the few very large magnitude t-statistics from unduly influencing motif discovery by applying a ceiling of 10 on the magnitude of y b :

S3.1 Illustration using simulated data supplementary
To demonstrate that these results are not a quirk of a single simulation, we repeated the process of generating 30 random sequences, embedding the motif CATACTGAGA into the last 10 of them, and then applying suffix array kernel smoothing to the sequence scores 1000 times. In 971 iterations, the maximum valueŷ calculated using the unpermuted sequence scores exceeded the maximum valuê obtained using one set of randomly permuted sequence scores per iteration. The full distribution of the differencesŷ max −ŷ (π) max is shown in the motif (+) column of Table S1. We also examined the results of SArKS applied to simulated data in which no motif was present to find; for this purpose, we repeated an amended version of the simulation process with the sole modification of omitting the motif embeddings again 1000 times. The distribution ofŷ max −ŷ (π) max for these no-motif simulations is presented in the motif (-) column of Table S1. In this case,ŷ max exceededŷ (π) max in only 239 of the simulations, whileŷ (π) max exceededŷ max in 282 simulations, with equality between the two holding in the remaining 479 iterations. The symmetry of the distribution ofŷ max −ŷ (π) max around 0 in the motif (-) case is to be expected since the scores y b are independent of the sequences w b whether permuted or not if no motifs are included. By contrast, the strong asymmetry of the distribution ofŷ max −ŷ (π) max when the motif is present demonstrates the ability of the permutation approach to differentiate a true signal from background noise.
S3.2 DNA motifs associated with neuron subtype-specific expression supplementary

S3.2.1 Benchmark comparisons to existing algorithms
An alternative approach to assessing motif discovery efforts is to compare the motifs identified algorithmically to databases of known TF-binding motifs, such as JASPAR [Mathelier et al., 2015].
In the context of our analysis of the Mo 2015 RNA-seq data, we can also check whether or not mRNA encoding TFs whose binding sites are similar to discovered motifs are present among the parvalbumin neuron transcripts. For this purpose, we employed a filter similar to that used in Section S2.7, requiring there to be at least one distinct mRNA transcript for the TF gene with at least 100 pseudocounts summed across the six neuron samples and for which the mean TPM-normalized estimated expression level in the parvalbumin samples ≥ the median value of such means for all Ensembl-annotated transcripts. Fig S2A shows the fraction of JASPAR TF-binding sites for which similar motifs were discovered by each algorithm for JASPAR TFs that passed the RNA-seq expression filter on the vertical axis. The horizontal axis of Fig S2A indicates instead the fraction of JASPAR TFs with Mo 2015 mRNA levels below the filter requirements (motif similarity was determined using tomtom [Gupta et al., 2007] with a q-value cutoff of 0.1). Fig S2A is similar to a receiver-operating characteristic plot in which the motif discovery algorithms are regarded as classifying JASPAR motifs as positive when they show sufficient similarity to any of the discovered motifs; the distance of a point above the diagonal thus indicates the degree to which an algorithm preferentially identifies binding motifs for TFs showing high RNA-seq expression levels. SArKS identifies motifs similar to a larger fraction of JASPAR than do the other algorithms while maintaining a preference for motifs for highly expressed TFs. Fig S2B shows the degrees of overlap in the sets of JASPAR motifs with similarities among the motifs identified by the motif discovery algorithms. HOMER identified motifs similar to the bulk-49 out of 58 total-of the JASPAR motifs for which SArKS did not find any similar pattern but at least one of the remaining algorithms did. Of these 49 JASPAR motifs, 48 were not identified as similar to any motifs discovered by any of the algorithms other than HOMER.

S4.1 Gapped motif detection
While lexical sorting of suffixes assembles occurrences of the same k-mer into a block of adjacent index positions i, gapped motifs such as u = u 0 * u gap * u 1 (S30) in which there is significant variability in the characters appearing within the internal substring u gap will be scattered into distinct subblocks dispersed within the larger superblock corresponding to their common prefix u 0 . This dispersion can dilute the apparent correlationŷ i between motif and score by mixing non-matching suffixes in with those corresponding to u within the range of the smoothing kernel.
While the technique described in Section S2.4 ameliorates this problem, it does not specifically focus on the important situation where a head motif u 0 is always followed by the same tail motif u 1 after the variable region u gap . Such gapped motifs might be discovered using SArKS by first applying a relatively relaxed threshold θ (which may on its own admit many false positives) and then examining the tail sequences u gap * u 1 * · · · following it for evidence of an enriched sequence u 1 , removing candidate head sequences for which no such corresponding tails can be found. In this way, the ability of SArKS to detect motifs with particularly variable internal positions may be improved.

S4.2 Synergistic motif detection
Because the regulation of eukaryotic gene expression may involve interactions among multiple motifs [Wunderlich and Mirny, 2009], it is of interest to discover motifs that work together synergistically to promote cell-type-specific gene expression. To achieve this objective, we can extend the kernel smoothing methods introduced here by, for example, utilizing multiple index kernel regression models such asŷ where the 4-index kernel K ijkl might be chosen to satisfy constraints along the lines of That is, i and k must correspond to suffixes with sufficiently similar prefixes (as must j and l), while i and j must come from the same word (as must k and l). Fig S3 depicts four suffixes with indices i, j, k, l which would contribute to the smoothed scoreŷ ij . Intuitively, if the beginning of the suffixes starting at s i and s j were two motifs which work together synergistically-and thereby lead to elevated scores y bi for the sequence b i = b j which they are found in-we might expect to see elevated scores y b k for sequences b k = b l containing the similar motifs prefixing the suffixes starting at s k and s l .

S4.3 Improved selection of input sequences
While we have investigated sequences near the TSS, it is well known that regulatory elements may lie quite far from the genes they affect [Sanyal et al., 2012]. It would be advantageous to develop more sophisticated approaches for (1) the identification of genomic regions most likely to contain regulatory elements and (2) the linking of potential regulatory element-dense-regions to genes whose expression they govern. Such approaches would likely benefit from consideration of information on

S4.4 Other applications of SArKS
SArKS · · · x si x si+1 x si+2 x si+3 · · · x sj x sj +1 x sj +2 · · · · · · x s k x s k +1 x s k +2 x s k +3 · · · x s l x s l +1 x s l +2 · · · w bi = w bj : w b k = w b l : Figure S3: Synergistic motif visualization. Visualization of a quartet (i, j, k, l) satisfying constraints (S32): the suffix x[si, |x|) originating from sequence w b i shares a common prefix, or motif, with the suffix x[s k , |x|) originating from sequence w b k ; a second pair of suffixes x[sj, |x|) and x[s l , |x|) sharing a (different) motif in common are also found in the sequences w b j = w b i and w b l = w b k , respectively. Each box contains a single character (or base, for DNA sequences), with motifs indicated by connecting two series of boxes composing matching substrings in different w b i . evolutionary conservation [Xie et al., 2005] and epigenetic modification [Long et al., 2013] near genes of interest.

S4.4 Other applications of SArKS
While we have tested SArKS as a method for identifying candidate cell-type specific regulatory motifs, it could also be applied to sequence motifs associated with state dependent changes in activated neurons of a single class as well as to differential gene expression in cancer and in specimens that have been exposed to varying physical or chemical stimuli. We also anticipate uses far afield from analysis of biological sequences, including motif discovery in time series data [Fu, 2011], or, by considering node or edge sequences produced by random walks, analysis of complex network structure [Masoudi-Nejad et al., 2012].