atSNP Search Help

Motif libraries provide nucleotide patterns that transcription factors recognize well from a set of transcription factor binding sites. The binding profiles are represented with position weight matrices (PWMs). PWMs are used for probabilistic description of the patterns in genomic sequences. The PWM is visualized as a sequence logo, where letter heights are scaled based on their information content.

  • JASPAR 2014 Core Vertebrata
  • JASPAR is the largest open-access database of matrix-based nucleotide profiles describing the binding preferences of transcription factors from multiple species (Mathelier et al., 2013). JASPAR 2014 Core Vertebrata provides position frequency matrices for 205 non-redundant profiles and their summaries. The position frequency matrix is converted into the position weight matrix, which carries the same information.

    Example

    The JASPAR library contains a position frequency matrix for a transcription factor RUNX1 and its sequence logo as below. Its summary is presented here.

    Figure 1 Sequence logo of MA0002.2 motif for transcription factor RUNX1
    A
    C
    G
    T
    287 234 123 57 0 87 0 17 10 131 500
    49648510720 75 1270 42 400463158
    696 467149 7 1872 70 19871848 251 81 289
    521 814 656 1936 53 171613 93 133913251053
    PFM of MA0002.2 motif for transcription factor RUNX1

  • ENCODE Library
  • The ENCODE library features systematic discovery and characterization of regulatory motifs in transcription factor binding experiments (Kheradpour et al., 2014). These PWMs are available here. This collection contains known motifs from the literature (TRANSFAC, JASPAR Homo Sapiens, and UniPROBE), in addition to the de novo motifs from hundreds of transcription factor binding experiments performed by ENCODE. These motifs were discovered using established tools such as MDscan , MEME, Trawler, and AlignACE . All known and discovered motifs are grouped as in the table here. For known ENCODE motifs, the naming convention indicates the source which initially reported the motif. For de novo motifs, the names contain information about the experiments that were used to discover the motif. This information includes the name of the lab and the names of any motif discovery tools used.

    Example

    The ENCODE library provides 22 known motifs and 6 discovered motifs for transcription factor GATA, which are available here. Here, we present one known GATA motif (GATA_known12) and one discovered GATA motif (GATA_disc1). GATA_known12 is from JASPAR Homo Sapiens and GATA_disc1 is obtained from ENCODE Snyder Lab GATA1 ChiP-seq TF experiments with the MDscan tool.

    Figure 2 Sequence logo of GATA_known12 motif
    >GATA_known12 GATA2_2 GATA2_jaspar_MA0036.1
    A
    C
    G
    T
    0.2450.000 0.981 0.0000.472
    0.2450.094 0.000 0.0000.132
    0.3400.906 0.019 0.0000.283
    0.1700.000 0.000 1.0000.113
    PWM of GATA_known12 motif
    Figure 3 Sequence logo of GATA_disc1 motif
    >GATA_disc1 GATA1_K562b_encode-Snyder_seq_hsa_r1:MDscan#1#Intergenic
    A
    C
    G
    T
    0.3230.1380.6670.0001.000 0.0000.8780.9740.0000.243
    0.1850.4760.1010.0000.000 0.0000.0000.0260.0530.249
    0.3600.3230.0001.0000.000 0.0000.0000.0000.9470.508
    0.1320.0630.2330.0000.000 1.0000.1220.0000.0000.000
    PWM of GATA_disc1 motif

    References
    • Mathelier, A., Zhao, X., Zhang, A. W., Parcy, F., Worsley-Hunt, R., Arenillas, D. J., Buchman, S., Chen, C.-y., Chou, A., Ienasescu, H., Lim, J., Shyr, C., Tan, G., Zhou, M., Lenhard, B., Sandelin, A. and Wasserman, W. W. “JASPAR 2014: an extensively expanded and updated open-access database of transcription factor binding profiles.” Nucleic Acids Research, 2013
    • Kheradpour, Pouya, and Manolis Kellis. "Systematic discovery and characterization of regulatory motifs in ENCODE TF binding experiments." Nucleic Acids Research 42.5 (2014): 2976-2987.
    • Matys,V., Fricke,E., Geffers,R., Gossling,E., Haubrock,M., Hehl,R., Hornischer,K., Karas,D., Kel,A.E., Kel-Margoulis,O.V. et al. (2003) TRANSFAC(R): transcriptional regulation, from patterns to profiles. Nucleic Acids Res., 31, 374–378.
    • Sandelin,A., Alkema,W., Engstrm,P., Wasserman,W.W. and Lenhard,B. (2004) JASPAR: an open-access database for eukaryotic transcription factor binding profiles. Nucleic Acids Res., 32, D91–D94.
    • Berger,M.F., Philippakis,A.A., Qureshi,A.M., He,F.S., Estep,P.W. and Bulyk,M.L. (2006) Compact, universal DNA microarrays to comprehensively determine transcription-factor binding site specificities. Nat. Biotechnol., 24, 1429–1435.
    • Badis,G., Berger,M.F., Philippakis,A.A., Talukder,S., Gehrke,A.R., Jaeger,S.A., Chan,E.T., Metzler,G., Vedenko,A., Chen,X. et al. (2009) Diversity and complexity in DNA recognition by transcription factors. Science, 324, 1720–1723.
    • Berger,M.F., Badis,G., Gehrke,A.R., Talukder,S., Philippakis,A.A., Pea-Castillo,L., Alleyne,T.M., Mnaimneh,S., Botvinnik,O.B., Chan,E.T. et al. (2008) Variation in homeodomain DNA binding revealed by high-resolution analysis of sequence preferences. Cell, 133, 1266–1276.
    • Jolma,A., Yan,J., Whitington,T., Toivonen,J., Nitta,K.R., Rastas,P., Morgunova,E., Enge,M., Taipale,M., Wei,G. et al. (2013) DNA-binding specificities of human transcription factors. Cell, 152, 327–339.
    • Hughes,J.D., Estep,P.W., Tavazoie,S. and Church,G.M. (2000) Computational identification of Cis-regulatory elements associated with groups of functionally related genes in Saccharomyces cerevisiae. J. Mol. Biol., 296, 1205–1214.
    • Liu,X.S., Brutlag,D.L. and Liu,J.S. (2002) An algorithm for finding protein-DNA binding sites with applications to chromatinimmunoprecipitation microarray experiments. Nat. Biotechnol., 20, 835–839.
    • Bailey,T.L. and Elkan,C. (1994) Fitting a mixture model by expectation maximization to discover motifs in biopolymers. Proc. Int. Conf. Int. Syst. Mol. Biol., 2, 28–36.
    • Pavesi,G., Mauri,G. and Pesole,G. (2001) An algorithm for finding signals of unknown length in DNA sequences. Bioinformatics, 17, S207–S214.
    • Ettwiller,L., Paten,B., Ramialison,M., Birney,E. and Wittbrodt,J. (2007) Trawler: de novo regulatory motif discovery pipeline for chromatin immunoprecipitation. Nat. Methods, 4, 563–565.

OVERVIEW AND PROCEDURE

OVERVIEW
Figure 4 atSNP flowchart

atSNP takes as input SNP information/genomic subsequences and a PWM for a transcription factor. atSNP uses a first order Markov model and an importance sampling technique to test the statistical significance of the best match between the PWM and the subsequences overlapping the SNP position with both the reference and the SNP alleles. More imporantly, it further tests the impact of SNPs on the matches to the PWM. The testing results are summarized with a table and a graphical output named composite logo plot. To learn more about the outputs, visit "Interpreting search output” and “Reading composite logo plot”.

Table 1 Comparison of atSNP with existing methods detecting impact of SNPs on transcription factor binding

Table 1 provides a comparison of functionality between atSNP's statistical method and alternative methods. atSNP provides the most comprehensive information on identifying the impact of SNPs on transcription factor binding. Only atSNP and is-rSNP provide statistical quantification of the impact of SNPs on the binding. Furthermore, atSNP addresses the scalability issue with an efficient computation enabled by an importance sampling technique. atSNP visualizes the best matches with both the reference and the SNP alleles and the change in those matches created by the SNP. atSNP Search results are obtained from the R package atSNP, that implements this procedure.

PROCEDURE
  • Step 1: affinity score
  • Using a PWM as a set of parameters for a product multinomial model, atSNP computes log likelihoods for all subsequences overlapping the SNP location that have the same length as the motif. The best matches are those with the largest log likelihood with both the SNP and reference alleles. The scores of the best matches are used for statistical testing.

    Example

    We evaluate impact of a SNP, rs117959046, on the binding of the transcription factor GATA2, with GATA_known12 motif:

    Figure 5 Sequence logo of GATA_known12 motif
    A
    C
    G
    T
    0.2450.0000.9810.0000.472
    0.2450.0940.0000.0000.132
    0.3400.9060.0190.0000.283
    0.1700.0000.0001.0000.113

    The motif length is 5 and sequences of length 9 centered at the SNP position are extracted from both the reference allele and the SNP variation of the reference allele:

    Reference allele: AAGACAAGC
    SNP allele: AAGATAAGC

    Next, we compute log likelihoods of all subsequences of length 5 spanning the SNP position and find the best match, which is associated with the largest log likelihood. The second subsequence on the forward strand (AAGACACG) has the largest log likelihood with the reference allele. (Table 2). The second subsequence on the forward strand also has the largest log likelihood with the SNP allele (Table 3).

    Reference Allele Forward strand score Reverse strand score
    AAGACAAGC -53.45 -49.31
    AAGACAAGC -25.30 -50.10
    AAGACAAGC -70.91 -30.79
    AAGACAAGC -28.08 -72.66
    AAGACAAGC -49.50 -27.73
    Table 2 Log likelihoods of all subsequences with the reference allele
    SNP Allele Forward strand score Reverse strand score
    AAGATAAGC -53.60 -49.64
    AAGATAAGC -2.28 -73.03
    AAGATAAGC -70.91 -26.84
    AAGATAAGC -48.74 -72.66
    AAGATAAGC -49.87 -27.22
    Figure 6 Log likelihoods of all subsequences with the SNP allele

  • Step 2: p-value for the affinity score
  • The p-value for the affinity score with the reference allele (p-value Reference) is the probability that a random sequence has an affinity score larger than or equal to the best match score with the reference allele. atSNP simulates random sequences using a first order Markov model. It estimates the p-value as the frequency of randomly getting as large or larger affinity scores than the best match score with the reference allele. Similarly, the p-value for the affinity score with the SNP allele (p-value SNP) is estimated as the frequency of getting random scores as large or larger than the best match score with the SNP allele.

    Example (Continued)

    (Continued From Step 1) The log likelihood score for the reference allele is -25.30, and the log likelihood score for the SNP allele is -2.28. The p-value for the reference allele is the probability that a random sequence has an affinity score larger than or equal to -25.30, which is our target region (Figure 7). For such a small p-value, naive Monte-Carlo generates a small number of scores in the target region (Figure 8), and thus requires a large number of simulations. Alternatively, in atSNP, the importance sampling generates many more scores in the target region (Figure 9). The p-value for the reference allele is then estimated by the weighted frequency as p-value Reference = 0.499. The importance sampling algorithm significantly reduces the required number of simulations. Similarly, the p-value for the SNP allele, p-value SNP, is estimated as 0.009.

    Figure 7 Distribution of scores
    from random motif sequences.
    Figure 8 Empirical distribution
    of simulated random scores with the naive
    Monte Carlo method.
    Figure 9 Empirical distribution of simulated
    random scores with importance sampling.

  • Step 3: p-value for the affinity change between alleles
  • atSNP tests the significance of log rank ratio: log(p-value Reference)-log(p-value SNP). atSNP computes the p-value (p-value SNP Impact) based on a Monte-Carlo simulation that uses both the reference and SNP alleles together. This is done using the importance sampling technique as described in Step 2.

    Example (Continued)

    The log rank ratio for the impact of SNP, rs117959046, is 4.02. With the importance sampling, the p-value for the log rank ratio is estimated as p-value rank = 0.005. In atSNP Search, this p-value is referred to as p-value SNP Impact.

    References
    • Zuo, Chandler, Shin, Sunyoung, and Keleş, Sündüz. atSNP: transcription factor binding affinity testing for regulatory SNP detection. Bioinformatics (2015): 31, 3353-3355
    • Andersen, Malin C., Engstrom, Par G., Lithwick, Stuart, Arenillas, David, Eriksson, Per, Lenhard, Boris, Wasserman, Wyeth W., and Odeberga, Jacob. In silico detection of sequence variations modifying transcriptional regulation. PLoS Computational Biology 4.1 (2008): e5.
    • Grant, Charles E., Timothy L. Bailey, and William Stafford Noble. FIMO: scanning for occurrences of a given motif. Bioinformatics 27.7 (2011): 1017-1018.
    • Macintyre, Geoff, Bailey, James, Haviv, Izhak, and Kowalczyk, Adam. is-rSNP: a novel technique for in silico regulatory SNP detection. Bioinformatics 26.18 (2010): i524-i530.
    • Riva, Alberto. Large-scale computational identification of regulatory SNPs with rSNP-MAPPER. BMC genomics 13.4 (2012): S7.
    • Thomas-Chollier, Morgane, Hufton, Andrew, Heinig, Matthias, O'keeffe, Sean, Masri, Nassim E., Roider, Helge G., Manke, Thomas, and Vingron, Martin. Transcription factor binding predictions using TRAP for the analysis of ChIP-seq data and regulatory SNPs. Nature protocols 6.12 (2011): 1860-1869.

atSNP search output is grouped into four general categories: SNP information, motif information, atSNP scores, and atSNP p-values. Details for these categories are summarized in the following tables:

SNP information
RS ID (ReferenceSNP ID number) Identification tag assigned by NCBI to a group of SNPs that map to an identical location1 A link to the DBSNP page for any SNP shown in the data is available by clicking on the SNPid.
Coordinate Formatted as chromosome:position. One of the human genome (assembly GRch38/hg38) chromosomes, e.g. Chr1-Chr22, ChrX, ChrY, ChrM. The position is the 1-based location of the SNP relative to the first base at the 5’ end of the forward strand of the reference sequence 2. A link to data around the genomic coordinate in the UCSC genome browser is available by clicking on the coordinate.
Reference and SNP Alleles Formatted as Reference allele/SNP allele. e.g.: T/G.
Reference and SNP Strands The reference and SNP strand information is shown on the composite logo plot. It indicates whether the displayed sequences are in the forward or reverse direction.
  1. http://www.ncbi.nlm.nih.gov/books/NBK44417/#Content.what_is_a_reference_snp_or__rs_i
  2. https://bioconductor.org/packages/release/data/annotation/manuals/SNPlocs.Hsapiens.dbSNP144.GRCh38/man/SNPlocs.Hsapiens.dbSNP144.GRCh38.pdf
    1. Motif information
      Motif Motif name, e.g., MA0002.2 For JASPAR motifs, a link to the motif's entry in JASPAR database by clicking on the motif.
      Transcription Factor Transcription factor name, e.g., RUNX1 When available, a link to the FACTORBOOK page for the transcription factor is provided by clicking on the transcription factor.

      atSNP Log Likelihoods (evaluated wrt PWM)
      Rank Ratio (SNP Impact) Log of ratio of p-value Reference to p-value SNP. This is the main statistic that the atSNP p-values are based on.
      Reference The log likelihood score of the best match (highest scoring subsequence) with the reference allele.
      SNP The log likelihood score of the best match (highest scoring subsequence) with the SNP allele.
      Ratio The difference between Log Likelihood Reference and Log Likelihood SNP.
      Enhance Odds Once atSNP identifies the best matching subsequence to the PWM with the reference allele, it evaluates the likelihood of that subsequence with the reference allele and the SNP allele. This quantity is the log of the ratio of these two likelihoods.
      Reduce Odds Once atSNP identifies the best matching subsequence to the PWM with the SNP allele, it evaluates the likelihood of that subsequence with the reference allele and the SNP allele. This quantity is the log of the ratio of these two likelihoods.

      atSNP p-values
      p-value SNP Impact A significant p-value, e.g. 0.0001, statistically supports the potential gain or loss of function of the genomic region with the SNP in terms of transcription factor binding.
      p-value Reference p-value for score with the reference allele (Log Likelihood Reference). A significant p-value indicates that the match to the PWM with the reference allele is statistically supported.
      p-value SNP p-value for scores with the SNP allele (Log Likelihood SNP). A significant p-value indicates that the match to the PWM with the SNP allele is statistically supported.
      *p-value Difference p-value for the difference in scores with the reference and the SNP alleles (Log Likelihood Ratio). This is essentially the result from a likelihood ratio test. While we report this for completeness, the final atSNP results are based on p-value SNP impact which quantifies the significace of the Log Rank Ratio.
      *p-value Condition Ref Conditional p-value for scores on the reference allele based on Log Enhance Odds.
      *p-value Condition SNP Conditional p-value for scores on the SNP allele based on Log Reduce Odds.
      * While these p-values are reported for completeness, they are not used for querying.

Example

We evaluate the impact of the SNP rs117959046 on the binding of the transcription factor GATA2, using the GATA_known12 motif:

Figure 10 Sequence logo of GATA_known12 motif
A
C
G
T
0.2450.0000.9810.0000.472
0.2450.0940.0000.0000.132
0.3400.9060.0190.0000.283
0.1700.0000.0001.0000.113

SNP Information
rsID rs117959046
Coordinate ch11:134339671
Reference/SNP allele C/T
Motif Information
Motif GATA2_2
Transcription FactorGATA

atSNP Results
Log Likelihood p-value
SNP Impact (Rank Ratio)4.02 0.00484
Reference -25.3 0.499
SNP -2.27 0.00898
Ratio -23.0 0.0207
Enhance Odds 23.0 0.499
Reduce Odds -23.0 0.00184

Figure 11 Composite sequence logo for this result

The SNP “rs117959046” is a genetic mutation from C to T in the forward strand at location Chromosome 1: 134,339,671. This creates a highly likely binding site for the transcription factor GATA2 as supported by the p-value SNP Impact of 0.00484. Furthermore, the p-value for the best match between the GATA2 motif and the DNA sequence with the SNP allele is 0.00898, whereas the p-value for the best match between the motif and the DNA sequence with the reference allele is 0.499. These p-values indicate that this genomic location, spanning the SNP position, does not harbor a GATA2 binding site with the reference allele, but the SNP creates a likely GATA2 binding site. Overall, atSNP statistically supports that the SNP leads to generation of an in silico predicted GATA2 binding site.

atSNP provides a composite logo plot to visualize the impacts of SNPs on transcription factor binding. These plots reveal whether the SNP might lead to gain of function or loss of function.

Examples
Figure 12 GATA2 motif scan for rs117959046

The SNP, rs117959046, is highly likely to create a binding site for the transcription factor GATA2 on the genome. The composite logo plot illustrates this with the comparison between the best matches of GATA2_2 PWM to the DNA sequence overlapping the SNP position with the reference and SNP alleles. Among subsequences of length 5 overlapping the SNP position with the reference allele, “AGACA” on the forward strand has the largest log likelihood score. This subsequence is referred to as the “best match with the reference allele”. This subsequence is depicted on the second row with the corresponding PWM on the first row. Similarly, among subsequences of length 5 overlapping the SNP position with the SNP allele, “AGATA” on the forward strand is referred to as the “best match with the SNP allele”. It is depicted on the third row with the corresponding PWM on the fourth row.

Figure 13 RUNX1 motif scan for rs199894601

The SNP, rs199894601, is highly likely to create a binding site for the transcription factor RUNX1 on the genome. The best match position on the genome with the SNP allele is different from that with the reference allele. Specifically, the best matches occur on different strands and genomic coordinates (bp). The composite logo plot illustrates this with the comparison between the best matches of RUNX1 PWM from JASPAR to the DNA sequence overlapping the SNP location with the reference and SNP alleles. The beginning of the best match with the SNP allele is 3 bases before the SNP position in the 5' to 3' direction on the reverse strand. Among subsequences of length 11 overlapping the SNP position with the SNP allele, this subsequence has the largest log likelihood score. This subsequence is referred to as the “best match with the SNP allele” and is depicted on the third row with the corresponding PWM on the fourth row. Similarly, the “best match with the reference allele” is depicted on the second row with the corresponding PWM on the first row. The beginning of the best match with the reference allele is one base before the SNP position in the 5' to 3' direction on the forward strand.

We report response time for the five search types under various combinations of query parameters and significance levels. We performed two experimental runs under one combination of each search type at a time and recorded both response times. Overall, both query type and size of query results determine the response time.

SNPid List p-value
# of SNP IDs 0.05 0.01
15 13.7-19.6 sec 14.7-32.1 sec
50 9.2-40.1 sec 3.5-27.0 sec
100 29.7-32.5 sec 13.6-14.4 sec
500 API timed out API timed out
SNPid Window p-value
Window Size 0.05 0.01
100 2.7-9.6 sec 2.7-3.3 sec
1K 2.8-22.0 sec 3.3-5.2 se
10K 40.5-53.4 sec 12.3-24.1 sec
100K API timed out API timed out
Genomic Location p-value
Window Size 0.05 0.01
1K 3.9-12.5 sec 3.1-5.2 sec
10K 22.8-51.1 sec 7.9-32.3 sec
50K 39.4-55.9 sec 36.1-37.4 sec
100K API timed out API timed out
Gene p-value
Window Size 0.05 0.01
100 4.9-11.5 sec 2.2-2.7 sec
1K 9.5-29.3 sec 2.6-7.3 sec
5K 46.7-61.2 sec 24.5-33.2 sec
10K API timed out API timed out
Transcription Factor p-value
0.05 0.0005
JASPAR ZNF263 2.3-4.7 sec 2.5-2.6 sec
CTCF 3.8-5.1 sec 2.6-3.9 sec
ENCODE AFP 2.7-3.3 sec 1.7-5.4 sec
GATA 40.3-47.7 sec 27.9-43.6 sec
Swapping P-Value Cutoff Operators

The operator for the SNP and reference p-value cutoffs can be swapped between ≤ and >. To reverse the operator for one of these cutoffs, click on the button with the operator on it beside the corresponding p-value.

Adjusting the Sort Order

The sort control indicates the priority of sort fields and the direction of each sort field. If a field has top sort priority, then that field will be the one that results are sorted by. Any results which have the same value for the first sort field will be ordered by the values of the fields with the second sort priority and so on.
To reverse the sort order for one sort field from ascending to descending or vice-versa, click the icon by that search field. To advance or demote a sort field's priority, select it (by clicking on it) and click the up or down arrow.

Filtering by Motif Degeneracy

This control is based on computing information content (IC) of each individual position within a PWM, then taking the median of the information content across all positions. These median IC values are then percentiled (25%, 50%, and 75%) to classify each PWM into the following four classes.

Median Information Content Motif Degeneracy Label
[0, 0.70) Very High
[0.7, 1.1) High
[1.1, 1.52) Moderate
[1.52, 2] Low

Low information content indicates a very degenerate motif. While this does not have implications in terms of actual functionality, when possible, following up less degenerate motifs is more preferable.

Limitations on Numeric Queries

Queries are numerically limited in the following ways:

  • SNPid
    Up to 1,000 SNPids may be entered. Plain text files or direct input are both acceptable.
  • SNPid Window
    The window size is limited to 125 million bases.
  • Genomic Location
    The maximum region size is 250 million bases. The region size is calculated by subtracting the start location from the end location on the selected chromosome.
  • Gene
    The region around the gene is limited to 125 million bases.

atSNP Search utilizes a scale-out systems architecture to store and retrieve approximately thirty seven billion records. Elasticsearch, a NoSQL database, utilizes a distributed scale-out system design for large w orkloads. It accomplishes this task by distributing requests for searches amongst the scaled computing resources. A n API handles communication between the search page and the Elasticsearch data store.