The Transcription Element 

Listening System 

 

  telis

Return to main page

© 2007 Steve Cole & the UC Regents

Background:  The TELiS database contains information on the prevalence of transcription factor binding motifs (TFBMs) in the promoters of all human, mouse, and rat genes. 

  • Promoters are strings of 300, 600, or 1200 nucleotides upstream of each gene's transcription start site (TSS) as indicated in the NCBI RefSeq database.  TELiS currently contains data on 34,622 human genes, 24,384 mouse genes, and 21,053 rat genes.

  • TFBMs are defined by 108 position-specific weight matrices from the JASPAR 2 database or 192 matrices representing all vertebrate transcription factors in the TRANSFAC database.  Binding motifs are detected by the MatInspector algorithm.

Use:  TELiS was originally developed to map transcription control networks.  In conjunction with the PromoterStats statistical tool, it can also identify transcription factors driving gene expression dynamics in microarray studies. 

  • TELiS Data Retrieval allows researchers to download raw data on TFBM prevalence for their own analyses.   Data are delivered as tab-delimited (.td) text files suitable for Excel and other spreadsheets.

 

Answers to some frequently asked questions:

 

Q:  What does the database contain?

A:  TELiS contains integer numbers indicating the frequency with which each specific TFBM is detected in each promoter - a P x T matrix of P promoters scanned for T TFBMs.  TELiS is actually a family of such matrices, with each matrix containing data from one scan conducted at a specific stringency (MatInspector mat_sim values of .80, .90, or .95) over a specified promoter size (300 or 600 bases upstream of the TSS, or a region from -1000 to +200 bases).  These scans are conducted by the Java application PromoterScan using NCBI RefSeq nucleotide sequences obtained during the fourth quarter of 2003.  TFBM definitions come from anoymous FTP releases of TRANSFAC v3.2 and JASPAR 2.

 

Q:  For genes with alternative transcription start sites (TSS)?

A:  Multiple results for a single gene are averaged and rounded to the nearest integer to produce a single record. 

 

Q:  How do I search for a gene?

A:  Genes are identified by HGNC Gene Symbols, with lists of Gene Symbols separated by tabs, spaces, or line-breaks (carriage returns). 

 

Q:  Why can't I find a gene in TELiS?

A:  Usually due to a formatting problem OR failure to use HGNC Gene Symbols.  Capitalization is ignored and names should be on separate lines or separated by spaces or tabs.  Do NOT include dividers such as slashes (/), commas (,), or colons (:).  Be sure to select the appropriate species for your gene names - do not submit mouse gene names and indicate using a human microarray.  It is also possible that the NCBI RefSeq database did not contain your gene in Fall 2003, or that its Gene Symbol has changed status from "-pending". 

 

Q:  What is a "TELiS differential expression analysis?"

A:  Differential expression analysis seeks to identify the transcription factors driving observed changes in gene expression.  Given a set of differentially expressed genes (defined by microarray, SAGE, etc.), the PromoterStats statistical tool determines which TFBMs are over-represented in the promoters of those genes.  This provides inferences about which transcription factors are active.   If the upstream signaling pathways that control transcription factor activity are known, differential expression analyses can also be used to indirectly monitor signal transduction dynamics and extracellular stimuli.   Some examples.  

To identify the effect of an experimental manipulation, consider a Transcriptional Shift Analysis.

 

Q:  What is a "transcriptional shift analysis?"

A:  A 2-Group Differential Expression Analysis is used to identify the effect of an experimental manipulation while controlling for background influences (e.g., cell type-specific biases in gene expression).   Comparison of differentially expressed genes with the entire genome (or all genes on a microarray) picks up both the effects of experimental differences and biases due to the specific cell type studied (e.g., transcription factors that determine cell fate/differentiation).   To isolate the effects of experimental conditions, compare a list of genes up-regulated in one cell type with a list of genes down-regulated in the same cell type.   This holds constant the cell type, and focuses the statistical analysis on promoter characteristics that show a shifting prevalence as a function of the experimental manipulation.

2-group differential expression analyses are available for TRANSFAC and JASPAR databases.

 

Q:  What stringency and promoter size should I use?

A:  Development studies show that default parameters (600 bases/.90 stringency) work well under a wide variety of circumstances.  The signal-to-noise ratio can be improved by reducing the promoter size to 300 bases or increasing scan stringency to .95.   Both decrease spurious background detections. 

HOWEVER, high stringency analyses may fail to detect TFBMs that are actually present.  If you increase scan stringency to .95, consider also increasing promoter size to 1200 to reduce the likelihood of null results. 

A good general strategy is to start with 600/.90, which provides a good balance between sensitivity and specificity.  To increase sensitivity, first try decreasing promoter size (300/.90). Then try increasing stringency and promoter size (1200/.95).  For maximal sensitivity, decrease promoter size again (to 600, then 300) while maintaining high stringency.  Remember that many valid results will disappear as stringency increases.

Low stringency analyses (.80) are useful because some TFBM matrices are already excessively stringent or poorly defined. 

 

Q:  How is statistical significance assessed?

A:  Two ways. 

1.)  Frequency analyses compare the average number of TFBMs detected in promoters of differentially expressed genes with the average number in all assayed genes (the "sampling frame"), or to the average number in a second gene list (in a 2-Group Differential Expression Analysis).  These comparisons are carried out using a z-test (comparing a gene list to a sampling frame) or a 2-sample t-test (comparing 2 different gene lists). 

2.)  Incidence analyses determine whether a TFBM is present in a greater fraction of differentially expressed genes than in the sampling frame as a whole (or another gene list).  This is a binary analysis (TFBM is present vs. not in each promoter), executed as an exact binomial test (comparing a gene list to a sampling frame) or a 2-sample z-test (comparing 2 different gene lists).  With > 1000 genes, incidence analyses switch to the normal approximation to the binomial.

TFBMs are ranked according to their p-value in frequency analyses.  BLUE motifs are significantly over-represented, RED motifs are significantly under-represented, and GRAY motifs do not differ significantly from population norms.  Top 100 results are listed.

 

Q:  Why 2 statistical tests?

A:  They measure different things and have different strengths and vulnerabilities.  Incidence analyses are somewhat more conservative, especially in small sample sizes.  Be most confident when both tests are statistically significant.

 

Q:  What does it mean for a TFBM to be under-represented?

A:  RED motifs are significantly less prevalent in promoters of the analyzed genes than in the sampling frame as a whole.  It is not clear what this means biologically, but it could reflect an inhibitory effect - genes can change expression only if this transcription factor has no opportunity to "veto" the change. 

 

Q:  How many genes should I submit?

A:  100 or more is best.  Analytic sensitivity drops significantly for samples < 20, and Frequency analysis p-values lose precision.  Incidence analysis p-values remain accurate for any sample size. 

 

Q:  Which genes are driving my differential expression results?

A:  Use TELiS Data Retrieval to download raw data for your gene list and load it into a spreadsheet such as Excel.   Examine the column containing data for the differentially represented TFBM to determine which genes contain that motif.

 

Q:  Why not analyze differential representation relative to the entire genome?

A:  Genes found in microarrays and other sampling frames are not representative of the entire genome.  TFBM prevalence in microarray-assessed genes can differ by 2-fold or more from their genome-wide prevalence.  The sampling frame defines the set of transcripts that could possibly be observed to change in an experiment, so it represents the appropriate reference population. 

It could be argued that the most appropriate sampling frame is the subset of genes found to be present in a particular experiment (rather than the entire set assayed by a microarray).  A Custom Sampling Frame Analysis allows you to paste a list of "present" genes as the sampling frame and test whether a subset of differentially expressed genes is representative of that population.  Be cautious of custom sampling frames, though, because development studies have shown that biases in microarray "present" calls can reduce the signal-to-noise ratio for detecting transcription factor activity.

 

Q:  What if my sampling frame is not available from TELiS?

A:  Run your analysis relative to the entire genome, but treat the results as provisional until an appropriate sampling frame is defined.  To generate a new sampling frame for human, mouse, or rat genes, email to coles@ucla.edu 1.) a list of all genes in your sampling frame (as HGNC Gene Symbols) and 2.) a brief title for your sampling frame (< 20 characters).

 

Q:  In frequency analyses, why not use a Poisson test?

A:  TFBM frequency data do not follow a Poisson distribution, so that test would produce inaccurate p-values.  The variance of TFBM frequency data often exceeds the mean frequency by 2-fold or more, whereas the Poisson distribution assumes the mean and variance are equal.  We recommend using the default z-test instead, but a Poisson-based analysis is available.

 

Q:  What is the risk of a false positive result?

A:  The p-value for each statistical test gives the risk of a false positive error for that particular TFBM (e.g., p < .01).   TELiS differential expression analyses survey hundreds of individual TFBMs, so the probability of at least one false positive error in the entire set of results is greater than the p-value for any single test.   False positive risks are often analyzed in terms of a "false discovery rate" (FDR) -- the fraction of significant results that are likely due to chance alone.   FDRs depend upon several factors, including the number of genes analyzed, the number of TFBMs surveyed, the characteristics of the promoter scan (stringency and promoter size), the stringency of the statistical analysis (p < .01 vs. p < .00001), and the number of truly significant results present in the data.  

TELiS differential expression analyses provide two FDR estimates.   At the top of the output, a Multiple testing note gives the estimated FDR when statistical results are declared significant at p < .01.   At the end of the output is an FDR threshold table which provides specific significance levels (p-values) that control the FDR at 10%, 20%, 30%, or 40%.   Thresholds are derived from the change in FDR over a range of p-values between .03 and .0001. The FDR is calculated for each p-value by comparing the frequency of significant results observed in your data with the incidence of significant results in 10,000 randomly sampled gene lists of similar size and scan characteristics.   The p-value generating a specified FDR is then estimated by regression.

 

Q:  Do TELiS differential expression analyses actually work?

A:  See some examples

 

Q:  Who made TELiS?

A:  Weihong Yan (wyan@chem.ucla.edu) generated the promoter bank, and Steve Cole (coles@ucla.edu) did scans and statistics. 

 

Q:  How is it implemented?

A:  The TELiS database is powered by MySQL.  Data are generated by the Java application PromoterScan (defining prevalence matrices) and analyzed by the Java servlet PromoterStats (detecting differential representation) running under Apache Tomcat.

 

Q:  Why can't I save my results?

A:  Most likely because you are using the Mozilla Firefox browser.  Other browsers such as Explorer, Navigator, and Safari let you save the current browser content using the "File / Save as..." feature.   For some reason, the Firefox engineers decided NOT to allow saving of current content.   Instead, their "Save page as..." function tries to read the content from the website again (they save a link instead of content).   Unfortunately, there is no persistent link to your TELiS results for security reasons.   Firefox loyalists can use this work-around:   From the "View" menu, select "Page source".   When the HTML text pops up, use "Edit / Select all..." to copy the content and paste it into a new empty text file (e.g., using Windows Notepad).   Save the text file with ".html" extension (NOT ".txt") and you will have a permanent copy of the TELiS results page that can be reopened by Firefox.   Or use another web browser for TELiS analyses.

 

References: 

Cole S, Yan W, Galic Z, Arevalo J, Zack J.

Expression-based monitoring of transcription factor activity:  The TELiS database.

Bioinformatics   2005 21(6):803-810.

 

Quandt K, Frech K, Karas H, Wingender E, Werner T.

MatInd and MatInspector: New fast and versatile tools for detection of consensus matches in nucleotide sequence data. 

Nucleic Acids Res. 1995 Dec 11;23(23):4878-4884.

 

 

 

TELiS main page    |    Top of this page