Methods


I. Normalization
II. Tile scoring
III. Feature identification


I. Normalizaiton

For each array in an experimental set, the relative contributions of the test and reference signals are compared. Ideally, if nucleic acid probes have equal concentration in the test and reference samples, the signals of the two dyes should be approximately equal (i.e., the ratio of the two signals should be close to one for probes hybridizing to an equal degree in both fluorescence channels). In practice, the signals can be rather different due to different chemical properties of dyes and nonspecific or incomplete hybridization to the array. Normalization is used to compensate for these effects by—depending on what method is being used—either applying a scale factor to equalize signals from probes with unchanged concentration or imposing the same empirical distribution of signal intensities. We put together and implemented standard statistical methods that were described in various literature sources and made them conveniently available for tiling array data analysis.

Mean/median normalization. Normalization by mean or median , the so-called ‘constant majority’ methods, is based on the assumption that the majority of genes do not change their expression level in response to the experimental perturbation. It is carried out by subtracting the mean or median of the base-two logarithm of the ratio of test to reference signal intensities from the log-ratio value of each tile on a single array. This procedure transforms the log-ratio distribution by centering it at zero. The mean is the maximum likelihood estimator of the long-term trend in the data while the median is more robust to the outliers in the data. In practice, however, because the mean and the median of the log-intensities from the probes on each array are often very close to each other, these two methods usually give very similar results. The advantages of these two methods include the easiness of their implementation and their robustness to the violation of the assumption—they remain applicable even in cases where up to 50% of probes have altered concentrations.

Loess normalization. Loess normalization normalizes array data between channels and removes the intensity-specific artifacts in the log-ratio measurements simultaneously. Like normalization by mean or median, loess normalization is also performed on an array-by-array basis. For each array, Tilescope first uniformly samples 50,000 log-ratio values from the original data, and then performs the locally weighted regression on the sampled data. The dependency of the log-ratio on the intensity is removed by subtracting predicted log-ratio based on the loess regression from the actual log-ratio, and the new test and reference log-intensities after normalization are recovered from the residuals. The main disadvantage of Loess normalization is that the locally weighted regression is computationally intensive, and thus the necessity of using sampled data instead of the original, much larger data set. Since loess normalization is carried out for each array one by one, even after data sampling, it remains expensive to use.

Quantile normalization. Unlike the normalization methods discussed above, quantile normalization [18] not only normalizes data between channels and across arrays simultaneously but also removes the dependency of the log-ratio on the intensity in one step. It imposes the same empirical distribution of intensities to each channel of every array. To achieve this, Tilescope first creates an n×2p (log) intensity matrix M, where n is the number of tiles on an array and p is the number of arrays in an experimental data set, and then sorts each column of M separately to give Ms. Afterwards, it takes the mean across rows of Ms and creates Ms', a matrix of the same dimension as M, but where all values in each row are equal to the row means of Ms. Finally, Tilescope produces the quantile-normalized (log) intensity matrix Mn by rearranging each column of Ms' to have the same ordering as the corresponding column of M. Quantile normalization is fast and has been demonstrated to outperform other normalization methods. Thus it is the default normalization method used by Tilescope.


II. Tile scoring

For each tile, given its neighboring tiles across replicates, Tilescope calculates the pseudo-median log-ratio value as its signal. The pseudo-median (a.k.a. the Hodges-Lehmann estimator) of the log-ratio is a nonparametric estimator of the difference between the logged intensities of the test sample and those of the reference sample. It is calculated for each tile using a sliding window. The tiles from all arrays in a sliding window are first collected into a tile set, and the pseudo-median is calculated for this window as S = median[ (log-ratioi + log-ratioj)/2 ] from all (i, j) pairs of tiles in the tile set. As a nonparametric estimator, pseudo-median is less susceptible to distributional abnormalities (such as skewness, unusual kurtosis, and outliers).

Due to the small sample size in each sliding window, whether the intensity distribution is normal or not in a given window cannot be reliably assessed. Without making the normality assumption about the intensity distribution, Tilescope uses the nonparametric Wilcoxon signed-rank test to compare the test with the reference signal intensities and quantifies the degree of significance by which the former consistently deviates from the latter across each of the sliding windows. It tests the null hypothesis that the median of the probability distribution of the differences between the logarithm of the intensities from the test sample and those from the reference sample is zero.

At the scoring step, Tilescope generates two tile maps, the signal map and the P-value map. Two values are calculated for each tile position: the pseudo-median of log-ratios, as a measure of the fold enrichment of the hybridization signal in the test sample over the reference at this genomic location and the probability, the P-value, that the null hypothesis—the local intensities of the test and the reference samples are the same—is true.


III. Feature identification

Given the tile map annotated with pseudo-medians and P-values, Tilescope filters away tiles that are below user-specified thresholds. Retained tiles are used to identify either deferentially expressed genes or TFBSs. Currently Tilescope users can choose one of three methods to identify such features

Max-gap and min-run. Based on the observation that a tile is usually too short to constitute a feature alone, the first method, modified from the scoring scheme used in Cawley et al. and Emanuelsson et al., groups together qualified tiles that are close to each other along the genomic sequence into ‘proto-features’ and then discards any proto-features that are too short. To use this method, a user needs to specify the maximum genomic distance (‘max-gap’) below which two adjacent qualified tiles can be joined and the minimum length (‘min-run’) of a proto-feature for it to be qualified as a feature.

Iterative peak identification. The second method, which we have recently developed and implemented as part of the pipeline, does not group qualified tiles into features. Instead, it identifies local signal ‘peaks’ in an iterative fashion. This method was developed to generate lists of non-overlapping features of a uniform genomic size.

Taking the signal map that has been generated in the tile scoring step using window-smoothing to integrate the data from multiple replicate arrays, this method first identify the tile (‘point source’) that corresponds to the peak in the signal map with the global maximum signal that also meets a predefined P-value threshold. A feature is then created centered at the genomic position of the peak with a predefined genomic size. We choose a feature size that is comparable with the average size of the fragmented ChIP DNA (typically ~ 1Kb). The feature is assigned a signal measurement from the associated peak.

All tiles within a predefined distance from the located ‘peak’ are then removed from the signal map data. Typically the distance is the same size as the selected features, though it can be larger. This is to ensure that apparent ‘secondary peaks’ in the signal maps that are really part of the same feature are not separately identified. The procedure is then iterated to find the next maximum ‘peak’ in the remaining signal map data. The iteration generates a list of features ranked by ‘peak’ signals and terminates when the identified ‘peak’ signal is below a specified signal enrichment threshold.

Hidden Markov model. The third method uses a supervised scoring framework based on hidden Markov models to predict and score features in the genome tiled on the microarray. Our method differs from previous HMM-based studies by specifically taking validated biological knowledge (e.g. experimental validation, gene annotation, etc.) into consideration and systematically incorporating it to score different types of array assays within the same framework.

For identification of transcriptionally active regions (TARs)/transcribed fragments (transfrags) in transcriptional tiling array data, a four-state (TAR, non-TAR, and two other intermediate transition states) HMM is constructed using the knowledge of gene annotation (information of the probes that fall into annotated gene regions). For ChIP-chip data, a two-state (TFBS and non-TFBS) HMM is constructed by using the knowledge of inner regions in genes to estimate the signal emission distribution g(t) of the non-TFBS state, and by using the subtraction of g(t) from the overall emission distribution h(t) to estimate the emission distribution f(t) of the TFBS state. In a more general ideal scenario, our framework first selects a medium-sized set of sub-regions by using some appropriate analysis methods (e.g. the MaxEntropy sampling scheme), and then utilizes the knowledge in these sub-regions as the training set to build the model for accurate analysis.

Further scoring on the initial analysis results can be also done by computing the posterior probabilities of each probe being active. The scores indicate the confidence in every single probe-level prediction and can be used to refine the previous analysis results by HMM. For instance, the identified active probes can be ranked according to the overall confidence levels in their regions and a threshold confidence level may either be set manually or be learned automatically to refine the original results


Powered by . Comments? Write to zhengdong.zhang at einstein.yu.edu.
2006-2011 © ZDZ Lab, Albert Einstein College of Medicine