
Methods
I. 
Normalization 
II. 
Tile scoring 
III. 
Feature identification 


I. Normalizaiton
For each array in an experimental set, the relative contributions of the test and reference signals are compared. Ideally, if nucleic acid probes have equal concentration in the test and reference samples, the signals of the two dyes should be approximately equal (i.e., the ratio of the two signals should be close to one for probes hybridizing to an equal degree in both fluorescence channels). In practice, the signals can be rather different due to different chemical properties of dyes and nonspecific or incomplete hybridization to the array. Normalization is used to compensate for these effects by—depending on what method is being used—either applying a scale factor to equalize signals from probes with unchanged concentration or imposing the same empirical distribution of signal intensities. We put together and implemented standard statistical methods that were described in various literature sources and made them conveniently available for tiling array data analysis.
Mean/median normalization. Normalization by mean or median , the socalled ‘constant majority’ methods, is based on the assumption that the majority of genes do not change their expression level in response to the experimental perturbation. It is carried out by subtracting the mean or median of the basetwo logarithm of the ratio of test to reference signal intensities from the logratio value of each tile on a single array. This procedure transforms the logratio distribution by centering it at zero. The mean is the maximum likelihood estimator of the longterm trend in the data while the median is more robust to the outliers in the data. In practice, however, because the mean and the median of the logintensities from the probes on each array are often very close to each other, these two methods usually give very similar results. The advantages of these two methods include the easiness of their implementation and their robustness to the violation of the assumption—they remain applicable even in cases where up to 50% of probes have altered concentrations.
Loess normalization. Loess normalization normalizes array data between channels and removes the intensityspecific artifacts in the logratio measurements simultaneously. Like normalization by mean or median, loess normalization is also performed on an arraybyarray basis. For each array, Tilescope first uniformly samples 50,000 logratio values from the original data, and then performs the locally weighted regression on the sampled data. The dependency of the logratio on the intensity is removed by subtracting predicted logratio based on the loess regression from the actual logratio, and the new test and reference logintensities after normalization are recovered from the residuals. The main disadvantage of Loess normalization is that the locally weighted regression is computationally intensive, and thus the necessity of using sampled data instead of the original, much larger data set. Since loess normalization is carried out for each array one by one, even after data sampling, it remains expensive to use.
Quantile normalization. Unlike the normalization methods discussed above, quantile normalization [18] not only normalizes data between channels and across arrays simultaneously but also removes the dependency of the logratio on the intensity in one step. It imposes the same empirical distribution of intensities to each channel of every array. To achieve this, Tilescope first creates an n×2p (log) intensity matrix M, where n is the number of tiles on an array and p is the number of arrays in an experimental data set, and then sorts each column of M separately to give M_{s}. Afterwards, it takes the mean across rows of M_{s} and creates M_{s}', a matrix of the same dimension as M, but where all values in each row are equal to the row means of M_{s}. Finally, Tilescope produces the quantilenormalized (log) intensity matrix M_{n} by rearranging each column of M_{s}' to have the same ordering as the corresponding column of M. Quantile normalization is fast and has been demonstrated to outperform other normalization methods. Thus it is the default normalization method used by Tilescope.
II. Tile scoring
For each tile, given its neighboring tiles across replicates, Tilescope calculates the pseudomedian logratio value as its signal. The pseudomedian (a.k.a. the HodgesLehmann estimator) of the logratio is a nonparametric estimator of the difference between the logged intensities of the test sample and those of the reference sample. It is calculated for each tile using a sliding window. The tiles from all arrays in a sliding window are first collected into a tile set, and the pseudomedian is calculated for this window as S = median[ (logratio_{i }+ logratio_{j})/2 ] from all (i, j) pairs of tiles in the tile set. As a nonparametric estimator, pseudomedian is less susceptible to distributional abnormalities (such as skewness, unusual kurtosis, and outliers).
Due to the small sample size in each sliding window, whether the intensity distribution is normal or not in a given window cannot be reliably assessed. Without making the normality assumption about the intensity distribution, Tilescope uses the nonparametric Wilcoxon signedrank test to compare the test with the reference signal intensities and quantifies the degree of significance by which the former consistently deviates from the latter across each of the sliding windows. It tests the null hypothesis that the median of the probability distribution of the differences between the logarithm of the intensities from the test sample and those from the reference sample is zero.
At the scoring step, Tilescope generates two tile maps, the signal map and the Pvalue map. Two values are calculated for each tile position: the pseudomedian of logratios, as a measure of the fold enrichment of the hybridization signal in the test sample over the reference at this genomic location and the probability, the Pvalue, that the null hypothesis—the local intensities of the test and the reference samples are the same—is true.
III. Feature identification
Given the tile map annotated with pseudomedians and Pvalues, Tilescope filters away tiles that are below userspecified thresholds. Retained tiles are used to identify either deferentially expressed genes or TFBSs. Currently Tilescope users can choose one of three methods to identify such features
Maxgap and minrun. Based on the observation that a tile is usually too short to constitute a feature alone, the first method, modified from the scoring scheme used in Cawley et al. and Emanuelsson et al., groups together qualified tiles that are close to each other along the genomic sequence into ‘protofeatures’ and then discards any protofeatures that are too short. To use this method, a user needs to specify the maximum genomic distance (‘maxgap’) below which two adjacent qualified tiles can be joined and the minimum length (‘minrun’) of a protofeature for it to be qualified as a feature.
Iterative peak identification. The second method, which we have recently developed and implemented as part of the pipeline, does not group qualified tiles into features. Instead, it identifies local signal ‘peaks’ in an iterative fashion. This method was developed to generate lists of nonoverlapping features of a uniform genomic size.
Taking the signal map that has been generated in the tile scoring step using windowsmoothing to integrate the data from multiple replicate arrays, this method first identify the tile (‘point source’) that corresponds to the peak in the signal map with the global maximum signal that also meets a predefined Pvalue threshold. A feature is then created centered at the genomic position of the peak with a predefined genomic size. We choose a feature size that is comparable with the average size of the fragmented ChIP DNA (typically ~ 1Kb). The feature is assigned a signal measurement from the associated peak.
All tiles within a predefined distance from the located ‘peak’ are then removed from the signal map data. Typically the distance is the same size as the selected features, though it can be larger. This is to ensure that apparent ‘secondary peaks’ in the signal maps that are really part of the same feature are not separately identified. The procedure is then iterated to find the next maximum ‘peak’ in the remaining signal map data. The iteration generates a list of features ranked by ‘peak’ signals and terminates when the identified ‘peak’ signal is below a specified signal enrichment threshold.
Hidden Markov model. The third method uses a supervised scoring framework based on hidden Markov models to predict and score features in the genome tiled on the microarray. Our method differs from previous HMMbased studies by specifically taking validated biological knowledge (e.g. experimental validation, gene annotation, etc.) into consideration and systematically incorporating it to score different types of array assays within the same framework.
For identification of transcriptionally active regions (TARs)/transcribed fragments (transfrags) in transcriptional tiling array data, a fourstate (TAR, nonTAR, and two other intermediate transition states) HMM is constructed using the knowledge of gene annotation (information of the probes that fall into annotated gene regions). For ChIPchip data, a twostate (TFBS and nonTFBS) HMM is constructed by using the knowledge of inner regions in genes to estimate the signal emission distribution g(t) of the nonTFBS state, and by using the subtraction of g(t) from the overall emission distribution h(t) to estimate the emission distribution f(t) of the TFBS state. In a more general ideal scenario, our framework first selects a mediumsized set of subregions by using some appropriate analysis methods (e.g. the MaxEntropy sampling scheme), and then utilizes the knowledge in these subregions as the training set to build the model for accurate analysis.
Further scoring on the initial analysis results can be also done by computing the posterior probabilities of each probe being active. The scores indicate the confidence in every single probelevel prediction and can be used to refine the previous analysis results by HMM. For instance, the identified active probes can be ranked according to the overall confidence levels in their regions and a threshold confidence level may either be set manually or be learned automatically to refine the original results


