About ORIO

ORIO (Online Resource for Integrative Omics) is an analysis platform for data from next generation sequencing (NGS). ORIO enables rapid analysis and integration of NGS data sets. ORIO was designed based on three central observations:

  1. Diverse biological phenomena may be represented by discrete positions in genomic space. Think protein binding sites for transcription factor regulation or transcription start sites for transcription initiation.
  2. Despite a wide diversity of NGS experiment and data types, analysis of NGS data often involves consideration and manipulation of genomic read coverage.
  3. Visual inspection remains a critical component of analysis.

The bulk of analysis is performed using the ORIO analysis package . An ORIO analysis run consists of two steps. First, the intersections between a feature list of genomic coordinates and a number of NGS data sets are found. Second, the NGS data sets are correlated based on these intersection values. The output of these steps may be dynamically visualized using ORIO-web .

ORIO has been published in Lavender et al. 2017 . To cite in your publications:

Lavender CA, Shapiro AJ, Burkholder AB, Bennett BD, Adelman K, Fargo DC. ORIO (Online Resource for Integrative Omics): a web-based platform for rapid integration of next generation sequencing data. Nucleic Acids Res. 2017 Jun 2; 45 (10): 5678-5690. doi: 10.1093/nar/gkx270.

Data intersection

The intersection of a feature list is iteratively found for each NGS dataset in an analysis. This intersection describes the overlap of read coverage from the NGS data across genomic windows anchored on feature list positions.

ORIO focuses its analysis on a list of genomic coordinates selected called a feature list. This feature list may be uploaded as a BED file (hyperlink), or the user may select from genomic feature lists hosted by ORIO. Analysis is performed considering genomic windows about each feature. Dimensions of the windows may be adjusted using the ‘bin start,’ ‘bin number,’ and ‘bin size’ parameters when setting up an analysis.

ORIO iteratively finds the intersection of selected NGS datasets with the genomic feature list. The reads intersecting with each feature window are found for each dataset. Datasets may be uploaded as read coverage bigwig files (hyperlink). If stranded data is being considered, two separate bigwig files corresponding to forward and reverse strands may be used. Alternatively, the user may select from hosted datasets taken from the first production run of ENCODE.

ORIO is able to find data intersections considering strand information. If strand information is included in the associated BED file, read coverage will be found respecting the strand of each feature: areas downstream of a feature will be given higher values while areas upstream will be given lower values. If the NGS data is stranded (i.e. forward and reverse strand bigWigs are available), then only coverage on the same strand of a stranded feature will be considered.

The product of the data intersection is a two-dimensional matrix, where each row corresponds to a genomic feature and each column corresponds to a bin of the genomic window. The user can download these files through the ‘Download zip’ button on an analysis page; the ‘Download zip’ command allows the user to access any pertinent data relevant to an analysis. Matrices generated in the data intersection step are then used in the correlative analysis step.

Correlative analysis

Using matrices generated in the data intersection step, ORIO then performs correlative analysis based on compiled read coverage values. NGS datasets and genomic features are grouped by hierarchical clustering and k-means clustering, respectively. Associations discovered through clustering can implicate important coordination of biological functions.

For each NGS dataset, there is a matrix of coverage values for each genomic feature in an analysis. For each dataset pair, the Spearman correlation value is found considering coverage values at each feature; the coverage value used is the sum of coverage across all bins in a genomic window. Hierarchical clustering is performed considering Spearman rho values as the pairwise distance metric.

To cluster genomic features, the total read coverage in a genomic window for each NGS dataset is concatenated to give a one-dimensional data vector for each feature in an analysis. These vectors are normalized by the variance in each dataset. For each pair of features, the Euclidean distance is found considering these normalized data vectors. k-means clustering is performed observing these distances iteratively with k-values from 2 to 10. Clustering values for each k are saved for future display.

Though read coverage is informative for many genomics experiments, in some NGS experiments specialized analytical techniques must be applied to read coverage in order to generate useful data metrics. Also, many non-NGS approaches are relevant for genomics analysis. Acknowledging this, ORIO allows the user to provide a single data value for each genomic feature to be used in correlative analysis of independent NGS datasets. We call this data set the sort vector. A sort vector may be provided at the onset of analysis in the form of a two-column tab-delimited text file where the first column contains feature names and the second contains data values.

If a sort vector is used, hierarchical clustering is performed focused on the sort vector. Read coverage values for each NGS dataset are correlated with data values in the sort vector by Spearman test. These correlation values are found for read coverages in each genomic window bin. For each dataset, correlation values for each bin are concatenated into a one-dimensional vector. For each dataset pair, the Euclidean distance between these data vectors is found, and the Euclidean distance is used as the distance metric in hierarchical clustering. k-means clustering is performed the same in analyses with and without a sort vector.

Correlative analysis results are stored for access and display by the web application ORIO-web.

Data management and display of results

ORIO-web is a web application designed to maintain and organize data for analysis by ORIO. ORIO-web also provides dynamic visualization of ORIO results. Together ORIO and ORIO-web allow for fast, flexible, and informative integration of whole-genome data with an intuitive web interface.

Account management

The ORIO-web landing page asks a user to generate an account associated with an email address. All data and analyses managed by ORIO-web are associated with a user account. Most data is privately associated with a user account; however, ORIO-web does allow individual analyses to designated as public, allowing for rapid sharing of results by URL address.

Data management

ORIO_web manages inputs for the ORIO analysis package. Feature lists, NGS data sets, and sort vectors are associated with a given user account.

Data management controls are found by clicking on the 'Manage data' link button. On the 'Data management' page, headers designate the 'Feature lists', 'Sort vectors', and 'User dataset' sections. Data may be deleted or modified by clicking on entries under each header, or new entries may be created by clicking on 'Create new' buttons.

When creating new entries, each data type requires a name, an associated genome assembly, and correctly formatted data set. Feature lists may be specified as stranded; if so, strand must be specified for each entry in the associated BED file in the sixth column. Sort vectors must be associated with an existing feature list, and that feature list must be specified upon creation.

NGS data sets are uploaded to the tool as read coverage bigWig files. Given the large size of these files, we require these files to be hosted by user and be publicly accessible by HTTP download. When creating a data set entry, the user must provide a valid URL for HTTP access.

Analysis management

Completed and pending analyses are presented on the ORIO-web dashboard.

Analysis visualization

ORIO-web provides an intuitive interface for investigating analysis results. The visualization interface may be accessed for a completed analysis by selecting that analysis on the dashboard and clicking 'View visualization' on the analysis page. The results of an ORIO analysis may also be downloaded as a zip file by selecting 'Download zip' from the 'Actions' drop-down on an analysis page.

Dataset clustering, without a sort vector.

Dataset clustering, with a sort vector.

Feature clustering.