Yes. The data files you upload for analysis as well as any analysis results, are not downloaded or examined in any way by
the administrators, unless required for system maintenance and troubleshooting. All files will be deleted automatically after
72 hours, and no archives or backups are kept unless you have registered an account and saved the analysis.
You are advised to download your results immediately after performing an analysis.
NetworkAnalyst accepts data from 17 species, in the following formats:
List(s) of genes or proteins: one or more lists of gene or protein IDs with optional expression profiles (i.e. fold changes).
Each gene should be in a row. Please refer to our example data for more details.
Single gene expression data table: a data table containing expression values (i.e. gene/probe intensities from microarray,
counts from RNA-seq saved as a tab delimited text file (.txt) with rows for features (genes/probes) and
columns for samples. The tab delimited file can be generated from any spreadsheet program. More details are
provided in the following sections.
Multiple gene expression tables: gene expression data from multiple studies collected under similar conditions can be
integrated in a meta-analysis.
Network file: users can upload network files generated with a different software to perform network visualization in
NetworkAnalyst. More details on network file formats are provided in the corresponding questions below.
Short-read RNA-Seq fastq files: users can upload single or paired-ends RNA-Seq fastq files and perform quality checking,
trimming, mapping using well-established Galaxy pipeline. Please note, as the task can not be
complete in real time, (free) registration required - users will need to provide a valid email in order to retrieve the result later.
There is a 50MB limit for the uploaded data. For gene expression profiles with 20 000 genes, this
corresponds to about 300 samples. Note - since DESeq2 requires high computational resources, there is a
50 sample limit for this option.
It is critical to properly label your data so that they can be recognized and compared.
The following common IDs are supported:
Gene ID: Entrez ID, Ensembl Gene ID, GenBank Accession ID, RefSeq ID, Ensembl Transcript ID, and official Gene Symbol
Probe ID (for human, mouse and rat only): popular microarray plotforms from Affymetrix, Agilent, Illumina;
The gene expression data also should contain sample names in the first line. Each sample name should be unique.
The class labels of experimental conditions should be in a new line beginning with "#CLASS".
Multiple class labels can be indicated by adding a colon and its name (for example, "#CLASS:cancer_type" and "#CLASS:stage").
For meta-analysis, the same set of labels must be used for ALL datasets.
Here is a good tutorial
on how to generate tab delimited text files from the Excel Spreadsheet program. When you open your data using
any text editor (for example, WordPad), it should look like the following:
Sample name, one class label (one missing value)
#NAME Sample1 Sample2 Sample3 Sample4 Sampl5 Sampl6 Sample7 Sample8
#CLASS case case case case control control control control
Gene1 -3.06 -2.25 -1.15 -6.64 0.4 1.08 1.22 1.02
Gene2 -1.36 -0.67 -0.17 -0.97 -2.32 -5.06 0.28 1.32
Gene3 1.61 -0.27 0.71 -0.62 0.14 0.11 0.98
Gene4 0.93 1.29 -0.23 -0.74 -2 -1.25 1.07 1.27
Sample name, two class labels (cancer and sex)
#NAME Sample1 Sample2 Sample3 Sample4 Sampl5 Sampl6 Sample7 Sample8
#CLASS:CANCER case case case case control control control control
#CLASS:SEX F F M M F M F M
Gene1 -3.06 -2.25 -1.15 -6.64 0.4 1.08 1.22 1.02
Gene2 -1.36 -0.67 -0.17 -0.97 -2.32 -5.06 0.28 1.32
Gene3 1.61 -0.27 0.71 -0.62 0.14 0.11 0.98
Gene4 0.93 1.29 -0.23 -0.74 -2 -1.25 1.07 1.27
Keep the ID Type as "Not Specified". You can still perform statistical analysis (differential expression, meta-analysis, volcano plot, heatmap, etc);
Use the microarray annotation file to annotate probes to one of the common gene IDs that are supported (entrez, refseq, ensemble, etc)
It is possible to add support for other model organisms/platforms based on user requests. Feel free to send us your suggestions. Note, this could
take a while depending on the available time.
NetworkAnalyst support four different types of files (.sif, .txt(edge list), .graphml and .json).
Please click on the following links to see example files supported:
Registering on NetworkAnalyst allows you to save up to 10 projects that will be stored in the system for 10 months.
You will be able to reload the work state of previous projects to resume previous analysis.
Microarray data provides probe-level expression measurements and RNA-seq data provides exon-level
or transcript-level (i.e. different isoforms of the same gene) expression measurements.
However, current functional annotations are mainly assigned at the gene or protein level. Therefore, when
multiple probes or transcripts are mapped to the same gene, they need to be summarized into a
single value for that gene. At the Gene Annotation step, users can choose to use
the averages or medians of multiple probe intensities (microarray), or
sums of counts from multiple transcripts (RNA-seq) to perform gene-level summarization.
Low variance filter: genes whose expression values do not change across different samples, and thus have very
low variance. Genes are ranked by their variance from low to high, and you can exclude a certain percentile of genes
with the lowest variance by adjusting the "Variance filter" slider. The above referenced study has suggested that up to
50% genes can be removed based on their variance with improved results
Low abundance filter: genes with very low abundance are not measured relaibly and amy not be biologically important.
You can exclude genes below a certain threshold by adjusting the "Low abundance" slider. The above referenced study
has suggested 10% genes can be removed based on their abundance with improved results
Normalizing the data accounts for systematic technical sources of variation so that
biologically-driven changes in gene expression can be better detected between samples.
A gene expression normalization method should be chosen unless the data has already been normalized,
in which case the user should select "None".
All of the normalization methods available on NetworkAnalyst are well-established and have been used in many
previous studies. They are based on slightly different assumptions about the underlaying distributions, but should
produce relatively similar results. If you are concerned about significant differences between normalization methods,
you can try out more than one and visualize the results using the provided plots.
Tip: if you are not sure whether the data is already log transformed or not, you can easily figure this out by visualizing the data (i.e. boxplot). For microarray
data, log transformed data values are usually less than 16. For RNA-seq data with 1 million reads,
log2(1,000,000) is less than 20. Therefore if all data values are all below 20, it is reasonable to
assume that the data has already been log transformed.
Potential outlier samples can be identified from PCA plots. The potential outlier will distinguish itself as the one
located far away from the major clusters formed by the remaining samples. To deal with outliers, the first thing is to check
if the sample was measured properly. In many cases, outliers are the result of operational errors during the analytical
process. If those values cannot be corrected, the sample should be removed from the input data and the analysis re-started.
Limma is a popular method for differential analysis that was first developed for microarray differential analysis. It addresses
the problem of low sample sizes typical to whole-transcriptome studies by using the whole expression profile to make more
stable estimates of gene expression variance. EdgeR and DESeq2 were both developed to analyze RNAseq data. All three methods
are well-established and should give similar results. Please note:
EdgeR and DESeq2 are only designed for RNAseq data and will be disabled for microarray data.
Due to high computational resources required, DESeq2 will be disabled when your dataset contain over 50 samples.
In differential expression analysis, you should first determine whether any of the metadata encode blocking factors,
then decide on how to classify individual samples into groups, and finally decide which groups of samples should be
compared to each other using statistical tests. Let's assume that none of your metadata are blocking factors (more on that
later) and try to understand how selecting primary and secondary factors creates different groups of samples. Consider the
"Estrogen" example data, generated in a study that measured gene expression at multiple time points in breast cancer cells
in which the estrogen receptor (ER) was either present or absent. Here, the metadata are "ER" and "TIME". As the figure below
shows, selecting "ER" as the primary factor divides the data into two groups because "ER" has two different levels ('present'
and 'absent'). Selecting "TIME" as the secondary factor results in four groups because the two primary groups are split based on
the two time points. If there were three time points, each primary group would be split into three groups, resulting in six
groups overall.
The defined groups can now be compared to find genes that are differentially expressed between them (more details on this in later
sections). In some experimental designs,
we aren't interested in finding the genes that are differentially expressed between the groups defined by the secondary factor
because it is a blocking factor. Examples of blocking factors are subject IDs when multiple samples were taken from the same
subject (e.g. paired samples, multiple tissue types), or batches of samples that were measured at different times or in different locations.
If you indicate that your secondary factor is a blocking factor, NetworkAnalyst will conduct comparisons within the groups that it defines,
which typically improves the accuracy of the overall result.
This means you do not have enough samples to perform the analysis you specified, usually when combining two metadata
in an independent two-factor analysis (no blocking factors). In this case, the total number of groups will be the
product of the number of levels in each metadata factor (i.e. if the primary metadata contains 3 levels, and the secondary
metadata contains 4, the total number of groups will be 3 * 4 = 12). We recommend a minimum of 3 samples per group,
therefore at least 36 samples are required in order to perform a 3 x 4 two-factor analysis.
In this case, you should focus on a single primary metadata and leave the seconday metadata as
"Not available", and perform differential analysis with regard to individual metadata. You can then
choose the other metadata as the primary metadata and perform the analysis again. If there are no or very few
significant genes identified, it is most likely that incorporating the secondary metadata into the analysis will not
affect the result.
A pair-wise comparison tests for genes that are differentially expressed between any pair of groups. For example,
take three groups A, B, and C. The "all pairwise" comparison will contrast A-B, A-C, and B-C. A time-series comparison
will only contrast consecutive pairs of groups, so in our example only A-B and B-C. Time-series are commonly used
when gene expression was measured at multiple time points, or after treatments with varying concentrations/durations.
A nested comparison allows you to determine which genes respond differently to a treatment condition,
respective to some other metadata. For example, consider the experimental design described in the section on multiple
metadata where cells with and without an ER were measured at 10hrs and at 48 hrs. To find the genes that respond
differently over time in the ER vs. noER cells, you would perform a nested comparison. First, compare ER10-ER48
to find the genes that are differentially expressed in cells with an ER (ERgenes). Next, compare noER10-noER48 to find
the genes differentially expressed in cells with no ER (noERgenes). Finally, to find the genes that respond differently
over time in ER vs. noER cells, compare ERgenes-noERgenes.
Selecting "Interaction only" will return significant results from only the ERgenes-noERgenes contrast. Otherwise
the full model is returned, which is the combination of significant genes from the ER10-ER48, noER10-noER48, and
the ERgenes-noERgenes contrasts.
PCA (principal component analysis) and t-SNE (t-distributed stochastic neighbor embedding) are both popular dimension
reduction techniques. In PCA, each principal component is the linear combination of predictor variables that explains the
greatest amount of variability in the outcome variable, after accounting for previously computed principal components.
Unlike PCA, t-SNE utilizes random walks to estimate non-linear relationships between predictor variables for each sample.
This means that each iteration of t-SNE will generate slightly different results.
The interactive PCA visualization summarizes all the data into the the first three principal components (PCs). Each data
point in the Scores Plot represents a sample. Samples that are close together are more similar to each other. The colors
of these data points are based on the factor labels. Users can change the colors according to any of the two factor labels.
Each data point in the Loadings Plot represents a feature. When scores and loadings plots are viewed from the identical
perspective, the direction of separation on the scores plot can be explained by the corresponding features on the same
directions - i.e. features on the two ends of the direction contribute more to the pattern of separation.
NetworkAnalyst supports enrichment analysis with gene sets from the Gene Ontology, PANTHER, KEGG, Reactome, and MSigDB
databases. Note - not all gene set libraries are available for all species.
The GO:BP, GO:MF, and GO:CC gene sets include the complete set of Gene Ontology terms (> 45 000) for the
biological process, molecular function, and cellular component categories. The PANTHER:BP, PANTHER:MF, and PANTHER:CC
are reduced sets of GO terms ("GO slims") that have been manually chosen based on
the PANTHER protein classification system. Briefly, the PANTHER project has created > 15 000 phylogenetic trees that encode
the evolutionary relationships within protein families. Subsets of GO terms were chosen that best reflect the function
gain or loss along the branches of the PANTHER trees for each of the BP, MF, and CC categories. In general, GO slims can simplify
the interpretation of enrichment analysis results because they reduce the number of highly similar GO terms.
The KEGG and Reactome gene sets are networks of molecular interactions that represent biological pathways and processes.
Reactome pathways are created through a process similar to scientific peer review, where different experts create
and review the pathway organization, and all interactions contain references to the primary literature. KEGG pathways are
also based on molecular interactions in the primary literature, but are accompanied by an extensive ortholog mapping
that allows KEGG pathways to be rapidly extended to additional species based on genome sequence homology.
The Motif gene sets are based on shared upstream regulatory motifs (short nucleotide or amino acid pattern) that can function
as potential transcription factor binding sites (source: MSigDB, set C3:TFT).
ORA is a statistical technique to identify gene sets or pathways that have a significant overlap with the selected genes
of interest. In NetworkAnalyst, Hypergeometric tests are used to compute the p-values. The gene sets are described
in the above FAQ on gene set libraries.
GSEA is a statistical method that determines whether a predefined gene set (GO, KEGG, etc) demonstrates statistically
significant difference between two groups. Taking as input a list of ranked genes and a gene set, it looks at whether the
genes from the gene set are randomly distributed in the ranked list or significantly enriched in the top and bottom
extremes of the ranked list. In the following schema, the gene set A is significantly enriched, while gene set B
represents a case where the genes are more randomly distributed. In contrast to ORA, GSEA can
detect weakly coordinated changes of gene expression in sets of functionally related genes because it is not limited
by the issue of losing information when setting a threshold.
Ranking the list of genes to be analyzed by GSEA is a critical step that can greatly influence the result. Many ranking
metrics are present in the literature and there is no consensus on which is best to use. NetworkAnalyst offers four different
methods. Rank based on DE method used and Fold change are the most intuitive, ranking genes according to their
p-values and fold changes with respect to the primary metadata factor. Moderated Welch's t-test (MWT) and signal-to-noise
ratio (S2N) are two other metrics that have been found to perform well with a low computational load. MWT is a version of
the t-test that allows for unequal variance between groups, and S2N is the difference between the mean expression divided by
the sum of the expression standard deviation for two phenotype groups.
Please note the above gene ranking methods are not applicable to meta-analysis. Instead, the genes are ranked based on the summary statistic obtained
from the previous meta-analysis (combine p-value, effect-size or direct merging). Results obtained from vote count can not be used to perform GSEA.
A recent publication compared different ranking metrics using 28 benchmark datasets and scored each one based on their sensitivity
and false positive rate, summarized in the table below for the four metrics supported by
NetworkAnalyst. While all metrics are widely accepted, you should choose based on how important the sensitivity/false positive
rate is to your analysis. For more details on how the sensitivity and false positive rate were determined,
refer to the original
publication.
The enrichment score is the main output of GSEA. It represents the number of genes in the gene set that are over-represented
at the extremes on the ranked list (most up or down regulated). It is the maximum deviation from zero encountered during
the random walk that goes through the ranked list.
GSEA requires an entire profile of gene expression values, and so it is only available after data processing and differential
analysis of uploaded gene expression table(s) in the GSEA Enrichment Network and GSEA Heatmap Clustering tools.
ORA is more flexible since it only requires a list of genes of interest. In addition to the stand-alone ORA Enrichment
Network and ORA Heatmap Clustering tools, ORA can be performed on subsets of genes identified in volcano plots,
network modules, sections of Venn diagrams and chord diagrams, and the focus view of any heatmap. The GSEA and ORA
enrichment networks are described in more detail in the following FAQ section.
Yes, after you have performed functional enrichment analysis, the significant gene sets will be displayed in
a table. By double clicking on a gene set name, all members will be displayed on the focus view
(heatmap analysis), as highlighted node(s) within the current network (network analysis/enrichment network), as highlighted
points in the volcano plot, or as highlighted chords in the chord diagram.
Enrichment networks are a good way of visualizing the output from enrichment analysis when there are many significant results.
Enriched gene sets are displayed in network form, where gene sets with overlapping genes are connected by edges. This groups
functionally similar gene sets together, which can be easier to interpret than a list of enriched gene sets in tabular form.
Enrichment networks are particularly useful for nested gene sets, such as in the Gene Ontology.
Gene sets are represented by the nodes that are automatically generated in the default view. The nodes are coloured according
to their enrichment score (GSEA) or p-value (ORA) from the results table. The size of the node corresponds to the number
of genes from that gene set that are on the analyzed gene list. The smaller nodes correspond to individual genes, and they
are coloured according to their fold change. More details on how to manipulate the appearance of the network can be found
in the "Network Visualization" section.
Gene set nodes are considered "meta-nodes" because double-clicking them reveals smaller nodes that correspond to the
individual genes belonging to that gene set from the analyzed gene list. There will be an edge between the individual gene
and any enriched gene set that they are a part of, so you can easily see the which genes are shared between sets. The
hierarchical organization of meta-nodes allows users to customize the level of detail represented by an enrichment network.
A bipartite network displays nodes for all gene sets and individual genes. The same network could be generated from the default
enrichment network view by double-clicking each gene set node. Bipartite networks are appropriate when there are a smaller
number of enriched gene sets.
There are two options for determining whether an edge is drawn between two gene sets. The overlap coefficient (OC)
is calculated as the overlap of two gene sets divided by the size of the smaller set. The Jaccard index (JI)
is the overlap of two gene sets divided by the size of their union. The JI is more applicable when gene sets have a
relatively similar size, such as KEGG pathways or PANTHER GO slims. The OC is better at detecting parent-child relationships
within hierarchically organized gene sets, such as the full Gene Ontology.
To increase the size of a gene set node, double-click on the name of the gene set in the results table on the right hand
side. Each time the gene set name is double-clicked, the size will increase. Since the appearance of labels depends on the
size of the node, this is a way too add labels to specific nodes in the network.
If there are a subset of enriched gene sets that are of particular interest, you can visualize them separately from the
rest of the network by extracting them. Select the gene sets from the "Result Table" panel and click the "Extract" button
at the top left corner. If you want to see the detailed connections between a few gene sets (shared individual genes), this
can be an effective way of simplifying the network so that these details are easier to visualize and interpret.
The goal of network construction is to generate a clear visualization of the biological context of the genes of interest.
This includes capturing relevant biological pathways and molecules (TFs, drugs, chemicals) that interact with the gene list,
as well as important connections between them. For a small gene list (< 100), this is accomplished by adding a
substantial number of interacting nodes from the underlaying databases. For a large gene list (> 1000), there are likely enough
biological interactions within the uploaded list and so network construction is focused on pruning nodes and edges
to reduce complexity to more effectively interpret the most critical connections.
The networks are generated by first mapping the significant genes/proteins to the selected
underlying database. A search algorithm is then performed to identify proteins that directly interact with
the uploaded genes/proteins ("seeds"). The seeds and their interaction partners are returned to
build the subnetworks.
This approach will typically return one giant subnetwork ("continent") with multiple smaller ones ("islands").
Most subsequent analysis is performed on the continent. Note, networks with less than 3 nodes will be excluded.
To perform integration, select more than databases in "Network Selection" page. There exists two integration options
Union: create multi-modal networks by creating a network that is the union of first-order networks of selected databases.
Intersection: identify the share portion of multiple first-order networks. A useful use case is to integrate tissue specific coexpression with generic PPI.
The visualization is actually limited by the performance of users' computers and screen resolutions.
Too many nodes will make the network too dense to visualize and the computer slow to respond.
We recommend limiting the total number of nodes to between 200 ~ 2000 for the best experience.
For very large networks, please make sure you have a decent computer equipped with a modern browser
(we recommend the latest Google Chrome).
When there are too few seed genes, the resulting network will be too simplified to identify themes in the biological context
of you genes of interest. There are two main solutions here:
Expand your search of the underlaying database by using a Second-order Network;
Increase the input genes by using smaller fold change and/or larger p value cutoffs.
When there are a large number of significant genes or seed proteins, the resulting networks will be
too large and complex for effective visualization and interpretation. There are four possible solutions here:
Reduce the networks using direct connections between seed proteins Zero-order Network;
Trim the networks to keep only seeds and their connecting nodes using Minimum Network
or Steiner Forest Network;
Filter the networks using the Degree Filter or Betweenness Filter;
Reduce the input genes by using larger fold change and/or smaller p value cutoffs.
The above approaches aim to reduce the network size and complexity, and to retain the most relevant
information for downstream functional analysis.
The "order" refers to the type of relationships that will be used to extract nodes from the underlaying database. The
default is a first-order network, which returns all seed genes and all nodes directly connected to them
in the database. A second-order network increases the size because it returns all seed genes and all nodes that are within
two connections in the database. Drawing a comparison to social networks, first-order networks return seed genes and their
"friends", while second-order networks return the seed genes, their "friends", and the "friends of their friends".
A zero-order network can reduce the number of seed genes because it retains only genes that are connected to each other
within the underlaying database. This can help simplify your gene list to highlight the biological theme of the database of
interest (i.e. protein-protein interactions, TF-gene interactions etc).
Both the Minimum Network and Steiner Forest Network tools aim to construct a minimally connected network that contains
all of the seed genes. This means that the only added nodes are ones that connect previously disjointed networks of seed genes.
The difference between the minimum network and the Steiner forest network is the way in which the approximate solution is
computed. For the minimum network, NetworkAnalyst implements an approximate approach based on shortest paths: we compute
pair-wise shortest paths between all seed nodes, and remove the nodes that are not on the shortest paths. For the Steiner
forest network, NetworkAnalyst implements a fast heuristic prize-collecting Steiner forest algorithm.
The degree and betweenness filters allow you to reduce the size of the network based on its connectivity alone (see later
FAQ sections for explanations of "degree" and "betweenness"). The key takeaway is that the degree filter tends to retain
hub genes (genes with many connections to other genes), and the betweenness filter tends to retain genes that connect
dense clusters of genes.
Yes, there are two main ways to exclude specific nodes from the network. A list of nodes can be uploaded using the Batch
Exclusion tool and the network will be re-computed without these nodes. Alternatively, you can delete nodes manually
using the Delete button at the top of the Node Table. See the Network Visualization FAQ section for
more details on deleting nodes manually.
Protein-protein interactions (PPI) include many types of relationship between proteins, including physical associations as
parts of molecular complexes, information-transfer associations in signaling pathways, and computationally predicted
functional associations based on shared membership in densely connected network modules. A PPI network summarizes these
types of interactions in graphical form.
The IMEx Interactome PPI data come from InnateDB,
a database aimed at facilitating systems-level analysis of the mammalian innate immune system by annotating the
relationships between biological pathways and molecules related to the innate immune system. All interactions are
manually curated from the literature according to the International Molecular Exchange Consortium (IMEx) standards.
The STRING Interactome integrate PPI interaction data from many sources, including using direct (physical associations
from experimental data) and indirect (functional associations based on computational predictions) evidence, for over 2000
species. The key distinguishing factor of the STRING project is that they assign a confidence score to each interaction,
with interactions with more evidence scoring higher. The "Confidence score cut-off" can be adjusted to restrict addition
of PPIs below the specified value from being added to your network. Checking the "Require experimental evidence" box will
exclude PPIs that are supported by computational predictions only.
The data were downloaded from the STRING database (version 10).
The Rolland Interactome PPI data are a collection of human binary PPIs from the literature in 7 public databases.
Binary PPIs refer to direct physical interactions between proteins. To produce the Rolland Interactome, 33 000 binary
human PPIs were collected from the literature. Of these, the 11 045 with multiple supporting studies were retained.
Using tissue-specific PPI networks gives the option of focusing on tissue-specific processes and phenotypes. The tissue-specific PPI
data is from DifferentialNet and was produced by integrating experimental binary PPI data with RNA-sequencing profiles
from different tissues, collected by the Genotype-Tissue Expression consortium. Each PPI was given a score for each tissue
that indicates whether the corresponding genes were similarly expressed across many tissues, or significantly dysregulated
only in the tissue of interest.
The filter can be adjusted to change how unique the PPI should be to the tissue of interest. A lower score will filter out more
PPIs, so the resulting PPIs will be highly unique to the selected tissue. See the latest
DifferentialNet publication
for more details on the scoring metric.
The Gene-miRNA Interactions rely on the TarBase database, which is a collection of experimentally supported
miRNA targets. This means that miRNAs are returned that interact with the uploaded seed genes. The
TF-gene Interactions have three different database options (see next FAQ for more details), all of which return
genes that function as transcription factors for the uploaded genes of interest. Finally, the
TF-miRNA Coregulatory Network draws from the RegNetwork, which contains TF-TF, TF-gene, TF-miRNA,
miRNA-TF, miRNA-gene binding interactions for human and mouse.
For all three of these network types, the returned nodes (miRNAs or TFs) only have connections to the uploaded seed
genes, not to each other. This gives these networks a characteristic appearance where the seed genes have connections
to many regulatory elements, while the regulatory molecules are only connected to a few seed genes. miRNA nodes are
represented as squares instead of circles.
The ENCODE TF-gene interactions are inferred from ENCODE ChIP-seq data using the BETA algorithm. BETA integrates
factor binding and differential expression analysis to predict whether a TF has an activating or a repressing effect,
to infer the gene targets, and to identify the binding motif. JASPAR uses a collection of position frequency
matrices to predict transcription factor binding sites on the DNA. ChEA collected ChIP-X (includes
ChIP-chip, ChIP-seq, ChIP-PET, and DamID) data from the literature to describe the binding of TFs to target genes in
mammalian species.
The Protein-drug Interactions come from DrugBank, a database that
combines bioinformatics and cheminformatics data on drugs and drug targets. The Protein-chemical Interactions come
from the Comparative Toxicogenomics Database, which contains curated
interactions between chemicals and genes from the literature. The Gene-disease Associations come from
DisGeNET, which integrates data from expert curated repositories,
GWAS studies, multiple species, and the literature. As in the gene regulatory networks, the added nodes are connected
to seed genes but not to each other, and are represented as squares instead of as circles.
Gene co-expression networks are constructed by measuring the similarity (i.e. correlation) in pairwise gene expression in
profiles across many conditions. Two genes are connected to each other in the network if they tend to respond similarly (consistently
up or down regulated together) to perturbations. Some PPI databases include gene co-expression data, along with other types of evidence,
to define interactions. Since co-expression networks can be computed from expression data alone, it is easier to generate separate ones
for many different tissues and even cell types compared to PPI networks.
A basic assumption is that changes in nodes that occupy key positions within a network will have a greater impact on
the overall network structure than changes in relatively isolated positions. In graph theory,
measures of centrality are used to identify the most important nodes. NetworkAnalyst provides two well-established
node centrality measures - degree and betweenness. The degree of a node is the number of connections
it has to other nodes. Nodes with a high degree act as hubs within the network. The betweenness of a node is the
number of paths that pass through it when considering the pairwise shortest paths between all nodes in the network.
A node that occurs between two dense clusters will have a high betweenness, even if it has a low degree. Note, you
can sort the node table based on either degree or betweenness values by double clicking the corresponding
column header.
Modules are tightly clustered subnetworks with more internal connections than expected randomly
in the whole network. They are considered as to be relatively independent components
in a graph. Members within a module are likely to work collectively to perform a biological function.
The biological functions of a module can be explored using enrichment analysis.
NetworkAnalyst currently offers three different approaches for module detection - the WalkTrap, InfoMap, and Label Propagation
algorithms. The general idea behind the Walktrap Algorithm is that if you perform random walks on a
graph, a higher number of walks are more likely to stay within a group of nodes that are highly connected to each other
because there are only a few edges that lead outside of them. The Walktrap algorithm runs many short random walks and
uses the results to detect small modules, and then merge separate smaller modules in a bottom-up manner. The InfoMap
Algorithm is also based on random walks, which it uses to minimize the hierarchical map equation for different partitions
of the network into modules. The Label Propagation Algorithm works by randomly assigning a unique label to every node.
On each iteration, node labels are updated to match the one that the maximum of its neighbours has. The algorithm converges when
each node has the same label as the majority of its neighbours.
NetworkAnalyst also integrates the gene expression values as edge weights during module searches. Weights are
calculated as the square of the mean absolute log fold changes of the two adjacent nodes. Larger weights mean
closer connections during random walks. To avoid zero-weight errors for non-seed proteins during program run,
pseudo-expression values are given to non-seed proteins of 1/10 of the minimal absolute log fold changes
of the seed proteins. By giving larger weights to seed proteins, the program encourages detecting modules
containing more seed proteins (shorter distances).
The p-value of a module is based solely on network connectivity, and gives some indication of how
significant the connections within a defined module are. Let's call the edges within a module "internal"
and the edges connecting the nodes of a module with the rest of the graph "external". The null hypothesis
of the test is that there is no difference between the number of "internal" and "external" connections to
a given node in the module. The p-value of a given module is calculated using a Wilcoxon rank-sum test of
the "internal" and "external" degrees. Users should also consider whether the modules are 'active' under the
experimental conditions, by taking into account the number of seed proteins, their average fold changes,
as well as the enriched functions displayed in the Module Explorer table.
Yes, you can test enriched gene sets or pathways for only your query genes.
To do so, first select the check-box in the top left of the Node Explorer toolbar. This will highlight all of your
seed genes. Next, go to the Function Explorer toolbar and change the query to "Highlighted nodes". Select the
gene set library of interest and click "Submit".
Yes. Users can perform enrichment tests on currently highlighted nodes in the network.
Module highlight (automatic): first perform module detection, then click on a module;
Module highlight (manual): set Scope to "including dependents", double click a node in
the network to highlight the node together with its direct neighbours, and repeat the process to select more nodes;
Node highlight (automatic): use Hub Highlighting or Data Highlighting to select nodes
based on degree or betweenness values;
Node highlight (manual): select nodes from the node table on the left or
by double clicking on a node (Single Mode).
After you have selected the nodes or modules, click the Perform Enrichment Analysis
button. The result table will be displayed in the panel below. Note, enrichment analyses are
performed on ALL currently highlighted nodes. To ensure only your current selections
are being used, first Reset the network, then perform highlighting/selections before performing
the enrichment analysis.
The enrichment analysis tests whether there is a significant overlap between the selected genes/proteins and the
user selected library of pre-defined gene sets/pathways (ORA). NetworkAnalyst's network viewer uses
hypergeometric tests to compute the enrichment p-values.
In the default network generated by NetworkAnalyst, the size of the nodes are based on their degree values,
with a big size for large degree values. The color of nodes are proportional to their betweenness centrality values.
When user switches to Expression View, the color will be based on their expression values (if available).
Yes, to view your query genes or proteins, use the color palette on the top-left corner
of the network viewer to set a highlight color. From the "Display Options" on the top right
panel, click the "Highlight". Select "Upregulated nodes" or "Downregulated nodes",
then click Submit button. You may also want to increase their node sizes by using the
Size function under Node Options. Nodes will be labeled automatically when
their size increase above a certain level.
Please use the Download option and choose "SVG Format" to save the current network view (tested using Chrome or FireFox, known issue with Safari).
SVG is a vector based graphic format and you can then export it into any resolution static image (i.e. png)
using a suitable graphic tool, for example, Adobe Illustrator or the free tool InkScape.
Note, it is best to save SVG in white background, as the default background color in InkScape is in white.
If your SVG is saved in Black background, after opening the SVG in InkScape, set the Background color to black (hex code: #222222) using the Document Properties menu.
Yes. To switch background color, click the pull-down menu next to Background on the toolbar at the top
of the screen. From the dropdown menu list, select either White, Black or Custom. Selecting custom will prompts a
dialog in which you can choose the color you want.
You can change the color and size of a node. The shape cannot be changed in the current implementation. To change
the node color, choose the color using the Color Palette and then double-click the node you want to change.
The node color will be changed to your specification. You can also change the whole color spectrum of the network.
Click the Node dropdown menu located on the top toolbar and click on the Color option. A pop-up dialog
will appear in which you are free to choose among the selection of color spectrums. To change the node size, you
can keep double-clicking it to increase its size. You can also use the Node Size functions to increase or decrease
the node size.
Yes. First use the Scope option on the top menu bar and make sure that the option including dependents
is selected. Then drag a central node to a new position, and all nodes connected to this one will be moved as well.
If you also want to adjust the position of other nodes, switch the Scope to "Current node", and then drag these
nodes individually to a new position.
Nodes will be automatically labeled when their sizes reach a certain threshold. Therefore,
you can simply increase node size to label any node. To label a single node, right click the node of interest
and click on the "Add Label" option in the context menu. The size of the node will increase so that the label appears.
To label all highlighted nodes, use the "Node" tab in the Display Options panel on the top right, select
"Highlighted nodes" and "Increase ++", then keep clicking the "Submit" button to increase the size until labels show up.
To label all nodes in the network, perform the same steps as above, but choose "All nodes" instead of "Highlighted
nodes".
Yes, it is possible to hide them. Click on the Nodes dropdown list located on the top menu and select label option.
Click on the display tab and select the "Hide" option.
Yes. You can delete nodes and their associated edges from the current network. First you need to select the nodes from
the Node Table in the left pane. Then click the Delete button at the top of the node table. A confirmation
dialog will appear asking if you really want to delete these nodes. Note, this action will trigger network re-arrangement,
especially if hub nodes are removed. In addition, other nodes that are no longer connected to the larger subnetwork after
node deletion will also be removed during re-arrangement.
There are two basic steps in the network highlighting - setting the highlight color and selecting the nodes to highlight.
Use the Color Palette to set the color for the next selection. You also need to choose the scope for node selection:
Current node: highlight only the selected node;
Including-dependents: highlight the selected node and its direct neighbours.
Now, double click on nodes to make your selections. Note, you can repeat the steps above to change colors and
scope to make different effects.
Yes. To do this, first select or highlight section of the network, then click the
Extract
icon on the left tool bar in the network view window. Note, the operation is computationally
expensive, so you will have to wait for ~20 seconds for the extracted network to return.
The returned network will be named as "moduleX" and is available in the "Network Explorer"
panel on the top-left of the page for future reference.
To view the current network in 3D, click on 3D button located in the toolbar located at the top left corner of the
network viewer. To view the network in VR, please make sure that you are in 3D view and click on VR button located
in the same toolbar. Make sure that you have a VR device connected to your computer.
Meta-analysis is a type of statistical technique used to integrate multiple independent datasets that have been collected
under similar experimental conditions, in order to obtain more robust biomarkers. By combining multiple data
sets, the approach can increase statistical power (more samples) and reduce potential bias.
A key concept in meta-analysis is that it is generally not advisable to directly combine different independent datasets (i.e.
merge them into a single large table) and analyze them as a single unit. This is due to potential batch effects associated
with each datasets, which can completely overwhelm the biological effects. This issue has been well-studied in microarray
experiments generated from different platforms.
Instead, meta-analysis is usually computed based on summary statistics (p values, effect sizes, etc.) to identify robust
biomarkers. The meta-analysis module in NetworkAnalyst was developed to support these approaches, at both the individual
gene and the gene set levels. Heatmaps and other visualization tools can used to explore patterns across different studies.
Here are some basic rules for data collection for meta-analysis in NetworkAnalyst:
The data sets should have been collected under comparable experimental conditions, and/or the underlying experiments
share the same hypothesis or have the same mechanistic underpinnings;
Only two-group comparisons are supported at the moment (i.e. control vs. treatment);
These datasets must share the same type of IDs so that the majority of the features overlap;
It is best to keep all data on the same scale or range (i.e. both raw or normalized in the same way). It is
generally preferred to compare datasets on a log scale.
A common reason for the integrity check to fail is that the meta-data classes and comparisons are not consistent across the
different datasets. To check if this is the problem, click View under "Data Summary" for each dataset. Make sure
that the "Set order of comparison" is the same, and that the spelling and capitalization are consistent. If there are small
differences, these can be corrected in the "Annotation" section - click Annotate next to the dataset you want to
modify and edit the "Group labels".
To normalize the gene expression values across datasets, we recommend the use of ComBat algorithm. At the quality
check page, after datasets upload, you can first visualize the PCA clustering of samples from different datasets. If
obvious batch effects are observed, select the checkbox located below the summary table to perform ComBat.
The method uses an empirical Bayes approach for adjusting batch effects in microarray and RNA-seq expression data.
The algorithm can be summarized in three main steps:
Genes are standardized to have similar overall mean and variance;
Information is pooled across genes from a batch to estimate batch effects
(increased level of expression, high variability, etc.);
The estimated batch effects are used to normalize the data to make them more comparable to each other.
Q-Q plots are graphical tools to determine whether the assumption that the data came from a specific parametric
distribution is plausible. They are generated by calculating evenly spaced percentiles from the sample data, and
from the distribution of interest, and plotting the theoretical and actual percentiles against each other in a scatter plot.
If the points fall on the diagonal straight line, it is reasonable to assume that the data came from that type of
distribution.
The FEM assumes a chi-squared distribution, and NetworkAnalyst supports a Q-Q plot to check the validity of this assumption.
The data will rarely fall perfectly on the straight line, even when they are randomly sampled from a known
distribution, so you should look for large deviations such as the Q-Q plot below. In this case, since there is a significant
deviation, the REM may be more appropriate since it does not assume any parametric distribution.
The number of differentially expressed genes can be reduced by making the significant threshold more stringent. This can
be done when statistics are combined for the actual meta-analysis itself, and back at the "Data upload" step. Here, the
p-value cut-off can be adjusted for the differential analysis of each data set by clicking Analyze under the
"DE Analysis" step.
Yes, it is possible to perform enrichment analysis (ORA) on parts of these diagrams. For Venn diagrams, portions of diagram
will be highlighted after you click them. Multiple sections can be highlighted at once, and the union of the genes from
the highlighted sections are listed under "Gene List View" on the bottom left. The "Enrichment Analysis" tool performs ORA
on this list of genes. For chord diagrams, start by selecting a specific colour that corresponds to the significant
results from either an individual dataset or the the meta-analysis. The options under "Enrichment Analysis" will change
depending on the section of the chord diagram that's been selected. You can perform ORA on either the genes unique to the
selected dataset, or on the genes belonging to the intersection of the selected dataset with another dataset.
Venn diagrams are limited to 4 or fewer gene lists, and while chord diagrams have no limitation on the number of lists, there must
be fewer than 2000 genes. This limitation can be overcome by uploading the genes of interest from each dataset you are comparing as
multiple gene lists in the "Gene List Input" from the NetworkAnalyst home page and visualizing them with the "Heatmap View". This
heatmap colours genes according to the number of lists they are a part of. A gene is represented by a grey square if it is not a
member of that list. There is currently no limit to the number of genes or lists that can be visualized in this way.