Please click on the links below for the help topics

Download User Guide

What is HAPPI?

The Human Annotated and Predicted Protein Interaction (HAPPI) Database is a free, open-access, and comprehensive database collection of computer annotated human protein-protein interactions from public data sources and computational predictions.
The database was developed by exhaustively integrating publicly available human protein interaction data from BioGRID,I2D, IntnetDB, HPRD, and STRING databases into a data warehouse powered by our Oracle 11g relational database server. In the data warehouse, various types of sequence, structure, pathway, and literature annotation data from established bioinformatics resources such as NCBI, PubMed, UniProt, HUGO, EBI, PDB were also integrated. Our long-term goal is to develop a new type of protein interaction database resource for biomedical scientists, who are interested in evaluating biological significant protein interactions, developing disease pathway models, and identifying disease drug targets or diagnostic biomarkers.

Why HAPPI?

In the current release of the HAPPI database, users can examine protein interaction in many ways.

  • The database is very comprehensive, containing 2,922,202 non-redundant reliable human protein interaction pairs among 32,125 human proteins (identified by UniProt ID) as of released in November 2010. In other major public protein interaction databases, the average reported human interaction pairs range from a count of 10,000 to 45,000.
  • Each interaction in HAPPI is assigned a confidence score between 0 and 1 and a corresponding converted confidence rank stars of 1, 2, 3, 4, and 5 (read the next section here for details). This scoring framework provides a unified framework for users to choose an appropriate reliability level (and therefore, a subset) of interactions for their own study. While "total protein interactions" represented in the HAPPI databases at all star ranks exceeds 1.2 million entries, HAPPI database defines "reliable interactions" stringently, requiring a "reliable interaction" to have a score of 0.75 and above (4 or 5 rank stars) to be included.
  • Each interaction in HAPPI is computationally annotated with bioinformatics data including biological pathway, gene function, protein family, protein structure, sequence features. The convenient convergence of these information enhances user's ability to examine the biological validity and significance of specific interactions and their relevance to disease specific studies.

How to Cite this Work?

For both in-depth technical details and citation of this work, please refer to the following article:

Jake Y. Chen, SudhaRani Mamidipalli, and Tianxiao Huan (2009) HAPPI: an Online Database of Comprehensive Human Annotated and Predicted Protein Interactions, BMC Genomics, Vol. 10, Suppl 1: S16 (free access here)

System Architecture

HAPPI database is developed as a classical 3-tier data-driven web application.

  • In the database tier, we host the HAPPI data warehouse on a collection of schemas powered by the Oracle 10g release2 relational database server. All the protein interaction data and related annotation data are imported from public sources, loaded into our database, and managed locally within our data warehouse. We used relational database views extensively to hide complexity of database queries.
  • In the application logic tier, we used predominately PHP as an extension to Apcahe web server functionality to manage data transformation logic (e.g., conversion of confidence score to confidence ranks on the fly), database connection, and data post-processing. To assist pdb structure viewer at the next layer display protein structures correctly, we also used PHP to generate temporary protein structure files on the web server for the viewer Applet to use.
  • In the presentation tier, we used html to display textual and hyperlinked data in a grid and Java Applets to render dynamic database content in visual exploratory environment. The pdb structure viewer used is JMOL, an open-source Java applet. The sequence alignment viewer used is Safmap, an Java applet developed internally.

Interaction Data Source

The majority of the human protein interaction data are extracted and integrated from HPRD, BioGRID, IntNetDB, STRING, and the I2D databases. In particular, we adopted the data source naming standard from the I2D database (for a listing all possible data source values, read here). For data integrated from HPRD, BioGrid we directly use the database names as the data source names.. In summary, the data sources are:

Database Description
BioGRID Database of Protein and Gene Interactions
HPRD Human protein interactions found in the HPRD database
IntNetDB Integrated protein interaction network database
(misc.) Additional recent experimental high-throughput human protein interaction data found in published studies, including JonesErbB1, Pawson, StelzlLow, StelzlMedium, StelzlHigh, VidalHuman_core, and VidalHuman_non_core
STRING Human protein interactions found in the STRING database
(other) Human protein interactions computationally derived and described in the I2D database. The various codes indicate the types of source interaction data used to derive human interaction data. These include: CORE_1, CORE_2, NON_CORE, LITERATURE, SCAFFOLD, INTEROLOG, and CE_DATA from C elegans; low, medium, high, and Krogran_Core from yeast; AfCS, Suzuki, RikenDIP, RikenLit, and RikenBIND from mouse; FlyHigh, FlyLow, and FlyCellCycle from D. melanogaster; and WranaHigh, WranaMedium, and WranaLow from LUMIER.

Interaction Scoring Method

We developed a unified scoring model to assess the reliability of human protein-protein interactions integrated from public protein interaction databases. First, independent scoring systems for individual protein interaction databases to be integrated were developed (after consulting with our collaborating biomedical scientists), primarily based on heuristic scoring of experimental or computational protocol categories. Each interaction pair under a specific experimental/computational derivation method from a given source is assigned a heuristic confidence score Si, which provides an estimation how reliable or trustworthy interaction data from the method/source are. Therefore, the more trustworthy the experimental or computational protocols that generate the interaction data, the higher the confidence score Si.

Score Datasource
0.80 Curated Human Protein Interactions found in HPRD, BIND, and MINT
0.75 High-throughput human protein interaction experimental data
0.70 Human protein interactions in I2D predicted from mouse and rats
0.65 High-quality human protein interactions in I2D predicted from drosophila
0.60 Medium-quality human protein interactions in I2D predicted from various mouse, rat, and drosophila projects
0.50 Human protein interaction data inferred from medium-to-high quality worm and yeast data; high-quality text mining results from STRING
0.40 Human protein interaction data inferred from low-quality worm or curated/high-quality yeast data (including those from MIPS yeast); medium-quality text mining results imported primarily from the STRING database
0.05-0.35 Human protein interaction data inferred from non-interaction data sources (indirect association evidence); low-to-medium-quality text mining results imported primarily from STRING database

Then, we used a score combination formula (listed below) to combine the individual confidence scores into a final h-score for each interaction that are derived from multiple experimental and computational methods or from different data sources:

 score

In the above formula, N represents the total count of different data sources and conditions where an indepent assessment of protein interaction reliability score Si exist.

Interaction Ranking Method

We used a ranking method that works in principle by clustering the distribution of h-scores for all interactions managed in the HAPPI database. The distribution of h-score ranges from 0 to 1. Based on combined score distributions binned properly (see figure below), a 5-star ranking model was developed, where the cutoff threshold is defined at the boundary of significant fall-off or rising bins measured by the "sum of counted interaction" value at the y-axis.

score distribution

Therefore, based on the above binned h-score distribution, our ranking star ratings for each interaction in the HAPPI database is defined with the following threshold values of h-scores:


star 1 h-score < 0.25 noisy and unsupported interactions
star 2 0.25 <= h-score < 0.45 very-low-confidence interactions
star 3 0.45 <= h-score < 0.75 low-confidence interactions
star 4 0.75 <= h-score < 0.90 medium-confidence interactions
star 5 0.90 <= h-score <=1 high-confidence interactions

Note that while reporting HAPPI database stastics, we only use interactions with h-score >= 0.75 (ranked at 4 or 5 star ratings). We do not include plausible interactions in the h-score <0.75 (ranked 1, 2, or 3 stars) in the statistics, although we do allow querying and retrieval of interactions at the rank of 3 star and above. Note that most of the interactions labelled as 3 stars and below are derived from STRING database or text mining methods, where often co-currence of gene/protein names were mentioned above certain frequencies in the same text. It remains uncertain how much interactions from this subset of data are real. Therefore they are excluded from the statistics report of HAPPI database.

Quality of HAPPI in comparison to other major public human protein interaction data sources

High-quality conserved gene co-expression profiles are used to assess protein interaction quality for the HAPPI database. Many protein interaction data set have been cross-validated with human gene co-expression profiles. While interacting proteins may share highly similar gene expression profiles, it was sometimes suggested that such expected correlation between protein interactions and gene expression is weaker in human than in model organisms. Tirosh and Barkai found out that, to improve the development of a confidence measure for interacting proteins, the application of co-expression of orthologs of interacting partners is a more reliable method for verifying protein interactions where comprehensive expression profiles are difficult to compile among all conditions and the interactions may be transient [41]. This is based on the assumption that evolutionary co-expression relationship is a reliable predictor for how true protein interaction may have evolved and conversed functionally. Therefore, it is more sensitive overall than using information purely from the organism, e.g., simple co-expression, cellular co-localization, and similarity in gene’s gene ontology functional annotations. In a similar study, Bhardwaj and Lu also verified that reliable predictions of interactions from heterogeneous data sources can be strengthened by evolutionary preserved gene co-expression measurements [42]. Therefore, we choose to apply conserved evolutionary co-expression pairs to the assessment and comparisons of PPI data qualities for different sources.

We evaluated the quality of a sufficiently large PPI data set based on the degree of overlaps between protein interactions and evolutionary conserved co-expressed genes found in the MetaGene data set, which consists of 22,163 evolutionary conserved co-expression gene pairs based on the analysis of over 3182 published DNA microarray experiments by Stuart et al [43]. MetaGene is a comprehensive compilation of evolutionary conserved gene co-expression pairs from a diverse set of DNA microarray experiments that were obtained from four different organisms: 1,202 DNA microarrays from humans, 979 from worms, 155 from flies, and 643 from yeast. The relative quality of each PPI database, including HAPPI, I2D, IntNetDB, ProNet, UniHI, and HPRD, was estimated as the percentage of overlap between protein interactions in the PPI database of interest and MetaGene pairs. The upper-bound of such overlap, is given by counting unified set of PPIs from all these six human PPI databases that can be mapped with MetaGene pairs—6,297 in all, or 6,145 PPIs from the largest connected components of the network. The lower-bounder of such overlap, is given by creating a random reference set of 37,000 PPIs pairs comparable to the size of PPIs in the HPRD database and comparing the mean degree of overlap between a random sub-sample (size=1,000) of PPIs with MetaGene pairs repetitively 1,000 times.
Therefore, to assess the quality of HAPPI datasets at different quality ranking levels, e.g., between 1-star and 5-star, we calculated the overlap between PPIs at a given quality ranking level from HAPPI and MetaGene pairs.

HAPPI_OTHER_FINAL_frequncy happi_star_covert_frenquncy

Degree of overlaps between randomly selected protein interaction pairs in selected protein interaction databases and MetaGene pairs. We randomly selected 1,000 protein-protein interactions, and count the numbers of protein interaction pairs overlapped with conserved co-expression pairs in the MetaGene database. This randomization and MetaGene overlap counting was repeated for 1000 time for each protein interaction database, and the resulting distribution is show as profiles on the graph. The scale of x-axis is normalized to make overlapping of all possible 6145 MetaGenes to be 100%. Panel A and B shows a comparison of distributions of MetaGene overlap counts for randomized samples of HAPPI-1 and HAPPI-2 database subsets.

A comparison of protein network degree distribution between HAPPI-1 and HAPPI-2.The two sets of data showed similar intercept and R2 for the linear function in the node degree distributions plotted using log-log scales, with HAPPI-2 having slightly flatter slopes thus showing the trend for the updated HAPPI-2 to have 'network hubs' with higher degree of connectivity than HAPPI-1 data as the data coverage expands. .

Web Database Site Map

User Guide

Our's is a web-based query interface for searching the interactions in HAPPI database. Users can then download the interactions or save them by for further analysis.

  1. Search protein interaction using a protein's UniProtID or Gene Name
    Enter the query protein's UniProtID/Gene Name, and press "Enter". You could enter multiple ids delimited by comma or semicolon.
    Examples: 1) TNF 2) BRCA1_HUMAN; FOXA1_HUMAN 3) brca1,tnf,atm,pcna

    In order to search interactions using other ids use the Advanced Search link.

  2. Search protein interaction using different ids.
    First select the id type (examples od ids will be provided). Then either enter the id list (multiple ids should be delimited by comma or semicolon) or upload a file with each id on seperate lines.

  3. Browse protein interactions involving the query protein and its interacting partners
    After a query protein is submited, a list of interacting proteins that interact with the query proteins is retrieved into a data grid, along with each protein's descriptions, source of interactions, and a ranking star rating on a scale of 1 to 5 (see discussion of Interaction Ranking Method above for details). The interacting proteins are sorted according to the rank level, with the one ranked at 5 stars listed first, then 4 stars, and so on. The user may click on protein's UniProt IDs to navigator to the "Protein Summary of Facts" page (list #5 below), or click on the relationship symbol to the left of each interacting protein to navigate to the "Protein Interaction Detail" page (see list #4 below). The relationship symbols, although currently only implemented as <=> bi-directional, have been reserved to be uni-direcional ('=>', query protein recruits partner proteins in interaction; or, "<=", partner protein recruits query protein in interaction).
  4. Browse protein interaction detailed reports
    In the “Protein Interaction Pair Detailed Report” page, users can further examine biological relationship evidence that may exist between interacting proteins. For example, previously in the “Protein Interaction List” page, users can find out that INS_HUMAN (insulin precursor protein) and INSR_HUMAN (insulin receptor precursor protein) interact with each other with high confidence (a 5-star ranking). In this page, aided by protein annotation information shown side-by-side for both INS_HUMAN and INSR_HUMAN proteins, users can further conclude that this interaction fits into the “peptide ligand - receptor” binding model, and that the controlled binding of the two proteins play essential roles in several shared common pathways including insulin signaling pathways, type II diabetes, and the DLPRA disease process.

    Various annotation information of the interacting proteins are available and presented within the same page. In the current release, the annotation information includes: Protein Description from the UniProt database, corresponding gene symbols from the NCBI GENE database (with hyperlinks to the original gene entry at NCBI), top literature abstract co-citation references where the names or synonyms of both genes/proteins are mentioned in the same abstract (with hyperlinks to the PubMed abstract), top molecular pathways in the KEGG database associated with each of the proteins (with hyperlinks to the KEGG pathways), representative protein PDB structures for both proteins whenever available displayed immmediately next to each other, and gene/mRNA sequence feature alignment/annotations.

    In particular, the display of two prontein PDB structures side-by-side along with their mRNA/gene sequence feature and alignment maps, are considered highly innovative. To activate the structure exploration window, a user should first left-click on the JMOL Java applet, and then right-click to reveal the structure exploration menu. In our current release, it is also possible to directly navigate to corresponding PDB record pages through the hyperlink, or to explore a zoomed-in version of the JMOL structure viewer in a standalone explorer window. To activate and examine what each aligned line segments represents, click on the SafMap feature alignment Java applet first, then move the mouse over to read/access to the line segment of interest to reveal comment box.



  5. Browse protein quick facts
    In this page, users are provided with a glimpse of a protein's name, gene name, description, cross-reference of sequence identifiers, and so on. Whenvever external database identifier reference is available, a hypertext link that allows the user to navigate to the original database web pages will be provided as much as possible.


  6. Annotate Interactions
    In this page the registered users can add new annotation to the interactions, rank as well as provide directionality to the specific interaction.
  7. Create and Save list of selected protein interactions
    To create list and to save interactions you must be a registered user.
    Customized list of interactions could be created by selecting the resulting interactions of their searched proteins and press 'Save to List' button. The users can then name the list and the interactions will be saved in the respective list.
    The List and the interactions can be viewed from the users 'My page' once logged in.
  8. Browse new protein interactions
    Different ways to browse new protein interactions-
    • HAPPI home page
    • HAPPI Advanced Search page
    • Go to an interacting protein's "Protein Quick Facts" annotation and click on the protein entry name
    • Once logged in, select the list name (if created any) to view the interactions in those list and select the proteins to futher search interactions.