A worldwide e-Infrastructure for NMR and structural biology

The tutorial describes how to model the complex of the glucose-specific enzyme IIA (E2A) and the histidine-containing phosphocarrier protein (HPr). The use of the server with various kinds of experimental NMR data (chemical shift perturbations, intermolecular NOEs and RDCs) or bioinformatic interface prediction (WHISCY, ProMate, PIER) is demonstrated. You will be guided through typical analysis steps to assess the results and will compare the effect of various data combinations on the accuracy of the generated models. A related tutorial in the form of a Flash movie can viewed here. Additional information can be found in the online HADDOCK manual.

Sjoerd de Vries (haddock.support_at_gmail.com) & Alexandre Bonvin (a.m.j.j.bonvin_at_uu.nl)

## Relevant web sites

In this tutorial you will perform protein-protein docking using the HADDOCK server, either the Utrecht local server:

or the eNMR GRID-enabled server:

The HADDOCK server requires registration (free for all academic users). In addition, the GRID-enabled server requires a valid personal certificate. For information on how to obtain such a certificate please refer to:

WeNMR registration

Normally, HADDOCK docks 1000 structures in the rigid body minimization (it0) and refines the top 200 in the semi-flexible refinement in torsion angle space (it1) followed by explicit solvent refinement (water). The Expert and Guru interfaces allow the number of structures to be modified. However, for the purpose of this tutorial, the number of structures should be scaled down to 250 (it0), 50 (it1), 50 (water) in order to limit the computational time requirements.

By default, HADDOCK discards for each docking trial 50 % of the restraints at random. The rationale is that, in case of unreliable data (e.g. bioinformatics prediction containing usually a significant amount of false positives), incorrect restraints will removed for some of the docking trials and that, hopefully, the resulting models will be favorable in terms of energy.

## Required software for local analysis

Information on how to obtain these other related and interesting programs can be found at:

## The E2A-HPR HADDOCK server tutorial

In this tutorial, we will dock the E. coli phosphoenolpyruvate-sugar phosphotransferase system between glucose-specific enzyme IIA (E2A) and histidine-containing phosphocarrier protein (HPr). Several structures of E2A and HPr are available in the free form (PDB entries 1F3G and 1HDN will be used here). The docking will be driven by experimental NMR data (chemical shift perturbations, intermolecular NOEs and RDCs) and various bioinformatic interface prediction programs (WHISCY, ProMate, PIER). The complex structure has been solved experimentally by NMR (PDB entry 1GGR).

Note that the example uses the NMR solution structure of HPr (entry 1HDN) to allow the docking from an ensemble of conformations. Because of that, the definition of active and passive residues might differ from the one in the original HADDOCK paper since the filtering is based on the average accessibilities.

This tutorial consists of 8 cases in which E2A-HPr docking runs are performed with different combination of NMR data or bioinformatics predictions.
• case01: Chemical shift perturbation (CSP) only
• case02: CSP + residual dipolar couplings (RDCs)
• case03: CSP + diffusion anistropy (DANI)
• case04: NOEs + H bonds
• case05: NOEs + H bonds + RDCs
• case06: Bioinformation predictions

#### Required input data:

Unpack this file in a working directory (preferably in your home directory for things to work smoothly) (under linux) with as command:

This will create a directory called haddock-protein-protein in which all input data can be found, together with results from server runs. The latter should allow you to check your results and/or proceed with the analysis without having to wait for the results.

#### The runs:

The HADDOCK server offers multiple interfaces, with an increasing number of options that can be changed. Depending on your data, you will need one of the following three interfaces.
• case01,06,07,08: the HADDOCK Easy interface
• case04,05: the HADDOCK Expert interface
• case02,03: the HADDOCK Guru interface

#### PDB files

The unbound (free-form) PDB files are

for E2A (the phosphorylated form) and

for HPr (NMR ensemble).

PDB files are specified in the "Structure definition" section of the menu's "First molecule" and "Second molecule". "Where is the structure provided" must be set on "I am submitting it". The chain must be set on "All", the file must be uploaded in "PDB structure to submit", and the molecule type must be set on "Protein".

#### Experimental data

All experimental data are present in:

Chemical shift perturbations: When dealing with chemical shift perturbations, it is necessary to map the shifted residues onto the protein surface and to filter them according to surface accessibility. Surface accessibility has been calculated with NACCESS: you can see their values in ~/haddock-protein-protein/e2a_1F3G.rsa and hpr_rsa_ave.lis (average accessibilities for the HPr ensemble). The nmr-data directory contains two Rasmol script files containing the filtered active and passive residues (e2a_rasmol_active-passive.script and hpr_rasmol_active-passive.script).

You can view the residues in Rasmol with the command "rasmol <pdb file> -script <script file>". Red and yellow are residues with a significant chemical shift. Of those, red residues are solvent-accessible (more than 50 % accessible) and therefore designated as active residues. Green are the solvent-accessible residues surrounding the active residues and therefore designated as passive. Orange are surrounding residues that are not solvent-exposed. Note that the accessibility cut-off is not a hard one; lower cut-offs can be used, especially for potentially functional residues (e.g. Gln, Tyr)

Rasmol script files are text files that can be displayed with cat or more or some text editor. The active (red) residues and passive (green) residues should be entered under "Restraints definition" in "First molecule" and "Second molecule".

Residual dipolar couplings: In HADDOCK, RDCs can be used either as SANI or as VEAN restraints. Here, we will be using them as SANI. Unfold Residual Dipolar Couplings => Residual Dipolar Couplings 1 by clicking on the double arrows. Set the type of RDCs at "SANI". Look in README_SANI to retrieve the axial (D) and rhombic (R) values. Upload e2a-hpr_rdcs_work.tbl as RDC file.

Diffusion anisotropy: Unfold Relaxation Anisotropy Restraints => Anisotropy Restraints 1 by clicking on the double arrows. Set the type of anisotropy at "DANI". Look in README_DANI to retrieve the correlation time (tc), anisotropy (D), proton frequency (wh), nitrogen frequency (wn) and R values. Upload e2a-hpr_dani.tbl as anisotropy file.

Important: Uncheck the option "delete non-polar hydrogens". HADDOCK uses a united atom force field and by default all non polar hydrogen atoms are deleted to speed up the calculations. If NOE restraints are included, all hydrogen atoms should be kept.

Hydrogen bonds: In the "Dihedral and hydrogen bonds restraints" menu, upload e2a-hpr_inter_hbonds.tbl as hydrogen bonds restraints.

Interface predictions: Bioinformatic interface predictions can be obtained from several programs and web servers. WHISCY (http://www.nmr.chem.uu.nl/whiscy), ProMate (http://bioinfo.weizmann.ac.il/promate) and PIER (http://abagyan.ucsd.edu/PIER/) are easy-to-use web servers that return their results quickly.

You can obtain predictions yourself at the respective web sites by uploading the proteins with hydrogens removed. The PDB files are in ~/haddock-protein-protein/predictions. Sequence alignments for WHISCY in PHYLIP format are also present.

However, for the docking, you should use the pre-generated predictions in the ~/haddock-protein-protein/predictions directory. For each predictor, there are Rasmol script files containing predictions for E2A and HPr (e2a_<predictor>.rasmol and ipr_<predictor>.rasmol, 20 predictions per predictor per protein).

The same considerations apply for interface predictions as for chemical shifts and they should be entered into HADDOCK in the same way. The pre-generated predictions have already been filtered for surface accessibility.

#### Job submission

As a final step, enter your username and password and press Submit. If everything went correctly, HADDOCK will provide you with a link to the results, as well as a parameter file. This parameter file can be used for reference, and parameter files can be submitted directly to the server as well. To check if you entered your parameters correctly, download your parameters as haddockparam.web and type:

where XX is a number corresponding to the case your docking, e.g. 01, 02, etc.

There should be no differences except user name and run name.

The server will send you email notifications. If the server indicates that an error message has been sent by e-mail, please check your email.

## Performing a docking based on NMR chemical shift titration data (case01)

For this tutorial, we will only perform one docking run based on NMR chemical shift titration data. If time allows, you are welcome to try other combination of data and/or interface predictions.

### STEP1 – setup tutorial

In order to run the HADDOCK tutorial please enter first the following commands in a terminal window (the tar archive should be present in your home directory):

cd
tcsh
source setup.csh

Note: every time you open a new terminal window you will have to repeat the following commands:

tcsh
source setup.csh

### STEP2 – NMR chemical shift perturbations

We are going to perform a docking run based on NMR chemical shift perturbations. These have been obtained for both E2A and HPR. The perturbed residues (those showing significant displacements in the NMR titration experiments) are:

E2A: 38,40,45,46,69,71,78,80,94,96,141

HPR: 15,16,17,20,48,49,51,52,54,56

Since experiments often do not detect all residues within an interface, for HADDOCK we typically increase the definition of the interface residues by selecting all solvent accessible neighbors of perturbed residues. You visualize those residues on the structure with rasmol scripts provided in the ~/haddock-protein-protein/nmr-data directory.

To view the perturbed residues for E2A type the following command:

./ana_scripts/rasmol.$ARCH -script nmr-data/e2a_rasmol_active-passive.script e2a_1F3G.pdb To view the perturbed residues for HPR type the following command: cd haddock-protein-protein ./ana_scripts/rasmol.$ARCH -script nmr-data/hpr_rasmol_active-passive.script  hpr-ensemble.pdb

The red residues are the one with significant NMR chemical shift perturbations, the green ones are their solvent accessible neighbors.

### STEP3 – Input data to the HADDOCK server

For this case we will be using the “easy” level of the HADDOCK server. For this, go to:

and click on the “easy" interface.

Before entering any data, make sure all menus are open: the arrows on the right side of the red bars should be pointing down.

#### Enter the data for E2A:

• Where is the structure provided? -> I am submitting it
• Which chain must be used? -> all
• PDB structure to be used? -> input here e2aP_1F3G.pdb which can be found in the haddock-protein-protein directory
• Active residues? -> input here the list of residues for E2A (see STEP2)
• Passive residues? -> leave blank
• Automatically define passive residues? -> click on the box to turn on this option
• What kind of molecule are you docking? -> protein

#### Enter the data for HPR:

• Where is the structure provided? -> I am submitting it
• Which chain must be used? -> all
• PDB structure to be used? -> input here hpr-ensemble.pdb which can be found in the haddock-protein-protein directory. This PDB file contains an ensemble of NMR structures
• Active residues? -> input here the list of residues for HPR (see STEP2)
• Passive residues? -> leave blank
• Automatically define passive residues? -> click on the box to turn on this option
• What kind of molecule are you docking? -> protein

Your docking run has now been submitted. You first see the following window:

Here you can download a haddock parameter file that contains all parameters and data you input. It can be used to resubmit a run via the “file upload interface”.

You are also provided with a link to where you results will appear (if you officially register to use the server you will also get an email with this link and be notified once the docking has completed).  By clicking on this link you can see the status of your run. The page will automatically refresh and the results will appear in it once the run is finished.

After completion the results will be presented to you on the web page with statistics per cluster, order from the best ranking cluster:

### STEP4 – Analysis

See below the Analysis section.

## Analysing the docking results

Since we are calculating a reduced number of structures, the HADDOCK runs should take no more than 30-60 minutes. Once the run is finished, the 50 top structures will be clustered and the energies for each cluster are reported. The clusters are numbered according to their size (cluster 1 is the largest cluster) but they are sorted according to HADDOCK score (the most favorable cluster is listed first).

Note: If, for any reason, your HADDOCK run fails, you can open ~/haddock-protein-protein/runs/caseXX/index.html with your browser for cluster statistics. The top four structures from each cluster will be in the same directory.

#### Visual inspection and comparison

Visual inspection of HADDOCK results is an important part of the analysis. You can download the top four structures for each cluster and open them in Rasmol, Pymol, or the embedded Jmol viewer. If you have only one cluster, look at all four structures. If you have multiple clusters, look at the top structure of each cluster. Look for favorable contacts (hydrophobic contacts, electrostatic contacts, aromatic stacking) and unfavorable ones (electrostatic repulsion, buried charges). Do the structures look reasonable? Would you agree with the way they are ranked by HADDOCK? Also pick a few of your active residues/NOEs to see if they are satisfied.

We will now compare our models of E2A-HPr to the minimized average NMR structure that is deposited in the PDB (1GGR). Download the top structure of the top cluster and save it for example as "cluster1_1.pdb".

We are going to superimpose the structure onto the reference NMR structure using the profit program. On the command line, type the following commands:

~/haddock-protein-protein/ana_scripts/profit.$ARCH ~/haddock-protein-protein/ana_scripts/e2a-hpr_1GGR.pdb cluster1_1.pdb atom CA,C,O,N zone A* fit write cluster1_1-fit.pdb quit Then, we are joining the structures into one single PDB for viewing: ~/haddock-protein-protein/ana_scripts/joinpdb -o cluster1_1-compare.pdb \ ~/haddock-protein-protein/ana_scripts/e2a-hpr_1GGR.pdb cluster1_1-fit.pdb You can now visualize the structure by opening "cluster1_1-compare.pdb" with Rasmol or Pymol. This PDB contains now two models: model 1 is 1GGR while model 2 is our top structure. The two models have been fitted on the backbone of E2A (chain A), so you can see the difference by comparing their positions of HPr (chain B). Do they look the same? Repeat this for the other clusters. Note: In pymol, to visualize all structures turn on under the "Movie" menu "visualize all states", or alternatively load all PDBs as separate files (i.e. not the combined PBD). #### Biological insight There is one additional piece of information that we did not use so far. This complex is a phosphate-transfer complex, meaning that a phosphate group is transferred from a histidine of E2A to a histidine of HPR. The docking models should make sense in the light of this new information. This means that two histidines should come close in space at the interface. Check the various clusters, for which cluster is this the case? Is this the best ranking(scoring) cluster? Note: In pymol you can visualize the histidines by typing the following command in the graphic window: show cartoon select resn HIS or resn NEP show spheres, sele (NEP is the name of a histidine amino acid carrying a phosphate group on the N epsilon) ### Quantitative comparisons We will compare the various energies of each structure (scoring) with the quantitative resemblance of each model to the reference structure (assessment). We will look at the following statistics: ### Scoring • HADDOCK score, a weighted sum of the following four terms: • Electrostatic energy (weight 0.2) • Van der Waals energy (weight 1.0) • Desolvation energy (weight 1.0) • Restraints violation energies (distance, SANI) (weight 0.1) ### Comparison to the reference structure • Ligand root mean square deviation (l-RMSD), calculated by fitting the model and the reference on the backbone atoms of the largest protein (chain A, E2A) and then calculating the RMSD on the backbone atoms of the smaller protein (chain B, HPr). • Interface root mean square deviation (i-RMSD), calculated by fitting on the backbone atoms of all residues in the interface (less than 10 Angstroms from the partner protein) and calculating the RMSD over the same atoms. • Fraction of native contacts, calculated by counting, on a residue level, all contacts between the two proteins in the reference structure, and assessing how many are retrieved. First we need to download the entire HADDOCK run. A link to a zipped archive (tgz file) is provided on the results page. Download the archive and extract it with the command: tar xvfz filename.tgz (the filename is the one you gave for your run) Where filename.tgz is the file name of the archive. Note: For this tutorial you can also simply to choose to use the file provided with the tutorial data in ~/haddock-protein-protein/runs/case01/case01.tgz. Extract it with the following command: tar xvfz ~/haddock-protein-protein/runs/case01/case01.tgz If you have to setup the tutorial, please do it now (see STEP1 above) Then, run the analysis by typing in the directory where your run resides: ~/haddock-protein-protein/ana_scripts/run_all.csh case01 (if you provided a name for your run, "case01" will be called differently). The analysis will take several minutes to run. Once it is finished, statistics will have been computed for all stages of HADDOCK: • the rigid body docking (it0) • the flexible refinement (it1) • and the water refinement (water). Their respective directories are • case01/structures/it0 • case01/structures/it1 • and case01/structures/it1/water The various energies, as well as the i-RMSD, are given in structures_haddock-sorted.dat. The structures are sorted according to HADDOCK score. These files are useful to generate various plots such as for example intermolecular energy versus rmsd from the lowest energy structure to visualize the clusters. The graphic program Xmgr/Grace 4 can be use for this purpose. To facilitate things, we are providing a csh script that will generate a xmgrace file called ene_rmsd.xmgr by specifying an input file and the column numbers to be plotted: cd case01/structures/it1/water ~/haddock-protein-protein/ana_scripts/make_ene-rmsd_graph.csh 3 2 structures_haddock-sorted.stat Columns 2 and 3 correspond to the HADDOCK score and the i-RMSD from 1GGR. You can then visualize it with (provided xmgrace is installed): xmgrace ene_rmsd.xmgr In water, the same statistics are given on a per-cluster basis in clusters_haddock-sorted.dat Finally, the ligand RMSD and the fraction of native contacts are provided for each iteration in l-RMSD.dat and file.nam_fnat. Again, the structures are sorted according to HADDOCK score. ### Using these files and tools, try to answer the following questions: • In it0, do you observe a correlation between HADDOCK score and i-RMSD? Note that the top 50 structures are selected to proceed to the next stage. Do they contain the best structures? If not, is there a different energy/score that does a better job? • In the Critical Assessment of PRedicted Interactions (CAPRI) docking experiment, docking solutions are ranked according to the following criteria:  Fnat criterion RMSD criterion i-RMSD l-RMSD Acceptable (*) fnat >= 0.1 i-RMSD <= 4 l-RMSD <= 10 Good (**) fnat >= 0.3 i-RMSD <= 2 l-RMSD <= 5 High (***) fnat >= 0.5 i-RMSD <= 1 l-RMSD <= 1 According to the RMSD criterion only, what is the highest numbers of stars in any it0 structure? What about the top 50 it0 structures? Compare this with it1 and water, is there an improvement in RMSD? Does this affect the number of stars? Note that complex_1 in it1 and complex_1w in water correspond to the top structure in it0. • Compare again the top 50 it0 structures with it1 and water, but now also consider the fnat values. Is there an improvement? Does this affect the number of stars? • Structures in water are clustered within 7.5 Angstroms. To see the how the structures are divided among the clusters, look in analysis/cluster7.5.out. The numbers refer to the rank of the structures in HADDOCK score. Is the best energy structure (number 1) part of a cluster? If your data is very specific, 7.5 Angstroms may be too loose, causing all your structures to end up in one cluster. In that case, you can re-cluster your solutions with the following commands: cd case01/structures/it1/water/analysis ~haddock-protein-protein/ana_scripts/cluster_struc.$ARCH complex_rmsd.disp X 4

where X is the clustering cutoff in Angstroms. Once you have determined the optimal cutoff, you can redo the cluster analysis with the following commands:

cd ..
\rm file.nam_clus* file.list_clus* file.cns_clus* clust*

• Looking at the clusters in water, is the best (lowest i-RMSD) cluster ranked first? If not, is there a different energy/score that does rank the clusters correctly?
The first line in file.nam_clust1 is the best structure of cluster 1 (the largest cluster). Check also its l-RMSD and fnat values, how many stars does it have? If it is a borderline case, also check structures 2-4 of the cluster. Do this for all clusters.
• Compare run A (with random removal of restraints) and run B (with random removal of restraints disabled). Does it make a difference for your data?

• Finally, compare the results of difference cases. How do the different kinds of data compare?

You can do one (or more) of the following options:
• Repeat your docking, but turn off the random removal of ambiguous restraints. This can be done is the distance restraints menu of the expert interface. It makes most sense to try this with either chemical shift perturbation data or bioinformatics interface predictions. Can you see the difference?
• Repeat your docking with the non-phosphorylated form of E2A (~/haddock-protein-protein/e2a_1F3G.pdb). Phosphorylation introduces at least two negative charges that have a profound effect on electrostatics. Can you see the difference?
• Determine a consensus of different interface predictors for E2A-HPr. In addition to the three prediction servers used here, you could try cons-PPISP (http://pipe.scs.fsu.edu/ppisp.html, SPPIDER (http://sppider.cchmc.org), ISIS (http://www.rostlab.org/services/isis/index.php), and/or PINUP (http://sparks.informatics.iupui.edu/PINUP/).
• You can also run HADDOCK on your own protein-protein or protein-DNA complex, using your own experimental data or predicted interface residues.

## Cite WeNMR

Usage of the WeNMR portals should be acknowledged in any publication:

"The FP7 WeNMR (project# 261572) and H2020 West-Life (project# 675858) European e-Infrastructure projects are acknowledged for the use of their web portals, which make use of the EGI infrastructure and DIRAC4EGI service with the dedicated support of CESNET-MetaCloud, INFN-PADOVA, NCG-INGRID-PT, RAL-LCG2, TW-NCHC, SURFsara and NIKHEF, and the additional support of the national GRID Initiatives of Belgium, France, Italy, Germany, the Netherlands, Poland, Portugal, Spain, UK, South Africa, Malaysia, Taiwan and the US Open Science Grid."

And the following article describing the WeNMR portals should be cited:
Wassenaar et al. (2012). WeNMR: Structural Biology on the Grid.J. Grid. Comp., 10:743-767.

## EGI-approved

The WeNMR Virtual Research Community has been the first to be officially recognized by the EGI.

## European Union

WeNMR is an e-Infrastructure project funded under the 7th framework of the EU. Contract no. 261572