Analysis

This section describes the configurations for various analysis-related scripts.

Complex alignment

This config file is used to determine how a predicted protein-ligand complex structure is optimally aligned to a corresponding ground-truth protein-ligand complex.

analysis/complex_alignment.yaml
method: neuralplexer # the method for which to align predictions - NOTE: must be one of (`diffdock`, `fabind`, `dynamicbind`, `neuralplexer`, `rfaa`)
vina_binding_site_method: diffdock # the method to use for Vina binding site prediction - NOTE: must be one of (`diffdock`, `fabind`, `dynamicbind`, `neuralplexer`, `rfaa`)
dataset: posebusters_benchmark # the dataset to use - NOTE: must be one of (`posebusters_benchmark`, `astex_diverse`, `dockgen`, `casp15`)
ensemble_ranking_method: consensus # the method with which to rank-order and select the top ensemble prediction for each target - NOTE: must be one of (`consensus`, `ff`)
input_data_dir: ${oc.env:PROJECT_ROOT}/data/${dataset}_set # the input protein-ligand complex directory to recursively parse
output_dir: ${resolve_method_output_dir:${method},${dataset},${vina_binding_site_method},${ensemble_ranking_method},${repeat_index}} # the output directory to which to save the relaxed predictions
rank_to_align: 1 # the pose rank to align
aligned_filename_postfix: "_aligned" # the postfix to append to each aligned complex filename
force_process: false # whether to force processing of all complexes, even if they have already been processed
repeat_index: 1 # the repeat index which was used for inference
pocket_only_baseline: false # whether to prepare the pocket-only baseline

Inference analysis (PoseBusters, Astex, and DockGen)

This config file is used to determine how to score a predicted protein-ligand complex from the PoseBusters Benchmark, Astex Diverse, or DockGen datasets.

analysis/inference_analysis.yaml
full_report: true # whether to generate a full PoseBusters report (i.e. with all metrics) or a summary report (i.e. with only the most important metrics)
method: diffdock # the method for which to score predictions - NOTE: must be one of (`diffdock`, `fabind`, `dynamicbind`, `neuralplexer`, `rfaa`, `vina`, `ensemble`)
vina_binding_site_method: diffdock # the method to use for Vina binding site prediction - NOTE: must be one of (`diffdock`, `fabind`, `dynamicbind`, `neuralplexer`, `rfaa`, `p2rank`)
dataset: posebusters_benchmark # the dataset to use - NOTE: must be one of (`posebusters_benchmark`, `astex_diverse`, `dockgen`, `casp15`)
ensemble_ranking_method: consensus # the method with which to rank-order and select the top ensemble prediction for each target - NOTE: must be one of (`consensus`, `ff`)
input_csv_path: ${resolve_method_input_csv_path:${method},${dataset}} # the input CSV filepath with which to run inference
input_data_dir: ${oc.env:PROJECT_ROOT}/data/${dataset}_set # the input protein-ligand complex directory to recursively parse
posebusters_ccd_ids_filepath: ${oc.env:PROJECT_ROOT}/data/posebusters_pdb_ccd_ids.txt # the path to the PoseBusters PDB CCD IDs file that lists the targets that do not contain any crystal contacts
dockgen_test_ids_filepath: ${oc.env:PROJECT_ROOT}/data/dockgen_set/split_test_rmsd_filtered.txt # the path to the (ESMFold RMSD-filtered) DockGen test set IDs file
output_dir: ${resolve_method_output_dir:${method},${dataset},${vina_binding_site_method},${ensemble_ranking_method},${repeat_index}} # the output directory to which to save the relaxed predictions
repeat_index: 1 # the repeat index which was used for inference
pocket_only_baseline: false # whether to analyze the pocket-only baseline
relax_protein: false # whether to relax the protein - NOTE: currently periodically yields unpredictable protein-ligand separation

Inference analysis (CASP)

This config file is used to determine how to score a predicted protein-ligand complex from the CASP15 dataset.

analysis/inference_analysis_casp.yaml
full_report: true # whether to generate a full PoseBusters report (i.e. with all metrics) or a summary report (i.e. with only the most important metrics)
python_exec_path: ${oc.env:HOME}/mambaforge/envs/casp15_ligand_scoring/bin/python3 # the Python executable to use
scoring_script_path: ${oc.env:PROJECT_ROOT}/posebench/analysis/casp15_ligand_scoring/score_predictions.py # the path to the script to use for scoring CASP predictions
method: diffdock # the method for which to score predictions - NOTE: must be one of (`diffdock`, `fabind`, `dynamicbind`, `neuralplexer`, `rfaa`, `vina`, `ensemble`, `tulip`)
vina_binding_site_method: diffdock # the method to use for Vina binding site prediction - NOTE: must be one of (`diffdock`, `fabind`, `dynamicbind`, `neuralplexer`, `rfaa`)
dataset: casp15 # the dataset to use - NOTE: must be one of (`casp15`)
ensemble_ranking_method: consensus # the method with which to rank-order and select the top ensemble prediction for each target - NOTE: must be one of (`consensus`, `ff`)
predictions_dir: ${oc.env:PROJECT_ROOT}/data/test_cases/${dataset}/top_${method}_ensemble_predictions_${repeat_index} # the directory containing the predictions to analyze
dataset_dir: ${oc.env:PROJECT_ROOT}/data/${dataset}_set # the input protein-ligand complex directory to recursively parse
targets: null # the optional list of target names for which to analyze predictions; if `null`, then all targets in the dataset will be analyzed
fault_tolerant: true # whether to continue processing targets if an error occurs during processing; note that targets H1171v1-2, H1172v1-4, and T1158v4 fail validation (sequence mismatch) and were run with the `fault_tolerant=true` argument during CASP15
skip_existing: true # whether to skip processing targets for which output already exists
score_relaxed_structures: true # whether to score relaxed structures in addition to the original (unrelaxed) structures
repeat_index: 1 # the run index to use for scoring predictions
no_pretraining: false # whether to score a model without pretraining