Model¶

This section describes the configurations for various method-related scripts.

Method inference¶

These configurations are used to specify how inference is performed with each method.

DiffDock inference¶

model/diffdock_inference.yaml¶

cuda_device_index: 0 # the CUDA device to use for inference, or `null` to use CPU
python_exec_path: ${oc.env:PROJECT_ROOT}/forks/DiffDock/DiffDock/bin/python3 # the Python executable to use
diffdock_exec_dir: ${oc.env:PROJECT_ROOT}/forks/DiffDock # the DiffDock directory in which to execute the inference scripts
dataset: posebusters_benchmark # the dataset to use - NOTE: must be one of (`posebusters_benchmark`, `astex_diverse`, `dockgen`, `casp15`)
input_csv_path: ${oc.env:PROJECT_ROOT}/forks/DiffDock/inference/diffdock_${dataset}_inputs.csv # the input CSV filepath with which to run inference
inference_config_path: ${oc.env:PROJECT_ROOT}/forks/DiffDock/default_inference_args.yaml # the inference configuration file to use
output_dir: ${oc.env:PROJECT_ROOT}/forks/DiffDock/inference/diffdock_${dataset}_output_${repeat_index} # the output directory to which to save the inference results
model_dir: ${oc.env:PROJECT_ROOT}/forks/DiffDock/workdir/v1.1/score_model # the directory in which the trained model is saved
confidence_model_dir: ${oc.env:PROJECT_ROOT}/forks/DiffDock/workdir/v1.1/confidence_model # the directory in which the trained confidence model is saved
inference_steps: 20 # the maximum number of inference (reverse diffusion) steps to run
samples_per_complex: 5 # the number of samples to generate per complex
batch_size: 5 # the batch size to use for inference
actual_steps: 19 # the actual number of inference steps to run (i.e., after how many steps to halt the reverse diffusion process)
no_final_step_noise: true # whether to disable the final inference step's noise from being added
repeat_index: 1 # the repeat index to use for inference
skip_existing: true # whether to skip inference for existing output directories
pocket_only_baseline: false # whether to run the pocket-only baseline
max_num_inputs: null # if provided, the number of (dataset subset) inputs over which to run inference
v1_baseline: false # whether to run the v1 baseline

FABind inference¶

model/fabind_inference.yaml¶

cuda_device_index: 0 # the CUDA device to use for inference, or `null` to use CPU
python_exec_path: ${oc.env:PROJECT_ROOT}/forks/FABind/FABind/bin/python3 # the Python executable to use
fabind_exec_dir: ${oc.env:PROJECT_ROOT}/forks/FABind/fabind # the FABind directory in which to execute the inference scripts
dataset: posebusters_benchmark # the dataset to use - NOTE: must be one of (`posebusters_benchmark`, `astex_diverse`, `dockgen`, `casp15`)
input_csv_path: ${oc.env:PROJECT_ROOT}/forks/FABind/inference/fabind_${dataset}_inputs.csv # the input CSV filepath with which to run inference
input_data_dir: ${oc.env:PROJECT_ROOT}/data/${dataset}_set/${dataset}_holo_aligned_predicted_structures # the input protein-ligand complex directory to recursively parse
num_threads: 1 # the number of threads to use for inference
save_mols_dir: ${oc.env:PROJECT_ROOT}/forks/FABind/inference/fabind_${dataset}_temp_files/mol # a temporary directory in which to save the intermediate RDKit molecules
save_pt_dir: ${oc.env:PROJECT_ROOT}/forks/FABind/inference/fabind_${dataset}_temp_files # a temporary directory in which to save the intermediate PyTorch tensors
ckpt_path: ${oc.env:PROJECT_ROOT}/forks/FABind/ckpt/best_model.bin # the checkpoint path to use for inference
output_dir: ${oc.env:PROJECT_ROOT}/forks/FABind/inference/fabind_${dataset}_output_${repeat_index} # the output directory to which to save the inference results
repeat_index: 1 # the repeat index to use for inference
pocket_only_baseline: false # whether to run the pocket-only baseline
max_num_inputs: null # if provided, the number of (dataset subset) inputs over which to run inference

DynamicBind inference¶

model/dynamicbind_inference.yaml¶

cuda_device_index: 0 # the CUDA device to use for inference, or `null` to use CPU
python_exec_path: ${oc.env:PROJECT_ROOT}/forks/DynamicBind/DynamicBind/bin/python3 # the Python executable to use
dynamicbind_exec_dir: ${oc.env:PROJECT_ROOT}/forks/DynamicBind # the DynamicBind directory in which to execute the inference scripts
dataset: posebusters_benchmark # the dataset to use - NOTE: must be one of (`posebusters_benchmark`, `astex_diverse`, `dockgen`, `casp15`)
input_data_dir: ${oc.env:PROJECT_ROOT}/data/${dataset}_set/${dataset}_holo_aligned_predicted_structures # the input protein-ligand complex directory to recursively parse for protein inputs
input_ligand_csv_dir: ${oc.env:PROJECT_ROOT}/forks/DynamicBind/inference/dynamicbind_${dataset}_inputs # the input CSV directory with which to run inference
samples_per_complex: 5 # the number of samples to generate per complex
savings_per_complex: 1 # the (top-N) number of sample visualizations to save per complex
inference_steps: 20 # the number of inference steps to run for each complex
batch_size: 5 # the batch size to use for inference
cache_path: ${oc.env:PROJECT_ROOT}/data/dynamicbind_cache/cache # the cache directory to use for storing intermediate data files
header: ${dataset} # name of the results directory to create
num_workers: 1 # the number of workers to use for native relaxation during inference
skip_existing: true # whether to skip existing predictions
repeat_index: 1 # the repeat index to use for inference
pocket_only_baseline: false # whether to run the pocket-only baseline
max_num_inputs: null # if provided, the number of (dataset subset) inputs over which to run inference

NeuralPLexer inference¶

model/neuralplexer_inference.yaml¶

python_exec_path: ${oc.env:PROJECT_ROOT}/forks/NeuralPLexer/NeuralPLexer/bin/python3 # the Python executable to use
neuralplexer_exec_dir: ${oc.env:PROJECT_ROOT}/forks/NeuralPLexer # the NeuralPLexer directory in which to execute the inference scripts
dataset: posebusters_benchmark # the dataset to use - NOTE: must be one of (`posebusters_benchmark`, `astex_diverse`, `dockgen`, `casp15`)
input_csv_path: ${oc.env:PROJECT_ROOT}/forks/NeuralPLexer/inference/neuralplexer_${dataset}_inputs.csv # the input CSV filepath to which parsed input data has been written
skip_existing: true # whether to skip existing predictions
task: batched_structure_sampling # the task to run - NOTE: must be one of (`single_sample_trajectory`, `batched_structure_sampling`, `structure_prediction_benchmarking`, `pdbbind_benchmarking`, `binding_site_recovery_benchmarking`)
sample_id: 0 # the sample ID to use for inference
template_id: 0 # the template ID to use for inference
cuda_device_index: 0 # the CUDA device to use for inference, or `null` to use CPU
model_checkpoint: ${oc.env:PROJECT_ROOT}/forks/NeuralPLexer/neuralplexermodels_downstream_datasets_predictions/models/complex_structure_prediction.ckpt # the model checkpoint to use for inference
out_path: ${oc.env:PROJECT_ROOT}/forks/NeuralPLexer/inference/neuralplexer_${dataset}_outputs_${repeat_index} # the output directory to which to write the predictions
n_samples: 5 # the number of conformations to generate per complex
chunk_size: 5 # the number of conformations to generate in parallel per complex
num_steps: 40 # the number of steps to take in the sampling process
latent_model: null # which type of latent model to use - NOTE: must be one of (`null`)
sampler: langevin_simulated_annealing # the sampler to use - NOTE: must be one of (`DDIM`, `VPSDE`, `simulated_annealing_simple`, `langevin_simulated_annealing`)
start_time: "1.0" # the start time at which to start sampling - NOTE: must be a string representation of a float
max_chain_encoding_k: -1 # the maximum chain encoding `k` to use
exact_prior: false # whether to use the exact prior
discard_ligand: false # whether to discard the ligand
discard_sdf_coords: true # whether to discard the SDF coordinates
detect_covalent: false # whether to detect covalent bonds
use_template: true # whether to use the input template protein structure
separate_pdb: true # whether to separate the predicted protein structures into dedicated PDB files
rank_outputs_by_confidence: true # whether to rank the output conformations, by default, by ligand confidence (if available) and by protein confidence otherwise
frozen_prot: false # whether to freeze the protein structure's geometry updates
plddt_ranking_type: ligand # the type of plDDT ranking to apply to generated samples - NOTE: must be one of (`protein`, `ligand`, `protein_ligand`)
no_ilcl: false # whether to score the NeuralPLexer weights trained without an inter-ligand clash loss (ILCL)
csv_path: null # the CSV filepath from which to parse benchmarking input data
repeat_index: 1 # the repeat index to use for inference
pocket_only_baseline: false # whether to run the pocket-only baseline
max_num_inputs: null # if provided, the number of (dataset subset) inputs over which to run inference

FlowDock inference¶

model/flowdock_inference.yaml¶

python_exec_path: ${oc.env:PROJECT_ROOT}/forks/FlowDock/FlowDock/bin/python3 # the Python executable to use
flowdock_exec_dir: ${oc.env:PROJECT_ROOT}/forks/FlowDock # the FlowDock directory in which to execute the inference scripts
dataset: posebusters_benchmark # the dataset to use - NOTE: must be one of (`posebusters_benchmark`, `astex_diverse`, `dockgen`, `casp15`)
input_csv_path: ${oc.env:PROJECT_ROOT}/forks/FlowDock/inference/flowdock_${dataset}_inputs.csv # the input CSV filepath to which parsed input data has been written
skip_existing: true # whether to skip existing predictions
sampling_task: batched_structure_sampling # the task to run - NOTE: must be one of (`batched_structure_sampling`)
sample_id: 0 # the sample ID to use for inference
input_receptor: null # NOTE: must be either a protein sequence string (with chains separated by `|`) or a path to a PDB file (from which protein chain sequences will be parsed)
input_ligand: null # NOTE: must be either a ligand SMILES string (with chains/fragments separated by `|`) or a path to a ligand SDF file (from which ligand SMILES will be parsed)
input_template: null # path to a protein PDB file to use as a starting protein template for sampling (with an ESMFold prior model)
cuda_device_index: 0 # the CUDA device to use for inference, or `null` to use CPU
model_checkpoint: ${oc.env:PROJECT_ROOT}/forks/FlowDock/checkpoints/esmfold_prior_paper_weights.ckpt # the model checkpoint to use for inference
out_path: ${oc.env:PROJECT_ROOT}/forks/FlowDock/inference/flowdock_${dataset}_outputs_${repeat_index} # the output directory to which to write the predictions
n_samples: 5 # the number of conformations to generate per complex
chunk_size: 5 # the number of conformations to generate in parallel per complex
num_steps: 40 # the number of steps to take in the sampling process
latent_model: null # which type of latent model to use - NOTE: must be one of (`null`)
sampler: VDODE # sampling algorithm to use - NOTE: must be one of (`ODE`, `VDODE`)
sampler_eta: 1.0 # the variance diminishing factor for the `VDODE` sampler - NOTE: offers a trade-off between exploration (1.0) and exploitation (> 1.0)
start_time: "1.0" # the start time at which to start sampling - NOTE: must be a string representation of a float
max_chain_encoding_k: -1 # the maximum chain encoding `k` to use
exact_prior: false # whether to use the exact prior
prior_type: esmfold # the type of prior to use for sampling - NOTE: must be one of (`gaussian`, `harmonic`, `esmfold`)
discard_ligand: false # whether to discard the ligand
discard_sdf_coords: true # whether to discard the SDF coordinates
detect_covalent: false # whether to detect covalent bonds
use_template: true # whether to use the input template protein structure
separate_pdb: true # whether to separate the predicted protein structures into dedicated PDB files
rank_outputs_by_confidence: true # whether to rank the output conformations, by default, by ligand confidence (if available) and by protein confidence otherwise
plddt_ranking_type: ligand # the type of plDDT ranking to apply to generated samples - NOTE: must be one of (`protein`, `ligand`, `protein_ligand`)
visualize_sample_trajectories: false # whether to visualize the generated samples' trajectories
auxiliary_estimation_only: false # whether to only estimate auxiliary outputs (e.g., confidence, affinity) for the input (generated) samples (potentially derived from external sources)
auxiliary_estimation_input_dir: null # if provided, an input directory of PDB-SDF file pairs (potentially derived from external sources) for which to estimate auxiliary outputs (e.g., confidence, affinity) and represent the outputs as a rank-ordered CSV file within the same directory
csv_path: null # the CSV filepath from which to parse benchmarking input data
esmfold_chunk_size: null # chunks axial attention computation to reduce memory usage from O(L^2) to O(L); equivalent to running a for loop over chunks of of each dimension; lower values will result in lower memory usage at the cost of speed; recommended values: 128, 64, 32
repeat_index: 1 # the repeat index to use for inference
pocket_only_baseline: false # whether to run the pocket-only baseline
max_num_inputs: null # if provided, the number of (dataset subset) inputs over which to run inference

RoseTTAFold-All-Atom inference¶

model/rfaa_inference.yaml¶

python_exec_path: ${oc.env:PROJECT_ROOT}/forks/RoseTTAFold-All-Atom/RFAA/bin/python3 # the Python executable to use
rfaa_exec_dir: ${oc.env:PROJECT_ROOT}/forks/RoseTTAFold-All-Atom # the RoseTTAFold-All-Atom directory in which to execute the inference scripts
dataset: posebusters_benchmark # the dataset to use - NOTE: must be one of (`posebusters_benchmark`, `astex_diverse`, `dockgen`, `casp15`)
input_dir: ${oc.env:PROJECT_ROOT}/forks/RoseTTAFold-All-Atom/prediction_inputs/${dataset} # the input directory with which to run inference
config_dir: ${oc.env:PROJECT_ROOT}/forks/RoseTTAFold-All-Atom/rf2aa/config/inference # the config directory with which to run inference
output_dir: ${oc.env:PROJECT_ROOT}/forks/RoseTTAFold-All-Atom/prediction_outputs/${dataset}_${repeat_index} # the output directory to which to save the inference results
max_cycles: 10 # the maximum number of recycling iterations to run
cuda_device_index: 0 # the CUDA device to use for inference, or `null` to use CPU
run_inference_directly: false # whether to run the inference code directly (true) or to rely on the user to run the generated scripts manually (false)
inference_config_name: null # the name of the inference config to use - NOTE: if `run_inference_directly` is true, this must reference a valid YAML config file name e.g., that was generated by `python posebench/models/rfaa_inference.py` with `run_inference_directly=false`
inference_dir_name: null # the name of the inference output directory to use
repeat_index: 1 # the repeat index to use for inference
skip_existing: true # whether to skip running inference if the prediction for a target already exists
pocket_only_baseline: false # whether to run the pocket-only baseline
max_num_inputs: null # if provided, the number of (dataset subset) inputs over which to run inference

Chai-1 inference¶

model/chai_inference.yaml¶

dataset: posebusters_benchmark # the dataset to use - NOTE: must be one of (`posebusters_benchmark`, `astex_diverse`, `dockgen`, `casp15`)
input_dir: ${oc.env:PROJECT_ROOT}/forks/chai-lab/prediction_inputs/${dataset} # the input directory with which to run inference
output_dir: ${oc.env:PROJECT_ROOT}/forks/chai-lab/prediction_outputs/${dataset}_${repeat_index} # the output directory to which to save the inference results
msa_dir: ${oc.env:PROJECT_ROOT}/data/${dataset}_set/${dataset}_chai_msas # the directory containing the `.aligned.pqt` MSA files prepared for Chai-1 via `posebench/data/components/prepare_chai_msas.py`; if not provided, Chai-1 will be run in single-sequence mode
cuda_device_index: 0 # the CUDA device to use for inference, or `null` to use CPU
repeat_index: 1 # the repeat index to use for inference
skip_existing: true # whether to skip running inference if the prediction for a target already exists
pocket_only_baseline: false # whether to run the pocket-only baseline
max_num_inputs: null # if provided, the number of (dataset subset) inputs over which to run inference

Boltz-1 inference¶

model/boltz_inference.yaml¶

model: boltz1 # the model to use for inference - NOTE: must be one of (`boltz1`, `boltz2`)
dataset: posebusters_benchmark # the dataset to use - NOTE: must be one of (`posebusters_benchmark`, `astex_diverse`, `dockgen`, `casp15`)
input_dir: ${oc.env:PROJECT_ROOT}/forks/boltz/prediction_inputs/${dataset} # the input directory with which to run inference
output_dir: ${oc.env:PROJECT_ROOT}/forks/boltz/prediction_outputs/${dataset}_${repeat_index} # the output directory to which to save the inference results
cuda_device_index: 0 # the CUDA device to use for inference, or `null` to use CPU
repeat_index: 1 # the repeat index to use for inference
use_potentials: true # whether to use inference-time potentials to potentially improve pose plausibility
skip_existing: true # whether to skip running inference if the prediction for a target already exists
pocket_only_baseline: false # whether to run the pocket-only baseline
max_num_inputs: null # if provided, the number of (dataset subset) inputs over which to run inference

Vina inference¶

model/vina_inference.yaml¶

# General inference arguments:
dataset: posebusters_benchmark # the dataset to use - NOTE: must be one of (`posebusters_benchmark`, `astex_diverse`, `dockgen`, `casp15`)
method: diffdock # the method from which to derive binding site predictions - NOTE: must be one of (`diffdock`, `dynamicbind`, `neuralplexer`, `flowdock`, `p2rank`, `ensemble`)
ensemble_ranking_method: consensus # the method with which to rank-order and select the top ensemble prediction for each target - NOTE: must be one of (`consensus`, `ff`)
python2_exec_path: ${oc.env:PROJECT_ROOT}/forks/Vina/ADFR/bin/python # the path to the Python 2 executable
p2rank_exec_path: ${oc.env:PROJECT_ROOT}/forks/P2Rank/p2rank_2.4.2/prank # the path to the P2Rank executable
prepare_receptor_script_path: ${oc.env:PROJECT_ROOT}/forks/Vina/ADFR/CCSBpckgs/AutoDockTools/Utilities24/prepare_receptor4.py # the path to the prepare_receptor.py script
input_dir: ${resolve_method_output_dir:${method},${dataset},${method},${ensemble_ranking_method},${repeat_index},${pocket_only_baseline},${v1_baseline}} # the input directory with which to run inference
input_protein_structure_dir: ${oc.env:PROJECT_ROOT}/data/${dataset}_set/${dataset}_holo_aligned_predicted_structures # the input protein structure directory to parse
output_dir: ${oc.env:PROJECT_ROOT}/data/test_cases/${dataset}/vina_${method}_${dataset}_outputs_${repeat_index} # the output directory to which to save the inference results
cpu: 0 # the number of CPU workers to use with AutoDock Vina for parallel processing, 0 for all available
seed: null # the random seed to use with AutoDock Vina
exhaustiveness: 32 # the exhaustiveness to use with AutoDock Vina
ligand_ligand_distance_threshold: 25.0 # the distance threshold (in Angstrom) to use for finding shared binding sites amongst ligands
protein_ligand_distance_threshold: 10.0 # the heavy-atom distance threshold (in Angstrom) to use for finding protein binding sites amongst grouped ligands
binding_site_size_x: 25.0 # the x-axis size of the binding site box to use with AutoDock Vina
binding_site_size_y: 25.0 # the y-axis size of the binding site box to use with AutoDock Vina
binding_site_size_z: 25.0 # the z-axis size of the binding site box to use with AutoDock Vina
binding_site_spacing: 1.0 # the spacing of the binding site box (in Angstrom) to use with AutoDock Vina
num_modes: 40 # the number of binding modes (i.e., poses) to generate with AutoDock Vina
skip_existing: true # whether to skip existing output files
protein_filepath: null # the protein file path to use for inference
ligand_filepaths: null # the ligand file paths to use for inference
apo_protein_filepath: null # the apo protein file path to use for inference
input_id: null # the input ID to use for inference
repeat_index: 1 # the repeat index to use for inference
pocket_only_baseline: false # whether to run the pocket-only baseline
v1_baseline: false # whether to run the v1 baseline
max_num_inputs: null # if provided, the number of (dataset subset) inputs over which to run inference
# p2rank inference arguments:
p2rank_exec_utility: predict # the P2Rank executable utility to use for inference
p2rank_config: alphafold # the P2Rank configuration to use for inference
p2rank_enable_pymol_visualizations: false # whether to enable P2Rank's PyMOL visualizations

Ensemble inference¶

This configuration is used to specify how inference is performed with a method ensemble (e.g., via consensus ranking).

Note

This script not only enables inference with a method ensemble, but it also provides a unified wrapper with which one can relax and structure a method’s predictions in a CASP-compliant file format for scoring.

Ensemble generation¶

model/ensemble_generation.yaml¶

# General inference arguments:
ensemble_methods: [diffdock, dynamicbind, neuralplexer, rfaa] # the methods from which to gather predictions for ensembling - NOTE: must be one of (`diffdock`, `dynamicbind`, `neuralplexer`, `flowdock`, `rfaa`, `chai-lab`, `boltz`, `alphafold3`, `vina`, `tulip`)
generate_vina_scripts: false # whether to generate Vina scripts using other methods' binding site predictions - NOTE: `resume` must also be `true` when this is `true`, meaning other methods' predictions must have already been generated locally
rank_single_method_intrinsically: true # whether to rank single-method predictions using either `consensus` or `vina` ranking (false) or instead using their intrinsic (explicit) rank assignment (true)
output_bash_file_dir: ensemble_generation_scripts # the directory in which to save the generated Bash scripts
cuda_device_index: 0 # the CUDA device to use for inference, or `null` to use CPU
input_csv_filepath: ${oc.env:PROJECT_ROOT}/data/test_cases/5S8I_2LY/ensemble_inputs.csv # path to a CSV file containing the following columns: `protein_input`, `ligand_smiles`, `name`
temp_protein_dir: ${oc.env:PROJECT_ROOT}/data/ensemble_proteins # directory path to which to save temporary predicted protein structures
structure_prediction_script_path: ${oc.env:PROJECT_ROOT}/posebench/data/components/esmfold_batch_structure_prediction.py # path to the ESMFold structure prediction script to use
structure_prediction_chunk_size: null # optional chunk size to use during ESMFold structure prediction to reduce memory requirements
structure_prediction_cpu_only: false # whether to only use CPU for structure prediction
structure_prediction_cpu_offload: false # whether to offload structure prediction to CPU
max_method_predictions: 5 # maximum number of predictions to make with each method
method_top_n_to_select: 3 # number of top-ranked predictions to select from each method for subsequent ranking
skip_existing: false # whether to skip existing ensemble predictions
relax_method_ligands_pre_ranking: false # whether to relax the predicted ligands (method-specifically) before ranking
relax_method_ligands_post_ranking: true # whether to relax the predicted ligands (method-agnostically) after ranking
relax_add_solvent: false # whether to add solvent during relaxation
relax_prep_only: false # only prepare the input files for relaxation
relax_platform: "fastest" # platform on which to run relaxation
relax_log_level: "INFO" # logging level for relaxation
relax_protein: false # whether to relax the protein
relax_remove_initial_protein_hydrogens: false # whether to remove hydrogens from the initial protein
relax_assign_each_ligand_unique_force: false # whether to assign each ligand a unique force constant
relax_model_ions: true # whether to model ions
relax_cache_files: false # whether to cache the prepared relaxation files
relax_assign_partial_charges_manually: false # whether to assign partial charges manually
relax_max_final_e_value: 1000.0 # when relaxing the protein, maximum final energy value to permit
relax_max_num_attempts: 5 # when relaxing the protein, maximum number of relaxation attempts to perform
relax_num_processes: 1 # number of processes to use for relaxation
relax_skip_existing: false # whether to skip existing relaxation results
resume: false # whether to resume from a previous run after generating and manually running each method's prediction script
input_dir: null # optional path to the directory from which to load the ensemble predictions to rank - NOTE: currently, only `neuralplexer` and `flowdock` make use of this for inference output parsing
output_dir: ${oc.env:PROJECT_ROOT}/data/test_cases/5S8I_2LY/top_${ensemble_ranking_method}_ensemble_predictions # path to the directory to save the top-ranked ensemble predictions
export_file_format: casp15 # the (optional) format (i.e., `casp15`) in which to export top-ranked predictions
export_top_n: 5 # number of top-ranked predictions to export in CASP format
casp_author: "001" # group number to report in CASP format
casp_method: "Ligand_Predictor" # the method name to report in CASP format
combine_casp_output_files: false # whether to combine the CASP protein and ligand output files into a single file
generate_hpc_scripts: false # whether to generate HPC scripts for running the ensemble generation; if `false`, then local scripts will be generated instead
pocket_only_baseline: false # whether to run ensemble generation with only pocket-based baseline methods
ensemble_benchmarking: false # whether to run ensemble benchmarking
ensemble_benchmarking_dataset: posebusters_benchmark # the dataset to use for ensemble benchmarking - NOTE: must be one of (`posebusters_benchmark`, `astex_diverse`, `dockgen`, `casp15`)
ensemble_benchmarking_repeat_index: 1 # the repeat index to use for ensemble benchmarking
ensemble_ranking_method: consensus # the method with which to rank-order and select the top ensemble prediction for each target - NOTE: must be one of (`consensus`, `ff`)
ensemble_benchmarking_apo_protein_dir: ${oc.env:PROJECT_ROOT}/data/${ensemble_benchmarking_dataset}_set/${ensemble_benchmarking_dataset}_holo_aligned_predicted_structures # the directory containing the apo proteins to use for ensemble benchmarking
# DiffDock inference arguments:
diffdock_python_exec_path: ${oc.env:PROJECT_ROOT}/forks/DiffDock/DiffDock/bin/python3 # the Python executable to use
diffdock_exec_dir: ${oc.env:PROJECT_ROOT}/forks/DiffDock # the DiffDock directory in which to execute the inference scripts
diffdock_input_csv_path: ${oc.env:PROJECT_ROOT}/forks/DiffDock/inference/diffdock_ensemble_inputs.csv # the input CSV filepath with which to run inference
diffdock_inference_config_path: ${oc.env:PROJECT_ROOT}/forks/DiffDock/default_inference_args.yaml # the inference configuration file to use
diffdock_output_dir: ${oc.env:PROJECT_ROOT}/forks/DiffDock/inference/diffdock_ensemble_output # the output directory to which to save the inference results
diffdock_model_dir: ${oc.env:PROJECT_ROOT}/forks/DiffDock/workdir/v1.1/score_model # the directory in which the trained model is saved
diffdock_confidence_model_dir: ${oc.env:PROJECT_ROOT}/forks/DiffDock/workdir/v1.1/confidence_model # the directory in which the trained confidence model is saved
diffdock_inference_steps: 20 # the maximum number of inference (reverse diffusion) steps to run
diffdock_samples_per_complex: ${max_method_predictions} # the number of samples to generate per complex
diffdock_batch_size: 5 # the batch size to use for inference
diffdock_actual_steps: 19 # the actual number of inference steps to run (i.e., after how many steps to halt the reverse diffusion process)
diffdock_no_final_step_noise: true # whether to disable the final inference step's noise from being added
diffdock_skip_existing: true # whether to skip existing predictions
diffdock_v1_baseline: false # whether to run the v1 baseline
# DynamicBind inference arguments:
dynamicbind_python_exec_path: ${oc.env:PROJECT_ROOT}/forks/DynamicBind/DynamicBind/bin/python3 # the Python executable to use
dynamicbind_exec_dir: ${oc.env:PROJECT_ROOT}/forks/DynamicBind # the DynamicBind directory in which to execute the inference scripts
dynamicbind_dataset: ensemble # the dataset to use for inference - NOTE: must be one of (`ensemble`)
dynamicbind_input_protein_data_dir: ${oc.env:PROJECT_ROOT}/forks/DynamicBind/inference/ensemble_predicted_structures # the input protein-ligand complex directory to recursively parse for protein inputs
dynamicbind_input_ligand_csv_dir: ${oc.env:PROJECT_ROOT}/forks/DynamicBind/inference/dynamicbind_ensemble_inputs # the input CSV directory with which to run inference
dynamicbind_samples_per_complex: ${max_method_predictions} # the number of samples to generate per complex
dynamicbind_savings_per_complex: 1 # the (top-N) number of sample visualizations to save per complex
dynamicbind_inference_steps: 20 # the number of inference steps to run for each complex
dynamicbind_batch_size: 5 # the batch size to use for inference
dynamicbind_header: ensemble # name of the results directory to create
dynamicbind_num_workers: 1 # the number of workers to use for native relaxation during inference
dynamicbind_skip_existing: true # whether to skip existing predictions
# NeuralPLexer inference arguments:
neuralplexer_python_exec_path: ${oc.env:PROJECT_ROOT}/forks/NeuralPLexer/NeuralPLexer/bin/python3 # the Python executable to use
neuralplexer_exec_dir: ${oc.env:PROJECT_ROOT}/forks/NeuralPLexer # the NeuralPLexer directory in which to execute the inference scripts
neuralplexer_input_data_dir: null # the input protein-ligand complex directory to recursively parse
neuralplexer_input_receptor_structure_dir: null # if not `null`, the input template protein structure directory to parse
neuralplexer_input_csv_path: ${oc.env:PROJECT_ROOT}/forks/NeuralPLexer/inference/neuralplexer_ensemble_inputs.csv # the input CSV filepath with which to run inference
neuralplexer_skip_existing: true # whether to skip existing predictions
neuralplexer_task: batched_structure_sampling # the task to run - NOTE: must be one of (`single_sample_trajectory`, `batched_structure_sampling`, `structure_prediction_benchmarking`, `pdbbind_benchmarking`, `binding_site_recovery_benchmarking`)
neuralplexer_sample_id: 0 # the sample ID to use for inference
neuralplexer_template_id: 0 # the template ID to use for inference
neuralplexer_model_checkpoint: ${oc.env:PROJECT_ROOT}/forks/NeuralPLexer/neuralplexermodels_downstream_datasets_predictions/models/complex_structure_prediction.ckpt # the model checkpoint to use for inference
neuralplexer_out_path: ${oc.env:PROJECT_ROOT}/forks/NeuralPLexer/inference/neuralplexer_ensemble_outputs # the output directory to which to write the predictions
neuralplexer_n_samples: ${max_method_predictions} # the number of conformations to generate per complex
neuralplexer_chunk_size: 5 # the number of conformations to generate in parallel per complex
neuralplexer_num_steps: 40 # the number of steps to take in the sampling process
neuralplexer_sampler: langevin_simulated_annealing # the sampler to use - NOTE: must be one of (`DDIM`, `VPSDE`, `simulated_annealing_simple`, `langevin_simulated_annealing`)
neuralplexer_start_time: "1.0" # the start time at which to start sampling - NOTE: must be a string representation of a float
neuralplexer_max_chain_encoding_k: -1 # the maximum chain encoding `k` to use
neuralplexer_exact_prior: false # whether to use the exact prior
neuralplexer_discard_ligand: false # whether to discard the ligand
neuralplexer_discard_sdf_coords: true # whether to discard the SDF coordinates
neuralplexer_detect_covalent: false # whether to detect covalent bonds
neuralplexer_use_template: true # whether to use the input template protein structure
neuralplexer_separate_pdb: true # whether to separate the predicted protein structures into dedicated PDB files
neuralplexer_rank_outputs_by_confidence: true # whether to rank the output conformations, by default, by ligand confidence (if available) and by protein confidence otherwise
neuralplexer_plddt_ranking_type: ligand # the type of plDDT ranking to apply to generated samples - NOTE: must be one of (`protein`, `ligand`, `protein_ligand`)
neuralplexer_no_ilcl: false # whether to score the NeuralPLexer weights trained without an inter-ligand clash loss (ILCL)
# FlowDock inference arguments:
flowdock_python_exec_path: ${oc.env:PROJECT_ROOT}/forks/FlowDock/FlowDock/bin/python3 # the Python executable to use
flowdock_exec_dir: ${oc.env:PROJECT_ROOT}/forks/FlowDock # the FlowDock directory in which to execute the inference scripts
flowdock_input_data_dir: null # the input protein-ligand complex directory to recursively parse
flowdock_input_receptor_structure_dir: null # if not `null`, the input template protein structure directory to parse
flowdock_input_csv_path: ${oc.env:PROJECT_ROOT}/forks/FlowDock/inference/flowdock_ensemble_inputs.csv # the input CSV filepath to which parsed input data has been written
flowdock_skip_existing: true # whether to skip existing predictions
flowdock_sampling_task: batched_structure_sampling # the task to run - NOTE: must be one of (`batched_structure_sampling`)
flowdock_sample_id: 0 # the sample ID to use for inference
flowdock_input_receptor: null # NOTE: must be either a protein sequence string (with chains separated by `|`) or a path to a PDB file (from which protein chain sequences will be parsed)
flowdock_input_ligand: null # NOTE: must be either a ligand SMILES string (with chains/fragments separated by `|`) or a path to a ligand SDF file (from which ligand SMILES will be parsed)
flowdock_input_template: null # path to a protein PDB file to use as a starting protein template for sampling (with an ESMFold prior model)
flowdock_model_checkpoint: ${oc.env:PROJECT_ROOT}/forks/FlowDock/checkpoints/esmfold_prior_paper_weights.ckpt # the model checkpoint to use for inference
flowdock_out_path: ${oc.env:PROJECT_ROOT}/forks/FlowDock/inference/flowdock_ensemble_outputs # the output directory to which to write the predictions
flowdock_n_samples: 5 # the number of conformations to generate per complex
flowdock_chunk_size: 5 # the number of conformations to generate in parallel per complex
flowdock_num_steps: 40 # the number of steps to take in the sampling process
flowdock_latent_model: null # which type of latent model to use - NOTE: must be one of (`null`)
flowdock_sampler: VDODE # sampling algorithm to use - NOTE: must be one of (`ODE`, `VDODE`)
flowdock_sampler_eta: 1.0 # the variance diminishing factor for the `VDODE` sampler - NOTE: offers a trade-off between exploration (1.0) and exploitation (> 1.0)
flowdock_start_time: "1.0" # the start time at which to start sampling - NOTE: must be a string representation of a float
flowdock_max_chain_encoding_k: -1 # the maximum chain encoding `k` to use
flowdock_exact_prior: false # whether to use the exact prior
flowdock_prior_type: esmfold # the type of prior to use for sampling - NOTE: must be one of (`gaussian`, `harmonic`, `esmfold`)
flowdock_discard_ligand: false # whether to discard the ligand
flowdock_discard_sdf_coords: true # whether to discard the SDF coordinates
flowdock_detect_covalent: false # whether to detect covalent bonds
flowdock_use_template: true # whether to use the input template protein structure
flowdock_separate_pdb: true # whether to separate the predicted protein structures into dedicated PDB files
flowdock_rank_outputs_by_confidence: true # whether to rank the output conformations, by default, by ligand confidence (if available) and by protein confidence otherwise
flowdock_plddt_ranking_type: ligand # the type of plDDT ranking to apply to generated samples - NOTE: must be one of (`protein`, `ligand`, `protein_ligand`)
flowdock_visualize_sample_trajectories: false # whether to visualize the generated samples' trajectories
flowdock_auxiliary_estimation_only: false # whether to only estimate auxiliary outputs (e.g., confidence, affinity) for the input (generated) samples (potentially derived from external sources)
flowdock_auxiliary_estimation_input_dir: null # if provided, an input directory of PDB-SDF file pairs (potentially derived from external sources) for which to estimate auxiliary outputs (e.g., confidence, affinity) and represent the outputs as a rank-ordered CSV file within the same directory
flowdock_csv_path: null # the CSV filepath from which to parse benchmarking input data
flowdock_esmfold_chunk_size: null # chunks axial attention computation to reduce memory usage from O(L^2) to O(L); equivalent to running a for loop over chunks of of each dimension; lower values will result in lower memory usage at the cost of speed; recommended values: 128, 64, 32
# RoseTTAFold-All-Atom inference arguments:
rfaa_python_exec_path: ${oc.env:PROJECT_ROOT}/forks/RoseTTAFold-All-Atom/RFAA/bin/python3 # the Python executable to use
rfaa_exec_dir: ${oc.env:PROJECT_ROOT}/forks/RoseTTAFold-All-Atom # the RoseTTAFold-All-Atom directory in which to execute the inference scripts
rfaa_config_dir: ${oc.env:PROJECT_ROOT}/forks/RoseTTAFold-All-Atom/rf2aa/config/inference # the config directory with which to run inference
rfaa_output_dir: ${oc.env:PROJECT_ROOT}/forks/RoseTTAFold-All-Atom/inference/rfaa_ensemble_outputs # the output directory to which to save the inference results
rfaa_max_cycles: 10 # the maximum number recycling iterations to run
rfaa_inference_config_name: null # the name of the inference config to use - NOTE: if `run_inference_directly` is true, this must reference a valid YAML config file name e.g., that was generated by `python posebench/models/rfaa_inference.py` with `run_inference_directly=false`
rfaa_inference_dir_name: null # the name of the inference output directory to use
# Chai-1 inference arguments:
chai_out_path: ${oc.env:PROJECT_ROOT}/forks/chai-lab/inference/chai-lab_ensemble_outputs # the output directory to which to write the predictions
chai_skip_existing: true # whether to skip running inference if the prediction for a target already exists
# Boltz inference arguments:
boltz_out_path: ${oc.env:PROJECT_ROOT}/forks/boltz/inference/boltz_ensemble_outputs # the output directory to which to write the predictions
boltz_skip_existing: true # whether to skip running inference if the prediction for a target already exists
# AlphaFold 3 inference arguments:
alphafold3_out_path: ${oc.env:PROJECT_ROOT}/forks/alphafold3/inference/alphafold3_ensemble_outputs # the output directory to which to write the predictions
# Vina inference arguments:
vina_binding_site_methods: [p2rank] # the methods to use for Vina binding site prediction - NOTE: must be one of (`diffdock`, `dynamicbind`, `neuralplexer`, `flowdock`, `p2rank`)
vina_python2_exec_path: ${oc.env:PROJECT_ROOT}/forks/Vina/ADFR/bin/python # the path to the Python 2 executable
vina_prepare_receptor_script_path: ${oc.env:PROJECT_ROOT}/forks/Vina/ADFR/CCSBpckgs/AutoDockTools/Utilities24/prepare_receptor4.py # the path to the prepare_receptor.py script
vina_output_dir: ${oc.env:PROJECT_ROOT}/forks/Vina/inference/vina_ensemble_outputs # the output directory to which to save the inference results
vina_cpu: 0 # the number of CPU workers to use with AutoDock Vina for parallel processing, 0 for all available
vina_seed: null # the random seed to use with AutoDock Vina
vina_exhaustiveness: 32 # the exhaustiveness to use with AutoDock Vina
vina_ligand_ligand_distance_threshold: 25.0 # the distance threshold (in Angstrom) to use for finding shared binding sites amongst ligands
vina_protein_ligand_distance_threshold: 10.0 # the distance threshold (in Angstrom) to use for finding protein binding sites amongst grouped ligands
vina_binding_site_size_x: 25.0 # the x-axis size of the binding site box to use with AutoDock Vina
vina_binding_site_size_y: 25.0 # the y-axis size of the binding site box to use with AutoDock Vina
vina_binding_site_size_z: 25.0 # the z-axis size of the binding site box to use with AutoDock Vina
vina_binding_site_spacing: 1.0 # the spacing of the binding site box (in Angstrom) to use with AutoDock Vina
vina_num_modes: 40 # the number of binding modes (i.e., poses) to generate with AutoDock Vina
vina_skip_existing: true # whether to skip existing output files
vina_p2rank_exec_utility: predict # the P2Rank executable utility to use for inference
vina_p2rank_config: alphafold # the P2Rank configuration to use for inference
vina_p2rank_enable_pymol_visualizations: false # whether to enable P2Rank's PyMOL visualizations
# TULIP inference arguments:
tulip_output_dir: ${oc.env:PROJECT_ROOT}/forks/TULIP/inference/tulip_ensemble_outputs # the output directory to which to save the inference results

Structure relaxation¶

These configurations are used to specify how relaxation is (optionally) applied to a predicted protein-ligand complex structure using molecular dynamics (i.e., OpenMM).

Note

The inference_relaxation configuration describes the behavior of the script that serves as an entry point for the relaxation process. The minimize_energy configuration is a multi-ligand generalization of the main energy minimization script originally implemented for the PoseBusters software suite.

Inference relaxation (entry point)¶

model/inference_relaxation.yaml¶

method: diffdock # the method for which to relax predictions - NOTE: must be one of (`diffdock`, `fabind`, `dynamicbind`, `neuralplexer`, `flowdock`, `rfaa`, `chai-lab`, `boltz`, `alphafold3`, `vina`, `tulip`)
vina_binding_site_method: p2rank # the method to use for Vina binding site prediction - NOTE: must be one of (`diffdock`, `fabind`, `dynamicbind`, `neuralplexer`, `flowdock`, `rfaa`, `chai-lab`, `boltz`, `alphafold3`, `p2rank`)
dataset: posebusters_benchmark # the dataset for which to relax predictions - NOTE: must be one of (`posebusters_benchmark`, `astex_diverse`, `dockgen`, `casp15`)
ensemble_ranking_method: consensus # the method with which to rank-order and select the top ensemble prediction for each target - NOTE: must be one of (`consensus`, `ff`)
num_processes: 1 # the number of parallel processes to use for relaxation
temp_dir: ${method}_${dataset}_cache_dir # temporary directory
add_solvent: false # whether to add solvent during relaxation
name: null # name of the relaxation target
prep_only: false # only prepare the input files
platform: "fastest" # platform on which to run relaxation
cuda_device_index: 0 # CUDA device index
log_level: "INFO" # logging level
protein_dir: ${resolve_method_protein_dir:${method},${dataset},${repeat_index},${pocket_only_baseline}} # the directory from which to load (potentially inferred) proteins
ligand_dir: ${resolve_method_ligand_dir:${method},${dataset},${vina_binding_site_method},${repeat_index},${pocket_only_baseline},${v1_baseline}} # the directory from which to load inferred ligands
output_dir: ${resolve_method_output_dir:${method},${dataset},${vina_binding_site_method},${ensemble_ranking_method},${repeat_index},${pocket_only_baseline},${v1_baseline}} # the output directory to which to save the relaxed predictions
relax_protein: false # whether to relax the protein - NOTE: currently periodically yields unpredictable protein-ligand separation
remove_initial_protein_hydrogens: false # whether to remove hydrogens from the initial protein
assign_each_ligand_unique_force: false # when relaxing the protein, whether to assign each ligand a unique force constant
model_ions: true # whether to model ions
cache_files: false # whether to cache the prepared relaxation files
assign_partial_charges_manually: false # whether to assign partial charges manually
report_initial_energy_only: false # skip relaxation and return only the initial energy of the complex structure
max_final_e_value: 1000.0 # when relaxing the protein, maximum final energy value to permit
max_num_attempts: 5 # when relaxing the protein, maximum number of relaxation attempts to perform
skip_existing: true # whether to skip existing relaxed predictions
repeat_index: 1 # the repeat index which was used for inference
pocket_only_baseline: false # whether to prepare the pocket-only baseline
v1_baseline: false # whether to prepare the v1 baseline

Minimize energy (relaxation engine)¶

model/minimize_energy.yaml¶

protein_file: ??? # input protein file
ligand_file: ??? # input ligand file
output_file: ??? # ligand output file
protein_output_file: null # optional protein output file
complex_output_file: null # optional complex output file
temp_dir: "cache_dir" # temporary directory
add_solvent: false # whether to add solvent during relaxation
name: null # name of the relaxation target
prep_only: false # only prepare the input files
platform: "fastest" # platform on which to run relaxation
cuda_device_index: 0 # CUDA device index
log_level: "INFO" # logging level
relax_protein: false # whether to relax the protein
remove_initial_protein_hydrogens: false # whether to remove hydrogens from the initial protein
assign_each_ligand_unique_force: false # whether to assign each ligand a unique force constant
model_ions: true # whether to model ions
cache_files: true # whether to cache the prepared relaxation files
assign_partial_charges_manually: false # whether to assign partial charges manually
report_initial_energy_only: false # skip relaxation and return only the initial energy of the complex structure
max_final_e_value: 1000.0 # when relaxing the protein, maximum final energy value to permit
max_num_attempts: 5 # when relaxing the protein, maximum number of relaxation attempts to perform