Ensemble generation¶

posebench.models.ensemble_generation.assign_reference_residue_b_factors(protein_output_files: list[str], protein_reference_filepath: str) → list[str][source]¶

If b_factor columns values are not already present, assign the reference protein structure’s per-residue confidence scores to each output protein.

Parameters:

protein_output_files – List of output protein structure PDB filepaths.
protein_reference_filepath – Path to the input protein structure PDB file.

Returns:

List of output protein structure PDB filepaths with the input protein’s per-residue confidence scores.

posebench.models.ensemble_generation.consensus_rank_ensemble_predictions(cfg: DictConfig, method_ligand_positions: list[ndarray], ensemble_predictions_list: list[tuple[str, str, str]]) → dict[int, tuple[str, str, str, float]][source]¶

Consensus-rank the predictions to select the top prediction(s).

Parameters:

cfg – Configuration dictionary for runtime arguments.
method_ligand_positions – List of ligand positions from each method’s predictions.
ensemble_predictions_list – List of tuples of method name, output protein filepath, and output ligand filepath.

Returns:

Dictionary of consensus-ranked predictions indexed by each prediction’s consensus ranking and valued as its method name, output protein filepath, output ligand filepath, and average pairwise RMSD.

posebench.models.ensemble_generation.create_chai_bash_script(protein_filepath: str, ligand_smiles: str, input_id: str, cfg: DictConfig, output_filepath: str | None = None, generate_hpc_scripts: bool = True)[source]¶

Create a bash script to run Chai-1 protein-ligand complex prediction.

Parameters:

protein_filepath – Path to the input protein structure PDB file.
ligand_smiles – SMILES string of the input ligand.
input_id – Input ID.
cfg – Configuration dictionary for runtime arguments.
output_filepath – Optional path to the output bash script file.
generate_hpc_scripts – Whether to generate HPC scripts for RoseTTAFold-All-Atom.

posebench.models.ensemble_generation.create_diffdock_bash_script(protein_filepath: str, ligand_smiles: str, input_id: str, output_filepath: str, cfg: DictConfig, generate_hpc_scripts: bool = True)[source]¶

Create a bash script to run DiffDock protein-ligand complex prediction.

Parameters:

protein_filepath – Path to the input protein structure PDB file.
ligand_smiles – SMILES string of the input ligand.
input_id – Input ID.
output_filepath – Path to the output bash script file.
cfg – Configuration dictionary for runtime arguments.
generate_hpc_scripts – Whether to generate HPC scripts for DiffDock.

posebench.models.ensemble_generation.create_dynamicbind_bash_script(protein_filepath: str, ligand_smiles: str, output_filepath: str, cfg: DictConfig, generate_hpc_scripts: bool = True)[source]¶

Create a bash script to run DynamicBind protein-ligand complex prediction.

Parameters:

protein_filepath – Path to the input protein structure PDB file.
ligand_smiles – SMILES string of the input ligand.
output_filepath – Path to the output bash script file.
cfg – Configuration dictionary for runtime arguments.
generate_hpc_scripts – Whether to generate HPC scripts for DynamicBind.

posebench.models.ensemble_generation.create_flowdock_bash_script(protein_filepath: str, ligand_smiles: str, input_id: str, output_filepath: str, cfg: DictConfig, generate_hpc_scripts: bool = True)[source]¶

Create a bash script to run FlowDock protein-ligand complex prediction.

Parameters:

protein_filepath – Path to the input protein structure PDB file.
ligand_smiles – SMILES string of the input ligand.
input_id – Input ID.
output_filepath – Path to the output bash script file.
cfg – Configuration dictionary for runtime arguments.
generate_hpc_scripts – Whether to generate HPC scripts for FlowDock.

posebench.models.ensemble_generation.create_neuralplexer_bash_script(protein_filepath: str, ligand_smiles: str, input_id: str, output_filepath: str, cfg: DictConfig, generate_hpc_scripts: bool = True)[source]¶

Create a bash script to run NeuralPLexer protein-ligand complex prediction.

Parameters:

protein_filepath – Path to the input protein structure PDB file.
ligand_smiles – SMILES string of the input ligand.
input_id – Input ID.
output_filepath – Path to the output bash script file.
cfg – Configuration dictionary for runtime arguments.
generate_hpc_scripts – Whether to generate HPC scripts for NeuralPLexer.

posebench.models.ensemble_generation.create_rfaa_bash_script(fasta_filepaths: list[str], sdf_filepaths: list[str] | None, input_id: str, cfg: DictConfig, output_filepath: str | None = None, smiles_strings: list[str] | None = None, generate_hpc_scripts: bool = True)[source]¶

Create a bash script to run RoseTTAFold-All-Atom protein-ligand complex prediction.

Parameters:

fasta_filepaths – List of FASTA filepaths.
sdf_filepaths – List of optional SDF filepaths.
input_id – Input ID.
cfg – Configuration dictionary for runtime arguments.
output_filepath – Optional path to the output bash script file.
smiles_strings – Optional list of SMILES strings of the input ligands to use directly.
generate_hpc_scripts – Whether to generate HPC scripts for RoseTTAFold-All-Atom.

posebench.models.ensemble_generation.create_temporary_fasta_file(protein_sequence: str, name: str | None = None) → str[source]¶

Create a temporary FASTA file for the input protein sequence.

Parameters:

protein_sequence – Amino acid sequence of the protein.
name – Optional name of the temporary FASTA file.

Returns:

Path to the temporary FASTA file.

posebench.models.ensemble_generation.create_vina_bash_script(binding_site_method: Literal['diffdock', 'fabind', 'dynamicbind', 'neuralplexer', 'flowdock', 'rfaa'], protein_filepath: str, ligand_filepath: str, apo_protein_filepath: str, input_id: str, output_filepath: str, cfg: DictConfig, generate_hpc_scripts: bool = True)[source]¶

Create a bash script to run Vina-based protein-ligand complex prediction.

Parameters:

binding_site_method – Name of the method used to predict the binding site.
protein_filepath – Path to the input protein structure PDB file.
ligand_filepath – Path to the input ligand structure SDF file.
apo_protein_filepath – Path to the predicted apo protein structure PDB file.
input_id – Input ID.
output_filepath – Path to the output bash script file.
cfg – Configuration dictionary for runtime arguments.
generate_hpc_scripts – Whether to generate HPC scripts for Vina.

posebench.models.ensemble_generation.dynamically_build_rfaa_input_config(fasta_filepaths: list[str], sdf_filepaths: list[str] | None, input_id: str, cfg: DictConfig, smiles_strings: list[str] | None = None) → str[source]¶

Dynamically build the RoseTTAFold-All-Atom inference configuration file for input proteins and ligands.

Parameters:

fasta_filepaths – List of FASTA filepaths.
sdf_filepaths – List of optional SDF filepaths.
input_id – Input ID.
cfg – Configuration dictionary for runtime arguments.
smiles_strings – Optional list of SMILES strings of the input ligands to use directly.

Returns:

Path to the dynamically built configuration file.

posebench.models.ensemble_generation.export_ligands_in_casp15_format(output_ligand_filepaths: list[str], output_ligand_sdf_file: str, sdf_header: str, method: str, append: bool = False, model_index: int | None = None, ligand_numbers: str | list[int] | None = None, ligand_names: str | list[str] | None = None)[source]¶

Export the predicted ligand structures in CASP15 format.

Note that for the sake of consistency when evaluating deep learning docking methods on the CASP15 benchmark, we only report a single pose per protein-ligand model (i.e., 5 submitted protein-ligand models with 1 pose per model vs. 5 poses per model).

Parameters:

output_ligand_filepaths – List of output ligand structure SDF filepaths.
ligand_output_filepath – Path to the output ligand structure SDF file.
sdf_header – Header string for the SDF file.
method – Method name.
append – Whether to append the predicted ligand structures to the output file.
model_index – Optional index of the model to write to the SDF file.
ligand_numbers – Optional list of ligand numbers represented as a _-delimited string or a list of integers.
ligand_names – Optional list of ligand names represented as a _-delimited string or a list of strings.

posebench.models.ensemble_generation.export_proteins_in_casp_format(output_protein_filepaths: list[str], output_protein_pdb_file: str, pdb_header: str, append: bool = False, export_casp15_format: bool = False, model_index: int | None = None, gap_insertion_point: int | None = None)[source]¶

Export the predicted protein structures in CASP format.

Parameters:

output_protein_filepaths – List of output protein structure PDB filepaths.
output_protein_pdb_file – Path to the output protein structure PDB file.
pdb_header – Header string for the PDB file.
append – Whether to append the predicted protein structures to the output file.
export_casp15_format – Whether to format the output file for CASP15 benchmarking.
model_index – Optional index of the model to write to the PDB file.
gap_insertion_point – Optional :-separated string representing the chain-residue pair index of the residue at which to insert a single index gap.

posebench.models.ensemble_generation.ff_rank_ensemble_predictions(cfg: DictConfig, ensemble_predictions_list: list[tuple[str, str, str]]) → dict[int, tuple[str, str, str, float]][source]¶

Rank the predictions using an OpenMM force field (FF) to select the top prediction(s) according to the criterion of minimum energy.

Parameters:

cfg – Configuration dictionary for runtime arguments.
ensemble_predictions_list – List of tuples of method name, output protein filepath, and output ligand filepath.

Returns:

Dictionary of Vina-ranked predictions indexed by each prediction’s consensus ranking and valued as its method name, output protein filepath, output ligand filepath, and Vina energy score.

posebench.models.ensemble_generation.generate_ensemble_predictions(protein_filepath: str, ligand_smiles: str, input_id: str, cfg: DictConfig, generate_hpc_scripts: bool = True, method_filepaths_mapping: dict[str, list[tuple[str, str]]] | None = None) → tuple[dict[str, list[tuple[str, str]]] | None, bool][source]¶

Generate bound complex predictions using an ensemble of methods.

Parameters:

protein_filepath – Path to the input protein structure PDB file.
ligand_input – Path to the input ligand SMILES string.
input_id – Input ID.
target – Name of the target protein-ligand pair.
cfg – Configuration dictionary for runtime arguments.
generate_hpc_scripts – Whether to generate HPC scripts for the ensemble predictions.
method_filepaths_mapping – Optional mapping of method names to a list of tuples of protein and ligand filepaths.

Returns:

Dictionary of method names and their corresponding predictions as well as whether the prediction scripts were generated and now need to be run.

posebench.models.ensemble_generation.generate_method_prediction_script(method: str, protein_filepath: str, ligand_smiles: str, input_id: str, output_filepath: str, cfg: DictConfig, generate_hpc_scripts: bool, method_filepaths_mapping: dict[str, list[tuple[str, str]]] | None = None)[source]¶

Generate a script to run the method’s protein-ligand complex prediction.

Parameters:

method – Name of the method to generate a prediction script for.
protein_filepath – Path to the input protein structure PDB file.
ligand_smiles – SMILES string of the input ligand.
input_id – Input ID.
output_filepath – Path to the output Bash script file.
cfg – Configuration dictionary for runtime arguments.
generate_hpc_scripts – Whether to generate HPC scripts for the method.
method_filepaths_mapping – Optional mapping of method names to a list of tuples of protein and ligand filepaths.

posebench.models.ensemble_generation.get_method_predictions(method: str, target: str, cfg: DictConfig, binding_site_method: str | None = None, input_protein_filepath: str | None = None, is_ss_method: bool = False) → list[tuple[str, str]][source]¶

Get the predictions generated by the method.

Parameters:

method – Name of the method to get predictions for.
target – Name of the target protein-ligand pair.
cfg – Configuration dictionary for runtime arguments.
binding_site_method – Optional name of the method used to predict AutoDock Vina’s binding sites.
input_protein_filepath – Optional path to the input protein structure PDB file.
is_ss_method – Whether the method is a single-sequence method.

Returns:

List of method predictions, each as a tuple of the output protein filepath and the output ligand filepath.

posebench.models.ensemble_generation.insert_hpc_headers(method: str, gpu_partition: str = 'chengji-lab-gpu', gpu_account: str = 'chengji-lab', gpu_type: Literal['A100', 'H100', ''] = '', cpu_memory_in_gb: int = 59, time_limit: str = '7-00:00:00') → str[source]¶

Insert batch headers for SLURM job scheduling.

Parameters:

method – Name of the method for which to generate a prediction script.
gpu_partition – Name of the GPU partition to use.
gpu_account – Name of the GPU account to use.
cpu_memory_in_gb – Amount of CPU memory to request in GB.
time_limit – Time limit for the job as a SLURM-compatible string.

Returns:

Batch headers string for SLURM job scheduling.

posebench.models.ensemble_generation.main(cfg: DictConfig)[source]¶: Generate predictions for a protein-ligand target pair using an ensemble of methods.

posebench.models.ensemble_generation.predict_protein_structure_from_sequence(python_exec_path: str, structure_prediction_script_path: str, fasta_filepath: str, output_pdb_dir: str, chunk_size: int | None = None, cpu_only: bool = False, cpu_offload: bool = False, cuda_device_index: int = 0)[source]¶

Predict protein structure from amino acid sequence.

Parameters:

python_exec_path – Path to the Python executable with which to run Python scripts.
structure_prediction_script_path – Path to the ESMFold structure prediction script to run.
fasta_filepath – Path to the input FASTA file.
output_pdb_dir – Path to the output PDB directory.
chunk_size – Optional chunk size for structure prediction.
cpu_only – Whether to use CPU only for structure prediction.
cpu_offload – Whether to use CPU offloading for structure prediction.
cuda_device_index – The optional index of the CUDA device to use for structure prediction.

posebench.models.ensemble_generation.rank_ensemble_predictions(ensemble_predictions_dict: dict[str, list[tuple[str, str]]], name: str, cfg: DictConfig) → dict[int, tuple[str, str, str, float]][source]¶

Rank the predictions to select the top prediction(s).

Parameters:

ensemble_predictions_dict – Dictionary of method names and their corresponding predictions.
name – Name of the target protein-ligand pair.
cfg – Configuration dictionary for runtime arguments.

Returns:

Dictionary of consensus-ranked predictions indexed by each prediction’s consensus ranking and valued as its method name, output protein filepath, output ligand filepath, and average pairwise RMSD or Vina energy score.

posebench.models.ensemble_generation.rank_key(file_path: str) → float[source]¶

Define a custom key for ranking the predictions.

Parameters:: file_path – Path to the file to rank.
Returns:: The rank key for the file.

posebench.models.ensemble_generation.rfaa_get_chain_letter(index: int) → str[source]¶: Get the RFAA chain letter based on index.

posebench.models.ensemble_generation.save_ranked_predictions(ranked_predictions: dict[int, tuple[str, str, str, float]], protein_input_filepath: str, name: str, ligand_numbers: str | list[int] | None, ligand_names: str | list[str] | None, ligand_tasks: str | None, cfg: DictConfig)[source]¶

Save the top-ranked predictions to the output directory.

Parameters:

ranked_predictions – Dictionary of ranked predictions indexed by each prediction’s ranking and valued as its method name, output protein filepath, output ligand filepath, and average pairwise RMSD or Vina energy score.
protein_input_filepath – Path to the input protein structure PDB file.
name – Name of the target protein-ligand pair.
ligand_numbers – Optional list of ligand numbers represented as a _-delimited string or a list of integers.
ligand_names – Optional list of ligand names represented as a _-delimited string or a list of strings.
ligand_tasks – Optional ligand tasks specification.
cfg – Configuration dictionary for runtime arguments.