Ensemble generation

posebench.models.ensemble_generation.assign_reference_residue_b_factors(protein_output_files: list[str], protein_reference_filepath: str) list[str][source]

If b_factor columns values are not already present, assign the reference protein structure’s per-residue confidence scores to each output protein.

Parameters:
  • protein_output_files – List of output protein structure PDB filepaths.

  • protein_reference_filepath – Path to the input protein structure PDB file.

Returns:

List of output protein structure PDB filepaths with the input protein’s per-residue confidence scores.

posebench.models.ensemble_generation.consensus_rank_ensemble_predictions(cfg: DictConfig, method_ligand_positions: list[ndarray], ensemble_predictions_list: list[tuple[str, str, str]]) dict[int, tuple[str, str, str, float]][source]

Consensus-rank the predictions to select the top prediction(s).

Parameters:
  • cfg – Configuration dictionary for runtime arguments.

  • method_ligand_positions – List of ligand positions from each method’s predictions.

  • ensemble_predictions_list – List of tuples of method name, output protein filepath, and output ligand filepath.

Returns:

Dictionary of consensus-ranked predictions indexed by each prediction’s consensus ranking and valued as its method name, output protein filepath, output ligand filepath, and average pairwise RMSD.

posebench.models.ensemble_generation.create_diffdock_bash_script(protein_filepath: str, ligand_smiles: str, input_id: str, output_filepath: str, cfg: DictConfig, generate_hpc_scripts: bool = True)[source]

Create a bash script to run DiffDock protein-ligand complex prediction.

Parameters:
  • protein_filepath – Path to the input protein structure PDB file.

  • ligand_smiles – SMILES string of the input ligand.

  • input_id – Input ID.

  • output_filepath – Path to the output bash script file.

  • cfg – Configuration dictionary for runtime arguments.

  • generate_hpc_scripts – Whether to generate HPC scripts for DiffDock.

posebench.models.ensemble_generation.create_dynamicbind_bash_script(protein_filepath: str, ligand_smiles: str, output_filepath: str, cfg: DictConfig, generate_hpc_scripts: bool = True)[source]

Create a bash script to run DynamicBind protein-ligand complex prediction.

Parameters:
  • protein_filepath – Path to the input protein structure PDB file.

  • ligand_smiles – SMILES string of the input ligand.

  • output_filepath – Path to the output bash script file.

  • cfg – Configuration dictionary for runtime arguments.

  • generate_hpc_scripts – Whether to generate HPC scripts for DynamicBind.

posebench.models.ensemble_generation.create_neuralplexer_bash_script(protein_filepath: str, ligand_smiles: str, input_id: str, output_filepath: str, cfg: DictConfig, generate_hpc_scripts: bool = True)[source]

Create a bash script to run NeuralPLexer protein-ligand complex prediction.

Parameters:
  • protein_filepath – Path to the input protein structure PDB file.

  • ligand_smiles – SMILES string of the input ligand.

  • input_id – Input ID.

  • output_filepath – Path to the output bash script file.

  • cfg – Configuration dictionary for runtime arguments.

  • generate_hpc_scripts – Whether to generate HPC scripts for NeuralPLexer.

posebench.models.ensemble_generation.create_rfaa_bash_script(fasta_filepaths: list[str], sdf_filepaths: list[str] | None, input_id: str, cfg: DictConfig, output_filepath: str | None = None, smiles_strings: list[str] | None = None, generate_hpc_scripts: bool = True)[source]

Create a bash script to run RoseTTAFold-All-Atom protein-ligand complex prediction.

Parameters:
  • fasta_filepaths – List of FASTA filepaths.

  • sdf_filepaths – List of optional SDF filepaths.

  • input_id – Input ID.

  • cfg – Configuration dictionary for runtime arguments.

  • output_filepath – Optional path to the output bash script file.

  • smiles_strings – Optional list of SMILES strings of the input ligands to use directly.

  • generate_hpc_scripts – Whether to generate HPC scripts for RoseTTAFold-All-Atom.

posebench.models.ensemble_generation.create_temporary_fasta_file(protein_sequence: str, name: str | None = None) str[source]

Create a temporary FASTA file for the input protein sequence.

Parameters:
  • protein_sequence – Amino acid sequence of the protein.

  • name – Optional name of the temporary FASTA file.

Returns:

Path to the temporary FASTA file.

posebench.models.ensemble_generation.create_vina_bash_script(binding_site_method: Literal['diffdock', 'fabind', 'dynamicbind', 'neuralplexer', 'rfaa'], protein_filepath: str, ligand_filepath: str, apo_protein_filepath: str, input_id: str, output_filepath: str, cfg: DictConfig, generate_hpc_scripts: bool = True)[source]

Create a bash script to run Vina-based protein-ligand complex prediction.

Parameters:
  • binding_site_method – Name of the method used to predict the binding site.

  • protein_filepath – Path to the input protein structure PDB file.

  • ligand_filepath – Path to the input ligand structure SDF file.

  • apo_protein_filepath – Path to the predicted apo protein structure PDB file.

  • input_id – Input ID.

  • output_filepath – Path to the output bash script file.

  • cfg – Configuration dictionary for runtime arguments.

  • generate_hpc_scripts – Whether to generate HPC scripts for Vina.

posebench.models.ensemble_generation.dynamically_build_rfaa_input_config(fasta_filepaths: list[str], sdf_filepaths: list[str] | None, input_id: str, cfg: DictConfig, smiles_strings: list[str] | None = None) str[source]

Dynamically build the RoseTTAFold-All-Atom inference configuration file for input proteins and ligands.

Parameters:
  • fasta_filepaths – List of FASTA filepaths.

  • sdf_filepaths – List of optional SDF filepaths.

  • input_id – Input ID.

  • cfg – Configuration dictionary for runtime arguments.

  • smiles_strings – Optional list of SMILES strings of the input ligands to use directly.

Returns:

Path to the dynamically built configuration file.

posebench.models.ensemble_generation.export_ligands_in_casp15_format(output_ligand_filepaths: list[str], output_ligand_sdf_file: str, sdf_header: str, method: str, append: bool = False, model_index: int | None = None, ligand_numbers: str | list[int] | None = None, ligand_names: str | list[str] | None = None)[source]

Export the predicted ligand structures in CASP15 format.

Note that for the sake of consistency when evaluating deep learning docking methods on the CASP15 benchmark, we only report a single pose per protein-ligand model (i.e., 5 submitted protein-ligand models with 1 pose per model vs. 5 poses per model).

Parameters:
  • output_ligand_filepaths – List of output ligand structure SDF filepaths.

  • ligand_output_filepath – Path to the output ligand structure SDF file.

  • sdf_header – Header string for the SDF file.

  • method – Method name.

  • append – Whether to append the predicted ligand structures to the output file.

  • model_index – Optional index of the model to write to the SDF file.

  • ligand_numbers – Optional list of ligand numbers represented as a _-delimited string or a list of integers.

  • ligand_names – Optional list of ligand names represented as a _-delimited string or a list of strings.

posebench.models.ensemble_generation.export_proteins_in_casp_format(output_protein_filepaths: list[str], output_protein_pdb_file: str, pdb_header: str, append: bool = False, export_casp15_format: bool = False, model_index: int | None = None, gap_insertion_point: int | None = None)[source]

Export the predicted protein structures in CASP format.

Parameters:
  • output_protein_filepaths – List of output protein structure PDB filepaths.

  • output_protein_pdb_file – Path to the output protein structure PDB file.

  • pdb_header – Header string for the PDB file.

  • append – Whether to append the predicted protein structures to the output file.

  • export_casp15_format – Whether to format the output file for CASP15 benchmarking.

  • model_index – Optional index of the model to write to the PDB file.

  • gap_insertion_point – Optional :-separated string representing the chain-residue pair index of the residue at which to insert a single index gap.

posebench.models.ensemble_generation.ff_rank_ensemble_predictions(cfg: DictConfig, ensemble_predictions_list: list[tuple[str, str, str]]) dict[int, tuple[str, str, str, float]][source]

Rank the predictions using an OpenMM force field (FF) to select the top prediction(s) according to the criterion of minimum energy.

Parameters:
  • cfg – Configuration dictionary for runtime arguments.

  • ensemble_predictions_list – List of tuples of method name, output protein filepath, and output ligand filepath.

Returns:

Dictionary of Vina-ranked predictions indexed by each prediction’s consensus ranking and valued as its method name, output protein filepath, output ligand filepath, and Vina energy score.

posebench.models.ensemble_generation.generate_ensemble_predictions(protein_filepath: str, ligand_smiles: str, input_id: str, cfg: DictConfig, generate_hpc_scripts: bool = True, method_filepaths_mapping: dict[str, list[tuple[str, str]]] | None = None) tuple[dict[str, list[tuple[str, str]]] | None, bool][source]

Generate bound complex predictions using an ensemble of methods.

Parameters:
  • protein_filepath – Path to the input protein structure PDB file.

  • ligand_input – Path to the input ligand SMILES string.

  • input_id – Input ID.

  • target – Name of the target protein-ligand pair.

  • cfg – Configuration dictionary for runtime arguments.

  • generate_hpc_scripts – Whether to generate HPC scripts for the ensemble predictions.

  • method_filepaths_mapping – Optional mapping of method names to a list of tuples of protein and ligand filepaths.

Returns:

Dictionary of method names and their corresponding predictions as well as whether the prediction scripts were generated and now need to be run.

posebench.models.ensemble_generation.generate_method_prediction_script(method: str, protein_filepath: str, ligand_smiles: str, input_id: str, output_filepath: str, cfg: DictConfig, generate_hpc_scripts: bool, method_filepaths_mapping: dict[str, list[tuple[str, str]]] | None = None)[source]

Generate a script to run the method’s protein-ligand complex prediction.

Parameters:
  • method – Name of the method to generate a prediction script for.

  • protein_filepath – Path to the input protein structure PDB file.

  • ligand_smiles – SMILES string of the input ligand.

  • input_id – Input ID.

  • output_filepath – Path to the output Bash script file.

  • cfg – Configuration dictionary for runtime arguments.

  • generate_hpc_scripts – Whether to generate HPC scripts for the method.

  • method_filepaths_mapping – Optional mapping of method names to a list of tuples of protein and ligand filepaths.

posebench.models.ensemble_generation.get_method_predictions(method: str, target: str, cfg: DictConfig, binding_site_method: str | None = None, input_protein_filepath: str | None = None) list[tuple[str, str]][source]

Get the predictions generated by the method.

Parameters:
  • method – Name of the method to get predictions for.

  • target – Name of the target protein-ligand pair.

  • cfg – Configuration dictionary for runtime arguments.

  • binding_site_method – Optional name of the method used to predict AutoDock Vina’s binding sites.

  • input_protein_filepath – Optional path to the input protein structure PDB file.

Returns:

List of method predictions, each as a tuple of the output protein filepath and the output ligand filepath.

posebench.models.ensemble_generation.insert_hpc_headers(method: str, gpu_partition: str = 'chengji-lab-gpu', gpu_account: str = 'chengji-lab', gpu_type: Literal['A100', 'H100'] = 'H100', cpu_memory_in_gb: int = 59, time_limit: str = '7-00:00:00') str[source]

Insert batch headers for SLURM job scheduling.

Parameters:
  • method – Name of the method for which to generate a prediction script.

  • gpu_partition – Name of the GPU partition to use.

  • gpu_account – Name of the GPU account to use.

  • cpu_memory_in_gb – Amount of CPU memory to request in GB.

  • time_limit – Time limit for the job as a SLURM-compatible string.

Returns:

Batch headers string for SLURM job scheduling.

posebench.models.ensemble_generation.main(cfg: DictConfig)[source]

Generate predictions for a protein-ligand target pair using an ensemble of methods.

posebench.models.ensemble_generation.predict_protein_structure_from_sequence(python_exec_path: str, structure_prediction_script_path: str, fasta_filepath: str, output_pdb_dir: str, chunk_size: int | None = None, cpu_only: bool = False, cpu_offload: bool = False, cuda_device_index: int = 0)[source]

Predict protein structure from amino acid sequence.

Parameters:
  • python_exec_path – Path to the Python executable with which to run Python scripts.

  • structure_prediction_script_path – Path to the ESMFold structure prediction script to run.

  • fasta_filepath – Path to the input FASTA file.

  • output_pdb_dir – Path to the output PDB directory.

  • chunk_size – Optional chunk size for structure prediction.

  • cpu_only – Whether to use CPU only for structure prediction.

  • cpu_offload – Whether to use CPU offloading for structure prediction.

  • cuda_device_index – The optional index of the CUDA device to use for structure prediction.

posebench.models.ensemble_generation.rank_ensemble_predictions(ensemble_predictions_dict: dict[str, list[tuple[str, str]]], name: str, cfg: DictConfig) dict[int, tuple[str, str, str, float]][source]

Rank the predictions to select the top prediction(s).

Parameters:
  • ensemble_predictions_dict – Dictionary of method names and their corresponding predictions.

  • name – Name of the target protein-ligand pair.

  • cfg – Configuration dictionary for runtime arguments.

Returns:

Dictionary of consensus-ranked predictions indexed by each prediction’s consensus ranking and valued as its method name, output protein filepath, output ligand filepath, and average pairwise RMSD or Vina energy score.

posebench.models.ensemble_generation.rank_key(file_path: str) float[source]

Define a custom key for ranking the predictions.

Parameters:

file_path – Path to the file to rank.

Returns:

The rank key for the file.

posebench.models.ensemble_generation.rfaa_get_chain_letter(index: int) str[source]

Get the RFAA chain letter based on index.

posebench.models.ensemble_generation.save_ranked_predictions(ranked_predictions: dict[int, tuple[str, str, str, float]], protein_input_filepath: str, name: str, ligand_numbers: str | list[int] | None, ligand_names: str | list[str] | None, ligand_tasks: str | None, cfg: DictConfig)[source]

Save the top-ranked predictions to the output directory.

Parameters:
  • ranked_predictions – Dictionary of ranked predictions indexed by each prediction’s ranking and valued as its method name, output protein filepath, output ligand filepath, and average pairwise RMSD or Vina energy score.

  • protein_input_filepath – Path to the input protein structure PDB file.

  • name – Name of the target protein-ligand pair.

  • ligand_numbers – Optional list of ligand numbers represented as a _-delimited string or a list of integers.

  • ligand_names – Optional list of ligand names represented as a _-delimited string or a list of strings.

  • ligand_tasks – Optional ligand tasks specification.

  • cfg – Configuration dictionary for runtime arguments.