Data utilities

posebench.utils.data_utils.combine_molecules(molecule_list: list[Mol]) Mol[source]

Combine a list of RDKit molecules into a single molecule.


molecule_list – A list of RDKit molecules.


A single RDKit molecule.

posebench.utils.data_utils.count_num_residues_in_pdb_file(pdb_filepath: str) int[source]

Count the number of Ca atoms (i.e., residues) in a PDB file.


pdb_filepath – Path to PDB file.


Number of Ca atoms (i.e., residues) in the PDB file.

posebench.utils.data_utils.count_pdb_inter_residue_clashes(pdb_filepath: str, clash_cutoff: float = 0.63) int[source]

Count the number of inter-residue clashes in a protein PDB file. From:

  • pdb_filepath – Path to the PDB file.

  • clash_cutoff – The cutoff for what is considered a clash.


The number of inter-residue clashes in the structure.

posebench.utils.data_utils.create_sdf_file_from_smiles(smiles: str, output_sdf_file: str) str[source]

Create an SDF file from a SMILES string.

  • smiles – SMILES string of the molecule.

  • output_sdf_file – Path to the output SDF file.


Path to the output SDF file.

posebench.utils.data_utils.extract_protein_and_ligands_with_prody(input_pdb_file: str, protein_output_pdb_file: str, ligands_output_sdf_file: str, sanitize: bool = True, add_element_types: bool = False, ligand_smiles: str | None = None) Mol | None[source]

Using ProDy, extract protein atoms and ligand molecules from a PDB file and write them to separate files.

  • input_pdb_file – The input PDB file.

  • protein_output_pdb_file – The output PDB file for the protein atoms.

  • ligands_output_sdf_file – The output SDF file for the ligand molecules.

  • sanitize – Whether to sanitize the ligand molecules.

  • add_element_types – Whether to add element types to the protein atoms.

  • ligand_smiles – The SMILES string of the ligand molecule.


The combined final ligand molecule(s) as an RDKit molecule.

posebench.utils.data_utils.extract_sequences_from_protein_structure_file(protein_filepath: str | Path, structure: Structure | None = None) list[str][source]

Extract the chain sequences from a protein structure file.

  • protein_filepath – Path to the protein structure file.

  • structure – Optional BioPython structure object to use instead.


A list of protein sequences.

posebench.utils.data_utils.get_pdb_components_with_prody(pdb_id) tuple[source]

Split a protein-ligand pdb into protein and ligand components using ProDy.


pdb_id – PDB ID


protein structure and ligand residues

posebench.utils.data_utils.parse_inference_inputs_from_dir(input_data_dir: str | Path, pdb_ids: set[Any] | None = None) list[tuple[str, str]][source]

Parse a data directory containing subdirectories of protein-ligand complexes and return corresponding SMILES strings and PDB IDs.

  • input_data_dir – Path to the input data directory.

  • pdb_ids – Optional set of IDs by which to filter processing.


A list of tuples each containing a SMILES string and a PDB ID.

posebench.utils.data_utils.process_ligand_with_prody(ligand, res_name, chain, resnum, sanitize: bool = True, sub_smiles: str | None = None) Mol[source]

Add bond orders to a pdb ligand using ProDy. 1. Select the ligand component with name “res_name” 2. Get the corresponding SMILES from pypdb 3. Create a template molecule from the SMILES in step 2 4. Write the PDB file to a stream 5. Read the stream into an RDKit molecule 6. Assign the bond orders from the template from step 3

  • ligand – ligand as generated by prody

  • res_name – residue name of ligand to extract

  • chain – chain of ligand to extract

  • resnum – residue number of ligand to extract

  • sanitize – whether to sanitize the molecule

  • sub_smiles – optional SMILES string of the ligand molecule


molecule with bond orders assigned

posebench.utils.data_utils.renumber_biopython_structure_residues(structure: Structure, gap_insertion_point: str | None = None) Structure[source]

Renumber residues in a PDB file using BioPython starting from 1 for each chain.

  • structure – BioPython structure object.

  • gap_insertion_point – Optional :-separated string representing the chain-residue pair index of the residue at which to insert a single index gap.


BioPython structure object with renumbered residues.

posebench.utils.data_utils.renumber_pdb_df_residues(input_pdb_file: str, output_pdb_file: str)[source]

Renumber residues in a PDB file starting from 1 for each chain.


input_pdb_file – Path to the input PDB file.

posebench.utils.data_utils.write_pdb_with_prody(protein, pdb_name, add_element_types=False)[source]

Write a protein to a pdb file using ProDy.

  • protein – protein object from prody

  • pdb_name – base name for the pdb file

  • add_element_types – whether to add element types to the pdb file

posebench.utils.data_utils.write_sdf(new_mol: Mol, pdb_name: str)[source]

Write an RDKit molecule to an SD file.

  • new_mol – RDKit molecule

  • pdb_name – name of the output file