Data utilities

posebench.utils.data_utils.combine_molecules(molecule_list: list[Mol]) Mol[source]

Combine a list of RDKit molecules into a single molecule.

Parameters:

molecule_list – A list of RDKit molecules.

Returns:

A single RDKit molecule.

posebench.utils.data_utils.count_num_residues_in_pdb_file(pdb_filepath: str) int[source]

Count the number of Ca atoms (i.e., residues) in a PDB file.

Parameters:

pdb_filepath – Path to PDB file.

Returns:

Number of Ca atoms (i.e., residues) in the PDB file.

posebench.utils.data_utils.count_pdb_inter_residue_clashes(pdb_filepath: str, clash_cutoff: float = 0.63) int[source]

Count the number of inter-residue clashes in a protein PDB file. From: https://www.blopig.com/blog/2023/05/checking-your-pdb-file-for-clashing-atoms/

Parameters:
  • pdb_filepath – Path to the PDB file.

  • clash_cutoff – The cutoff for what is considered a clash.

Returns:

The number of inter-residue clashes in the structure.

posebench.utils.data_utils.create_sdf_file_from_smiles(smiles: str, output_sdf_file: str) str[source]

Create an SDF file from a SMILES string.

Parameters:
  • smiles – SMILES string of the molecule.

  • output_sdf_file – Path to the output SDF file.

Returns:

Path to the output SDF file.

posebench.utils.data_utils.extract_protein_and_ligands_with_prody(input_pdb_file: str, protein_output_pdb_file: str | None, ligands_output_sdf_file: str | None, sanitize: bool = True, add_element_types: bool = False, write_output_files: bool = True, load_hetatms_as_ligands: bool = False, ligand_smiles: str | None = None, ligand_expo_mapping: dict[str, Any] | None = None, permute_ligand_smiles: bool = False) Mol | None[source]

Using ProDy, extract protein atoms and ligand molecules from a PDB file and write them to separate files.

Parameters:
  • input_pdb_file – The input PDB file.

  • protein_output_pdb_file – The output PDB file for the protein atoms.

  • ligands_output_sdf_file – The output SDF file for the ligand molecules.

  • sanitize – Whether to sanitize the ligand molecules.

  • add_element_types – Whether to add element types to the protein atoms.

  • write_output_files – Whether to write the output files.

  • load_hetatms_as_ligands – Whether to load HETATM records as ligands if no ligands are initially found.

  • ligand_smiles – The SMILES string of the ligand molecule.

  • ligand_expo_mapping – The Ligand Expo mapping.

  • permute_ligand_smiles – Whether to permute the ligand SMILES string’s fragment components if necessary.

Returns:

The combined final ligand molecule(s) as an RDKit molecule.

posebench.utils.data_utils.extract_remarks_from_pdb(pdb_file: str, remark_number: int | None = None) list[str][source]

Extract REMARK statements from a PDB file.

Parameters:
  • pdb_file – Path to the PDB file.

  • remark_number – Specific REMARK number to filter. If None, extracts all REMARKs.

Return list:

List of REMARK statements.

posebench.utils.data_utils.extract_sequences_from_protein_structure_file(protein_filepath: str | Path, structure: Structure | None = None, exclude_hetero: bool = False) list[str][source]

Extract the protein chain sequences from a protein structure file.

Parameters:
  • protein_filepath – Path to the protein structure file.

  • structure – Optional BioPython structure object to use instead.

  • exclude_hetero – Whether to exclude hetero (e.g., water) residues.

Returns:

A list of protein sequences.

posebench.utils.data_utils.get_pdb_components_with_prody(input_pdb_file: str, load_hetatms_as_ligands: bool = False) tuple[source]

Split a protein-ligand pdb into protein and ligand components using ProDy.

Parameters:
  • input_pdb_file – Path to the input PDB file.

  • load_hetatms_as_ligands – Whether to load HETATM records as ligands if no ligands are initially found.

Returns:

Tuple of protein and ligand components.

posebench.utils.data_utils.parse_fasta(file_path: str, only_mols: list[Literal['protein', 'na']] | None = None, collate_by_pdb_id: bool = False) dict[str, str][source]

Parses a FASTA file into a dictionary and optionally filters by molecule type.

Parameters:
  • file_path – Path to the .txt FASTA file.

  • only_mols – List of molecule types to filter (e.g., [‘protein’, ‘na’]).

  • collate_by_pdb_id – Whether to group sequences by PDB ID.

Returns:

A dictionary where keys are sequence IDs and values are tuples (description, sequence).

posebench.utils.data_utils.parse_inference_inputs_from_dir(input_data_dir: str | Path, pdb_ids: set[Any] | None = None) list[tuple[str, str]][source]

Parse a data directory containing subdirectories of protein-ligand complexes and return corresponding SMILES strings and PDB IDs.

Parameters:
  • input_data_dir – Path to the input data directory.

  • pdb_ids – Optional set of IDs by which to filter processing.

Returns:

A list of tuples each containing a SMILES string and a PDB ID.

posebench.utils.data_utils.process_ligand_with_prody(ligand, res_name, chain, resnum, sanitize: bool = True, sub_smiles: str | None = None, ligand_expo_mapping: dict[str, Any] | None = None) Mol[source]

Add bond orders to a pdb ligand using ProDy. 1. Select the ligand component with name “res_name” 2. Get the corresponding SMILES from pypdb 3. Create a template molecule from the SMILES in step 2 4. Write the PDB file to a stream 5. Read the stream into an RDKit molecule 6. Assign the bond orders from the template from step 3

Parameters:
  • ligand – ligand as generated by prody

  • res_name – residue name of ligand to extract

  • chain – chain of ligand to extract

  • resnum – residue number of ligand to extract

  • sanitize – whether to sanitize the molecule

  • sub_smiles – optional SMILES string of the ligand molecule

  • ligand_expo_mapping – optional Ligand Expo mapping

Returns:

molecule with bond orders assigned

posebench.utils.data_utils.read_ligand_expo(ligand_expo_url: str = 'http://ligand-expo.rcsb.org/dictionaries', ligand_expo_filename: str = 'Components-smiles-stereo-oe.smi') dict[str, Any][source]

Read Ligand Expo data, first trying to find a file called Components- smiles-stereo-oe.smi in the current directory. If the file can’t be found, grab it from the RCSB.

Parameters:
  • ligand_expo_url – URL to Ligand Expo.

  • ligand_expo_filename – Name of the Ligand Expo file.

Returns:

Ligand Expo as a dictionary with ligand id as the key

posebench.utils.data_utils.renumber_biopython_structure_residues(structure: Structure, gap_insertion_point: str | None = None) Structure[source]

Renumber residues in a PDB file using BioPython starting from 1 for each chain.

Parameters:
  • structure – BioPython structure object.

  • gap_insertion_point – Optional :-separated string representing the chain-residue pair index of the residue at which to insert a single index gap.

Returns:

BioPython structure object with renumbered residues.

posebench.utils.data_utils.renumber_pdb_df_residues(input_pdb_file: str, output_pdb_file: str)[source]

Renumber residues in a PDB file starting from 1 for each chain.

Parameters:

input_pdb_file – Path to the input PDB file.

posebench.utils.data_utils.write_pdb_with_prody(atoms, pdb_name, add_element_types=False)[source]

Write atoms to a pdb file using ProDy.

Parameters:
  • atoms – atoms object from prody

  • pdb_name – base name for the pdb file

  • add_element_types – whether to add element types to the pdb file

posebench.utils.data_utils.write_sdf(new_mol: Mol, pdb_name: str)[source]

Write an RDKit molecule to an SD file.

Parameters:
  • new_mol – RDKit molecule

  • pdb_name – name of the output file