Data utilities¶
- posebench.utils.data_utils.combine_molecules(molecule_list: list[Mol]) Mol [source]¶
Combine a list of RDKit molecules into a single molecule.
- Parameters:
molecule_list – A list of RDKit molecules.
- Returns:
A single RDKit molecule.
- posebench.utils.data_utils.count_num_residues_in_pdb_file(pdb_filepath: str) int [source]¶
Count the number of Ca atoms (i.e., residues) in a PDB file.
- Parameters:
pdb_filepath – Path to PDB file.
- Returns:
Number of Ca atoms (i.e., residues) in the PDB file.
- posebench.utils.data_utils.count_pdb_inter_residue_clashes(pdb_filepath: str, clash_cutoff: float = 0.63) int [source]¶
Count the number of inter-residue clashes in a protein PDB file. From: https://www.blopig.com/blog/2023/05/checking-your-pdb-file-for-clashing-atoms/
- Parameters:
pdb_filepath – Path to the PDB file.
clash_cutoff – The cutoff for what is considered a clash.
- Returns:
The number of inter-residue clashes in the structure.
- posebench.utils.data_utils.create_sdf_file_from_smiles(smiles: str, output_sdf_file: str) str [source]¶
Create an SDF file from a SMILES string.
- Parameters:
smiles – SMILES string of the molecule.
output_sdf_file – Path to the output SDF file.
- Returns:
Path to the output SDF file.
- posebench.utils.data_utils.extract_protein_and_ligands_with_prody(input_pdb_file: str, protein_output_pdb_file: str | None, ligands_output_sdf_file: str | None, sanitize: bool = True, add_element_types: bool = False, write_output_files: bool = True, load_hetatms_as_ligands: bool = False, ligand_smiles: str | None = None, ligand_expo_mapping: dict[str, Any] | None = None, permute_ligand_smiles: bool = False) Mol | None [source]¶
Using ProDy, extract protein atoms and ligand molecules from a PDB file and write them to separate files.
- Parameters:
input_pdb_file – The input PDB file.
protein_output_pdb_file – The output PDB file for the protein atoms.
ligands_output_sdf_file – The output SDF file for the ligand molecules.
sanitize – Whether to sanitize the ligand molecules.
add_element_types – Whether to add element types to the protein atoms.
write_output_files – Whether to write the output files.
load_hetatms_as_ligands – Whether to load HETATM records as ligands if no ligands are initially found.
ligand_smiles – The SMILES string of the ligand molecule.
ligand_expo_mapping – The Ligand Expo mapping.
permute_ligand_smiles – Whether to permute the ligand SMILES string’s fragment components if necessary.
- Returns:
The combined final ligand molecule(s) as an RDKit molecule.
- posebench.utils.data_utils.extract_remarks_from_pdb(pdb_file: str, remark_number: int | None = None) list[str] [source]¶
Extract REMARK statements from a PDB file.
- Parameters:
pdb_file – Path to the PDB file.
remark_number – Specific REMARK number to filter. If None, extracts all REMARKs.
- Return list:
List of REMARK statements.
- posebench.utils.data_utils.extract_sequences_from_protein_structure_file(protein_filepath: str | Path, structure: Structure | None = None, exclude_hetero: bool = False) list[str] [source]¶
Extract the protein chain sequences from a protein structure file.
- Parameters:
protein_filepath – Path to the protein structure file.
structure – Optional BioPython structure object to use instead.
exclude_hetero – Whether to exclude hetero (e.g., water) residues.
- Returns:
A list of protein sequences.
- posebench.utils.data_utils.get_pdb_components_with_prody(input_pdb_file: str, load_hetatms_as_ligands: bool = False) tuple [source]¶
Split a protein-ligand pdb into protein and ligand components using ProDy.
- Parameters:
input_pdb_file – Path to the input PDB file.
load_hetatms_as_ligands – Whether to load HETATM records as ligands if no ligands are initially found.
- Returns:
Tuple of protein and ligand components.
- posebench.utils.data_utils.parse_fasta(file_path: str, only_mols: list[Literal['protein', 'na']] | None = None, collate_by_pdb_id: bool = False) dict[str, str] [source]¶
Parses a FASTA file into a dictionary and optionally filters by molecule type.
- Parameters:
file_path – Path to the .txt FASTA file.
only_mols – List of molecule types to filter (e.g., [‘protein’, ‘na’]).
collate_by_pdb_id – Whether to group sequences by PDB ID.
- Returns:
A dictionary where keys are sequence IDs and values are tuples (description, sequence).
- posebench.utils.data_utils.parse_inference_inputs_from_dir(input_data_dir: str | Path, pdb_ids: set[Any] | None = None) list[tuple[str, str]] [source]¶
Parse a data directory containing subdirectories of protein-ligand complexes and return corresponding SMILES strings and PDB IDs.
- Parameters:
input_data_dir – Path to the input data directory.
pdb_ids – Optional set of IDs by which to filter processing.
- Returns:
A list of tuples each containing a SMILES string and a PDB ID.
- posebench.utils.data_utils.process_ligand_with_prody(ligand, res_name, chain, resnum, sanitize: bool = True, sub_smiles: str | None = None, ligand_expo_mapping: dict[str, Any] | None = None) Mol [source]¶
Add bond orders to a pdb ligand using ProDy. 1. Select the ligand component with name “res_name” 2. Get the corresponding SMILES from pypdb 3. Create a template molecule from the SMILES in step 2 4. Write the PDB file to a stream 5. Read the stream into an RDKit molecule 6. Assign the bond orders from the template from step 3
- Parameters:
ligand – ligand as generated by prody
res_name – residue name of ligand to extract
chain – chain of ligand to extract
resnum – residue number of ligand to extract
sanitize – whether to sanitize the molecule
sub_smiles – optional SMILES string of the ligand molecule
ligand_expo_mapping – optional Ligand Expo mapping
- Returns:
molecule with bond orders assigned
- posebench.utils.data_utils.read_ligand_expo(ligand_expo_url: str = 'http://ligand-expo.rcsb.org/dictionaries', ligand_expo_filename: str = 'Components-smiles-stereo-oe.smi') dict[str, Any] [source]¶
Read Ligand Expo data, first trying to find a file called Components- smiles-stereo-oe.smi in the current directory. If the file can’t be found, grab it from the RCSB.
- Parameters:
ligand_expo_url – URL to Ligand Expo.
ligand_expo_filename – Name of the Ligand Expo file.
- Returns:
Ligand Expo as a dictionary with ligand id as the key
- posebench.utils.data_utils.renumber_biopython_structure_residues(structure: Structure, gap_insertion_point: str | None = None) Structure [source]¶
Renumber residues in a PDB file using BioPython starting from 1 for each chain.
- Parameters:
structure – BioPython structure object.
gap_insertion_point – Optional :-separated string representing the chain-residue pair index of the residue at which to insert a single index gap.
- Returns:
BioPython structure object with renumbered residues.
- posebench.utils.data_utils.renumber_pdb_df_residues(input_pdb_file: str, output_pdb_file: str)[source]¶
Renumber residues in a PDB file starting from 1 for each chain.
- Parameters:
input_pdb_file – Path to the input PDB file.