Data utilities¶
- multicom_ligand.utils.data_utils.combine_molecules(molecule_list: list[Mol]) Mol [source]¶
Combine a list of RDKit molecules into a single molecule.
- Parameters:
molecule_list – A list of RDKit molecules.
- Returns:
A single RDKit molecule.
- multicom_ligand.utils.data_utils.count_num_residues_in_pdb_file(pdb_filepath: str) int [source]¶
Count the number of Ca atoms (i.e., residues) in a PDB file.
- Parameters:
pdb_filepath – Path to PDB file.
- Returns:
Number of Ca atoms (i.e., residues) in the PDB file.
- multicom_ligand.utils.data_utils.count_pdb_inter_residue_clashes(pdb_filepath: str, clash_cutoff: float = 0.63) int [source]¶
Count the number of inter-residue clashes in a protein PDB file. From: https://www.blopig.com/blog/2023/05/checking-your-pdb-file-for-clashing-atoms/
- Parameters:
pdb_filepath – Path to the PDB file.
clash_cutoff – The cutoff for what is considered a clash.
- Returns:
The number of inter-residue clashes in the structure.
- multicom_ligand.utils.data_utils.create_sdf_file_from_smiles(smiles: str, output_sdf_file: str) str [source]¶
Create an SDF file from a SMILES string.
- Parameters:
smiles – SMILES string of the molecule.
output_sdf_file – Path to the output SDF file.
- Returns:
Path to the output SDF file.
- multicom_ligand.utils.data_utils.extract_protein_and_ligands_with_prody(input_pdb_file: str, protein_output_pdb_file: str, ligands_output_sdf_file: str, sanitize: bool = True, add_element_types: bool = False, ligand_smiles: str | None = None) Mol | None [source]¶
Using ProDy, extract protein atoms and ligand molecules from a PDB file and write them to separate files.
- Parameters:
input_pdb_file – The input PDB file.
protein_output_pdb_file – The output PDB file for the protein atoms.
ligands_output_sdf_file – The output SDF file for the ligand molecules.
sanitize – Whether to sanitize the ligand molecules.
add_element_types – Whether to add element types to the protein atoms.
ligand_smiles – The SMILES string of the ligand molecule.
- Returns:
The combined final ligand molecule(s) as an RDKit molecule.
- multicom_ligand.utils.data_utils.extract_sequences_from_macromolecule_structure_file(pdb_filepath: str | Path, structure: Structure | None = None) tuple[list[str], list[Literal['DNA', 'RNA', 'Protein']]] [source]¶
Extract the chain sequences from a macromolecule structure file.
- Parameters:
pdb_filepath – Path to the PDB structure file.
structure – Optional BioPython structure object to use instead.
- Returns:
A list of macromolecule sequences and their corresponding molecule chain types (e.g., Protein).
- multicom_ligand.utils.data_utils.get_pdb_components_with_prody(pdb_id) tuple [source]¶
Split a protein-ligand pdb into protein and ligand components using ProDy.
- Parameters:
pdb_id – PDB ID
- Returns:
protein structure, nucleic acid structure, and ligand residues
- multicom_ligand.utils.data_utils.parse_inference_inputs_from_dir(input_data_dir: str | Path, pdb_ids: set[Any] | None = None) list[tuple[str, str]] [source]¶
Parse a data directory containing subdirectories of protein-ligand complexes and return corresponding SMILES strings and PDB IDs.
- Parameters:
input_data_dir – Path to the input data directory.
pdb_ids – Optional set of IDs by which to filter processing.
- Returns:
A list of tuples each containing a SMILES string and a PDB ID.
- multicom_ligand.utils.data_utils.process_ligand_with_prody(ligand, res_name, chain, resnum, sanitize: bool = True, sub_smiles: str | None = None) Mol [source]¶
Add bond orders to a pdb ligand using ProDy. 1. Select the ligand component with name “res_name” 2. Get the corresponding SMILES from pypdb 3. Create a template molecule from the SMILES in step 2 4. Write the PDB file to a stream 5. Read the stream into an RDKit molecule 6. Assign the bond orders from the template from step 3
- Parameters:
ligand – ligand as generated by prody
res_name – residue name of ligand to extract
chain – chain of ligand to extract
resnum – residue number of ligand to extract
sanitize – whether to sanitize the molecule
sub_smiles – optional SMILES string of the ligand molecule
- Returns:
molecule with bond orders assigned
- multicom_ligand.utils.data_utils.read_molecule(molecule_file: str, sanitize: bool = False, calc_charges: bool = False, remove_hs: bool = False) Mol | None [source]¶
Load an RDKit molecule from a given filepath.
- multicom_ligand.utils.data_utils.renumber_biopython_structure_residues(structure: Structure, gap_insertion_point: str | None = None) Structure [source]¶
Renumber residues in a PDB file using BioPython starting from 1 for each chain.
- Parameters:
structure – BioPython structure object.
gap_insertion_point – Optional :-separated string representing the chain-residue pair index of the residue at which to insert a single index gap.
- Returns:
BioPython structure object with renumbered residues.
- multicom_ligand.utils.data_utils.renumber_pdb_df_residues(input_pdb_file: str, output_pdb_file: str)[source]¶
Renumber residues in a PDB file starting from 1 for each chain.
- Parameters:
input_pdb_file – Path to the input PDB file.
- multicom_ligand.utils.data_utils.write_pdb_with_prody(macromolecule, pdb_name, add_element_types=False)[source]¶
Write a protein or nucleic acid structure to a pdb file using ProDy.
- Parameters:
macromolecule – protein or nucleic acid object from prody
pdb_name – base name for the pdb file
add_element_types – whether to add element types to the pdb file