Data utilities¶

multicom_ligand.utils.data_utils.combine_molecules(molecule_list: list[Mol]) → Mol[source]¶

Combine a list of RDKit molecules into a single molecule.

Parameters:: molecule_list – A list of RDKit molecules.
Returns:: A single RDKit molecule.

multicom_ligand.utils.data_utils.count_num_residues_in_pdb_file(pdb_filepath: str) → int[source]¶

Count the number of Ca atoms (i.e., residues) in a PDB file.

Parameters:: pdb_filepath – Path to PDB file.
Returns:: Number of Ca atoms (i.e., residues) in the PDB file.

multicom_ligand.utils.data_utils.count_pdb_inter_residue_clashes(pdb_filepath: str, clash_cutoff: float = 0.63) → int[source]¶

Count the number of inter-residue clashes in a protein PDB file. From: https://www.blopig.com/blog/2023/05/checking-your-pdb-file-for-clashing-atoms/

Parameters:

pdb_filepath – Path to the PDB file.
clash_cutoff – The cutoff for what is considered a clash.

Returns:

The number of inter-residue clashes in the structure.

multicom_ligand.utils.data_utils.create_sdf_file_from_smiles(smiles: str, output_sdf_file: str) → str[source]¶

Create an SDF file from a SMILES string.

Parameters:

smiles – SMILES string of the molecule.
output_sdf_file – Path to the output SDF file.

Returns:

Path to the output SDF file.

multicom_ligand.utils.data_utils.extract_protein_and_ligands_with_prody(input_pdb_file: str, protein_output_pdb_file: str, ligands_output_sdf_file: str, sanitize: bool = True, add_element_types: bool = False, ligand_smiles: str | None = None) → Mol | None[source]¶

Using ProDy, extract protein atoms and ligand molecules from a PDB file and write them to separate files.

Parameters:

input_pdb_file – The input PDB file.
protein_output_pdb_file – The output PDB file for the protein atoms.
ligands_output_sdf_file – The output SDF file for the ligand molecules.
sanitize – Whether to sanitize the ligand molecules.
add_element_types – Whether to add element types to the protein atoms.
ligand_smiles – The SMILES string of the ligand molecule.

Returns:

The combined final ligand molecule(s) as an RDKit molecule.

multicom_ligand.utils.data_utils.extract_sequences_from_macromolecule_structure_file(pdb_filepath: str | Path, structure: Structure | None = None) → tuple[list[str], list[Literal['DNA', 'RNA', 'Protein']]][source]¶

Extract the chain sequences from a macromolecule structure file.

Parameters:

pdb_filepath – Path to the PDB structure file.
structure – Optional BioPython structure object to use instead.

Returns:

A list of macromolecule sequences and their corresponding molecule chain types (e.g., Protein).

multicom_ligand.utils.data_utils.get_pdb_components_with_prody(pdb_id) → tuple[source]¶

Split a protein-ligand pdb into protein and ligand components using ProDy.

Parameters:: pdb_id – PDB ID
Returns:: protein structure, nucleic acid structure, and ligand residues

multicom_ligand.utils.data_utils.parse_inference_inputs_from_dir(input_data_dir: str | Path, pdb_ids: set[Any] | None = None) → list[tuple[str, str]][source]¶

Parse a data directory containing subdirectories of protein-ligand complexes and return corresponding SMILES strings and PDB IDs.

Parameters:

input_data_dir – Path to the input data directory.
pdb_ids – Optional set of IDs by which to filter processing.

Returns:

A list of tuples each containing a SMILES string and a PDB ID.

multicom_ligand.utils.data_utils.process_ligand_with_prody(ligand, res_name, chain, resnum, sanitize: bool = True, sub_smiles: str | None = None) → Mol[source]¶

Add bond orders to a pdb ligand using ProDy. 1. Select the ligand component with name “res_name” 2. Get the corresponding SMILES from pypdb 3. Create a template molecule from the SMILES in step 2 4. Write the PDB file to a stream 5. Read the stream into an RDKit molecule 6. Assign the bond orders from the template from step 3

Parameters:

ligand – ligand as generated by prody
res_name – residue name of ligand to extract
chain – chain of ligand to extract
resnum – residue number of ligand to extract
sanitize – whether to sanitize the molecule
sub_smiles – optional SMILES string of the ligand molecule

Returns:

molecule with bond orders assigned

multicom_ligand.utils.data_utils.read_molecule(molecule_file: str, sanitize: bool = False, calc_charges: bool = False, remove_hs: bool = False) → Mol | None[source]¶: Load an RDKit molecule from a given filepath.

multicom_ligand.utils.data_utils.renumber_biopython_structure_residues(structure: Structure, gap_insertion_point: str | None = None) → Structure[source]¶

Renumber residues in a PDB file using BioPython starting from 1 for each chain.

Parameters:

structure – BioPython structure object.
gap_insertion_point – Optional :-separated string representing the chain-residue pair index of the residue at which to insert a single index gap.

Returns:

BioPython structure object with renumbered residues.

multicom_ligand.utils.data_utils.renumber_pdb_df_residues(input_pdb_file: str, output_pdb_file: str)[source]¶

Renumber residues in a PDB file starting from 1 for each chain.

Parameters:: input_pdb_file – Path to the input PDB file.

multicom_ligand.utils.data_utils.write_pdb_with_prody(macromolecule, pdb_name, add_element_types=False)[source]¶

Write a protein or nucleic acid structure to a pdb file using ProDy.

Parameters:

macromolecule – protein or nucleic acid object from prody
pdb_name – base name for the pdb file
add_element_types – whether to add element types to the pdb file

multicom_ligand.utils.data_utils.write_sdf(new_mol: Mol, pdb_name: str)[source]¶

Write an RDKit molecule to an SD file.

Parameters:

new_mol – RDKit molecule
pdb_name – name of the output file