Data

This section describes the configurations for various data-related scripts.

Input data components

These data component configurations are used to modify how the input (apo) protein structures are predicted or aligned.

Protein apo-to-holo alignment

data/components/protein_apo_to_holo_alignment.yaml
dataset: posebusters_benchmark # the dataset to use - NOTE: must be one of (`posebusters_benchmark`, `astex_diverse`, `dockgen`, `casp15`)
data_dir: ${oc.env:PROJECT_ROOT}/data/${dataset}_set/ # where the processed datasets (e.g., PoseBusters Benchmark) are placed
predicted_structures_dir: ${oc.env:PROJECT_ROOT}/data/${dataset}_set/${dataset}_predicted_structures # where the predicted protein structures are placed
output_dir: ${oc.env:PROJECT_ROOT}/data/${dataset}_set/${dataset}_holo_aligned_predicted_structures # where the holo-aligned predicted apo structures should be stored
num_workers: 1 # number of CPU workers for parallel processing

Protein FASTA preparation

data/components/protein_fasta_preparation.yaml
dataset: posebusters_benchmark # the dataset to use - NOTE: must be one of (`posebusters_benchmark`, `astex_diverse`)
data_dir: ${oc.env:PROJECT_ROOT}/data/${dataset}_set # where the processed datasets (e.g., PoseBusters Benchmark) are placed
out_file: ${oc.env:PROJECT_ROOT}/data/${dataset}_set/${dataset}_sequences.fasta # the output FASTA file to produce

ESMFold sequence preparation

data/components/esmfold_sequence_preparation.yaml
dataset: posebusters_benchmark # the dataset to use - NOTE: must be one of (`posebusters_benchmark`, `astex_diverse`)
input_fasta_file: ${oc.env:PROJECT_ROOT}/data/${dataset}_set/${dataset}_sequences.fasta # the input FASTA file to modify
output_fasta_file: ${oc.env:PROJECT_ROOT}/data/${dataset}_set/${dataset}_esmfold_sequences.fasta # the input FASTA file to modify

Method data parsers

These data parser configurations are used to modify how the input (output) protein-ligand complex structures of each method are prepared (extracted).

Binding site crop preparation

data/binding_site_crop_preparation.yaml
dataset: posebusters_benchmark # the dataset to use - NOTE: must be one of (`posebusters_benchmark`, `astex_diverse`)
input_data_dir: ${oc.env:PROJECT_ROOT}/data/${dataset}_set # the input protein-ligand complex directory to recursively parse
input_protein_structure_dir: ${oc.env:PROJECT_ROOT}/data/${dataset}_set/${dataset}_holo_aligned_predicted_structures # the input protein structure directory to parse
protein_ligand_distance_threshold: 4.0 # the heavy-atom distance threshold (in Angstrom) to use for finding protein binding site residues in interaction with ligand heavy atoms
num_buffer_residues: 7 # the number of sequence-regional buffer residues to include around the native binding site residues

DiffDock input preparation

data/diffdock_input_preparation.yaml
dataset: posebusters_benchmark # the dataset to use - NOTE: must be one of (`posebusters_benchmark`, `astex_diverse`, `dockgen`, `casp15`)
input_data_dir: ${oc.env:PROJECT_ROOT}/data/${dataset}_set # the input protein-ligand complex directory to recursively parse
input_protein_structure_dir: ${oc.env:PROJECT_ROOT}/data/${dataset}_set/${dataset}_holo_aligned_esmfold_structures # the input protein structure directory to parse
output_csv_path: ${oc.env:PROJECT_ROOT}/forks/DiffDock/inference/diffdock_${dataset}_inputs.csv # the output CSV filepath to which to write the parsed input data
posebusters_ccd_ids_filepath: ${oc.env:PROJECT_ROOT}/data/posebusters_pdb_ccd_ids.txt # the path to the PoseBusters PDB CCD IDs file that lists the targets that do not contain any crystal contacts
dockgen_test_ids_filepath: ${oc.env:PROJECT_ROOT}/data/dockgen_set/split_test.txt # the path to the DockGen test set IDs file
protein_filepath: null # the path to the protein structure file to use
ligand_smiles: null # the ligand SMILES string for which to predict the binding pose
input_id: null # the input ID to use for inference
pocket_only_baseline: false # whether to prepare the pocket-only baseline

FABind input preparation

data/fabind_input_preparation.yaml
dataset: posebusters_benchmark # the dataset to use - NOTE: must be one of (`posebusters_benchmark`, `astex_diverse`, `dockgen`, `casp15`)
input_data_dir: ${oc.env:PROJECT_ROOT}/data/${dataset}_set # the input protein-ligand complex directory to recursively parse
output_csv_path: ${oc.env:PROJECT_ROOT}/forks/FABind/inference/fabind_${dataset}_inputs.csv # the output CSV filepath to which to write the parsed input data
posebusters_ccd_ids_filepath: ${oc.env:PROJECT_ROOT}/data/posebusters_pdb_ccd_ids.txt # the path to the PoseBusters PDB CCD IDs file that lists the targets that do not contain any crystal contacts
dockgen_test_ids_filepath: ${oc.env:PROJECT_ROOT}/data/dockgen_set/split_test.txt # the path to the DockGen test set IDs file

DynamicBind input preparation

data/dynamicbind_input_preparation.yaml
dataset: posebusters_benchmark # the dataset to use - NOTE: must be one of (`posebusters_benchmark`, `astex_diverse`, `dockgen`, `casp15`)
input_data_dir: ${oc.env:PROJECT_ROOT}/data/${dataset}_set # the input protein-ligand complex directory to recursively parse
input_protein_data_dir: null # the input protein structure directory to recursively parse during inference
output_csv_dir: ${oc.env:PROJECT_ROOT}/forks/DynamicBind/inference/dynamicbind_${dataset}_inputs # the output CSV directory to which to write the parsed ligand SMILES strings
posebusters_ccd_ids_filepath: ${oc.env:PROJECT_ROOT}/data/posebusters_pdb_ccd_ids.txt # the path to the PoseBusters PDB CCD IDs file that lists the targets that do not contain any crystal contacts
dockgen_test_ids_filepath: ${oc.env:PROJECT_ROOT}/data/dockgen_set/split_test.txt # the path to the DockGen test set IDs file
protein_filepath: null # the path to the protein structure file to use
ligand_smiles: null # the ligand SMILES string for which to predict the binding pose

NeuralPLexer input preparation

data/neuralplexer_input_preparation.yaml
dataset: posebusters_benchmark # the dataset to use - NOTE: must be one of (`posebusters_benchmark`, `astex_diverse`, `dockgen`, `casp15`)
input_data_dir: ${oc.env:PROJECT_ROOT}/data/${dataset}_set # the input protein-ligand complex directory to recursively parse
input_receptor_structure_dir: ${oc.env:PROJECT_ROOT}/data/${dataset}_set/${dataset}_holo_aligned_esmfold_structures # if not `null`, the input template protein structure directory to parse
output_csv_path: ${oc.env:PROJECT_ROOT}/forks/NeuralPLexer/inference/neuralplexer_${dataset}_inputs.csv # the output CSV filepath to which to write the parsed input data
posebusters_ccd_ids_filepath: ${oc.env:PROJECT_ROOT}/data/posebusters_pdb_ccd_ids.txt # the path to the PoseBusters PDB CCD IDs file that lists the targets that do not contain any crystal contacts
dockgen_test_ids_filepath: ${oc.env:PROJECT_ROOT}/data/dockgen_set/split_test.txt # the path to the DockGen test set IDs file
input_receptor: null # the input protein sequence
input_ligand: null # the input ligand SMILES
input_template: null # the input template protein structure to optionally use
input_id: null # the input ID to use for inference
pocket_only_baseline: false # whether to prepare the pocket-only baseline

RoseTTAFold-All-Atom input preparation

data/rfaa_input_preparation.yaml
dataset: posebusters_benchmark # the dataset to use - NOTE: must be one of (`posebusters_benchmark`, `astex_diverse`, `dockgen`, `casp15`)
input_data_dir: ${oc.env:PROJECT_ROOT}/data/${dataset}_set # the input protein-ligand complex directory to recursively parse
output_scripts_path: ${oc.env:PROJECT_ROOT}/forks/RoseTTAFold-All-Atom/prediction_inputs/${dataset} # the output directory in which to save the input files
posebusters_ccd_ids_filepath: ${oc.env:PROJECT_ROOT}/data/posebusters_pdb_ccd_ids.txt # the path to the PoseBusters PDB CCD IDs file that lists the targets that do not contain any crystal contacts
dockgen_test_ids_filepath: ${oc.env:PROJECT_ROOT}/data/dockgen_set/split_test.txt # the path to the DockGen test set IDs file
protein_filepath: null # the path to the protein structure file to use
ligand_smiles: null # the ligand SMILES string for which to predict the binding pose
input_id: null # the input ID to use for inference
pocket_only_baseline: false # whether to prepare the pocket-only baseline

RoseTTAFold-All-Atom output extraction

data/rfaa_output_extraction.yaml
dataset: posebusters_benchmark # the dataset to use - NOTE: must be one of (`posebusters_benchmark`, `astex_diverse`, `dockgen`, `casp15`)
prediction_inputs_dir: ${oc.env:PROJECT_ROOT}/forks/RoseTTAFold-All-Atom/prediction_inputs/${dataset}
prediction_outputs_dir: ${oc.env:PROJECT_ROOT}/forks/RoseTTAFold-All-Atom/prediction_outputs/${dataset}_${repeat_index}
inference_outputs_dir: ${oc.env:PROJECT_ROOT}/forks/RoseTTAFold-All-Atom/inference/rfaa_${dataset}_outputs_${repeat_index}
complex_filepath: null # if not `null`, this should be the path to the complex PDB file for which to extract outputs
complex_id: null # if not `null`, this should be the complex ID of the single complex for which to extract outputs
ligand_smiles: null # if not `null`, this should be the (i.e., `.` fragment-separated) complex ligand SMILES string of the single complex for which to extract outputs
output_dir: null # if not `null`, this should be the path to the output file to which to write the extracted outputs
repeat_index: 1 # the repeat index with which inference was run

TULIP output extraction

data/tulip_output_extraction.yaml
dataset: posebusters_benchmark # the dataset to use - NOTE: must be one of (`posebusters_benchmark`, `astex_diverse`, `casp15`)
prediction_outputs_dir: ${oc.env:PROJECT_ROOT}/forks/TULIP/outputs/${dataset}_${repeat_index}
inference_outputs_dir: ${oc.env:PROJECT_ROOT}/forks/TULIP/inference/tulip_${dataset}_outputs_${repeat_index}
posebusters_ccd_ids_filepath: ${oc.env:PROJECT_ROOT}/data/posebusters_pdb_ccd_ids.txt # the path to the PoseBusters PDB CCD IDs file that lists the targets that do not contain any crystal contacts
method_top_n_to_select: 5 # the number of top models for each target to select for analysis
repeat_index: 1 # the repeat index to use