Data¶

This section describes the configurations for various data-related scripts.

Input data components¶

These data component configurations are used to modify how the input (apo) protein structures are predicted or aligned.

Protein apo-to-holo alignment¶

data/components/protein_apo_to_holo_alignment.yaml¶

dataset: posebusters_benchmark # the dataset to use - NOTE: must be one of (`posebusters_benchmark`, `astex_diverse`, `dockgen`, `casp15`)
data_dir: ${oc.env:PROJECT_ROOT}/data/${dataset}_set/ # where the processed datasets (e.g., PoseBusters Benchmark) are placed
predicted_structures_dir: ${oc.env:PROJECT_ROOT}/data/${dataset}_set/${dataset}_predicted_structures # where the predicted protein structures are placed
output_dir: ${oc.env:PROJECT_ROOT}/data/${dataset}_set/${dataset}_holo_aligned_predicted_structures # where the holo-aligned predicted apo structures should be stored
processing_esmfold_structures: false # whether to process the ESMFold predicted structures
num_workers: 1 # number of CPU workers for parallel processing

FASTA preparation¶

data/components/fasta_preparation.yaml¶

dataset: posebusters_benchmark # the dataset to use - NOTE: must be one of (`posebusters_benchmark`, `astex_diverse`, `dockgen`, `casp15`)
posebusters_ccd_ids_filepath: ${oc.env:PROJECT_ROOT}/data/posebusters_pdb_ccd_ids.txt # the path to the PoseBusters PDB CCD IDs file that lists the targets that do not contain any crystal contacts
dockgen_test_ids_filepath: ${oc.env:PROJECT_ROOT}/data/dockgen_set/split_test.txt # the path to the DockGen test set IDs file
data_dir: ${oc.env:PROJECT_ROOT}/data/${dataset}_set # where the processed datasets (e.g., PoseBusters Benchmark) are placed
out_file: ${oc.env:PROJECT_ROOT}/data/${dataset}_set/reference_${dataset}_sequences.fasta # the output FASTA file to produce
include_all_biomolecules: false # instead of just protein chains, whether to include FASTA entries for each type of biomolecule chain (e.g., protein, ligand) in the PDB file

ESMFold sequence preparation¶

data/components/esmfold_sequence_preparation.yaml¶

dataset: posebusters_benchmark # the dataset to use - NOTE: must be one of (`posebusters_benchmark`, `astex_diverse`)
input_fasta_file: ${oc.env:PROJECT_ROOT}/data/${dataset}_set/reference_${dataset}_sequences.fasta # the input FASTA file to modify
output_fasta_file: ${oc.env:PROJECT_ROOT}/data/${dataset}_set/reference_${dataset}_esmfold_sequences.fasta # the input FASTA file to modify

Method data parsers¶

These data parser configurations are used to modify how the input (output) protein-ligand complex structures of each method are prepared (extracted).

Binding site crop preparation¶

data/binding_site_crop_preparation.yaml¶

dataset: posebusters_benchmark # the dataset to use - NOTE: must be one of (`posebusters_benchmark`, `astex_diverse`)
input_data_dir: ${oc.env:PROJECT_ROOT}/data/${dataset}_set # the input protein-ligand complex directory to recursively parse
input_protein_structure_dir: ${oc.env:PROJECT_ROOT}/data/${dataset}_set/${dataset}_holo_aligned_predicted_structures # the input protein structure directory to parse
protein_ligand_distance_threshold: 10.0 # the heavy-atom distance threshold (in Angstrom) to use for finding protein binding site residues in interaction with ligand heavy atoms
num_buffer_residues: 7 # the number of sequence-regional buffer residues to include around the native binding site residues

DiffDock input preparation¶

data/diffdock_input_preparation.yaml¶

dataset: posebusters_benchmark # the dataset to use - NOTE: must be one of (`posebusters_benchmark`, `astex_diverse`, `dockgen`, `casp15`)
input_data_dir: ${oc.env:PROJECT_ROOT}/data/${dataset}_set # the input protein-ligand complex directory to recursively parse
input_protein_structure_dir: ${oc.env:PROJECT_ROOT}/data/${dataset}_set/${dataset}_holo_aligned_predicted_structures # the input protein structure directory to parse
output_csv_path: ${oc.env:PROJECT_ROOT}/forks/DiffDock/inference/diffdock_${dataset}_inputs.csv # the output CSV filepath to which to write the parsed input data
posebusters_ccd_ids_filepath: ${oc.env:PROJECT_ROOT}/data/posebusters_pdb_ccd_ids.txt # the path to the PoseBusters PDB CCD IDs file that lists the targets that do not contain any crystal contacts
dockgen_test_ids_filepath: ${oc.env:PROJECT_ROOT}/data/dockgen_set/split_test.txt # the path to the DockGen test set IDs file
protein_filepath: null # the path to the protein structure file to use
ligand_smiles: null # the ligand SMILES string for which to predict the binding pose
input_id: null # the input ID to use for inference
pocket_only_baseline: false # whether to prepare the pocket-only baseline

FABind input preparation¶

data/fabind_input_preparation.yaml¶

dataset: posebusters_benchmark # the dataset to use - NOTE: must be one of (`posebusters_benchmark`, `astex_diverse`, `dockgen`, `casp15`)
input_data_dir: ${oc.env:PROJECT_ROOT}/data/${dataset}_set # the input protein-ligand complex directory to recursively parse
output_csv_path: ${oc.env:PROJECT_ROOT}/forks/FABind/inference/fabind_${dataset}_inputs.csv # the output CSV filepath to which to write the parsed input data
posebusters_ccd_ids_filepath: ${oc.env:PROJECT_ROOT}/data/posebusters_pdb_ccd_ids.txt # the path to the PoseBusters PDB CCD IDs file that lists the targets that do not contain any crystal contacts
dockgen_test_ids_filepath: ${oc.env:PROJECT_ROOT}/data/dockgen_set/split_test.txt # the path to the DockGen test set IDs file
pocket_only_baseline: false # whether to prepare the pocket-only baseline

DynamicBind input preparation¶

data/dynamicbind_input_preparation.yaml¶

dataset: posebusters_benchmark # the dataset to use - NOTE: must be one of (`posebusters_benchmark`, `astex_diverse`, `dockgen`, `casp15`)
input_data_dir: ${oc.env:PROJECT_ROOT}/data/${dataset}_set # the input protein-ligand complex directory to recursively parse
input_protein_data_dir: null # the input protein structure directory to recursively parse during inference
output_csv_dir: ${oc.env:PROJECT_ROOT}/forks/DynamicBind/inference/dynamicbind_${dataset}_inputs # the output CSV directory to which to write the parsed ligand SMILES strings
posebusters_ccd_ids_filepath: ${oc.env:PROJECT_ROOT}/data/posebusters_pdb_ccd_ids.txt # the path to the PoseBusters PDB CCD IDs file that lists the targets that do not contain any crystal contacts
dockgen_test_ids_filepath: ${oc.env:PROJECT_ROOT}/data/dockgen_set/split_test.txt # the path to the DockGen test set IDs file
protein_filepath: null # the path to the protein structure file to use
ligand_smiles: null # the ligand SMILES string for which to predict the binding pose
pocket_only_baseline: false # whether to prepare the pocket-only baseline

NeuralPLexer input preparation¶

data/neuralplexer_input_preparation.yaml¶

dataset: posebusters_benchmark # the dataset to use - NOTE: must be one of (`posebusters_benchmark`, `astex_diverse`, `dockgen`, `casp15`)
input_data_dir: ${oc.env:PROJECT_ROOT}/data/${dataset}_set # the input protein-ligand complex directory to recursively parse
input_receptor_structure_dir: ${oc.env:PROJECT_ROOT}/data/${dataset}_set/${dataset}_holo_aligned_predicted_structures # if not `null`, the input template protein structure directory to parse
output_csv_path: ${oc.env:PROJECT_ROOT}/forks/NeuralPLexer/inference/neuralplexer_${dataset}_inputs.csv # the output CSV filepath to which to write the parsed input data
posebusters_ccd_ids_filepath: ${oc.env:PROJECT_ROOT}/data/posebusters_pdb_ccd_ids.txt # the path to the PoseBusters PDB CCD IDs file that lists the targets that do not contain any crystal contacts
dockgen_test_ids_filepath: ${oc.env:PROJECT_ROOT}/data/dockgen_set/split_test.txt # the path to the DockGen test set IDs file
input_receptor: null # the input protein sequence
input_ligand: null # the input ligand SMILES
input_template: null # the input template protein structure to optionally use
input_id: null # the input ID to use for inference
pocket_only_baseline: false # whether to prepare the pocket-only baseline

FlowDock input preparation¶

data/flowdock_input_preparation.yaml¶

dataset: posebusters_benchmark # the dataset to use - NOTE: must be one of (`posebusters_benchmark`, `astex_diverse`, `dockgen`, `casp15`)
input_data_dir: ${oc.env:PROJECT_ROOT}/data/${dataset}_set # the input protein-ligand complex directory to recursively parse
input_receptor_structure_dir: ${oc.env:PROJECT_ROOT}/data/${dataset}_set/${dataset}_holo_aligned_predicted_structures # if not `null`, the input template protein structure directory to parse
output_csv_path: ${oc.env:PROJECT_ROOT}/forks/FlowDock/inference/flowdock_${dataset}_inputs.csv # the output CSV filepath to which to write the parsed input data
posebusters_ccd_ids_filepath: ${oc.env:PROJECT_ROOT}/data/posebusters_pdb_ccd_ids.txt # the path to the PoseBusters PDB CCD IDs file that lists the targets that do not contain any crystal contacts
dockgen_test_ids_filepath: ${oc.env:PROJECT_ROOT}/data/dockgen_set/split_test.txt # the path to the DockGen test set IDs file
input_receptor: null # the input protein sequence
input_ligand: null # the input ligand SMILES
input_template: null # the input template protein structure to optionally use
input_id: null # the input ID to use for inference
pocket_only_baseline: false # whether to prepare the pocket-only baseline

RoseTTAFold-All-Atom input preparation¶

data/rfaa_input_preparation.yaml¶

dataset: posebusters_benchmark # the dataset to use - NOTE: must be one of (`posebusters_benchmark`, `astex_diverse`, `dockgen`, `casp15`)
input_data_dir: ${oc.env:PROJECT_ROOT}/data/${dataset}_set # the input protein-ligand complex directory to recursively parse
output_scripts_path: ${oc.env:PROJECT_ROOT}/forks/RoseTTAFold-All-Atom/prediction_inputs/${dataset} # the output directory in which to save the input files
posebusters_ccd_ids_filepath: ${oc.env:PROJECT_ROOT}/data/posebusters_pdb_ccd_ids.txt # the path to the PoseBusters PDB CCD IDs file that lists the targets that do not contain any crystal contacts
dockgen_test_ids_filepath: ${oc.env:PROJECT_ROOT}/data/dockgen_set/split_test.txt # the path to the DockGen test set IDs file
protein_filepath: null # the path to the protein structure file to use
ligand_smiles: null # the ligand SMILES string for which to predict the binding pose
input_id: null # the input ID to use for inference
pocket_only_baseline: false # whether to prepare the pocket-only baseline

RoseTTAFold-All-Atom output extraction¶

data/rfaa_output_extraction.yaml¶

dataset: posebusters_benchmark # the dataset to use - NOTE: must be one of (`posebusters_benchmark`, `astex_diverse`, `dockgen`, `casp15`)
prediction_inputs_dir: ${oc.env:PROJECT_ROOT}/forks/RoseTTAFold-All-Atom/prediction_inputs/${dataset}
prediction_outputs_dir: ${oc.env:PROJECT_ROOT}/forks/RoseTTAFold-All-Atom/prediction_outputs/${dataset}_${repeat_index}
inference_outputs_dir: ${oc.env:PROJECT_ROOT}/forks/RoseTTAFold-All-Atom/inference/rfaa_${dataset}_outputs_${repeat_index}
complex_filepath: null # if not `null`, this should be the path to the complex PDB file for which to extract outputs
complex_id: null # if not `null`, this should be the complex ID of the single complex for which to extract outputs
ligand_smiles: null # if not `null`, this should be the (i.e., `.` fragment-separated) complex ligand SMILES string of the single complex for which to extract outputs
output_dir: null # if not `null`, this should be the path to the output file to which to write the extracted outputs
repeat_index: 1 # the repeat index with which inference was run
pocket_only_baseline: false # whether to prepare the pocket-only baseline

Chai-1 input preparation¶

data/chai_input_preparation.yaml¶

dataset: posebusters_benchmark # the dataset to use - NOTE: must be one of (`posebusters_benchmark`, `astex_diverse`, `dockgen`, `casp15`)
input_data_dir: ${oc.env:PROJECT_ROOT}/data/${dataset}_set # the input protein-ligand complex directory to recursively parse
output_scripts_path: ${oc.env:PROJECT_ROOT}/forks/chai-lab/prediction_inputs/${dataset} # the output directory in which to save the input files
posebusters_ccd_ids_filepath: ${oc.env:PROJECT_ROOT}/data/posebusters_pdb_ccd_ids.txt # the path to the PoseBusters PDB CCD IDs file that lists the targets that do not contain any crystal contacts
dockgen_test_ids_filepath: ${oc.env:PROJECT_ROOT}/data/dockgen_set/split_test.txt # the path to the DockGen test set IDs file
protein_filepath: null # the path to the protein structure file to use
ligand_smiles: null # the ligand SMILES string for which to predict the binding pose
input_id: null # the input ID to use for inference
pocket_only_baseline: false # whether to prepare the pocket-only baseline

Chai-1 output extraction¶

data/chai_output_extraction.yaml¶

dataset: posebusters_benchmark # the dataset to use - NOTE: must be one of (`posebusters_benchmark`, `astex_diverse`, `dockgen`, `casp15`)
prediction_inputs_dir: ${oc.env:PROJECT_ROOT}/forks/chai-lab/prediction_inputs/${dataset}
prediction_outputs_dir: ${oc.env:PROJECT_ROOT}/forks/chai-lab/prediction_outputs/${dataset}_${repeat_index}
inference_outputs_dir: ${oc.env:PROJECT_ROOT}/forks/chai-lab/inference/chai-lab_${dataset}_outputs_${repeat_index}
input_data_dir: ${oc.env:PROJECT_ROOT}/data/${dataset}_set # the input protein-ligand complex directory to recursively parse
posebusters_ccd_ids_filepath: ${oc.env:PROJECT_ROOT}/data/posebusters_pdb_ccd_ids.txt # the path to the PoseBusters PDB CCD IDs file that lists the targets that do not contain any crystal contacts
dockgen_test_ids_filepath: ${oc.env:PROJECT_ROOT}/data/dockgen_set/split_test.txt # the path to the DockGen test set IDs file
complex_filepath: null # if not `null`, this should be the path to the complex PDB file for which to extract outputs
complex_id: null # if not `null`, this should be the complex ID of the single complex for which to extract outputs
ligand_smiles: null # if not `null`, this should be the (i.e., `.` fragment-separated) complex ligand SMILES string of the single complex for which to extract outputs
output_dir: null # if not `null`, this should be the path to the output file to which to write the extracted outputs
repeat_index: 1 # the repeat index with which inference was run
pocket_only_baseline: false # whether to prepare the pocket-only baseline

AlphaFold 3 output extraction¶

data/af3_output_extraction.yaml¶

dataset: posebusters_benchmark # the dataset to use - NOTE: must be one of (`posebusters_benchmark`, `astex_diverse`, `dockgen`, `casp15`)
prediction_outputs_dir: ${oc.env:PROJECT_ROOT}/forks/alphafold3/prediction_outputs/${dataset}_${repeat_index}
inference_outputs_dir: ${oc.env:PROJECT_ROOT}/forks/alphafold3/inference/alphafold3_${dataset}_outputs_${repeat_index}
input_data_dir: ${oc.env:PROJECT_ROOT}/data/${dataset}_set # the input protein-ligand complex directory to recursively parse
posebusters_ccd_ids_filepath: ${oc.env:PROJECT_ROOT}/data/posebusters_pdb_ccd_ids.txt # the path to the PoseBusters PDB CCD IDs file that lists the targets that do not contain any crystal contacts
dockgen_test_ids_filepath: ${oc.env:PROJECT_ROOT}/data/dockgen_set/split_test.txt # the path to the DockGen test set IDs file
complex_filepath: null # if not `null`, this should be the path to the complex PDB file for which to extract outputs
complex_id: null # if not `null`, this should be the complex ID of the single complex for which to extract outputs
ligand_smiles: null # if not `null`, this should be the (i.e., `.` fragment-separated) complex ligand SMILES string of the single complex for which to extract outputs
output_dir: null # if not `null`, this should be the path to the output file to which to write the extracted outputs
repeat_index: 1 # the repeat index with which inference was run
pocket_only_baseline: false # whether to prepare the pocket-only baseline

TULIP output extraction¶

data/tulip_output_extraction.yaml¶

dataset: posebusters_benchmark # the dataset to use - NOTE: must be one of (`posebusters_benchmark`, `astex_diverse`, `casp15`)
prediction_outputs_dir: ${oc.env:PROJECT_ROOT}/forks/TULIP/outputs/${dataset}_${repeat_index}
inference_outputs_dir: ${oc.env:PROJECT_ROOT}/forks/TULIP/inference/tulip_${dataset}_outputs_${repeat_index}
posebusters_ccd_ids_filepath: ${oc.env:PROJECT_ROOT}/data/posebusters_pdb_ccd_ids.txt # the path to the PoseBusters PDB CCD IDs file that lists the targets that do not contain any crystal contacts
dockgen_test_ids_filepath: ${oc.env:PROJECT_ROOT}/data/dockgen_set/split_test.txt # the path to the DockGen test set IDs file
method_top_n_to_select: 5 # the number of top models for each target to select for analysis
repeat_index: 1 # the repeat index to use