Data¶
This section describes the configurations for various data-related scripts.
Input data components¶
These data component configurations are used to modify how the input (apo) protein structures are predicted, aligned, or analyzed or how one data format is converted to another.
Convert mmCIF to PDB¶
data/components/convert_mmcif_to_pdb.yaml
¶input_mmcif_dir: ???
output_pdb_dir: ???
dataset: "N/A"
lowercase_id: false
ESMFold sequence preparation¶
data/components/esmfold_sequence_preparation.yaml
¶dataset: posebusters_benchmark # the dataset to use - NOTE: must be one of (`posebusters_benchmark`, `astex_diverse`)
input_fasta_file: ${oc.env:PROJECT_ROOT}/data/${dataset}_set/reference_${dataset}_sequences.fasta # the input FASTA file to modify
output_fasta_file: ${oc.env:PROJECT_ROOT}/data/${dataset}_set/reference_${dataset}_esmfold_sequences.fasta # the input FASTA file to modify
FASTA preparation¶
data/components/fasta_preparation.yaml
¶dataset: posebusters_benchmark # the dataset to use - NOTE: must be one of (`posebusters_benchmark`, `astex_diverse`, `dockgen`, `casp15`)
posebusters_ccd_ids_filepath: ${oc.env:PROJECT_ROOT}/data/posebusters_pdb_ccd_ids.txt # the path to the PoseBusters PDB CCD IDs file that lists the targets that do not contain any crystal contacts
dockgen_test_ids_filepath: ${oc.env:PROJECT_ROOT}/data/dockgen_set/split_test.txt # the path to the DockGen test set IDs file
data_dir: ${oc.env:PROJECT_ROOT}/data/${dataset}_set # where the processed datasets (e.g., PoseBusters Benchmark) are placed
out_file: ${oc.env:PROJECT_ROOT}/data/${dataset}_set/reference_${dataset}_sequences.fasta # the output FASTA file to produce
include_all_biomolecules: false # instead of just protein chains, whether to include FASTA entries for each type of biomolecule chain (e.g., protein, ligand) in the PDB file
Plot dataset RMSD¶
data/components/plot_dataset_rmsd.yaml
¶data_dir: ${oc.env:PROJECT_ROOT}/data
usalign_exec_path: ??? # the path to a local USAlign executable (e.g., ~/Programs/USalign/USalign)
Prepare Boltz-1 MSAs¶
data/components/prepare_boltz_msas.yaml
¶dataset: posebusters_benchmark # the dataset to use - NOTE: must be one of (`posebusters_benchmark`, `astex_diverse`, `dockgen`, `casp15`)
input_data_dir: ${oc.env:PROJECT_ROOT}/data/${dataset}_set # the input protein-ligand complex directory to recursively parse
input_msa_dir: ${oc.env:PROJECT_ROOT}/data/${dataset}_set/${dataset}_msas # where the original MSA files are placed
output_msa_dir: ${oc.env:PROJECT_ROOT}/data/${dataset}_set/${dataset}_boltz_msas # where the processed MSA files should be stored
skip_existing: true # whether to skip processing if the output file already exists
pocket_only_baseline: false # whether to prepare the pocket-only baseline
Prepare Chai-1 MSAs¶
data/components/prepare_chai_msas.yaml
¶dataset: posebusters_benchmark # the dataset to use - NOTE: must be one of (`posebusters_benchmark`, `astex_diverse`, `dockgen`, `casp15`)
input_msa_dir: ${oc.env:PROJECT_ROOT}/data/${dataset}_set/${dataset}_msas # where the original MSA files are placed
output_msa_dir: ${oc.env:PROJECT_ROOT}/data/${dataset}_set/${dataset}_chai_msas # where the processed MSA files should be stored
skip_existing: true # whether to skip processing if the output file already exists
Protein apo-to-holo alignment¶
data/components/protein_apo_to_holo_alignment.yaml
¶dataset: posebusters_benchmark # the dataset to use - NOTE: must be one of (`posebusters_benchmark`, `astex_diverse`, `dockgen`, `casp15`)
data_dir: ${oc.env:PROJECT_ROOT}/data/${dataset}_set/ # where the processed datasets (e.g., PoseBusters Benchmark) are placed
predicted_structures_dir: ${oc.env:PROJECT_ROOT}/data/${dataset}_set/${dataset}_predicted_structures # where the predicted protein structures are placed
output_dir: ${oc.env:PROJECT_ROOT}/data/${dataset}_set/${dataset}_holo_aligned_predicted_structures # where the holo-aligned predicted apo structures should be stored
processing_esmfold_structures: false # whether to process the ESMFold predicted structures
num_workers: 1 # number of CPU workers for parallel processing
Method data parsers¶
These data parser configurations are used to modify how the input (output) protein-ligand complex structures of each method are prepared (extracted).
Binding site crop preparation¶
data/binding_site_crop_preparation.yaml
¶dataset: posebusters_benchmark # the dataset to use - NOTE: must be one of (`posebusters_benchmark`, `astex_diverse`)
input_data_dir: ${oc.env:PROJECT_ROOT}/data/${dataset}_set # the input protein-ligand complex directory to recursively parse
input_protein_structure_dir: ${oc.env:PROJECT_ROOT}/data/${dataset}_set/${dataset}_holo_aligned_predicted_structures # the input protein structure directory to parse
protein_ligand_distance_threshold: 10.0 # the heavy-atom distance threshold (in Angstrom) to use for finding protein binding site residues in interaction with ligand heavy atoms
num_buffer_residues: 7 # the number of sequence-regional buffer residues to include around the native binding site residues
DiffDock input preparation¶
data/diffdock_input_preparation.yaml
¶dataset: posebusters_benchmark # the dataset to use - NOTE: must be one of (`posebusters_benchmark`, `astex_diverse`, `dockgen`, `casp15`)
input_data_dir: ${oc.env:PROJECT_ROOT}/data/${dataset}_set # the input protein-ligand complex directory to recursively parse
input_protein_structure_dir: ${oc.env:PROJECT_ROOT}/data/${dataset}_set/${dataset}_holo_aligned_predicted_structures # the input protein structure directory to parse
output_csv_path: ${oc.env:PROJECT_ROOT}/forks/DiffDock/inference/diffdock_${dataset}_inputs.csv # the output CSV filepath to which to write the parsed input data
posebusters_ccd_ids_filepath: ${oc.env:PROJECT_ROOT}/data/posebusters_pdb_ccd_ids.txt # the path to the PoseBusters PDB CCD IDs file that lists the targets that do not contain any crystal contacts
dockgen_test_ids_filepath: ${oc.env:PROJECT_ROOT}/data/dockgen_set/split_test.txt # the path to the DockGen test set IDs file
protein_filepath: null # the path to the protein structure file to use
ligand_smiles: null # the ligand SMILES string for which to predict the binding pose
input_id: null # the input ID to use for inference
pocket_only_baseline: false # whether to prepare the pocket-only baseline
FABind input preparation¶
data/fabind_input_preparation.yaml
¶dataset: posebusters_benchmark # the dataset to use - NOTE: must be one of (`posebusters_benchmark`, `astex_diverse`, `dockgen`, `casp15`)
input_data_dir: ${oc.env:PROJECT_ROOT}/data/${dataset}_set # the input protein-ligand complex directory to recursively parse
output_csv_path: ${oc.env:PROJECT_ROOT}/forks/FABind/inference/fabind_${dataset}_inputs.csv # the output CSV filepath to which to write the parsed input data
posebusters_ccd_ids_filepath: ${oc.env:PROJECT_ROOT}/data/posebusters_pdb_ccd_ids.txt # the path to the PoseBusters PDB CCD IDs file that lists the targets that do not contain any crystal contacts
dockgen_test_ids_filepath: ${oc.env:PROJECT_ROOT}/data/dockgen_set/split_test.txt # the path to the DockGen test set IDs file
pocket_only_baseline: false # whether to prepare the pocket-only baseline
DynamicBind input preparation¶
data/dynamicbind_input_preparation.yaml
¶dataset: posebusters_benchmark # the dataset to use - NOTE: must be one of (`posebusters_benchmark`, `astex_diverse`, `dockgen`, `casp15`)
input_data_dir: ${oc.env:PROJECT_ROOT}/data/${dataset}_set # the input protein-ligand complex directory to recursively parse
input_protein_data_dir: null # the input protein structure directory to recursively parse during inference
output_csv_dir: ${oc.env:PROJECT_ROOT}/forks/DynamicBind/inference/dynamicbind_${dataset}_inputs # the output CSV directory to which to write the parsed ligand SMILES strings
posebusters_ccd_ids_filepath: ${oc.env:PROJECT_ROOT}/data/posebusters_pdb_ccd_ids.txt # the path to the PoseBusters PDB CCD IDs file that lists the targets that do not contain any crystal contacts
dockgen_test_ids_filepath: ${oc.env:PROJECT_ROOT}/data/dockgen_set/split_test.txt # the path to the DockGen test set IDs file
protein_filepath: null # the path to the protein structure file to use
ligand_smiles: null # the ligand SMILES string for which to predict the binding pose
pocket_only_baseline: false # whether to prepare the pocket-only baseline
NeuralPLexer input preparation¶
data/neuralplexer_input_preparation.yaml
¶dataset: posebusters_benchmark # the dataset to use - NOTE: must be one of (`posebusters_benchmark`, `astex_diverse`, `dockgen`, `casp15`)
input_data_dir: ${oc.env:PROJECT_ROOT}/data/${dataset}_set # the input protein-ligand complex directory to recursively parse
input_receptor_structure_dir: ${oc.env:PROJECT_ROOT}/data/${dataset}_set/${dataset}_holo_aligned_predicted_structures # if not `null`, the input template protein structure directory to parse
output_csv_path: ${oc.env:PROJECT_ROOT}/forks/NeuralPLexer/inference/neuralplexer_${dataset}_inputs.csv # the output CSV filepath to which to write the parsed input data
posebusters_ccd_ids_filepath: ${oc.env:PROJECT_ROOT}/data/posebusters_pdb_ccd_ids.txt # the path to the PoseBusters PDB CCD IDs file that lists the targets that do not contain any crystal contacts
dockgen_test_ids_filepath: ${oc.env:PROJECT_ROOT}/data/dockgen_set/split_test.txt # the path to the DockGen test set IDs file
input_receptor: null # the input protein sequence
input_ligand: null # the input ligand SMILES
input_template: null # the input template protein structure to optionally use
input_id: null # the input ID to use for inference
pocket_only_baseline: false # whether to prepare the pocket-only baseline
FlowDock input preparation¶
data/flowdock_input_preparation.yaml
¶dataset: posebusters_benchmark # the dataset to use - NOTE: must be one of (`posebusters_benchmark`, `astex_diverse`, `dockgen`, `casp15`)
input_data_dir: ${oc.env:PROJECT_ROOT}/data/${dataset}_set # the input protein-ligand complex directory to recursively parse
input_receptor_structure_dir: ${oc.env:PROJECT_ROOT}/data/${dataset}_set/${dataset}_holo_aligned_predicted_structures # if not `null`, the input template protein structure directory to parse
output_csv_path: ${oc.env:PROJECT_ROOT}/forks/FlowDock/inference/flowdock_${dataset}_inputs.csv # the output CSV filepath to which to write the parsed input data
posebusters_ccd_ids_filepath: ${oc.env:PROJECT_ROOT}/data/posebusters_pdb_ccd_ids.txt # the path to the PoseBusters PDB CCD IDs file that lists the targets that do not contain any crystal contacts
dockgen_test_ids_filepath: ${oc.env:PROJECT_ROOT}/data/dockgen_set/split_test.txt # the path to the DockGen test set IDs file
input_receptor: null # the input protein sequence
input_ligand: null # the input ligand SMILES
input_template: null # the input template protein structure to optionally use
input_id: null # the input ID to use for inference
pocket_only_baseline: false # whether to prepare the pocket-only baseline
RoseTTAFold-All-Atom input preparation¶
data/rfaa_input_preparation.yaml
¶dataset: posebusters_benchmark # the dataset to use - NOTE: must be one of (`posebusters_benchmark`, `astex_diverse`, `dockgen`, `casp15`)
input_data_dir: ${oc.env:PROJECT_ROOT}/data/${dataset}_set # the input protein-ligand complex directory to recursively parse
output_scripts_path: ${oc.env:PROJECT_ROOT}/forks/RoseTTAFold-All-Atom/prediction_inputs/${dataset} # the output directory in which to save the input files
posebusters_ccd_ids_filepath: ${oc.env:PROJECT_ROOT}/data/posebusters_pdb_ccd_ids.txt # the path to the PoseBusters PDB CCD IDs file that lists the targets that do not contain any crystal contacts
dockgen_test_ids_filepath: ${oc.env:PROJECT_ROOT}/data/dockgen_set/split_test.txt # the path to the DockGen test set IDs file
protein_filepath: null # the path to the protein structure file to use
ligand_smiles: null # the ligand SMILES string for which to predict the binding pose
input_id: null # the input ID to use for inference
pocket_only_baseline: false # whether to prepare the pocket-only baseline
RoseTTAFold-All-Atom output extraction¶
data/rfaa_output_extraction.yaml
¶dataset: posebusters_benchmark # the dataset to use - NOTE: must be one of (`posebusters_benchmark`, `astex_diverse`, `dockgen`, `casp15`)
prediction_inputs_dir: ${oc.env:PROJECT_ROOT}/forks/RoseTTAFold-All-Atom/prediction_inputs/${dataset}
prediction_outputs_dir: ${oc.env:PROJECT_ROOT}/forks/RoseTTAFold-All-Atom/prediction_outputs/${dataset}_${repeat_index}
inference_outputs_dir: ${oc.env:PROJECT_ROOT}/forks/RoseTTAFold-All-Atom/inference/rfaa_${dataset}_outputs_${repeat_index}
complex_filepath: null # if not `null`, this should be the path to the complex PDB file for which to extract outputs
complex_id: null # if not `null`, this should be the complex ID of the single complex for which to extract outputs
ligand_smiles: null # if not `null`, this should be the (i.e., `.` fragment-separated) complex ligand SMILES string of the single complex for which to extract outputs
output_dir: null # if not `null`, this should be the path to the output file to which to write the extracted outputs
repeat_index: 1 # the repeat index with which inference was run
pocket_only_baseline: false # whether to prepare the pocket-only baseline
Chai-1 input preparation¶
data/chai_input_preparation.yaml
¶dataset: posebusters_benchmark # the dataset to use - NOTE: must be one of (`posebusters_benchmark`, `astex_diverse`, `dockgen`, `casp15`)
input_data_dir: ${oc.env:PROJECT_ROOT}/data/${dataset}_set # the input protein-ligand complex directory to recursively parse
output_scripts_path: ${oc.env:PROJECT_ROOT}/forks/chai-lab/prediction_inputs/${dataset} # the output directory in which to save the input files
posebusters_ccd_ids_filepath: ${oc.env:PROJECT_ROOT}/data/posebusters_pdb_ccd_ids.txt # the path to the PoseBusters PDB CCD IDs file that lists the targets that do not contain any crystal contacts
dockgen_test_ids_filepath: ${oc.env:PROJECT_ROOT}/data/dockgen_set/split_test.txt # the path to the DockGen test set IDs file
protein_filepath: null # the path to the protein structure file to use
ligand_smiles: null # the ligand SMILES string for which to predict the binding pose
input_id: null # the input ID to use for inference
pocket_only_baseline: false # whether to prepare the pocket-only baseline
Chai-1 output extraction¶
data/chai_output_extraction.yaml
¶dataset: posebusters_benchmark # the dataset to use - NOTE: must be one of (`posebusters_benchmark`, `astex_diverse`, `dockgen`, `casp15`)
prediction_inputs_dir: ${oc.env:PROJECT_ROOT}/forks/chai-lab/prediction_inputs/${dataset}
prediction_outputs_dir: ${oc.env:PROJECT_ROOT}/forks/chai-lab/prediction_outputs/${dataset}_${repeat_index}
inference_outputs_dir: ${oc.env:PROJECT_ROOT}/forks/chai-lab/inference/chai-lab_${dataset}_outputs_${repeat_index}
input_data_dir: ${oc.env:PROJECT_ROOT}/data/${dataset}_set # the input protein-ligand complex directory to recursively parse
posebusters_ccd_ids_filepath: ${oc.env:PROJECT_ROOT}/data/posebusters_pdb_ccd_ids.txt # the path to the PoseBusters PDB CCD IDs file that lists the targets that do not contain any crystal contacts
dockgen_test_ids_filepath: ${oc.env:PROJECT_ROOT}/data/dockgen_set/split_test.txt # the path to the DockGen test set IDs file
complex_filepath: null # if not `null`, this should be the path to the complex PDB file for which to extract outputs
complex_id: null # if not `null`, this should be the complex ID of the single complex for which to extract outputs
ligand_smiles: null # if not `null`, this should be the (i.e., `.` fragment-separated) complex ligand SMILES string of the single complex for which to extract outputs
output_dir: null # if not `null`, this should be the path to the output file to which to write the extracted outputs
repeat_index: 1 # the repeat index with which inference was run
pocket_only_baseline: false # whether to prepare the pocket-only baseline
Boltz-1 input preparation¶
data/boltz_input_preparation.yaml
¶dataset: posebusters_benchmark # the dataset to use - NOTE: must be one of (`posebusters_benchmark`, `astex_diverse`, `dockgen`, `casp15`)
input_data_dir: ${oc.env:PROJECT_ROOT}/data/${dataset}_set # the input protein-ligand complex directory to recursively parse
msa_dir: ${oc.env:PROJECT_ROOT}/data/${dataset}_set/${dataset}_boltz_msas # the directory containing the `.csv` MSA files prepared for Boltz via `posebench/data/components/prepare_boltz_msas.py`; if not provided, Boltz will be run in single-sequence mode
output_scripts_path: ${oc.env:PROJECT_ROOT}/forks/boltz/prediction_inputs/${dataset} # the output directory in which to save the input files
posebusters_ccd_ids_filepath: ${oc.env:PROJECT_ROOT}/data/posebusters_pdb_ccd_ids.txt # the path to the PoseBusters PDB CCD IDs file that lists the targets that do not contain any crystal contacts
dockgen_test_ids_filepath: ${oc.env:PROJECT_ROOT}/data/dockgen_set/split_test.txt # the path to the DockGen test set IDs file
protein_filepath: null # the path to the protein structure file to use
ligand_smiles: null # the ligand SMILES string for which to predict the binding pose
input_id: null # the input ID to use for inference
pocket_only_baseline: false # whether to prepare the pocket-only baseline
Boltz-1 output extraction¶
data/boltz_output_extraction.yaml
¶dataset: posebusters_benchmark # the dataset to use - NOTE: must be one of (`posebusters_benchmark`, `astex_diverse`, `dockgen`, `casp15`)
prediction_inputs_dir: ${oc.env:PROJECT_ROOT}/forks/boltz/prediction_inputs/${dataset}
prediction_outputs_dir: ${oc.env:PROJECT_ROOT}/forks/boltz/prediction_outputs/${dataset}_${repeat_index}
inference_outputs_dir: ${oc.env:PROJECT_ROOT}/forks/boltz/inference/boltz_${dataset}_outputs_${repeat_index}
input_data_dir: ${oc.env:PROJECT_ROOT}/data/${dataset}_set # the input protein-ligand complex directory to recursively parse
posebusters_ccd_ids_filepath: ${oc.env:PROJECT_ROOT}/data/posebusters_pdb_ccd_ids.txt # the path to the PoseBusters PDB CCD IDs file that lists the targets that do not contain any crystal contacts
dockgen_test_ids_filepath: ${oc.env:PROJECT_ROOT}/data/dockgen_set/split_test.txt # the path to the DockGen test set IDs file
complex_filepath: null # if not `null`, this should be the path to the complex PDB file for which to extract outputs
complex_id: null # if not `null`, this should be the complex ID of the single complex for which to extract outputs
ligand_smiles: null # if not `null`, this should be the (i.e., `.` fragment-separated) complex ligand SMILES string of the single complex for which to extract outputs
output_dir: null # if not `null`, this should be the path to the output file to which to write the extracted outputs
repeat_index: 1 # the repeat index with which inference was run
pocket_only_baseline: false # whether to prepare the pocket-only baseline
AlphaFold 3 output extraction¶
data/af3_output_extraction.yaml
¶dataset: posebusters_benchmark # the dataset to use - NOTE: must be one of (`posebusters_benchmark`, `astex_diverse`, `dockgen`, `casp15`)
prediction_outputs_dir: ${oc.env:PROJECT_ROOT}/forks/alphafold3/prediction_outputs/${dataset}_${repeat_index}
inference_outputs_dir: ${oc.env:PROJECT_ROOT}/forks/alphafold3/inference/alphafold3_${dataset}_outputs_${repeat_index}
input_data_dir: ${oc.env:PROJECT_ROOT}/data/${dataset}_set # the input protein-ligand complex directory to recursively parse
posebusters_ccd_ids_filepath: ${oc.env:PROJECT_ROOT}/data/posebusters_pdb_ccd_ids.txt # the path to the PoseBusters PDB CCD IDs file that lists the targets that do not contain any crystal contacts
dockgen_test_ids_filepath: ${oc.env:PROJECT_ROOT}/data/dockgen_set/split_test.txt # the path to the DockGen test set IDs file
complex_filepath: null # if not `null`, this should be the path to the complex PDB file for which to extract outputs
complex_id: null # if not `null`, this should be the complex ID of the single complex for which to extract outputs
ligand_smiles: null # if not `null`, this should be the (i.e., `.` fragment-separated) complex ligand SMILES string of the single complex for which to extract outputs
output_dir: null # if not `null`, this should be the path to the output file to which to write the extracted outputs
repeat_index: 1 # the repeat index with which inference was run
pocket_only_baseline: false # whether to prepare the pocket-only baseline
TULIP output extraction¶
data/tulip_output_extraction.yaml
¶dataset: posebusters_benchmark # the dataset to use - NOTE: must be one of (`posebusters_benchmark`, `astex_diverse`, `casp15`)
prediction_outputs_dir: ${oc.env:PROJECT_ROOT}/forks/TULIP/outputs/${dataset}_${repeat_index}
inference_outputs_dir: ${oc.env:PROJECT_ROOT}/forks/TULIP/inference/tulip_${dataset}_outputs_${repeat_index}
posebusters_ccd_ids_filepath: ${oc.env:PROJECT_ROOT}/data/posebusters_pdb_ccd_ids.txt # the path to the PoseBusters PDB CCD IDs file that lists the targets that do not contain any crystal contacts
dockgen_test_ids_filepath: ${oc.env:PROJECT_ROOT}/data/dockgen_set/split_test.txt # the path to the DockGen test set IDs file
method_top_n_to_select: 5 # the number of top models for each target to select for analysis
repeat_index: 1 # the repeat index to use