How to prepare PoseBench data

Downloading Astex, PoseBusters, DockGen, and CASP15 data

# fetch, extract, and clean-up preprocessed Astex Diverse, PoseBusters Benchmark, DockGen, and CASP15 data (~3 GB) #
wget https://zenodo.org/records/14629652/files/astex_diverse_set.tar.gz
wget https://zenodo.org/records/14629652/files/posebusters_benchmark_set.tar.gz
wget https://zenodo.org/records/14629652/files/dockgen_set.tar.gz
wget https://zenodo.org/records/14629652/files/casp15_set.tar.gz
tar -xzf astex_diverse_set.tar.gz
tar -xzf posebusters_benchmark_set.tar.gz
tar -xzf dockgen_set.tar.gz
tar -xzf casp15_set.tar.gz
rm astex_diverse_set.tar.gz
rm posebusters_benchmark_set.tar.gz
rm dockgen_set.tar.gz
rm casp15_set.tar.gz

Downloading benchmark method predictions

# fetch, extract, and clean-up benchmark method predictions to reproduce paper results (~19 GB) #
# AutoDock Vina predictions and results
wget https://zenodo.org/records/14629652/files/vina_benchmark_method_predictions.tar.gz
tar -xzf vina_benchmark_method_predictions.tar.gz
rm vina_benchmark_method_predictions.tar.gz
# DiffDock predictions and results
wget https://zenodo.org/records/14629652/files/diffdock_benchmark_method_predictions.tar.gz
tar -xzf diffdock_benchmark_method_predictions.tar.gz
rm diffdock_benchmark_method_predictions.tar.gz
# DynamicBind predictions and results
wget https://zenodo.org/records/14629652/files/dynamicbind_benchmark_method_predictions.tar.gz
tar -xzf dynamicbind_benchmark_method_predictions.tar.gz
rm dynamicbind_benchmark_method_predictions.tar.gz
# NeuralPLexer predictions and results
wget https://zenodo.org/records/14629652/files/neuralplexer_benchmark_method_predictions.tar.gz
tar -xzf neuralplexer_benchmark_method_predictions.tar.gz
rm neuralplexer_benchmark_method_predictions.tar.gz
# RoseTTAFold-All-Atom predictions and results
wget https://zenodo.org/records/14629652/files/rfaa_benchmark_method_predictions.tar.gz
tar -xzf rfaa_benchmark_method_predictions.tar.gz
rm rfaa_benchmark_method_predictions.tar.gz
# Chai-1 predictions and results
wget https://zenodo.org/records/14629652/files/chai_benchmark_method_predictions.tar.gz
tar -xzf chai_benchmark_method_predictions.tar.gz
rm chai_benchmark_method_predictions.tar.gz
# AlphaFold 3 predictions and results
wget https://zenodo.org/records/14629652/files/af3_benchmark_method_predictions.tar.gz
tar -xzf af3_benchmark_method_predictions.tar.gz
rm af3_benchmark_method_predictions.tar.gz
# CASP15 predictions and results for all methods
wget https://zenodo.org/records/14629652/files/casp15_benchmark_method_predictions.tar.gz
tar -xzf casp15_benchmark_method_predictions.tar.gz
rm casp15_benchmark_method_predictions.tar.gz

Downloading benchmark method interactions

# fetch, extract, and clean-up benchmark method interactions to reproduce paper results (~12 GB) #
# cached ProLIF interactions for notebook plots
wget https://zenodo.org/records/14629652/files/posebench_notebooks.tar.gz
tar -xzf posebench_notebooks.tar.gz
rm posebench_notebooks.tar.gz

Downloading sequence databases (required only for RoseTTAFold-All-Atom inference)

# acquire multiple sequence alignment databases for RoseTTAFold-All-Atom (~2.5 TB)
cd forks/RoseTTAFold-All-Atom/

# uniref30 [46G]
wget http://wwwuser.gwdg.de/~compbiol/uniclust/2020_06/UniRef30_2020_06_hhsuite.tar.gz
mkdir -p UniRef30_2020_06
tar xfz UniRef30_2020_06_hhsuite.tar.gz -C ./UniRef30_2020_06

# BFD [272G]
wget https://bfd.mmseqs.com/bfd_metaclust_clu_complete_id30_c90_final_seq.sorted_opt.tar.gz
mkdir -p bfd
tar xfz bfd_metaclust_clu_complete_id30_c90_final_seq.sorted_opt.tar.gz -C ./bfd

# structure templates [81G] (including *_a3m.ffdata, *_a3m.ffindex)
wget https://files.ipd.uw.edu/pub/RoseTTAFold/pdb100_2021Mar03.tar.gz
tar xfz pdb100_2021Mar03.tar.gz

cd ../../

Downloading PDB metadata

# download and extract the PDB's FASTA sequence files
mkdir -p ./data/pdb_data/
wget -P ./data/pdb_data/ https://files.rcsb.org/pub/pdb/derived_data/pdb_seqres.txt.gz
find ./data/pdb_data/ -type f -name "*.gz" -exec gzip -d {} \;

Predicting apo protein structures using ESMFold (optional, preprocessed data available)

First create all the corresponding FASTA files for each protein sequence

python3 posebench/data/components/fasta_preparation.py dataset=posebusters_benchmark
python3 posebench/data/components/fasta_preparation.py dataset=astex_diverse
python3 posebench/data/components/fasta_preparation.py dataset=dockgen
python3 posebench/data/components/fasta_preparation.py dataset=casp15

To generate the apo version of each protein structure, create ESMFold-ready versions of the combined FASTA files prepared above by the script fasta_preparation.py for the PoseBusters Benchmark and Astex Diverse sets, respectively

python3 posebench/data/components/esmfold_sequence_preparation.py dataset=posebusters_benchmark
python3 posebench/data/components/esmfold_sequence_preparation.py dataset=astex_diverse
python3 posebench/data/components/esmfold_sequence_preparation.py dataset=dockgen
python3 posebench/data/components/esmfold_sequence_preparation.py dataset=casp15

Then, predict each apo protein structure using ESMFold’s batch inference script

python3 posebench/data/components/esmfold_batch_structure_prediction.py -i data/posebusters_benchmark_set/reference_posebusters_benchmark_esmfold_sequences.fasta -o data/posebusters_benchmark_set/posebusters_benchmark_esmfold_predicted_structures --skip-existing
python3 posebench/data/components/esmfold_batch_structure_prediction.py -i data/astex_diverse_set/reference_astex_diverse_esmfold_sequences.fasta -o data/astex_diverse_set/astex_diverse_esmfold_predicted_structures --skip-existing
python3 posebench/data/components/esmfold_batch_structure_prediction.py -i data/dockgen_set/reference_dockgen_esmfold_sequences.fasta -o data/dockgen_set/dockgen_esmfold_predicted_structures --skip-existing
python3 posebench/data/components/esmfold_batch_structure_prediction.py -i data/casp15_set/reference_casp15_esmfold_sequences.fasta -o data/casp15_set/casp15_esmfold_predicted_structures --skip-existing

NOTE: Having a CUDA-enabled device available when running ESMFold is highly recommended

NOTE: ESMFold may not be able to predict apo protein structures for a handful of exceedingly-long (e.g., >2000 token) input sequences

Lastly, align each apo protein structure to its corresponding holo protein structure counterpart for each dataset, taking ligand conformations into account during each alignment

conda activate PyMOL-PoseBench
python3 posebench/data/components/protein_apo_to_holo_alignment.py dataset=posebusters_benchmark processing_esmfold_structures=true num_workers=1
python3 posebench/data/components/protein_apo_to_holo_alignment.py dataset=astex_diverse processing_esmfold_structures=true num_workers=1
python3 posebench/data/components/protein_apo_to_holo_alignment.py dataset=dockgen processing_esmfold_structures=true num_workers=1
python3 posebench/data/components/protein_apo_to_holo_alignment.py dataset=casp15 processing_esmfold_structures=true num_workers=1
conda deactivate

Note

The preprocessed Astex Diverse, PoseBusters Benchmark, DockGen, and CASP15 data (available via Zenodo at https://doi.org/10.5281/zenodo.14629652) provide pre-holo-aligned protein structures predicted by AlphaFold 3 (and alternatively MIT-licensed ESMFold) for these respective datasets. Accordingly, users must ensure their usage of such predicted protein structures from AlphaFold 3 aligns with AlphaFold 3’s Terms of Use at https://github.com/google-deepmind/alphafold3/blob/main/WEIGHTS_TERMS_OF_USE.md.