How to prepare PoseBench data

Downloading Astex, PoseBusters, DockGen, and CASP15 data

# fetch, extract, and clean-up preprocessed Astex Diverse, PoseBusters Benchmark, DockGen, and CASP15 data (~3 GB) #
wget https://zenodo.org/records/11477766/files/astex_diverse_set.tar.gz
wget https://zenodo.org/records/11477766/files/posebusters_benchmark_set.tar.gz
wget https://zenodo.org/records/11477766/files/dockgen_set.tar.gz
wget https://zenodo.org/records/11477766/files/casp15_set.tar.gz
tar -xzf astex_diverse_set.tar.gz
tar -xzf posebusters_benchmark_set.tar.gz
tar -xzf dockgen_set.tar.gz
tar -xzf casp15_set.tar.gz
rm astex_diverse_set.tar.gz
rm posebusters_benchmark_set.tar.gz
rm dockgen_set.tar.gz
rm casp15_set.tar.gz

Downloading benchmark method predictions

# fetch, extract, and clean-up benchmark method predictions to reproduce paper results (~19 GB) #
# DiffDock predictions and results
wget https://zenodo.org/records/11477766/files/diffdock_benchmark_method_predictions.tar.gz
tar -xzf diffdock_benchmark_method_predictions.tar.gz
rm diffdock_benchmark_method_predictions.tar.gz
# FABind predictions and results
wget https://zenodo.org/records/11477766/files/fabind_benchmark_method_predictions.tar.gz
tar -xzf fabind_benchmark_method_predictions.tar.gz
rm fabind_benchmark_method_predictions.tar.gz
# DynamicBind predictions and results
wget https://zenodo.org/records/11477766/files/dynamicbind_benchmark_method_predictions.tar.gz
tar -xzf dynamicbind_benchmark_method_predictions.tar.gz
rm dynamicbind_benchmark_method_predictions.tar.gz
# NeuralPLexer predictions and results
wget https://zenodo.org/records/11477766/files/neuralplexer_benchmark_method_predictions.tar.gz
tar -xzf neuralplexer_benchmark_method_predictions.tar.gz
rm neuralplexer_benchmark_method_predictions.tar.gz
# RoseTTAFold-All-Atom predictions and results
wget https://zenodo.org/records/11477766/files/rfaa_benchmark_method_predictions.tar.gz
tar -xzf rfaa_benchmark_method_predictions.tar.gz
rm rfaa_benchmark_method_predictions.tar.gz
# TULIP predictions and results
wget https://zenodo.org/records/11477766/files/tulip_benchmark_method_predictions.tar.gz
tar -xzf tulip_benchmark_method_predictions.tar.gz
rm tulip_benchmark_method_predictions.tar.gz
# AutoDock Vina predictions and results
wget https://zenodo.org/records/11477766/files/vina_benchmark_method_predictions.tar.gz
tar -xzf vina_benchmark_method_predictions.tar.gz
rm vina_benchmark_method_predictions.tar.gz
# Astex Diverse, PoseBusters Benchmark (w/ pocket-only results), DockGen, and CASP15 consensus ensemble predictions and results
wget https://zenodo.org/records/11477766/files/astex_diverse_ensemble_benchmark_method_predictions.tar.gz
wget https://zenodo.org/records/11477766/files/posebusters_benchmark_ensemble_benchmark_method_predictions.tar.gz
wget https://zenodo.org/records/11477766/files/dockgen_ensemble_benchmark_method_predictions.tar.gz
wget https://zenodo.org/records/11477766/files/casp15_ensemble_benchmark_method_predictions.tar.gz
tar -xzf astex_diverse_ensemble_benchmark_method_predictions.tar.gz
tar -xzf posebusters_benchmark_ensemble_benchmark_method_predictions.tar.gz
tar -xzf dockgen_ensemble_benchmark_method_predictions.tar.gz
tar -xzf casp15_ensemble_benchmark_method_predictions.tar.gz
rm astex_diverse_ensemble_benchmark_method_predictions.tar.gz
rm posebusters_benchmark_ensemble_benchmark_method_predictions.tar.gz
rm dockgen_ensemble_benchmark_method_predictions.tar.gz
rm casp15_ensemble_benchmark_method_predictions.tar.gz

NOTE: One can reproduce the pocket-only experiments with the PoseBusters Benchmark set by adding the argument pocket_only_baseline=true to each command below used to run PoseBusters Benchmark dataset inference with all the baseline methods, since the pocket-only versions of the dataset’s holo-aligned predicted protein structures have also been included in the downloadable Zenodo archive posebusters_benchmark_set.tar.gz referenced above. However, be aware that one then needs to rename any existing directories containing PoseBusters Benchmark dataset inference results for each baseline method, to prevent these existing inference directories from being merged with new pocket-only results. Please see the config files within configs/data/, configs/model/, and configs/analysis/ for more details.

Downloading sequence databases (required only for RoseTTAFold-All-Atom inference)

# acquire multiple sequence alignment databases for RoseTTAFold-All-Atom (~2.5 TB)
cd forks/RoseTTAFold-All-Atom/

# uniref30 [46G]
wget http://wwwuser.gwdg.de/~compbiol/uniclust/2020_06/UniRef30_2020_06_hhsuite.tar.gz
mkdir -p UniRef30_2020_06
tar xfz UniRef30_2020_06_hhsuite.tar.gz -C ./UniRef30_2020_06

# BFD [272G]
wget https://bfd.mmseqs.com/bfd_metaclust_clu_complete_id30_c90_final_seq.sorted_opt.tar.gz
mkdir -p bfd
tar xfz bfd_metaclust_clu_complete_id30_c90_final_seq.sorted_opt.tar.gz -C ./bfd

# structure templates (including *_a3m.ffdata, *_a3m.ffindex)
wget https://files.ipd.uw.edu/pub/RoseTTAFold/pdb100_2021Mar03.tar.gz
tar xfz pdb100_2021Mar03.tar.gz

cd ../../

Predicting apo protein structures using ESMFold (optional, preprocessed data available)

First create all the corresponding FASTA files for each protein sequence

python3 posebench/data/components/protein_fasta_preparation.py dataset=posebusters_benchmark
python3 posebench/data/components/protein_fasta_preparation.py dataset=astex_diverse

To generate the apo version of each protein structure, create ESMFold-ready versions of the combined FASTA files prepared above by the script protein_fasta_preparation.py for the PoseBusters Benchmark and Astex Diverse sets, respectively

python3 posebench/data/components/esmfold_sequence_preparation.py dataset=posebusters_benchmark
python3 posebench/data/components/esmfold_sequence_preparation.py dataset=astex_diverse

Then, predict each apo protein structure using ESMFold’s batch inference script

python3 posebench/data/components/esmfold_batch_structure_prediction.py -i data/posebusters_benchmark_set/posebusters_benchmark_esmfold_sequences.fasta -o data/posebusters_benchmark_set/posebusters_benchmark_esmfold_structures --skip-existing
python3 posebench/data/components/esmfold_batch_structure_prediction.py -i data/astex_diverse_set/astex_diverse_esmfold_sequences.fasta -o data/astex_diverse_set/astex_diverse_esmfold_structures --skip-existing

NOTE: Having a CUDA-enabled device available when running ESMFold is highly recommended

NOTE: ESMFold may not be able to predict apo protein structures for a handful of exceedingly-long (e.g., >2000 token) input sequences

Lastly, align each apo protein structure to its corresponding holo protein structure counterpart in the PoseBusters Benchmark or Astex Diverse set, taking ligand conformations into account during each alignment

python3 posebench/data/components/protein_apo_to_holo_alignment.py dataset=posebusters_benchmark num_workers=1
python3 posebench/data/components/protein_apo_to_holo_alignment.py dataset=astex_diverse num_workers=1

NOTE: The preprocessed Astex Diverse, PoseBusters Benchmark, DockGen, and CASP15 data available via Zenodo provide pre-holo-aligned predicted protein structures for these respective datasets.