Skip to main content
bioRxiv
  • Home
  • About
  • Submit
  • ALERTS / RSS
Advanced Search
New Results

AllTheBacteria – all bacterial genomes assembled, available, and searchable

View ORCID ProfileMartin Hunt, View ORCID ProfileLeandro Lima, View ORCID ProfileDaniel Anderson, View ORCID ProfileGeorge Bouras, View ORCID ProfileMichael Hall, View ORCID ProfileJane Hawkey, View ORCID ProfileOliver Schwengers, View ORCID ProfileWei Shen, View ORCID ProfileJohn A. Lees, View ORCID ProfileZamin Iqbal
doi: https://doi.org/10.1101/2024.03.08.584059
Martin Hunt
1European Molecular Biology Laboratory, European Bioinformatics Institute, Hinxton, UK
2Nuffield Department of Medicine, University of Oxford, Oxford, UK
3National Institute of Health Research Oxford Biomedical Research Centre, John Radcliffe Hospital, Headley Way, Oxford, UK
4Health Protection Research Unit in Healthcare Associated Infections and Antimicrobial Resistance, University of Oxford, Oxford, UK
  • Find this author on Google Scholar
  • Find this author on PubMed
  • Search for this author on this site
  • ORCID record for Martin Hunt
Leandro Lima
1European Molecular Biology Laboratory, European Bioinformatics Institute, Hinxton, UK
  • Find this author on Google Scholar
  • Find this author on PubMed
  • Search for this author on this site
  • ORCID record for Leandro Lima
Daniel Anderson
1European Molecular Biology Laboratory, European Bioinformatics Institute, Hinxton, UK
  • Find this author on Google Scholar
  • Find this author on PubMed
  • Search for this author on this site
  • ORCID record for Daniel Anderson
George Bouras
5Adelaide Medical School, Faculty of Health and Medical Sciences, The University of Adelaide, Adelaide, Australia
6The Department of Surgery – Otolaryngology Head and Neck Surgery, Central Adelaide Local Health Network, Adelaide, Australia
  • Find this author on Google Scholar
  • Find this author on PubMed
  • Search for this author on this site
  • ORCID record for George Bouras
Michael Hall
7Department of Microbiology and Immunology, The University of Melbourne, at the Peter Doherty Institute for Infection and Immunity, Australia
8Center for Clinical Research, University of Queensland Centre for Clinical Research, Faculty of Medicine, The University of Queensland, Brisbane, QLD, Australia
  • Find this author on Google Scholar
  • Find this author on PubMed
  • Search for this author on this site
  • ORCID record for Michael Hall
Jane Hawkey
9Department of Infectious Diseases, School of Translational Medicine, Monash University, Melbourne, Victoria 3004, Australia
  • Find this author on Google Scholar
  • Find this author on PubMed
  • Search for this author on this site
  • ORCID record for Jane Hawkey
Oliver Schwengers
10Bioinformatics and Systems Biology, Justus Liebig University Giessen, Giessen 35392, Germany
  • Find this author on Google Scholar
  • Find this author on PubMed
  • Search for this author on this site
  • ORCID record for Oliver Schwengers
Wei Shen
1European Molecular Biology Laboratory, European Bioinformatics Institute, Hinxton, UK
11Institute for Viral Hepatitis, The Second Affiliated Hospital of Chongqing Medical University, China
  • Find this author on Google Scholar
  • Find this author on PubMed
  • Search for this author on this site
  • ORCID record for Wei Shen
John A. Lees
1European Molecular Biology Laboratory, European Bioinformatics Institute, Hinxton, UK
  • Find this author on Google Scholar
  • Find this author on PubMed
  • Search for this author on this site
  • ORCID record for John A. Lees
  • For correspondence: zi245{at}bath.ac.uk jlees{at}ebi.ac.uk
Zamin Iqbal
1European Molecular Biology Laboratory, European Bioinformatics Institute, Hinxton, UK
12Milner Centre for Evolution, University of Bath, UK
  • Find this author on Google Scholar
  • Find this author on PubMed
  • Search for this author on this site
  • ORCID record for Zamin Iqbal
  • For correspondence: zi245{at}bath.ac.uk jlees{at}ebi.ac.uk
  • Abstract
  • Full Text
  • Info/History
  • Metrics
  • Data/Code
  • Preview PDF
Loading

Abstract

The bacterial sequence data publicly available via the global DNA archives is a vast potential source of information on the evolution of bacteria. However, most of this sequence data is unassembled, or where assembled was done so with no consistent assembler or quality control. Although this data has great potential, these inconsistencies make it unsuitable for large-scale analyses, and inaccessible for most researchers to reuse. Therefore in our previous effort, we released a uniformly assembled set of 661,405 genomes, consisting of all publicly available whole genome sequenced bacterial isolate data up to a cutoff of November 2018, enriched with various search indexes to make the data easier to sort and use. In this study, we first extend the dataset up to August 2024 with the same consistent assembly pipeline, more than tripling the number of genomes available. We also expand the scope of the dataset beyond genomes, as we begin a global collaborative project to generate annotations, species-specific analyses, evolutionary data, new search indices, and protein structural data. Our collaboration is therefore grass-roots, driven by the needs of different research communities within microbiology.

In this paper, we describe the project as of release 2024-08, comprising 2,440,377 assemblies. All 2.4 million genomes have been uniformly reprocessed for quality criteria and to give taxonomic abundance estimates with respect to the GTDB phylogeny. We further enrich the dataset with sequence annotations from Bakta, antimicrobial resistance predictions from AMRFinderPlus, and AlphaFold2 protein structure predictions for the 17.7M hypothetical proteins. By applying an evolution-informed compression approach, the full set of genomes is just 130Gb: a reduction of ∽23x compared to compressing individual assemblies. To make the resource as accessible as possible, we also provide multiple search indexes, a method for alignment to the full dataset, and cloud-based access to all the genomes.

The AllTheBacteria data (https://allthebacteria.org/) has already been independently used in multiple other analyses – our goal is to make this a self-sustaining community-driven resource, which increases the accessiblity and reuse of bacterial genomes for a large range of purposes.

Competing Interest Statement

The authors have declared no competing interest.

Footnotes

  • Fixed incorrect statistic in Figure 1(number of high quality assemblies was 2.3million, not 1.9 million)

  • https://osf.io/xv7q9/

  • https://allthebacteria.readthedocs.io/en/latest/overview.html#current-status

  • https://allthebacteria.org/

Copyright 
The copyright holder for this preprint is the author/funder, who has granted bioRxiv a license to display the preprint in perpetuity. It is made available under a CC-BY 4.0 International license.
Back to top
PreviousNext
Posted August 28, 2025.
Download PDF
Data/Code
Email

Thank you for your interest in spreading the word about bioRxiv.

NOTE: Your email address is requested solely to identify you as the sender of this article.

Enter multiple addresses on separate lines or separate them with commas.
AllTheBacteria – all bacterial genomes assembled, available, and searchable
(Your Name) has forwarded a page to you from bioRxiv
(Your Name) thought you would like to see this page from the bioRxiv website.
CAPTCHA
This question is for testing whether or not you are a human visitor and to prevent automated spam submissions.
Share
AllTheBacteria – all bacterial genomes assembled, available, and searchable
Martin Hunt, Leandro Lima, Daniel Anderson, George Bouras, Michael Hall, Jane Hawkey, Oliver Schwengers, Wei Shen, John A. Lees, Zamin Iqbal
bioRxiv 2024.03.08.584059; doi: https://doi.org/10.1101/2024.03.08.584059
Twitter logo Facebook logo LinkedIn logo Mendeley logo
Citation Tools
AllTheBacteria – all bacterial genomes assembled, available, and searchable
Martin Hunt, Leandro Lima, Daniel Anderson, George Bouras, Michael Hall, Jane Hawkey, Oliver Schwengers, Wei Shen, John A. Lees, Zamin Iqbal
bioRxiv 2024.03.08.584059; doi: https://doi.org/10.1101/2024.03.08.584059

Citation Manager Formats

  • BibTeX
  • Bookends
  • EasyBib
  • EndNote (tagged)
  • EndNote 8 (xml)
  • Medlars
  • Mendeley
  • Papers
  • RefWorks Tagged
  • Ref Manager
  • RIS
  • Zotero
  • Tweet Widget
  • Facebook Like
  • Google Plus One

Subject Area

  • Bioinformatics
Subject Areas
All Articles
  • Animal Behavior and Cognition (7538)
  • Biochemistry (17329)
  • Bioengineering (13568)
  • Bioinformatics (41179)
  • Biophysics (21116)
  • Cancer Biology (18229)
  • Cell Biology (25075)
  • Clinical Trials (138)
  • Developmental Biology (13195)
  • Ecology (19596)
  • Epidemiology (2067)
  • Evolutionary Biology (24053)
  • Genetics (15452)
  • Genomics (22200)
  • Immunology (17449)
  • Microbiology (39769)
  • Molecular Biology (16887)
  • Neuroscience (87209)
  • Paleontology (662)
  • Pathology (2791)
  • Pharmacology and Toxicology (4722)
  • Physiology (7517)
  • Plant Biology (14872)
  • Scientific Communication and Education (2031)
  • Synthetic Biology (4216)
  • Systems Biology (9683)
  • Zoology (2238)