Skip to main content

Web Crawl Refusals: Insights From Common Crawl

  • Conference paper
  • First Online:
Image Passive and Active Measurement (PAM 2025)

Part of the book series: Lecture Notes in Computer Science ((LNCS,volume 15567))

Included in the following conference series:

  • 2595 Accesses

  • 1 Citation

Abstract

Web crawlers are an indispensable tool for collecting research data. However, they may be blocked by servers for various reasons. This can reduce their coverage. In this early-stage work, we investigate server-side blocks encountered by Common Crawl (CC). We analyze page contents to cover a broader range of refusals than previous work. We construct fine-grained regular expressions to identify refusal pages with precision, finding that at least 1.68% of sites in a CC snapshot exhibit a form of explicit refusal. Significant contributors include large hosters. Our analysis categorizes the forms of refusal messages, from straight blocks to challenges and rate-limiting responses. We are able to extract the reasons for nearly half of the refusals we identify. We find an inconsistent and even incorrect use of HTTP status codes to indicate refusals. Examining the temporal dynamics of refusals, we find that most blocks resolve within one hour, but also that 80% of refusing domains block every request by CC. Our results show that website blocks deserve more attention as they have a relevant impact on crawling projects. We also conclude that standardization to signal refusals would be beneficial for both site operators and web crawlers.

This is a preview of subscription content, log in via an institution to check access.

Access this chapter

Subscribe and save

Springer+
from $39.99 /Month
  • Starting from 10 chapters or articles per month
  • Access and download chapters and articles from more than 300k books and 2,500 journals
  • Cancel anytime
View plans

Buy Now

Chapter
USD 29.95
Price excludes VAT (USA)
  • Available as PDF
  • Read on any device
  • Instant download
  • Own it forever
eBook
USD 59.99
Price excludes VAT (USA)
  • Available as EPUB and PDF
  • Read on any device
  • Instant download
  • Own it forever
Softcover Book
USD 79.99
Price excludes VAT (USA)
  • Compact, lightweight edition
  • Dispatched in 3 to 5 business days
  • Free shipping worldwide - see info

Tax calculation will be finalised at checkout

Purchases are for personal use only

Institutional subscriptions

Similar content being viewed by others

Notes

  1. 1.

    https://github.com/mstfnsr/web_refusal_regex.

References

  1. Pepyaka webserver. https://webtechsurvey.com/technology/pepyaka. Accessed May 2024

  2. Ablove, A., et al.: Digital discrimination of users in sanctioned states: the case of the cuba embargo. In: 33rd USENIX Security Symposium (USENIX Security 2024), Philadelphia, PA, pp. 3909–3926. USENIX Association (2024). https://www.usenix.org/conference/usenixsecurity24/presentation/ablove

  3. Afroz, S., Tschantz, M.C., Sajid, S., Qazi, S.A., Javed, M., Paxson, V.: Exploring server-side blocking of regions. arXiv abs/1805.11606 (2018). https://api.semanticscholar.org/CorpusID:44131334

  4. Ahmad, S.S., Dar, M.D., Zaffar, M.F., Vallina-Rodriguez, N., Nithyanand, R.: Apophanies or epiphanies? How crawlers impact our understanding of the web. In: Proceedings of The Web Conference 2020 (WWW 2020), pp. 271–280 (2020)

    Google Scholar 

  5. Asghari, H.: pyasn. https://github.com/hadiasghari/pyasn

  6. Center for Applied Internet Data Analysis (CAIDA): AS Organizations Dataset (2024). https://catalog.caida.org/dataset/as_organizations. Accessed May 2024

  7. Common Crawl: November/december 2023 crawl archive now available. https://www.commoncrawl.org/blog/november-december-2023-crawl-archive-now-available. Accessed May 2024

  8. Darer, A., Farnan, O., Wright, J.: Automated discovery of internet censorship by web crawling. In: Proceedings of the 10th ACM Conference on Web Science (WebSci 2018), pp. 195–204 (2018)

    Google Scholar 

  9. Fielding, R.T., Nottingham, M.: Additional HTTP Status Codes. RFC 6585 (2012). https://doi.org/10.17487/RFC6585. https://www.rfc-editor.org/info/rfc6585

  10. Fielding, R.T., Nottingham, M., Reschke, J.: HTTP Semantics. RFC 9110 (2022). https://doi.org/10.17487/RFC9110. https://www.rfc-editor.org/info/rfc9110

  11. Holz, R., Braun, L., Kammenhuber, N., Carle, G.: The SSL landscape - a thorough analysis of the X.509 PKI using active and passive measurements. In: Proceedings of the ACM/USENIX 11th Annual Internet Measurement Conference (IMC), Berlin, Germany (2011)

    Google Scholar 

  12. http.dev: HTTP status codes. https://http.dev/status

  13. Institute, R.: How many news websites block AI crawlers. Reuters Institute for the Study of Journalism (2023). https://reutersinstitute.politics.ox.ac.uk/how-many-news-websites-block-ai-crawlers#:~:text=Examining. Accessed May 2024

  14. Invernizzi, L., Thomas, K., Kapravelos, A., Comanescu, O., Picod, J.M., Bursztein, E.: Cloak of visibility: detecting when machines browse a different web. In: 2016 IEEE Symposium on Security and Privacy (SP), pp. 743–758 (2016)

    Google Scholar 

  15. Koster, M., Illyes, G., Zeller, H., Sassman, L.: Robots Exclusion Protocol. RFC 9309 (2022). https://doi.org/10.17487/RFC9309. https://www.rfc-editor.org/info/rfc9309

  16. Leonard, D., Loguinov, D.: Demystifying service discovery: implementing an internet-wide scanner. In: Proceedings of the ACM SIGCOMM Conference on Internet Measurement (IMC 2010), pp. 109–122 (2010)

    Google Scholar 

  17. McDonald, A., et al.: 403 forbidden: a global view of CDN geoblocking. In: Proceedings of the Internet Measurement Conference (IMC 2018), pp. 218–230 (2018)

    Google Scholar 

  18. Nagel, S.: Common crawl: data collection and use cases for NLP (2023). http://nlpl.eu/skeikampen23/nagel.230206.pdf. Accessed May 2024

  19. Niaki, A.A., et al.: ICLab: a global, longitudinal internet censorship measurement platform. In: 2020 IEEE Symposium on Security and Privacy (SP), pp. 135–151 (2020)

    Google Scholar 

  20. Tschantz, M.C., Afroz, S., Sajid, S., Qazi, S.A., Javed, M., Paxson, V.: A bestiary of blocking: the motivations and modes behind website unavailability. In: 8th USENIX Workshop on Free and Open Communications on the Internet (FOCI 2018) (2018)

    Google Scholar 

  21. Vastel, A., Rudametkin, W., Rouvoy, R., Blanc, X.: FP-crawlers: studying the resilience of browser fingerprinting to block crawlers. In: NDSS Workshop on Measurements, Attacks, and Defenses for the Web, MADWeb 2020 (2020)

    Google Scholar 

  22. Wan, G., et al.: On the origin of scanning: the impact of location on Internet-wide scans. In: Proceedings of the ACM Internet Measurement Conference (IMC 2020), pp. 662–679 (2020)

    Google Scholar 

  23. Zeber, D., et al.: The representativeness of automated web crawls as a surrogate for human browsing. In: Proceedings of the Web Conference 2020 (WWW 2020), pp. 167–178 (2020)

    Google Scholar 

Download references

Acknowledgments

This work was partially supported by the research project ‘CATRIN’ (NWA. 1215.18.003) as part of the Dutch Research Council’s (NWO) National Research Agenda (NWA).

Author information

Authors and Affiliations

Authors

Corresponding author

Correspondence to Mostafa Ansar .

Editor information

Editors and Affiliations

Ethics declarations

Ethics

This work raises no ethical concerns. The CC dataset was ethically created [18]. We ran fewer than \(70 \times 10^{3}\) DNS queries (about \(12 \times 10^{3}\) FQDN samples) to identify their NS and PTR records, distributing them over time and sequentially to avoid network or nameserver load.

A Additional Figures and Tables

A Additional Figures and Tables

Table 9. Freq. of status codes in pruned set. Asterisks refer to unofficial status codes
Table 10. Frequency of tags in refusals
Table 11. Frequency of status codes in refusals. The finer granularity (< 50 refusals) for the tail was obtained by manual investigation of refusals captured by the more generic regular expressions. Asterisks refer to unofficial status codes.
Fig. 7.
figure 7The alternative text for this image may have been generated using AI.

Refusal rate per FQDN by reason.

Fig. 8.
figure 8The alternative text for this image may have been generated using AI.

Refusal rate per FQDN by tag.

figure aThe alternative text for this image may have been generated using AI.
Table 12. A subset of regular expressions with labels and full page textual contents

Rights and permissions

Reprints and permissions

Copyright information

© 2025 The Author(s), under exclusive license to Springer Nature Switzerland AG

About this paper

Check for updates. Verify currency and authenticity via CrossMark

Cite this paper

Ansar, M., Sperotto, A., Holz, R. (2025). Web Crawl Refusals: Insights From Common Crawl. In: Testart, C., van Rijswijk-Deij, R., Stiller, B. (eds) Passive and Active Measurement. PAM 2025. Lecture Notes in Computer Science, vol 15567. Springer, Cham. https://doi.org/10.1007/978-3-031-85960-1_9

Download citation

Keywords

Publish with us

Policies and ethics