Web Crawl Refusals: Insights From Common Crawl

Ansar, Mostafa; Sperotto, Anna; Holz, Ralph

doi:10.1007/978-3-031-85960-1_9

Mostafa Ansar¹⁰,
Anna Sperotto¹⁰ &
Ralph Holz¹¹

Part of the book series: Lecture Notes in Computer Science ((LNCS,volume 15567))

Included in the following conference series:

International Conference on Passive and Active Network Measurement

2595 Accesses
1 Citation

Abstract

Web crawlers are an indispensable tool for collecting research data. However, they may be blocked by servers for various reasons. This can reduce their coverage. In this early-stage work, we investigate server-side blocks encountered by Common Crawl (CC). We analyze page contents to cover a broader range of refusals than previous work. We construct fine-grained regular expressions to identify refusal pages with precision, finding that at least 1.68% of sites in a CC snapshot exhibit a form of explicit refusal. Significant contributors include large hosters. Our analysis categorizes the forms of refusal messages, from straight blocks to challenges and rate-limiting responses. We are able to extract the reasons for nearly half of the refusals we identify. We find an inconsistent and even incorrect use of HTTP status codes to indicate refusals. Examining the temporal dynamics of refusals, we find that most blocks resolve within one hour, but also that 80% of refusing domains block every request by CC. Our results show that website blocks deserve more attention as they have a relevant impact on crawling projects. We also conclude that standardization to signal refusals would be beneficial for both site operators and web crawlers.

This is a preview of subscription content, log in via an institution to check access.

Access this chapter

Log in via an institution

Subscribe and save

Springer+

from $39.99 /Month

Starting from 10 chapters or articles per month
Access and download chapters and articles from more than 300k books and 2,500 journals
Cancel anytime

Buy Now

Chapter: USD 29.95; Price excludes VAT (USA)

eBook: USD 59.99; Price excludes VAT (USA)

Softcover Book: USD 79.99; Price excludes VAT (USA)

Tax calculation will be finalised at checkout

Purchases are for personal use only

Institutional subscriptions

Large Scale Web Crawling and Distributed Search Engines: Techniques, Challenges, Current Trends, and Future Prospects

Finding Server-Side Endpoints with Static Analysis of Client-Side JavaScript

The Ethics of Web Crawling and Web Scraping in Cybercrime Research: Navigating Issues of Consent, Privacy, and Other Potential Harms Associated with Automated Data Collection

Notes

1.
https://github.com/mstfnsr/web_refusal_regex.

References

Pepyaka webserver. https://webtechsurvey.com/technology/pepyaka. Accessed May 2024
Ablove, A., et al.: Digital discrimination of users in sanctioned states: the case of the cuba embargo. In: 33rd USENIX Security Symposium (USENIX Security 2024), Philadelphia, PA, pp. 3909–3926. USENIX Association (2024). https://www.usenix.org/conference/usenixsecurity24/presentation/ablove
Afroz, S., Tschantz, M.C., Sajid, S., Qazi, S.A., Javed, M., Paxson, V.: Exploring server-side blocking of regions. arXiv abs/1805.11606 (2018). https://api.semanticscholar.org/CorpusID:44131334
Ahmad, S.S., Dar, M.D., Zaffar, M.F., Vallina-Rodriguez, N., Nithyanand, R.: Apophanies or epiphanies? How crawlers impact our understanding of the web. In: Proceedings of The Web Conference 2020 (WWW 2020), pp. 271–280 (2020)
Google Scholar
Asghari, H.: pyasn. https://github.com/hadiasghari/pyasn
Center for Applied Internet Data Analysis (CAIDA): AS Organizations Dataset (2024). https://catalog.caida.org/dataset/as_organizations. Accessed May 2024
Common Crawl: November/december 2023 crawl archive now available. https://www.commoncrawl.org/blog/november-december-2023-crawl-archive-now-available. Accessed May 2024
Darer, A., Farnan, O., Wright, J.: Automated discovery of internet censorship by web crawling. In: Proceedings of the 10th ACM Conference on Web Science (WebSci 2018), pp. 195–204 (2018)
Google Scholar
Fielding, R.T., Nottingham, M.: Additional HTTP Status Codes. RFC 6585 (2012). https://doi.org/10.17487/RFC6585. https://www.rfc-editor.org/info/rfc6585
Fielding, R.T., Nottingham, M., Reschke, J.: HTTP Semantics. RFC 9110 (2022). https://doi.org/10.17487/RFC9110. https://www.rfc-editor.org/info/rfc9110
Holz, R., Braun, L., Kammenhuber, N., Carle, G.: The SSL landscape - a thorough analysis of the X.509 PKI using active and passive measurements. In: Proceedings of the ACM/USENIX 11th Annual Internet Measurement Conference (IMC), Berlin, Germany (2011)
Google Scholar
http.dev: HTTP status codes. https://http.dev/status
Institute, R.: How many news websites block AI crawlers. Reuters Institute for the Study of Journalism (2023). https://reutersinstitute.politics.ox.ac.uk/how-many-news-websites-block-ai-crawlers#:~:text=Examining. Accessed May 2024
Invernizzi, L., Thomas, K., Kapravelos, A., Comanescu, O., Picod, J.M., Bursztein, E.: Cloak of visibility: detecting when machines browse a different web. In: 2016 IEEE Symposium on Security and Privacy (SP), pp. 743–758 (2016)
Google Scholar
Koster, M., Illyes, G., Zeller, H., Sassman, L.: Robots Exclusion Protocol. RFC 9309 (2022). https://doi.org/10.17487/RFC9309. https://www.rfc-editor.org/info/rfc9309
Leonard, D., Loguinov, D.: Demystifying service discovery: implementing an internet-wide scanner. In: Proceedings of the ACM SIGCOMM Conference on Internet Measurement (IMC 2010), pp. 109–122 (2010)
Google Scholar
McDonald, A., et al.: 403 forbidden: a global view of CDN geoblocking. In: Proceedings of the Internet Measurement Conference (IMC 2018), pp. 218–230 (2018)
Google Scholar
Nagel, S.: Common crawl: data collection and use cases for NLP (2023). http://nlpl.eu/skeikampen23/nagel.230206.pdf. Accessed May 2024
Niaki, A.A., et al.: ICLab: a global, longitudinal internet censorship measurement platform. In: 2020 IEEE Symposium on Security and Privacy (SP), pp. 135–151 (2020)
Google Scholar
Tschantz, M.C., Afroz, S., Sajid, S., Qazi, S.A., Javed, M., Paxson, V.: A bestiary of blocking: the motivations and modes behind website unavailability. In: 8th USENIX Workshop on Free and Open Communications on the Internet (FOCI 2018) (2018)
Google Scholar
Vastel, A., Rudametkin, W., Rouvoy, R., Blanc, X.: FP-crawlers: studying the resilience of browser fingerprinting to block crawlers. In: NDSS Workshop on Measurements, Attacks, and Defenses for the Web, MADWeb 2020 (2020)
Google Scholar
Wan, G., et al.: On the origin of scanning: the impact of location on Internet-wide scans. In: Proceedings of the ACM Internet Measurement Conference (IMC 2020), pp. 662–679 (2020)
Google Scholar
Zeber, D., et al.: The representativeness of automated web crawls as a surrogate for human browsing. In: Proceedings of the Web Conference 2020 (WWW 2020), pp. 167–178 (2020)
Google Scholar

Download references

Acknowledgments

This work was partially supported by the research project ‘CATRIN’ (NWA. 1215.18.003) as part of the Dutch Research Council’s (NWO) National Research Agenda (NWA).

Author information

Authors and Affiliations

University of Twente, Enschede, The Netherlands
Mostafa Ansar & Anna Sperotto
University of Münster, Münster, Germany
Ralph Holz

Authors

Mostafa Ansar
View author publications
Search author on:PubMed Google Scholar
Anna Sperotto
View author publications
Search author on:PubMed Google Scholar
Ralph Holz
View author publications
Search author on:PubMed Google Scholar

Corresponding author

Correspondence to Mostafa Ansar .

Editor information

Editors and Affiliations

Georgia Institute of Technology, Atlanta, GA, USA
Cecilia Testart
University of Twente, Enschede, The Netherlands
Roland van Rijswijk-Deij
University of Zürich, Zürich, Switzerland
Burkhard Stiller

Ethics declarations

Ethics

This work raises no ethical concerns. The CC dataset was ethically created [18]. We ran fewer than $70 \times 10^{3}$ DNS queries (about $12 \times 10^{3}$ FQDN samples) to identify their NS and PTR records, distributing them over time and sequentially to avoid network or nameserver load.

A Additional Figures and Tables

Table 9. Freq. of status codes in pruned set. Asterisks refer to unofficial status codes

Full size table

Table 10. Frequency of tags in refusals

Full size table

Table 11. Frequency of status codes in refusals. The finer granularity (< 50 refusals) for the tail was obtained by manual investigation of refusals captured by the more generic regular expressions. Asterisks refer to unofficial status codes.

Full size table

Table 12. A subset of regular expressions with labels and full page textual contents

Full size table

Rights and permissions

Reprints and permissions

Copyright information

About this paper

Cite this paper

Ansar, M., Sperotto, A., Holz, R. (2025). Web Crawl Refusals: Insights From Common Crawl. In: Testart, C., van Rijswijk-Deij, R., Stiller, B. (eds) Passive and Active Measurement. PAM 2025. Lecture Notes in Computer Science, vol 15567. Springer, Cham. https://doi.org/10.1007/978-3-031-85960-1_9

Download citation

DOI: https://doi.org/10.1007/978-3-031-85960-1_9
Published: 07 March 2025
Publisher Name: Springer, Cham
Print ISBN: 978-3-031-85959-5
Online ISBN: 978-3-031-85960-1
eBook Packages: Computer ScienceComputer Science (R0)Springer Nature Proceedings Computer Science

Keywords

Publish with us

Policies and ethics

Web Crawl Refusals: Insights From Common Crawl

Abstract

Access this chapter

Subscribe and save

Buy Now

Similar content being viewed by others

Large Scale Web Crawling and Distributed Search Engines: Techniques, Challenges, Current Trends, and Future Prospects

Finding Server-Side Endpoints with Static Analysis of Client-Side JavaScript

The Ethics of Web Crawling and Web Scraping in Cybercrime Research: Navigating Issues of Consent, Privacy, and Other Potential Harms Associated with Automated Data Collection

Explore related subjects

Notes

References

Acknowledgments

Author information

Authors and Affiliations

Corresponding author

Editor information

Editors and Affiliations

Ethics declarations

Ethics

A Additional Figures and Tables

A Additional Figures and Tables

Rights and permissions

Copyright information

About this paper

Cite this paper

Download citation

Share this paper

Keywords

Publish with us