Starting in 1996, Alexa Internet has been donating their crawl data to the Internet Archive. Flowing in every day, these data are added to the Wayback Machine after an embargo period.
Starting in 1996, Alexa Internet has been donating their crawl data to the Internet Archive. Flowing in every day, these data are added to the Wayback Machine after an embargo period.
TIMESTAMPS
The Wayback Machine - https://web.archive.org/web/20130707020128/http://oak.cs.ucla.edu:80/~cho/research/crawl.html
Web Crawling Project
A crawler is a program that
retrieves and stores pages from the Web, commonly for a Web search
engine. A crawler often has to download hundreds of millions of pages
in a short period of time and has to constantly monitor and refresh
the downloaded pages. In addition, the crawler should avoid putting
too much pressure on the visited Web sites and the crawler's local
network, because they are intrinsically shared resources.
In this project we studied how we can build an effective Web crawler
that can retrieve "high quality" pages quickly, while maintaining
the retrieved pages "fresh." Towards that goal, we identified
popular and reasonable definitions for the "importance" of pages
and proposed simple algorithms that can identify important pages
at the early stage of a
crawl [11].
We also explored how we can parallelize a crawling process to
maximize the download rate while minimizing the overhead from
parallelization [7]. In addition we experimentally and theoretically
studied how Web pages change over time and proposed an optimal page
refresh policy that maximizes the "freshness" of the retrieved pages
[1,
2,
3,
5,
6,
9].
Finally, we investigated automatic ways to download contents from the "Hidden Web"
[2], automatic ways to detect mirrors
(or replicated collection of pages) from the Web
[8]
and potential changes to existing HTTP protocol to make
the crawling process much more efficient
[12].
Junghoo Cho and Hector Garcia-Molina "Parallel Crawlers."In Proceedings of the 11th World
Wide Web conference (WWW11), Honolulu, Hawaii, May 2002.
Junghoo Cho, Narayanan Shivakumar, Hector Garcia-Molina "Finding
replicated Web collections."In Proceedings of 2000 ACM
International Conference on Management of Data (SIGMOD), May 2000.
Junghoo Cho, Sougata Mukherjea "Crawling Images on
the Web." In Proceedings of Third International
Conference on Visual Information Systems (Visual99), Amsterdam, The
Netherlands, June 1999.
Junghoo Cho, Hector Garcia-Molina, Lawrence Page "Efficient
Crawling Through URL Ordering."In Proceedings of the 7th World
Wide Web conference (WWW7), Brisbane, Australia, April 1998.
Onn Brandman, Junghoo Cho, Hector Garcia-Molina, Narayanan
Shivakumar "Crawler-Friendly
Web Servers."In Proceedings of the Workshop on Performance and
Architecture of Web Servers (PAWS), held in conjunction with ACM
SIGMETRICS 2000, Santa Clara, California, June 2000.