Ash Nallawalla's blog

Google has a sitemap.xml file!

Google has a sitemap.xml file.
Google has a sitemap.xml file.

We should not be surprised by this factoid, but check out hxxp://www.google.com/sitemap.xml (replace xx with tt). It is 4 MB in size. If you thought that it would be a sitemap index file consisting of thousands of sitemaps, you’d be mistaken.

The file is 142,111 lines long, which means there are 35,527 URL entries in it. What are the interesting pages?

  • http://www.google.com/a/help/intl/en/admins/overview.html looks interesting, but try loading it in your browser and you are taken to http://www.google.com/a/help/intl/en/index.html
  • http://www.google.com/a/cpanel/domain doesn’t load, but you end up at http://www.google.com/a/cpanel/domain/new. Weird.
  • http://www.google.com/a/interest leads to http://www.google.com/a/cpanel/interest, which happens to be a 404. Will Google get penalised? Will it lose PR? [I am just parodying forum newbies, relax.]
  • There are plenty of pages relating to ads – AdWords and AdSense, which is to be expected. The usual corporate pages, April Fool gags, zeitgeist, etc.
  • Numerous foreign-language versions of its content for its overseas markets.
  • Numerous university searches, such as http://www.google.com/univ/calpoly – where is Gopher these days?
  • Only the home page has a priority of 1.0; the rest are all 0.5.

Google also has a robots.txt file, but it doesn’t reference this sitemap.

Yes, a pretty small site, if you took out the non-English content. All fits in a single sitemap.xml file. :lol:

Ash Nallawalla

Search strategist experienced in large, complex websites. Ash's Google+ profile

Related Posts

Will Experts Exchange become a victim of the new Chrome extension?

Feel free to share...Google just released a new extension for its Chrome browser. Initially I wasn’t sure what it is called, as it seemed to be “block sites from Google’s web search results”. On closer inspection, it is “Personal Blocklist” and here is the official description: The personal blocklist extension will transmit to Google the […]

Read More

JC Penney followup: Doug Pierce’s research for the NYT expose

Feel free to share...A few blogs have picked up the story about the paid links allegedly obtained by JC Penney’s former SEO company SearchDex. Vanessa Fox’s detailed article in SearchEngineLand led me to Doug Unplugged, the blog of Doug Pierce, of Blue Fountain Media. An interesting find by Doug was SearchDex’s client list, which has […]

Read More

Older Posts