SEO

The SEO mystery of the missing sitemap XML requests finally solved

We did it. We solved one of the unsolved big SEO (Search Engine Optimization) mysteries of the modern time. It took quite some time, dragged us down in the deep pits of the TCP/IP and HTTP 1.1 specification, but finally we emerged victorious.

What’s the mystery:

Sometimes in Google Webmaster Tools -> Sitemaps you see error messages like: “We encountered an error while trying to access your Sitemap. Please ensure your Sitemap follows our guidelines and can be accessed at the location you provided and then resubmit.

Webmaster Tools - Sitemaps-1

Then you start investigating: yes, the file is there, yes, it is accessible for googlebot, yes, it has content, yes, it’s like specified on sitemaps.org.

Then you dig deeper: the error logs, the access logs, all logs, and then you realize: that f*cking GET request does not exist in the timespan Google reported (to be sure you look at a longer timespan, still nothing there).

No GET request!

You look at the network, you look at the DNS, could it be that the requests went astray. You dig and dig, even bother to write a ticket to your server housing company. Still, nothing comes up, no leads.

Then you dig deeper: TCP/IP HTTP 1.1

  • You realize that googlebot makes multiple (we counted up to 11) GET requests in one single TCP/IP connection. (which is OK according to the HTTTP 1.1 spec).
  • You realize (with the help of stackoverflow) that these multiple GET requests in the some TCP/IP connection are processed in sequence (one after the other).
  • You realize that if one these GET requests has a major time lag (is much slower than the other GET requests) Google cuts the TCP/IP connection.
  • Because all the GET requests in the connection were processed in sequence, all the GET requests after the cut are lost. You don’t see them in the error/access logs as they were never processed, even though they were sent.
  • You see an error in Google Webmaster Tools, without a trace in your logfiles.

SEO Mystery solved.

If you don’t understand a single word i just wrote, please remember, we are geeks.

Previous Post Next Post

You Might Also Like

15 Comments

  • Reply submitshop January 7, 2011 at 3:21 am

    Nice discovering friends. I have to test that immediately. How did you find out about that – is there a log or tool that enables you to see which http requests share the same tcp connection?

  • Reply Uğur Eskici December 3, 2010 at 8:24 am

    Nice discovering franz. Thanks for sharing

  • Reply Ty October 14, 2010 at 4:46 pm

    You just saved SEO managers a lot of time!

  • Reply franz enzenhofer October 12, 2010 at 7:11 am

    first: thx
    second: dedicated servers in a high quality server housing environment, the “slower than the rest” requests where something out of the ordinary, an issue within the “garbage collection” logic.

  • Reply Alistair Lattimore October 11, 2010 at 11:16 pm

    Franz,

    Great investigative work to nut that one out.

    In your particular instance, was the ‘slow’ request anything out of the ordinary or just your server deciding to serve that particular resource slower than others? On the server front, is your site hosted on a good quality hosting or is it on cheaper style hosting?

    Al.

  • Reply Colin October 11, 2010 at 1:27 pm

    That is some next level analysis!

  • Reply Der Zuckerbäcker October 11, 2010 at 10:04 am

    very interesting. I have to test that immediately.

  • Reply Philip October 11, 2010 at 6:08 am

    This is great news! I also have problem like this. Will try what you did and maybe get mine to start working. Thanks.
    SEOP.com

  • Reply Philip Tellis October 10, 2010 at 4:47 am

    I’d say serve the googlebot HTTP/1.0. That way it won’t use keep-alive, or just configure apache to turn off keepalive for googlebot.

  • Reply Steve De Jonghe October 9, 2010 at 6:18 pm

    no offense or anything, but I wouldn’t call it “solved”, feels more like you identified it, which is already a big step, I suppose…

  • Reply Matt Cutts October 9, 2010 at 3:25 pm

    Interesting–good find.

  • Reply Moravec October 9, 2010 at 3:13 pm

    Yeah nice discovery. Now what?

  • Reply Anonymous October 8, 2010 at 12:13 pm

    @til: As Franz said, we captured a pcap file with tcpdump:

    tcpdump -w foo.pcap net 66.249.0.0/16 # ‘tcp port 80 and (((ip[2:2] – ((ip[0]&0xf)<<2)) - ((tcp[12]&0xf0)>>2)) != 0)’

    The download the aptly named foo.pcap and analyze with Wireshark.

  • Reply franz enzenhofer October 8, 2010 at 11:54 am

    we used an installation of http://www.wireshark.org/ on the server.

    >make sure that all requests are getting served fast enough?
    that’s the funny thing, all request must get served – pretty much – equally fast. the googlebot decision if a connection gets cut is measured relatively. if you have a “fast, fast, fast, very slow, fast, fast” package, googlebot will be likely to cut the connection at “very slow”. if you have “slow, slow, slow, slow, slow, very slow, slow, slow” package, googlebot does not seem to cut the connection. (but will spend much less ressources on the site in general).

    so yes, the best way is to server all request fast-er, but you could also deliver all request equally slow (not recommended).

  • Reply til October 8, 2010 at 11:19 am

    Wow, interesting. How did you find out about that – is there a log or tool that enables you to see which http requests share the same tcp connection?

    And what can one do against it – make sure that all requests are getting served fast enough?

  • Leave a Reply