It’s just data

Confirmed: Google Hates XHTML?

Tom Pike: Question: Does Google support pages sent as application/xhtml+xml?

No.

Call this anecdotal if you like, but a few days ago I mentioned two products which have been widely reviewed, and yet my entry appears fairly high on the search results.  Pleasingly, traffic is starting to flow to that post, particularly when these proct names are combined with the word “Ubuntu”.  I hope that these people found something that they consider useful.

The next day, I nixed content negotiation for my home page.

The day after that, I referenced a new tool created by Morten Frederiksen.  Echoes of my post were picked up by Google, but the original was not.  Joe’s post (served as faux XHTML to Google), however, does rank highly — what’s up with that?.

It is my belief that the W3C could learn a lot from the way the Python 3000 effort is being planned, and in particular the plans to back-port changes to Python 2.6 (and 2.7!) that will ease the transition.

But mostly what will drive the transition to Python 3K is that people will start writing code that only works w/Python 3.0.  I don’t begin to presume that I have enough clout to move the mighty Google to action, but perhaps if the folks at the W3C who authored or supported the XHTML standard had created enough meaningful content and served it with the proper media type, Google (and Microsoft!) might have put XHTML support a bit higher on the priority list.

Update: now my post is top of the search results.  Manual intervention?  Google dance?  I’ll probably never know...


[from wearehugh] Sam Ruby: Confirmed: Google Hates XHTML

[link]...

Excerpt from del.icio.us/network/krakatoa at

Yahoo Search treats you better. So does Microsoft. Google does index your front page. It even, helpfully, offers to render it to HTML.

Sad ...

Posted by Jacques Distler at

It’s not surprising that they ignore XHTML.  Google is an advertising company!  There’s currently zero benefit<sup>*</sup> to companies switching to XHTML.  Those that have are practically guaranteed to be serving as text/html anyway, so why should Google grok application/xhtml+xml?  (Of course that answer is “it’s danged easy to parse, so why not?!?")

For most intents and purposes, XHTML has failed, and it’s a damned shame it has.  Until Microsoft produces a user agent that can accept application/xhtml+xml (and preferably lists application/xhtml+xml in its HTTP_ACCEPT header) the incentive for supporting it remains low.

<sup>*</sup> please read as "practically zero perceived benefit”

Posted by Josh Peters at

Until Microsoft produces a user agent that can accept application/xhtml+xml (and preferably lists application/xhtml+xml in its HTTP_ACCEPT header) the incentive for supporting it remains low.

I don’t believe that it is fair to pin this on Microsoft.  I believe that if those that had created XHTML had the courage of their convictions, both Google and Microsoft would have had no choice.

I also believe that there should have been a maintenance release or two of HTML4.  In HTML5, the root element MAY have an xmlns attribute, but only if it matches the one defined by XHTML; and void elements may have terminating slash characters in their start element.

It is these small touches that make transition easier.

Posted by Sam Ruby at

Sam Ruby: Confirmed: Google Hates XHTML

wearehugh : Sam Ruby: Confirmed: Google Hates XHTML Tags : google xhtml...

Excerpt from HotLinks - Level 1 at

(Of course that answer is “it’s danged easy to parse, so why not?!?")

Things like internal subset, namespaces, et cetera make XML not that easier to parse than HTML I think. I suppose the tree construction phase is slightly less involved, but I doubt there’s much difference overall now we have a specification on parsing text/html.

Posted by Anne van Kesteren at

Google hates xhtml?

[link] [more]...

Excerpt from reddit.com: newest submissions at

Things like internal subset, namespaces, et cetera make XML not that easier to parse than HTML I think. I suppose the tree construction phase is slightly less involved, but I doubt there’s much difference overall now we have a specification on parsing text/html.

Anne, why would they parse it as XML? Just tokenize it like I imagine the must currently do for HTML. Google doesn’t care about the structure of the document, just the words contained within.

Posted by Bill Mill at

Things like internal subset, namespaces, et cetera make XML not that easier to parse than HTML I think. I suppose the tree construction phase is slightly less involved, but I doubt there’s much difference overall now we have a specification on parsing text/html.

Google is perfectly happy indexing Sam’s atom feed (with its <content type="xhtml">). If they can handle his XHTML content, when sent as application/atom+xml, there’s no reason they couldn’t handle the same content sent as application/xhtml+xml.

Posted by Jacques Distler at

Bill, if you would do it that way you wouldn’t be able to properly handle tags such as foo:html where foo is bound to http://www.w3.org/1999/xhtml.

Jacques, I wasn’t really arguing about Google’s capabilities. I hoped that much was clear :-)

Posted by Anne van Kesteren at

Bill, if you would do it that way you wouldn’t be able to properly handle tags such as foo:html where foo is bound to “”.

Pardon me if I’m being dense, but I’d still assume that they don’t parse at that level. I’m under the impression that they’d just parse the page for words, and put the page into their huge word<->document matrix. They don’t need to know what <foo:html> means, just store the important tokens and throw away the ones in your stoplist.

Posted by Bill Mill at

Summary for billmill.org

Summary First hit counted 16 hours, 17 minutes ago Total hits: 2000 Page hits: 343 (21 per hour) Last page request: / details Processing time: 18 ms Recent popular pages (5 or more requests) /static/medmen/lib/exe/css.php: 97...

Excerpt from peastat for billmill.org at

I assert that Google does care about the structure of the document, at least to the point that <a href="http://example.com/"> is relevant to their page rank algorithm.

Beyond that, the assertion that a draft WHATWG document, which concerns itself about minutia such as the content type of titles, somehow retroactively makes Google’s job easier, well that is somewhat absurd.  What it hopefully will do, years down the road, is make more people apply HTML consistently; and that will make Google’s job easier.

Posted by Sam Ruby at

Apple canvas patent

It seems Apple are asserting IP claims on the <canvas> element, which apparently did come from Apple but has since been taken up by the WHATWG with implementation support in Opera and Moz. As Arve says : "This has the potential to make people...

Excerpt from Planet RDF at

[from gregorrothfuss] Sam Ruby: Confirmed: Google Hates XHTML

google does not index xhtml served with the proper mime type...

Excerpt from del.icio.us/network/loerracher at

15

Bruce Sterling gives blogs 10 years to live , “SXSW Science fiction writer and professional pundit Bruce Sterling has cracked bloggers with the extinction stick, saying the plebs will crawl back into their ooze by 2017.” It won’t...

Excerpt from Anarchaia at

Joe’s post, however, does rank highly — what’s up with that?.

Could it be that Google is DOCTYPE sniffing, and prefers Joe’s “XHTML 1.0 Transitional” Doctype over your "html"?

Pardon me if I’m being dense, but I’d still assume that they don’t parse at that level. I’m under the impression that they’d just parse the page for words, and put the page into their huge word<->document matrix.

I’ve always heard that Google puts more weight on words in <h1> and <title> tags, and that all-other-things-equal prefers semantic documents to non-semantic ones, so I think they are doing more than just pulling out the words and tossing the tags.

Posted by Kevin H at

“ I don’t begin to presume that I have enough clout to move the mighty Google to action,”

“Google Hates”

Not a great start. Btw, I didn’t known that Google’s bots accepted application/html. Did you check that before serving?

Posted by Bill de hOra at

“application/html.”

Mistake - that should be application/xhtml+xml.

Posted by Bill de hOra at

I’m not sure Google actually looks at links. There is at least on search bot out there that follows just everything that looks like a URL: The logs of xopus.com are full of xpaths.

Posted by Sjoerd Visscher at

Looks like manual intervention.

It hasn’t happened (yet) for this post, which ought to come out on top but doesn’t.

Posted by Jacques Distler at

Hi, I’m an engineer at Google. I meant to stop by several days ago (sorry that it took me a while to get here). I chatted with a crawl person, and he said that Google should handle xhtml+xml pretty well.

It can take a few days for Google to crawl/index/rank individual pages well. I really don’t think that there wasn’t any manual intervention in this case at all, but feel free to drop me an email if you’d like to discuss it more. Jacques Distler said a few days ago that this post doesn’t rank for “google hates xhtml”, but now it does, for example (and Google didn’t do anything special for this post). I think sometimes search engines just need a short time to find/crawl/index/rank a page well.

Posted by Matt Cutts at

Hi Matt, thanks for stopping by.

Specific question: why does Google say “File Format: Unrecognized” on the top result for this search?

Posted by Sam Ruby at

Sam, I’m checking into the specifics of why we show that. IE6 doesn’t seem to handle that file type that well (offering to download the page as a file), but I think Google could still present the link better (e.g. that “Unrecognized” is unfortunate).

Out of curiosity, how would you want that snippet to look in your ideal world? Maybe for browsers like Firefox that handle the page fine, just not even tell the user that it’s a different type of file?

Posted by Matt Cutts at

Out of curiosity, how would you want that snippet to look in your ideal world? Maybe for browsers like Firefox that handle the page fine, just not even tell the user that it’s a different type of file?

Let’s make it interesting.  I’ve restored content negotiation and enhanced my regular expression.  A few examples:

$ curl -s -I -H 'Accept:text/html,application/*' http://intertwingly.net/blog/ | grep Content-Type
Content-Type: application/xhtml+xml;charset=utf-8

$ curl -s -I -H 'Accept:application/xhtml+xml' http://intertwingly.net/blog/ | grep Content-Type
Content-Type: application/xhtml+xml;charset=utf-8

$ curl -s -I -H 'Accept:text/html' http://intertwingly.net/blog/ | grep Content-Type
Content-Type: text/html; charset=utf-8

Net result: the same content (with negotiated metadata) is sent to every browser from Lynx to Firefox 2.0, and each displays the content to the best of their abilities.  Lynx won’t understand any of the images; IE6 won’t understand some of the CSS or any of the SVG but will display ads as Google adsense requires document.write; and browsers that support SVG will get the MIME type they need to trigger SVG support but, as a byproduct, they won’t see ads.

Based on this, I would suggest that Google not tell any user that it is a different type of file.

Question: what accept header is sent by the Googlebot crawler?

Posted by Sam Ruby at

Mark got me your email, so I’ll drop you a line directly. I think we send ‘*/*’ for the Accept: header.

Posted by Matt Cutts at

MSIE 6.0 sends the following:

image/gif, image/x-xbitmap, image/jpeg, image/pjpeg, application/msword, application/vnd.ms-excel, application/vnd.ms-powerpoint, application/x-shockwave-flash, */*

Note: no mention of EITHER text/html or application/xhtml+xml.  Nice.  NOT!

Beginnings of a test script.

Posted by Sam Ruby at

Actually, that’s only when you install plugins and such. The default is */* much like it is for Safari. Accept is pretty useless nowadays.

Posted by Anne van Kesteren at

I just tried it on MSIE 7.0, and got similar results.

Since my wife and I don’t use IE except when we have to, I don’t believe either of us have intentionally installed any plugins.

As to the purported uselessness of the Accept header, do you have any alternate suggestions on how to serve web pages which contain inline SVG to a variety of browsers in a way that gracefully degrades?

Posted by Sam Ruby at

Checking if application/xhtml+xml occurs in an Accept header with a different q value from 0 should probably work fine in the majority of cases. Plus maybe specifically sniffing for Safari. Or you could use your script to inject the SVG dynamically and just serve everything up as text/html.

Posted by Anne van Kesteren at

Checking if application/xhtml+xml occurs in an Accept header with a different q value from 0 should probably work fine in the majority of cases.

OK, that confirms that the Accept header is useful.

My current set of .htaccess rules are as follows:

RewriteCond %{HTTP_ACCEPT} application/xhtml\+xml
RewriteCond %{HTTP_ACCEPT} !application/xhtml\+xml\s*;\s*q=0\.?0*(\s|,|$)
RewriteCond %{REQUEST_URI} ^([^.]*|.*\.html)$
RewriteRule .* - [T=application/xhtml+xml;charset=utf-8]

RewriteCond %{HTTP_ACCEPT} !image/gif
RewriteCond %{HTTP_ACCEPT} text/html\s*;\s*q=0\.?0*(\s|,|$) [OR]
RewriteCond %{HTTP_ACCEPT} !text/html
RewriteCond %{HTTP_ACCEPT} (application|\*)/(xhtml\+xml|\*)
RewriteCond %{HTTP_ACCEPT} !application/xhtml\+xml\s*;\s*q=0\.?0*(\s|,|$)
RewriteCond %{REQUEST_URI} ^([^.]*|.*\.html)$
RewriteRule .* - [T=application/xhtml+xml;charset=utf-8]

Or you could use your script to inject the SVG dynamically and just serve everything up as text/html.

If you are talking about this script, it doesn’t work for syndication, and it doesn’t work reliably for Firefox.

Posted by Sam Ruby at

Plus maybe specifically sniffing for Safari.

Why do you need to sniff for Safari?

The only reason I can see is to disable sending application/xhtml+xml to that particular XHTML-UA.

I disable sending application/xhtml+xml  to Safari for S5 slideshows (no inline SVG for Safari-users!) because its support for Javascript in XHTML is completely broken.

Otherwise, the usual Accept header logic works just fine for Safari.

Posted by Jacques Distler at

because its support for Javascript in XHTML is completely broken

If a Safari user out there could confirm the value of HTTP_ACCEPT on this page, I would appreciate it.

Also, if anybody notes ways in which my weblog does not gracefully degrade for Safari users please let me know, as I include both SVG and JavaScript in my pages, but both are done in ways that are intended to gracefully degrade.

Posted by Sam Ruby at

Of course the SVG renders just fine in the current  WebKit Nightlies.

Javascript is hopelessly broken (as I said). And there are some minor CSS issues.

But no show-stoppers.

Well... OK ... since cookies don’t work, OpenID is problematic.

Posted by Jacques Distler at

it doesn’t work reliably for Firefox

The bug comments seem to indicate that it works on trunk.  Can you confirm?  And if so, who cares about Firefox 2?  No one of importance runs release builds of a browser.

Posted by Mark at

No one of importance runs release builds of a browser.

I did consider giving it the full CADT treatment, and resolving it worksforme, but then I realized that just kicking it out of my product would work as well, while giving me a thin veneer of productivity.

Posted by Phil Ringnalda at

American Buddha

It is now possible to rank the batshit-craziness of American conservatives on a single linear scale, based on when they stopped watching Fox News....

Excerpt from dive into mark at

American Buddha

Java Beans © digital_image_fan / CC I’m trying to ride my new bike to as many places as possible. Most days this involves riding past Booda Beans. It’s everything you would expect from a neighborhood...

Excerpt from Get Latest Mozilla Firefox Browsers at

Giving up on application/xhtml+xml

Until today, I had been following the recommendation of the W3C Validator and serving the XHTML pages of this blog as application/xhtml+xml (to all clients except Internet Explorer ). Unfortunately, it appears that Google just doesn’t like to index...

Excerpt from pseudogreen at

Add your comment