It’s just data

Happy Birthday, Feed Validator

The Feed Validator has been giving advice for five years as of today.  From a modest beginning of 300 test cases, there now are over two thousand.

My favorite post on this topic during these past five years is Common Feed Errors.  Time to revisit.

Missing atom:link with rel="self" (3072)
This is a relatively new recommendation from the RSS Advisory Board.  One thing I don’t remember sharing before is that typically when I do these checks, I find that fully one out of three feeds have already fixed these messages by the time I can recheck the feed myself.  This message appears to be no exception.  In addition, it already is fixed in WordPress HEAD.  Needless to say, I expect the frequency of this message to go down quickly.
XML Parsing error: syntax error (901)
This is the glaring exception to the one out of three rule I mentioned above.  XML Errors in Feeds still is the most systematic analysis of such errors that I am aware of to date.  It would be nice if that study were to be updated.
Email address is missing real name (822)
Another new recommendation.  Again, this should work itself out over time.  Adding either your real name, or a recognizable pseudonym, should increase usability with a number of feed aggregators.
item should contain a guid element (634)
This is not a new recommendation.  From the original RSS 2.0 spec:
In all cases, it’s recommended that you provide the guid, and if possible make it a permalink.
Undefined parent element: child (574)
This message covers two separate symptoms: typos and people not knowing about RSS 2.0’s support for namespaces.  While this issue is considerably less troublesome than non-well-formed feeds, what is a concern is that this many years after the RSS 2.0 spec was released, this problem is as prevalent as it is.
element must be an RFC-822 date-time (500)
This continues to be the most problematic date format ever.  I’m pleased to see that extensions such as SSE have moved away from it.
Feeds should not be served with the type/subtype media type (479)
Misconfigured servers, often serving feeds as either text/html or text/plain, have regretfully lead browser vendors and even spec writers to conclude that content sniffing is a necessity.
Your feed appears to be encoded as “this”, but your server is reporting “that (376)
Another way in which servers are commonly misconfigured: the use of text/xml in ways that don’t comply with RFC 3023.
HTTP Error (381)
It is clear that not everybody has mastered even the most basic concepts of the internet, many still need a bit of help.  Don’t laugh, undoubtedly there are areas where you aren’t an expert.  Now look again at that count.  That many people needed additional help when the Feed Validator said that their feed was not found, or that there is a server error.  In the past week alone.
Image title doesn’t match channel title (278)
Another, relatively new, recommendation.
Invalid email address (274)
In general, this means that people are incorrectly using RSS 2.0 core elements when the Dublin Core extension is what they really want.
Self reference doesn’t match document location (214)
Sometimes this simply means that there are multiple URIs which can be used to fetch a feed (example http://www.example.com/… vs http://example.com/…), but in other cases there is a real problem.
element should not contain script tag (172)
Most well-maintained aggregators these days strip scripts from incoming feeds, so if you include such things in your feeds with the expectation that users will see the effects, you will often be disappointed.  Unfortunately, this often affects embedded YouTube videos.
Invalid HTML (166)
While HTML grammar rules are fairly lax (especially when compared against XML), there actually are some rules.  While browsers routinely deal with common variations (at times, with minor differences), the more important consideration is that a simple unmatched quote may confuse the code that scans your markup for security risks.  This can lead to users seeing widely divergent, often severely stripped, output.
element should not contain script attribute (150)
Same basic issue, but in this case dealing with attributes like onclick.
UnicodeError: decoding error, invalid data (146)
This is a common enough subclass of well-formedness errors that it merits its own message.  And, yes, that means that this count really should be added to the SAX Error count above.  Most commonly this error occurs when people write code that essentially does a bit-for-bit copy of data from a webpage (which defaults to iso-8859-1 encoding), to an XML feed (which defaults to utf-8).
Invalid URI character (93)
Most commonly, a space character.
Undefined named entity (86)
This is yet still another common enough well-formedness error to merit its own message.    and — are not predefined in XML.
The XML encoding does not appear to match the characters used (83)
This is a variation on Unicode Errors.  In this case what you have is an incorrect encoding, but one that technically is legal.  Like taking a data that is either utf-8 or win-1252 encoded and declaring it as iso-8859-1.  In some many cases, what you will see in a feed is incorrect numeric character references, like ’ when what is desired is a right single quote or ’.
Incorrect day of week (83)
All I can say is that the sheer frequency of this error flabbergasts me.  People even have been known to protest when they get this message.  Again, don’t laugh, one day it could be you.
Email address is not in recommended format (81)
Another new recommendation, but one that affects relatively few feeds.
Missing recommended iTunes parent element: child (75)
Itunes is optional, but if you add itunes elements you might as well follow the recommendations.
element should not contain HTML (75)
People still try to put escaped HTML in some of the darndest locations.  But I am pleased to report that this is down slightly from before.
Image link doesn’t match channel link (66)
Another long standing recommendation -- this one is down significantly from prior times.
element must be a full URI (65)
Also down significantly.

Speaking of The XML encoding does not appear to match the characters used... ;)

Posted by Miles at

On the Feed Validator's fifth birthday, Sam Ruby recompiles a list of the most common feed errors. Are your friends making any of these?

[link] [more]...

Excerpt from programming: what's new online at

As ever, it seems Dave Winer disagrees with some of your finer points.

Posted by Noah Slater at

... [more]

Trackback from universityupdate.com

at

Miles: fixed.  Thanks!

Noah: hopefully the past and present RSS Advisory Board members can find a way to come to a common set of recommendations.  Alternatively, Dave is welcome to create his own profile, which I will honor.  And, as always, the Feed Validator source code is open source and he or anybody else is welcome to modify it to suit their needs and host it wherever they like.

Posted by Sam Ruby at

links for 2007-10-22

Happy Birthday, Feed Validator...

Excerpt from Breyten's Dev Blog at

And, as always, the Feed Validator source code is open source and he or anybody else is welcome to modify it to suit their needs and host it wherever they like.

Bah! It’s a lot more fun to complain. Besides, with just a single validator, taking over the world becomes much easier.

Posted by James Snell at

Sam,

Are your “profile conformance” tables live or snapshots?
[link]

- Mark

Posted by Mark Woodman at

Snapshots.  The code to produce them is in the same directory.  Feel free to modify it to (for example) use another feed list or to search for different messages.  Does require that you download the Feed Validator and modify the sys.path.insert line accordingly.  If you come up with something interesting, please let me know.

Posted by Sam Ruby at

Random Bits

The Atompub WG has concluded. Many thanks to the co-chairs Tim and Paul, to the secretary Sam, to the editors Joe, Bill, Mark and Robert, and to the many folks who contributed. The mailing lists will stay up and work continues on various Atom...

Excerpt from snellspace.com at

For those who care about these things, it appears that SAXError and UndefinedElement are over-counted.  The primary cause of both seems to be people who try to validate HTML pages which contain no obvious feed links, and then can’t resist clicking on the error messages produced.

I’ve modified the feed validator to no longer include the error message section in such situations.  Example.

Posted by Sam Ruby at

On the Feed Validator's fifth birthday, Sam Ruby recompiles a list of the most common feed errors. Are your friends making any of these?

[link] [more]...

Excerpt from programming: what's new online at

Nice breakdown of errors!

For the next version of Spinn3r I was going to try to ship a lot more stats about our crawler.

I released a CMS generator breakdown a few months ago:

Some stats I can think of...

If you can think of anything else I can try to fit it in.

Not many people have access to a crawler which is indexing 12M blogs so I figure it’s a good place to compute these stats.

Posted by Kevin Burton at

Add your comment