It seems that both Blogdex and KeepMedia seem to think that they each have exclusive rights to the domain of small positive integers. I wonder what part of Globally Unique Identifier people have difficulty understanding? Seems clear enough to me.
What happens when guids collide? Well, at least one popular aggregator will decide that you have already seen the post.
Because of this, the feedvalidator will now flag both of these feeds as having guids which are NotSufficientlyUnique. Thanks to Joseph Walton, such messages will be clearly marked as warnings and not cause the feed to be flagged as invalid.
While neither RSS 1.0 nor Atom can absolutely prevent collisions, both rdf:about and atom:id are defined as URIs. This means that such identifiers (by virtue of not containing a colon) will be evaluated as relative to the source page. Which means that they only need to be locally unique.
Furthermore, both rdf:about and atom:id are defined as being required. Which is a good thing.
If you're an MT user, I wish you the best of luck with atom:id
Sam,
Thanks for adding this to the validator. On a related note I'd like to thank you and Mark for writing the feed validator. I've lost count of the amount of times I've gotten mail or bugs filed about how some feed doesn't work in RSS Bandit which was quickly resolved by sending the person to the Feed Validator.
I do wonder about some other decisions you've made. A while ago I noticed the validator complains if a feed's content contains Javascript because this could be used for malicious purposes. However you don't issue a warning if a feed contains enclosures or binary content in ATOM. This seems very inconsistent. After all, downloading and executing arbitrary binary content is just as bad as executing arbitrary Javascript. Even an MP3 file can be used to 0wn your box, just look at the Winamp buffer overflow.
Is there a particular reason for this decision or did you guys just overlook this issue?
Dare: binary doesn't, by itself, mean executable. So, from that perspective, whether the mode is xml, escaped, or base64 should make no difference. What does, however, make a difference is whether the data, once unescaped or decoded, actually contains a script or not.
And, in case you are wondering, the validator does contain explicit code to detect scripts encoded in base64 in Atom feeds. Here is a testcase and here are the results.
If you have other conditions you would like to see checked for, please let us know, preferably by opening a bug report or feature request on sourceforge.
Bo, I think that there are a number of aspects to this that need to be teased apart.
The best place to look for the primary reason why GUIDs were introduced into RSS 2.0 is in the comments section of that spec. The primary use case seems to be to allow "aggregators to not repeat items, even if there have been editing changes.". Editing changes. Changes which would affect the value of a hash.
A second use case for guids would be to enable the identification of blog entries that have been syndicated. Everything I say shows up in Planet Apache. It looks like weblogs 2 asp.net is a similar service. There are many others. In such circumstances, there is the potential for uniqueness to span feeds, and this presumably is the motivation for guids to be defines as Globally unique ids.
A third use case deals with portability of weblogs. Quite frankly, that use case remains speculative and controversial. It will be interesting to see how that discussion turns out.
Bottom line, while it may be tolerable for the same blog entry to occasionally be issued a new guid / rdf:about / atom:id value, it is never tolerable for two different blog entries to have the same guid / rdf:about / atom:id. That is what the feedvalidator is trying to guard against.
Sam,
Most binary content is dangerous, regardless of whether it is executable or not. Accepting arbitrary MP3 files or Word documents can lead to as many security issues as accepting arbitrary HTML fragments with embedded script depending on what tools you are using.
I'll probably file a bug report on SF later today. Thanks for the response.
Dare, can you explain the danger of "binary"? Text systems are also subject to buffer overflows. Escaped data is just another form of encoding.
Should everything be disallowed except for inline XML?
For that matter, should all RSS 2.0 enclosures be flagged?
It seems to me that the issue is more the mime type than the encoding used. HTML (however obscured) has certain dangers, Word files have others.
Tomas,
If the feed validator is going to go through the trouble of parsing the entire HTML content to ferret out Javascript in onload attributes and in style elements I don't see why warning a user that an enclosure or binary content in an ATOM feed that is a Word document or Windows screen saver file could be dangerous is that more extraordinary. Personally, I find it more extraordinary that the validator flags img tags with onmouseover as issues but lets you have an enclosure or binary content that could be an arbitrary executable file.
Sam,
My suggestion would be for all none textual or HTML content to be flagged as potentially dangerous. This is accurate as claiming that Javascript in HTML content is potentially dangerous.
I generally agree that entries should have globally unique identifiers, and on the web they are URIs. There is an alternative, for systems to determine an entry's identity through a combination of its characteristics (as done by FOAF) but without a framework in which to use this (e.g. you-know-what) I think it would be too much work.
Still, the MT problem Phil points to is tricky.
Just as a strawman, might there be a way of using, rather than URIs, URIRefs? i.e. use frag IDs to look after post versioning, so the original version of a post might be:
tag:www.intertwingly.net/blog,2004-03-31:weblog.1
and a revised version
tag:www.intertwingly.net/blog,2004-03-31:weblog.1#2
or even
tag:www.intertwingly.net/blog,2004-03-31:weblog.1#2004-04-01
re. dangerous code - I'm afraid Dare's probably right, once you start deciding things are potentially dangerous there's a vast array of material to take into account. Personally I'd leave it to the client to figure out (i.e. if it automatically runs executables or scripts, then it will rapidly become unpopular!)
if it automatically runs executables or scripts, then it will rapidly become unpopular!
Which is why the world has migrated so rapidly from IE and Outlook to Mozilla and Evolution.
I've filed a bug on this. Good thing I happened by here!
(Not sure what company policy is on representing myself like this, but standard disclaimer applies, etc.)