It seems that both
Blogdex and
KeepMedia
seem to think that they each have exclusive rights to the domain of
small positive integers. I wonder what part of
Globally Unique Identifier people have difficulty
understanding? Seems clear enough to me.
Because of this, the
feedvalidator will now
flag both of
these
feeds as having guids which are
NotSufficientlyUnique. Thanks to
Joseph Walton, such messages
will be clearly marked as warnings and not cause the feed to be
flagged as invalid.
While neither RSS
1.0 nor Atom can
absolutely prevent collisions, both
rdf:about and
atom:id are defined as URIs. This means that such
identifiers (by virtue of not containing a colon) will be evaluated
as relative to the source page. Which means that they only
need to be locally unique.
Furthermore, both rdf:about and atom:id are defined as being
required. Which is a
good
thing.
If you're an MT user, I wish you the best of luck with atom:id
Sam,
Thanks for adding this to the validator. On a related note I'd like to thank you and Mark for writing the feed validator. I've lost count of the amount of times I've gotten mail or bugs filed about how some feed doesn't work in RSS Bandit which was quickly resolved by sending the person to the Feed Validator.
I do wonder about some other decisions you've made. A while ago I noticed the validator complains if a feed's content contains Javascript because this could be used for malicious purposes. However you don't issue a warning if a feed contains enclosures or binary content in ATOM. This seems very inconsistent. After all, downloading and executing arbitrary binary content is just as bad as executing arbitrary Javascript. Even an MP3 file can be used to 0wn your box, just look at the Winamp buffer overflow.
Is there a particular reason for this decision or did you guys just overlook this issue?
Dare: binary doesn't, by itself, mean executable. So, from that perspective, whether the mode is xml, escaped, or base64 should make no difference. What does, however, make a difference is whether the data, once unescaped or decoded, actually contains a script or not.
And, in case you are wondering, the validator does contain explicit code to detect scripts encoded in base64 in Atom feeds. Here is a testcase and here are the results.
If you have other conditions you would like to see checked for, please let us know, preferably by opening a bug report or feature request on sourceforge.
Am I missing something? What on earth is the point of giving GUIDs to different posts? Isn't that what URLs are for? Can't any client generate a GUID just by hashing the post content? And, personally, I don't buy the scenario of moving your blogs across different domains. I'd say if the URL of your blog changes you have a new blog and more than likely you want people to treat your blog as completely new and separate entity. These kinds of fuzzy 'these-two-are-the-same-but-really-they're-not' situations always cause huge headaches in distributed systems for very little gain.
Bo, as far as the reason, reference the Globally Unique Identifier link in Sam's post, esp. "It's up to the source of the feed to establish the uniqueness of the string.". Sounds like the validator is following the spec. How is hashing a post "globally unique". For all you know, that's how Blogdex and KeepMedia generate guids. And if the URL of a blog changes, but the content moved with it, shouldn't the GUIDs stay the same? To me, "moving a blog" is the same as relocating the content. For examle, the URL of Sam's blog has changed from when it was hosted at Userland, but the content moved with it. I haven't checked, but I assume that the guids stayed the same with the move.
Bo, I think that there are a number of aspects to this that need to be teased apart.
The best place to look for the primary reason why GUIDs were introduced into RSS 2.0 is in the comments section of that spec. The primary use case seems to be to allow "aggregators to not repeat items, even if there have been editing changes.". Editing changes. Changes which would affect the value of a hash.
A second use case for guids would be to enable the identification of blog entries that have been syndicated. Everything I say shows up in Planet Apache. It looks like weblogs 2 asp.net is a similar service. There are many others. In such circumstances, there is the potential for uniqueness to span feeds, and this presumably is the motivation for guids to be defines as Globally unique ids.
A third use case deals with portability of weblogs. Quite frankly, that use case remains speculative and controversial. It will be interesting to see how that discussion turns out.
Bottom line, while it may be tolerable for the same blog entry to occasionally be issued a new guid / rdf:about / atom:id value, it is never tolerable for two different blog entries to have the same guid / rdf:about / atom:id. That is what the feedvalidator is trying to guard against.
Sam,
Most binary content is dangerous, regardless of whether it is executable or not. Accepting arbitrary MP3 files or Word documents can lead to as many security issues as accepting arbitrary HTML fragments with embedded script depending on what tools you are using.
I'll probably file a bug report on SF later today. Thanks for the response.
Dare,
To warn about javascript being embedded in a feed is within the scope of what one might expect from a Feed-validator to do. But, surely, one cannot expect the Feed-validator to take on the role of Norton Anti-Virus and examine every possible type of binary file? Is the difference not obvious?
Tomas,
If the feed validator is going to go through the trouble of parsing the entire HTML content to ferret out Javascript in onload attributes and in style elements I don't see why warning a user that an enclosure or binary content in an ATOM feed that is a Word document or Windows screen saver file could be dangerous is that more extraordinary. Personally, I find it more extraordinary that the validator flags img tags with onmouseover as issues but lets you have an enclosure or binary content that could be an arbitrary executable file.
Sam,
My suggestion would be for all none textual or HTML content to be flagged as potentially dangerous. This is accurate as claiming that Javascript in HTML content is potentially dangerous.
I generally agree that entries should have globally unique identifiers, and on the web they are URIs. There is an alternative, for systems to determine an entry's identity through a combination of its characteristics (as done by FOAF) but without a framework in which to use this (e.g. you-know-what) I think it would be too much work.
Still, the MT problem Phil points to is tricky.
Just as a strawman, might there be a way of using, rather than URIs, URIRefs? i.e. use frag IDs to look after post versioning, so the original version of a post might be:
re. dangerous code - I'm afraid Dare's probably right, once you start deciding things are potentially dangerous there's a vast array of material to take into account. Personally I'd leave it to the client to figure out (i.e. if it automatically runs executables or scripts, then it will rapidly become unpopular!)
Sam, do you know if anyone has done a feed of exploits to test aggregators against, and also to test whether we are right that they are exploitable? I threatened to do it back at the time of the Great Platypus Attack, but I still haven't gotten around to it, and I'm beginning to suspect that in a number of cases our assumptions about what is and isn't dangerous could use some refining by actual test.
Sam, both the syndication scenario and the relocation scenario provide arguments for why you don't want atom:id to be relative -- it provides yet another way to make mistakes.
Hey Sam, thanks for pointing this out. I'd just like to point out that I have filed a patent for all integers up to 1 billion, but until that application comes back I've switched the GUIDs to a fully qualified URL. I had originally used just the internal Blogdex ID for brevity's sake and thought that I changed it a while back. It should be working now.