It’s just data

Uniqueness

It seems that both Blogdex and KeepMedia seem to think that they each have exclusive rights to the domain of small positive integers.  I wonder what part of Globally Unique Identifier people have difficulty understanding?  Seems clear enough to me.

What happens when guids collide?  Well, at least one popular aggregator will decide that you have already seen the post.

Because of this, the feedvalidator will now flag both of these feeds as having guids which are NotSufficientlyUnique.  Thanks to Joseph Walton, such messages will be clearly marked as warnings and not cause the feed to be flagged as invalid.

While neither RSS 1.0 nor Atom can absolutely prevent collisions, both rdf:about and atom:id are defined as URIs.  This means that such identifiers (by virtue of not containing a colon) will be evaluated as relative to the source page.  Which means that they only need to be locally unique.

Furthermore, both rdf:about and atom:id are defined as being required.  Which is a good thing.


If you're an MT user, I wish you the best of luck with atom:id

[link]

Posted by Richard at

Sam,
Thanks for adding this to the validator. On a related note I'd like to thank you and Mark for writing the feed validator. I've lost count of the amount of times I've gotten mail or bugs filed about how some feed doesn't work in RSS Bandit which was quickly resolved by sending the person to the Feed Validator.

I do wonder about some other decisions you've made. A while ago I noticed the validator complains if a feed's content contains Javascript because this could be used for malicious purposes. However you don't issue a warning if a feed contains enclosures or binary content in ATOM. This seems very inconsistent. After all, downloading and executing arbitrary binary content is just as bad as executing arbitrary Javascript. Even an MP3 file can be used to 0wn your box, just look at the Winamp buffer overflow.

Is there a particular reason for this decision or did you guys just overlook this issue?

Posted by Dare Obasanjo at

Dare: binary doesn't, by itself, mean executable.  So, from that perspective, whether the mode is xml, escaped, or base64 should make no difference.  What does, however, make a difference is whether the data, once unescaped or decoded, actually contains a script or not.

And, in case you are wondering, the validator does contain explicit code to detect scripts encoded in base64 in Atom feeds.  Here is a testcase and here are the results.

If you have other conditions you would like to see checked for, please let us know, preferably by opening a bug report or feature request on sourceforge.

Posted by Sam Ruby at

Am I missing something? What on earth is the point of giving GUIDs to different posts? Isn't that what URLs are for? Can't any client generate a GUID just by hashing the post content? And, personally, I don't buy the scenario of moving your blogs across different domains. I'd say if the URL of your blog changes you have a new blog and more than likely you want people to treat your blog as completely new and separate entity. These kinds of fuzzy 'these-two-are-the-same-but-really-they're-not' situations always cause huge headaches in distributed systems for very little gain.

Posted by Bo at

Bo, as far as the reason, reference the Globally Unique Identifier link in Sam's post, esp. "It's up to the source of the feed to establish the uniqueness of the string.".  Sounds like the validator is following the spec.  How is hashing a post "globally unique".  For all you know, that's how Blogdex and KeepMedia generate guids.  And if the URL of a blog changes, but the content moved with it, shouldn't the GUIDs stay the same?  To me, "moving a blog" is the same as relocating the content.  For examle, the URL of Sam's blog has changed from when it was hosted at Userland, but the content moved with it.  I haven't checked, but I assume that the guids stayed the same with the move.

Posted by Gordon Weakliem at

Bo, I think that there are a number of aspects to this that need to be teased apart.

The best place to look for the primary reason why GUIDs were introduced into RSS 2.0 is in the comments section of that spec.  The primary use case seems to be to allow "aggregators to not repeat items, even if there have been editing changes.".  Editing changes.  Changes which would affect the value of a hash.

A second use case for guids would be to enable the identification of blog entries that have been syndicated.  Everything I say shows up in Planet Apache.  It looks like weblogs 2 asp.net is a similar service.  There are many others.  In such circumstances, there is the potential for uniqueness to span feeds, and this presumably is the motivation for guids to be defines as Globally unique ids.

A third use case deals with portability of weblogs.  Quite frankly, that use case remains speculative and controversial.  It will be interesting to see how that discussion turns out.

Bottom line, while it may be tolerable for the same blog entry to occasionally be issued a new guid / rdf:about / atom:id value, it is never tolerable for two different blog entries to have the same guid / rdf:about / atom:id.  That is what the feedvalidator is trying to guard against.

Posted by Sam Ruby at

Sam,
  Most binary content is dangerous, regardless of whether it is executable or not. Accepting arbitrary MP3 files or Word documents can lead to as many security issues as accepting arbitrary HTML fragments with embedded script depending on what tools you are using.

I'll probably file a bug report on SF later today. Thanks for the response.

Posted by Dare Obasanjo at

Dare,
To warn about javascript being embedded in a feed is within the scope of what one might expect from a Feed-validator to do. But, surely, one cannot expect the Feed-validator to take on the role of Norton Anti-Virus and examine every possible type of binary file? Is the difference not obvious?

Posted by Tomas at

Dare, can you explain the danger of "binary"?  Text systems are also subject to buffer overflows.  Escaped data is just another form of encoding.

Should everything be disallowed except for inline XML?

For that matter, should all RSS 2.0 enclosures be flagged?

It seems to me that the issue is more the mime type than the encoding used. HTML (however obscured) has certain dangers, Word files have others.

Posted by Sam Ruby at

Tomas,
  If the feed validator is going to go through the trouble of parsing the entire HTML content to ferret out Javascript in onload attributes and in style elements I don't see why warning a user that an enclosure or binary content in an ATOM feed that is a Word document or Windows screen saver file could be dangerous is that more extraordinary. Personally, I find it more extraordinary that the validator flags img tags with onmouseover as issues but lets you have an enclosure or binary content that could be an arbitrary executable file.

  Sam,
  My suggestion would be for all none textual or HTML content to be flagged as potentially dangerous. This is accurate as claiming that Javascript in HTML content is potentially dangerous.

Posted by Dare Obasanjo at

FOR GREAT UNIQUENESS in python check out mxUID.

Posted by John Beimler at

I generally agree that entries should have globally unique identifiers, and on the web they are URIs. There is an alternative, for systems to determine an entry's identity through a combination of its characteristics (as done by FOAF) but without a framework in which to use this (e.g. you-know-what) I think it would be too much work.
Still, the MT problem Phil points to is tricky.

Just as a strawman, might there be a way of using, rather than URIs, URIRefs? i.e. use frag IDs to look after post versioning, so the original version of a post might be:

tag:www.intertwingly.net/blog,2004-03-31:weblog.1

and a revised version

tag:www.intertwingly.net/blog,2004-03-31:weblog.1#2

or even

tag:www.intertwingly.net/blog,2004-03-31:weblog.1#2004-04-01

re. dangerous code - I'm afraid Dare's probably right, once you start deciding things are potentially dangerous there's a vast array of material to take into account. Personally I'd leave it to the client to figure out (i.e. if it automatically runs executables or scripts, then it will rapidly become unpopular!)

Posted by Danny at

if it automatically runs executables or scripts, then it will rapidly become unpopular!

Which is why the world has migrated so rapidly from IE and Outlook to Mozilla and Evolution.

Posted by Mark at

Sam, do you know if anyone has done a feed of exploits to test aggregators against, and also to test whether we are right that they are exploitable? I threatened to do it back at the time of the Great Platypus Attack, but I still haven't gotten around to it, and I'm beginning to suspect that in a number of cases our assumptions about what is and isn't dangerous could use some refining by actual test.

Posted by Phil Ringnalda at

Sam, both the syndication scenario and the relocation scenario provide arguments for why you don't want atom:id to be relative -- it provides yet another way to make mistakes.

Posted by Sam Ruby at

I've filed a bug on this. Good thing I happened by here!

(Not sure what company policy is on representing myself like this, but standard disclaimer applies, etc.)

Posted by a KeepMedia developer at

Hey Sam, thanks for pointing this out. I'd just like to point out that I have filed a patent for all integers up to 1 billion, but until that application comes back I've switched the GUIDs to a fully qualified URL. I had originally used just the internal Blogdex ID for brevity's sake and thought that I changed it a while back. It should be working now.

Posted by Cameron at

Reuters joins the club.

Posted by Sam Ruby at

Add your comment