Just over a month ago, Tim Bray pointed both to Jacques’ Atom Torture Test, and Planet Intertwingly. Regarding the later, he noted with evident delight that NetNewsWire was able to tell him which entries he had already seen due to the fact that Planet made an effort to retain atom id’s.
Until today, it didn’t occur to me that those two were related. Programs which couldn’t handle such things as MathML do a disservice by resyndicating mangled or neutered content. This brings up a number of interesting questions. I’m going to take a stab at answering them, but in all honesty, this is a subject for interesting debate.
I have no problem with transformations which should be lossless. Making relative URIs absolute, or even rebasing them should be OK. Adding or removing ignorable whitespace, and even lowercasing element names should be OK.
Now matter how extensive the test suite, bugs are a fact of life. If there is a substantive difference that creeps in unintentionally, this should be treated as a bug and fixed.
Policy decisions are another matter. If styles or scripts are stripped, then either a new atom:id needs to be minted or that particular entry needs to not be re-syndicated.
Unsupported features, like MathML or inline SVG, are a special case of policy decisions, and probably should be treated likewise.
The first step is that the Feed Parser needs to be modified to return back a flag for each entry indicating whether that entry has been sanitized.
Since you can never really trust the producer (or republisher) to reliably sanitize feed entries, sanitization is always going to be, ultimately, the purview of the User-Agent.
I objected to having the MathML content of my entries stripped, without a new <atom:id> assigned. That’s an honest change in content, little different, really, than stripping out all the cuss-words.
While it’s true that there are not a lot of Atom clients that could tell the difference, such clients do exist. In any case, it’s a bad precedent for the future where, presumably, such clients will be more common.
And, anyway, you haven’t really sanitized a feed, until you’ve changed all instances of “hell” to “heck.”
Disagree. The presence of a source element should be enough to tip off a consumer. Suppose Symantec started making a Planet product that sanitized content coming into a corporate environment. Should they mint new atom:id elements? I don’t think so.
I’m shocked to say that I agree with Robert. Sanitizing content in an entry does not change the identity of that entry. Perhaps some other piece of metadata should be changed (like atom:updated and/or some extension element) and a atom:source should be inserted if it’s not already there, but changing atom:id? Definitely not.
“Sanitize,” in this context, means “change the <atom:content> in some essentially lossy way.” Depending on what your Policy is (which could include the removal of cuss words, or of mentions of Vi@gra), the <atom:content> of different instances of the same <atom:id> can no longer be considered “the same.”
If User-Agents are going to treat an <atom:entry> as read, if it has the same <atom:id>, how does the presence of an <atom:source> element help?
How is the user supposed to know that some entry marked as “read” is not the same as the entry he actually read, because the content of the latter had been "sanitized"?
Jacques: The presence of the atom:source allows the client a way of going back and locating the original, unmodified entry. Polite intermediaries modifying an entry should likely use some form of extension element that indicates that some form of third-party modification has been performed.
What steps do actual UAs take to guard against spoofing attacks? And what “sanitization” Policies of republishers are “acceptable” and will not trigger that anti-spoofing code?
Either way, there’s the question: should republishers disclose their sanitization Policies? Should they flag individual entries as having been sanitized? And, if so, how?
I agree with Robert and James, re-publishers MUST use source to provide some sort of tracability, but if they start changing IDs, you don’t have a prayer of ever achieving sync between these systems.
Sanitizing content in an entry does not change the identity of that entry
Policy is a tricky issue, I can think of several levels:
<ol>
<li>Fixing up broken HTML in partial content feeds (content produced without attempting to close tags)</li>
<li>Modifying content to produce safer RSS - e.g. removing or editing style tags or attributes</li>
<li>Outright content modification (hell becomes heck)</li>
</ul>
At first glance, I thought that markup is generally OK to modify without substatially changing the content, but your example shows that’s not the case - Math ML is markup. I’m not especially familiar with it, but it seems to me that stripping that markup would be removing important information, changing the meaning of the equation represented.
At first glance, I thought that markup is generally OK to modify without substatially changing the content
At second glance, you wouldn’t think that.
Take an entry with several XHTML tables in it, and strip out all <table>, <tr> and <td> elements (because, I mean, who uses those anyway?). Not much left to the meaning of the original content, is there?
MathML is just a slightly more exotic version of the same problem.
J. Distler: “Either way, there’s the question: should republishers disclose their sanitization Policies? Should they flag individual entries as having been sanitized? And, if so, how?”
Some form of extension element is likely the right approach. What that extension would look like, I’m not sure. Microsoft’s SSE or my thought-experiment around Atom revisions could likely provide interesting starting points. Alternatively or in-addition, a “via” link would likely be appropriate to add to the entry. In any case, the exact how is debatable. What should not be debatable is whether or not atom:id ever changes.
Take an entry with several XHTML tables in it, and strip out all <table>, <tr> and <td> elements (because, I mean, who uses those anyway?). Not much left to the meaning of the original content, is there?
I remember the days when table was new, and not widely supported... and what you describe is exactly what happened.
I’ve spent some time looking at UFP, and the reason why MathML is currently stripped is simple: the only secure way to display HTML is to only allow a known subset of tags through. MathML is not “known” to the UFP. This is complicated by the fact that namespace processing is somewhat incomplete.
I’m also with Robert and James. Sanitization is only changing a representation of the resource (/creating a new one). The (conceptual) resource stays the same, hence should keep the same identifier.
If it’s of any use, the MathML elements and attributes that I use are listed here.
That, indeed, is helpful. What would also be helpful is if you can verify two things: (1) the outer element of any MathML sequence is always <math> (this is how I read the MathML specification), and (2) your constrained subset requires that all valid descendant elements of the <math> element be in the MathML namespace. From the spec, I see xlink, OpenMath, and SVG counter-examples. If those are not allowed for the moment, it would be easier for me. Additionally, it would be much easier if I didn’t have to worry about elements in the XHTML namespace as descendants of the <math> element.
(1) the outer element of any MathML sequence is always <math> (this is how I read the MathML specification)
Correct.
(2) your constrained subset requires that all valid descendant elements of the <math> element be in the MathML namespace. From the spec, I see xlink, OpenMath, and SVG counter-examples. If those are not allowed for the moment, it would be easier for me.
Currently, the only foreign namespaces that can occur are xlink:type, xlink:show and xlink:hrefattributes on the mrow element (for turning a mathematical expression into a hyperlink).
Embedding inline SVG would be an interesting project for the future, to deal with the fact that MathML is inadequate to express certain more-complicated mathematical formulae. (See this discussion of XYpic.sty and DCpic.sty)
I don’t have any use for Content-MathML (throw away half the Spec right there) or OpenMath (similar in goals to Content-MathML).
The trailing slash is MT’s notation for an empty element: <mprescripts />, <none />. Dunno what your question about mroot is. It’s an element with two children (<mroot> base index </mroot>).
Can you provide a live example of XLink being used in MathML?
Hmmm. Fascinating. You’ve found a bug in itex2MML. I’ve never actually used that feature, and itex2MML doesn’t implement it correctly. It’s designed to produce MathML fragments that look like
Sam is experimenting with making sure that PlanetPlanet can properly resyndicate Atom entries that contain MathML. Just to test things out, I ran the feed through Abdera’s parser and tried an XPath to select all of the contained MathML...