It’s just data

Re-syndicating vs sanitizing

Just over a month ago, Tim Bray pointed both to Jacques’ Atom Torture Test, and Planet Intertwingly. Regarding the later, he noted with evident delight that NetNewsWire was able to tell him which entries he had already seen due to the fact that Planet made an effort to retain atom id’s.

Until today, it didn’t occur to me that those two were related.  Programs which couldn’t handle such things as MathML do a disservice by resyndicating mangled or neutered content.  This brings up a number of interesting questions.  I’m going to take a stab at answering them, but in all honesty, this is a subject for interesting debate.

The first step is that the Feed Parser needs to be modified to return back a flag for each entry indicating whether that entry has been sanitized.


Since you can never really trust the producer (or republisher) to reliably sanitize feed entries, sanitization is always going to be, ultimately, the purview of the User-Agent.

I objected to having the MathML content of my entries stripped, without a new <atom:id> assigned. That’s an honest change in content, little different, really, than stripping out all the cuss-words.

While it’s true that there are not a lot of Atom clients that could tell the difference, such clients do exist. In any case, it’s a bad precedent for the future where, presumably, such clients will be more common.

And, anyway, you haven’t really sanitized a feed, until you’ve changed all instances of “hell” to “heck.”

Posted by Jacques Distler at

Policy decisions are another matter.

Disagree. The presence of a source element should be enough to tip off a consumer. Suppose Symantec started making a Planet product that sanitized content coming into a corporate environment. Should they mint new atom:id elements? I don’t think so.

Posted by Robert Sayre at

I’m shocked to say that I agree with Robert.  Sanitizing content in an entry does not change the identity of that entry.  Perhaps some other piece of metadata should be changed (like atom:updated and/or some extension element) and a atom:source should be inserted if it’s not already there, but changing atom:id? Definitely not.

Posted by James Snell at

“Sanitize,” in this context, means “change the <atom:content> in some essentially lossy way.” Depending on what your Policy is (which could include the removal of cuss words, or of mentions of Vi@gra), the <atom:content> of different instances of the same <atom:id> can no longer be considered “the same.”

If User-Agents are going to treat an <atom:entry> as read, if it has the same <atom:id>, how does the presence of an <atom:source> element help?

How is the user supposed to know that some entry marked as “read” is not the same as the entry he actually read, because the content of the latter had been "sanitized"?

Posted by Jacques Distler at

How is the user supposed to know that some entry marked as “read” is not the same as the entry he actually read

User-Agent policy issue:

RFC 4287, section 8.4

Posted by Robert Sayre at

Jacques: The presence of the atom:source allows the client a way of going back and locating the original, unmodified entry.  Polite intermediaries modifying an entry should likely use some form of extension element that indicates that some form of third-party modification has been performed.

Posted by James Snell at

Ah. OK. Sanitization as a spoofing attack.

That turns the question on its head.

What steps do actual UAs take to guard against spoofing attacks? And what “sanitization” Policies of republishers are “acceptable” and will not trigger that anti-spoofing code?

Either way, there’s the question: should republishers disclose their sanitization Policies? Should they flag individual entries as having been sanitized? And, if so, how?

Posted by Jacques Distler at

I agree with Robert and James, re-publishers MUST use source to provide some sort of tracability, but if they start changing IDs, you don’t have a prayer of ever achieving sync between these systems.

Sanitizing content in an entry does not change the identity of that entry

Policy is a tricky issue, I can think of several levels:
<ol>
<li>Fixing up broken HTML in partial content feeds (content produced without attempting to close tags)</li>
<li>Modifying content to produce safer RSS - e.g. removing or editing style tags or attributes</li>
<li>Outright content modification (hell becomes heck)</li>
</ul>
At first glance, I thought that markup is generally OK to modify without substatially changing the content, but your example shows that’s not the case - Math ML is markup.  I’m not especially familiar with it, but it seems to me that stripping that markup would be removing important information, changing the meaning of the equation represented.

Posted by Gordon Weakliem at

At first glance, I thought that markup is generally OK to modify without substatially changing the content

At second glance, you wouldn’t think that.

Take an entry with several XHTML tables in it, and strip out all <table>, <tr> and <td> elements (because, I mean, who uses those anyway?). Not much left to the meaning of the original content, is there?

MathML is just a slightly more exotic version of the same problem.

Posted by Jacques Distler at

J. Distler: “Either way, there’s the question: should republishers disclose their sanitization Policies? Should they flag individual entries as having been sanitized? And, if so, how?

Some form of extension element is likely the right approach.  What that extension would look like, I’m not sure.  Microsoft’s SSE or my thought-experiment around Atom revisions could likely provide interesting starting points.  Alternatively or in-addition, a “via” link would likely be appropriate to add to the entry.  In any case, the exact how is debatable.  What should not be debatable is whether or not atom:id ever changes.

Posted by James Snell at

Take an entry with several XHTML tables in it, and strip out all <table>, <tr> and <td> elements (because, I mean, who uses those anyway?). Not much left to the meaning of the original content, is there?

I remember the days when table was new, and not widely supported... and what you describe is exactly what happened.

I’ve spent some time looking at UFP, and the reason why MathML is currently stripped is simple: the only secure way to display HTML is to only allow a known subset of tags through.  MathML is not “known” to the UFP.  This is complicated by the fact that namespace processing is somewhat incomplete.

I believe that these problems are solvable.

Posted by Sam Ruby at

the only secure way to display HTML is to only allow a known subset of tags through.

Not an unexpected answer.

If it’s of any use, the MathML elements and attributes that I use are listed here.

That doesn’t solve the general problem (anyone know a safe subset of SVG?), but it is, perhaps, a start ...

Posted by Jacques Distler at

I’m also with Robert and James. Sanitization is only changing a representation of the resource (/creating a new one). The (conceptual) resource stays the same, hence should keep the same identifier.

Posted by Danny at

If it’s of any use, the MathML elements and attributes that I use are listed here.

That, indeed, is helpful.  What would also be helpful is if you can verify two things: (1) the outer element of any MathML sequence is always <math> (this is how I read the MathML specification), and (2) your constrained subset requires that all valid descendant elements of the <math> element be in the MathML namespace.  From the spec, I see xlink, OpenMath, and SVG counter-examples.  If those are not allowed for the moment, it would be easier for me.  Additionally, it would be much easier if I didn’t have to worry about elements in the XHTML namespace as descendants of the <math> element.

Posted by Sam Ruby at

(1) the outer element of any MathML sequence is always <math> (this is how I read the MathML specification)

Correct.

(2) your constrained subset requires that all valid descendant elements of the <math> element be in the MathML namespace.  From the spec, I see xlink, OpenMath, and SVG counter-examples.  If those are not allowed for the moment, it would be easier for me.

Currently, the only foreign namespaces that can occur are xlink:type, xlink:show and xlink:href attributes on the mrow element (for turning a mathematical expression into a hyperlink).

Embedding inline SVG would be an interesting project for the future, to deal with the fact that MathML is inadequate to express certain more-complicated mathematical formulae. (See this discussion of XYpic.sty and DCpic.sty)

I don’t have any use for Content-MathML (throw away half the Spec right there) or OpenMath (similar in goals to Content-MathML).

Posted by Jacques Distler at

Two, hopefully final, requests:

Can you explain what mprescripts/,none/,mroot means in this list.

Can you provide a live example of XLink being used in MathML?  A quick Google search did not turn up much.

Posted by Sam Ruby at

The trailing slash is MT’s notation for an empty element: <mprescripts />, <none />. Dunno what your question about mroot is. It’s an element with two children (<mroot> base index </mroot>).

Can you provide a live example of XLink being used in MathML?

Hmmm. Fascinating. You’ve found a bug in itex2MML. I’ve never actually used that feature, and itex2MML doesn’t implement it correctly. It’s designed to produce MathML fragments that look like

<mrow xlink:type="simple" xlink:show="replace" xlink:href="http://golem.ph.utexas.edu">
<mrow>
<mi>a</mi><mo>+</mo><mi>b</mi>
</mrow></mrow>

But the result ends up being ill-formed because itex2MML never actually declares the xlink namespace. I shall have to fix that.

Posted by Jacques Distler at

I’m testing a fix.

Posted by Sam Ruby at

Abdera, XPath and MathML

Sam is experimenting with making sure that PlanetPlanet can properly resyndicate Atom entries that contain MathML. Just to test things out, I ran the feed through Abdera’s parser and tried an XPath to select all of the contained MathML...

Excerpt from snellspace.com at

Add your comment