A little over a month ago, I outlined how I would like to see the feed parser reorganized. I’ve now put a little meat on the bones, in the form of running code. Not just for the feed parser, but also for Planet. I also did it all in Ruby, so I named this little experiment Mars. Warning: this version is 0.0.1. It just barely runs end-to-end. Feed it real data, and it will choke on some of it. But it can now produce partial results.
All in all, I’m pleased with how compact this code is. If anybody wants to join in on the fun, it is a bzr repository and there are plenty of test cases ready to be ported.
A little over a month ago, I outlined how I would like to see the feed parser reorganized. I’ve now put a little meat on the bones, in the form of running code. Not just for the feed parser, but also for Planet. I also did it all in Ruby, so I named this little experiment Mars. Warning: this version is 0.0.1. It just barely runs end-to-end. Feed it real data, and it will choke on some of it. But it can now produce partial results.
Inventory:
<a>&a</a>
and <a a='<'/>
without error). In all REXML isn’t too bad... as long as you don’t depend on it for serialization or deserialization or XPath or expect quick turnaround on bug fixes or responses on their mailing list. In the event the chosen parser fails to parse the document, the HTML5lib liberal XML parser will be used, and a bozo
flag will be set on the document itself.There should be one—and preferably only one—obvious way to do it. This module is clearly opinionated software in that it will transmogrify feeds which use less obvious constructs into more obvious ones.
Also provided
wellformed/(atom10|rss)/*.xml
) of the feedparser test suite. Check the comments to see what is not yet supported (mostly elements like cloud
and textInput
All in all, I’m pleased with how compact this code is. If anybody wants to join in on the fun, it is available as a bzr repository and there are plenty of test cases ready to be ported.
If you’re looking at rewriting the Planet code, I have a request for you to consider. I suspect this won’t be feasible, but it would be really nice if the aggregated feed from the Planet could have the absolute minimum of filtering and transmogriphying applied to it. Two reasons:
1. When you filter the feed, you butcher a lot of the content. Stuff which you may not want in your HTML view (for security reasons or whatever), is often very useful for people viewing the feed from a desktop feed reader (think embedded videos, and microformats).
2. I’ve recently started experimenting with duplicate detection across feeds (I’ve reconsidered my previous position on the subject). The problem I’m finding is that the feed from Planet Intertwingly often contains different content to the source feeds, and I flag any such changes in an effort to identify hack attempts. This becomes somewhat counter-productive (and annoying) when items are being flagged almost all the time.
I can go into more detail if this seems like something that could be addressed. If not, I understand.
absolute minimum of filtering and transmogriphying
A few questions:
A suggestion that if it pans out could possibly even be applied today: if there is a <source> element present, treat the entry as being of a lower “fidelity” than the original.
Bare in mind that I’m doing my own normalizing/transmogriphying. Encoding, relative URIs, even changes in feed format shouldn’t be much of a issue. And obviously a lot of that sort of thing has to be done for you to produce a unified feed. I’m more worried about changes to the actual text content in the feed. I’m not even comparing markup, so you would think this should work fairly well, but it doesn’t.
I haven’t looked at it in much detail yet (didn’t think there was much point if you can’t do anything about it), but one example I saw was when you converted an html atom:summary to plain text. Stripping the markup wasn’t a problem - it was the whitespace that became an issue. In HTML, whitespace is generally not significant, but in plain text it is. So your conversion appeared to me as a significant change in content.
I could probably deal with that particular issue myself with a more relaxed string comparison, but the underlying problem remains - if you’re changing the message content, then something is bound to break sooner or later.
if there is a <source> element present, treat the entry as being of a lower “fidelity” than the original.
I’m already doing that. However, consider this scenario. The users refreshes a Planet feed and gets a bunch of new messages which he immediately reads. He’s also subscribed to some of the feeds individually, and when those feeds refresh he gets a new set of messages all of which he has already read. However, since the content now appears to be different, the app is obliged to inform the user (otherwise he won’t know that the messages he previously read may have been fake). That’s what I’m trying to avoid.
Also, a source element doesn’t solve my first problem (in the case of feeds that I’m only seeing via the Planet). I’d like to be able to view an embedded video in the Planet feed without having to subscribe to the source feed separately.
When you filter the feed, you butcher a lot of the content.
Perhaps we could dispense with the editorializing?
Bare in mind that I’m doing my own normalizing/transmogriphying.
I see. When we do it, it’s butchering. When you do it, it’s cute and fluffy like something out of a Calvin and Hobbes cartoon.
Perhaps we could dispense with the editorializing?
Yeah, that’s what I’m asking. Although I probably wouldn’t have said “editorializing”. Maybe "butchering"?
When we do it, it’s butchering. When you do it, it’s cute and fluffy like something out of a Calvin and Hobbes cartoon.
No, when I do it, it’s still butchering. But I’m not republishing the butchered data with an atom:id that implies it’s the same thing as the original.
rawr!
Tigers are great.
relative URIs ... shouldn’t be much of a issue. ... I’m more worried about changes to the actual text content in the feed.
What about relative URIs in the actual text content?
In HTML, whitespace is generally not significant, but in plain text it is.
Only if you want it to be. RFC 4285 § 3.1.1.1
Stripping the markup wasn’t a problem
Mars doesn’t intentionally strip valid markup. But the code is new and bound to be full of bugs.
But I’m not republishing the butchered data with an atom:id that implies it’s the same thing as the original.
I also note that you haven’t answered my original questions. There will be changes to the feed. For you to be able to do any duplicate detection, there original id
will need to be propagated — either as the id
(per the spec) or in an extension (as Google Reader does).
The codebase is young and easily refactored. If you have specific proposals, I’ll be glad to work through them with you.
What about relative URIs in the actual text content?
By text content, I meant text and only text (not markup). You’re assumedly not making changes to bits of plain text that look like they might be relative URIs.
Only if you want it to be. RFC 4285 § 3.1.1.1In HTML, whitespace is generally not significant, but in plain text it is.
I realise that. I don’t have a problem with you choosing to ignore whitepace when you display it. However you’re republishing the content with changes that make it impossible for me to make that choice.
Mars doesn’t intentionally strip valid markup.
I was referring to what I saw in Planet Intertwingly (which is assumedly still Venus), namely the conversion of an html atom:summary into a plain text atom:summary. Obviously that would result in markup being stripped if there was any, but I can’t find any examples now of summaries containing markup (that couldn’t at least be converted) so maybe that never happens. Either way, I didn’t have a problem with that.
I also note that you haven’t answered my original questions.
Looking back I think I answered all your questions, except “What if the original feed isn’t well formed?”. The answer is the same for all of them. It doesn’t matter. I understand that you need to make certain syntactic changes to the source data in order to produce a unified, valid atom feed as output. That’s not a problem. It’s semantic changes that bother me.
Removing a video from someone’s message is not the same as converting their encoding from UTF-16 to UTF-8. When it comes to correcting well-formedness errors or resolving relative URIs in RSS, the issue becomes less clear cut because the semantics weren’t clear to begin with, but those are edge cases. And with any luck, we’ll probably agree on those semantic interpretations anyway.
There will be changes to the feed.
Understandable. As I said above: syntactic ok; semantic not so ok.
If you have specific proposals, I’ll be glad to work through them with you.
Well here’s one idea. Can you not pull out your filtering/whitelisting code into a separate module? Then rather than applying it to the feed content before writing to the cache, let the cache keep the unfiltered content, and only apply the whitelist module when generating the HTML for your web view. That way the feed can still be generated with the unfiltered content.
Can you not pull out your filtering/whitelisting code into a separate module?
The purpose of this particular refactoring is to enable exactly that sort of experimentation. Part of the experimentation will be to enable publishers of planets to set policies. I, for example, don’t like the thought of my planet being used as a vector to distribute a script attack. I also see value in videos, so if I can find a policy I am comfortable with, such will start showing up in the html output as well as the feed.
When it comes to correcting well-formedness errors ... in RSS, the issue becomes less clear cut
There is no spec governing the correction of non-well-formed Atom feeds either.
not the same as converting their encoding from UTF-16 to UTF-8
Unicode Normalization Form C or Unicode Normalization Form KC?
Part of the experimentation will be to enable publishers of planets to set policies. I, for example, don’t like the thought of my planet being used as a vector to distribute a script attack.
I figured you’d say that. :)
Now consider someone of religious persuasion that doesn’t want their planet feed being used to distribute foul, blasphemous language. As a result, they apply a filter to all feed content that automatically removes any swearing or references to Richard Dawkins. At what point do you consider the content in such a feed to be a derivative work?
And if it is a derivative work, should the atom:ids not be different from the ids identifying the original work? I’m not really convinced either way.
I suspect there are legal implications too, but that’s not of much interest to me.
Mark:
There is no ... well-formed Atom ...
Amazing how you can twist what someone says when you quote selectively.
Unicode Normalization Form C or Unicode Normalization Form KC?not the same as converting their encoding from UTF-16 to UTF-8
I wasn’t aware that Unicode Normalization was required when converting from UTF-16 to UTF-8. Oh wait, it isn’t. I choose “none of the above”.
I flag any such changes in an effort to identify hack attempts.
A modest proposal: When attempting to identify “hack” attempts, assuming you already have both pieces of content, don’t do so with a simple string-based comparison. Tokenize the content first, resolve any URIs, and then compare the two token sets. Intentionally make the tokenization step lossy, but avoid any lossiness that might be obviously exploitable.
And if you don’t have both pieces of content, maybe you should?
Is this meant to continue as an experiment or does this effort mean you will stop work on Venus and UFP?
I honestly don’t know. I don’t have any current plans to stop work on Venus and UFP. Meanwhile, I am continuing to commit function to Mars, and have set up a temporary, parallel planet running the absolute latest.
The biggest factor to me is the size of the development community that each code base attracts.
When it comes to correcting well-formedness errors or resolving relative URIs in RSS, the issue becomes less clear cut because the semantics weren’t clear to begin with, but those are edge cases.
Happy? It doesn’t change the fact that you’re on your own with non-well-formed feeds of any format. Unlike HTML5, the Atom Working Group chose not to deal with the issue of error correction. But of course you knew that already.
I wasn’t aware that Unicode Normalization was required when converting from UTF-16 to UTF-8.
This thread began because you were complaining about the problems you were having trying to compare strings for equality. I naively assumed that you gave a shit about whether they were, you know, equal. Just out of curiosity, how DO you plan to compare strings, once Sam is done bending the world to your whim?
This thread began because you were complaining about the problems you were having trying to compare strings for equality.
The things that get you folks fired up.
But Mark is right, if you actually want to make a big deal out of “hack” attempts, normalization has to be done. If it’s even a possibility that a legitimate republisher of content has normalized the content (which it is), then you have to take that into account. So one of the transformations you’d make during tokenization would probably be to convert to UTF-8 and normalize. I’d probably use the same normalization steps done for IDN, because lossiness is desirable here.
That said, I think this is an incredible waste of time. If “hack” attempts actually matter to you, I’d probably go with a more low-tech solution. Again, assuming you already have both sets of content, just create a “tabbed” interface that lets you select either content source for display. This should be trivial for both web and desktop applications. Diff as necessary.
Where is the bzr repository? bzr branch [link] doesn’t seem to work (sorry, I’m new to bzr).
Also, rfeedparser seems to be a pretty good project. Why did you choose to roll your own? Just curious what rfeedparser’s shortcomings were.
Lovely... now my email address is harvestable. Good thing I tagged it.
Sam, how about putting a warning on the E-mail box? Something like, “WARNING: if you do not supply a URI, I will display your address publicly!”
(Sorry, I would have sent this privately if I could have easily found your contact info...)
Thank you all for your suggestions, but you’re trying to solve problems that I don’t have. And for now I’ve decided to pull this feature anyway.
My initial request still stands: it would be nice if the feed from Planet Intertwingly returned the source content unfiltered. However, I can accept that that’s not likely to happen. I guess I always have the option of subscribing to the individual feeds myself.
bar branch http://intertwingly.net/code/mars/
doesn’t seem to work
Try bzr get
. More info here.
rfeedparser seems to be a pretty good project. Why did you choose to roll your own
It shares a design approach with feedparser. Namely that it provides the sanitization, with no access to the original. It “butchers” extensions. It converts everything to a Hash, which (with Venus’s design) needs to be converted back to an XML document, and then (if you use HTMLTmpl) back to a Hash. Both use sgmllib (or equivalent) instead of html5lib.
I honestly don’t know how far this experiment will take me, but so far it looks promising.
my email address is harvestable
Removed. Note: that field is optional.
Hi Sam. So far I’ve been enjoying playing with your code. Two questions: what license are you releasing it under? Some variation of the Python license, like the old Planet code?
Also, in config.rb:
next if line.split(nil,2).first.downcase and ‘rR’.include?(line[0])
What is the intent of this line? afaict, the first expression always returns true and the second one will return true if the line begins with ‘r’ or ‘R’. What do you have against config options that begin with “r” and don’t have any leading whitespace? :)
Thanks!
what license are you releasing it under?
Also, in config.rb:
next if line.split(nil,2).first.downcase and ‘rR’.include?(line[0])
That’s a bug. You can find the original Python in ConfigParser.py,
if line.split(None, 1)[0].lower() == 'rem' and line[0] in "rR":
If you haven’t do so recently, bzr pull
the latest as there have been a number of fixes. If you are in a position to do so, please publish any changes you may make in a bzr repository so that I can pull
them from you.
I’ve settled on the following as a fix to the config problem cited above:
next if line =~ /^rem(\s|$)/i
The biggest factor to me is the size of the development community that each code base attracts.
Where would said community gather? I’ve got some ideas for contributions--is there a mailing list, or are the comments on this entry the best place to discuss it?
I coded up a patch that adds support for importing subscription lists from OPML files with REXML and provides sane default config values: [link]. (This is my first bzr use, so if there’s a better way to submit contributions I’d love to hear it.)
I’m a little curious as to why you recommend running ruby with the -rubygems switch every time rather than writing out a simple “require 'rubygems'” once in the actual code. Seems like one more thing to forget.
Would be great if it could use OPML or YAML by default as the INI format feels quite foreign to Ruby. Right now my patch just uses OPML if the config filename contains “opml”, otherwise it runs it through the backwards-compatible parser. Subclassing ConfigParser from PythonConfigParser seems a little odd though.
This is my first bzr use, so if there’s a better way to submit contributions I’d love to hear it.
Simply scp/rsync/ftp your entire directory structure up to technomancy.us, and let me know the URI I can use to get to it. That’s it. Nothing to install. I can then bzr pull
from you, and vice versa. Or, for more complicated situations bzr merge
followed by bzr commit
.
Would be great if it could use OPML or YAML by default as the INI format feels quite foreign to Ruby.
Supporting YAML would be cool. And, obviously, quite easy.
Right now my patch just uses OPML if the config filename contains “opml”, otherwise it runs it through the backwards-compatible parser.
We could change it to
begin send "read_#{filename.split('.').last}", filename rescue NoMethodError STDERR.puts "Unsupported file format" exit end
Subclassing ConfigParser from PythonConfigParser seems a little odd though.
Feel free to rename the class.
Sam,
Do you have any opinion about the best order for porting the test cases? I have some spare cycles to work on them. Finish the feedparser cases first? Input through to output? Middle out? Any guidance you have would be appreciated.
Do you have any opinion about the best order for porting the test cases?
At the moment, IMHO, the most important piece of missing functionality is support for a second templating system. XSLT is sufficient for my needs and for proof of concept, but I imagine that most would prefer another templating system. I’d like for one of the templating systems supported to be htmltmpl compatible, but that doesn’t have to be the next one.
bzr seems easier to publish than git
The procedure is exactly the same in git; you need to do more only if you want more efficient synch than via HTTP. And I’ve been told that this works for Mercurial also.
Hi Sam. You say in the readme, “REXML version 3.1.6 won’t do.” Well, apparently neither will 3.1.7.1:
2) Failure:
test_102(XmlParserTestCase) [./test/xmlparser.rb:10]:
<nil> is not true.
svn, as you mention, works just fine.
Hi Sam,
I have a dot release (0.4) available for processing haml templates in Mars. Haml is described at [link] The bazaar repository is available at [link]. While the code still needs work, the template (see index.html.haml) produces a very good facsimile of the Mars version of Planet Intertwingly.
The processing engine relies on harvest.rb for loading the template environment. Template variables are pure Ruby and have the same names as in Planet Venus (see docs/templates.html).
I ported the tmpl filter test cases from Venus. Haml passes 40 of the 51 cases. Many of the 11 that fail seem like they are related to encoding issues in harvest.
So... 1) Are the tmpl filter test cases appropriate given the difference between harvest and feedparser, 2) if so, any guidance about what needs changing--if anything--in harvest, and 3) because htmltmpl is in python, how you define htmltmpl compatibility?
Finally, thanks for sharing the fun.
Hi Sam. Here’s a bug fix: [github]
Currently, if a feed is ill-formed, and you’re using libxml, it will be skipped rather than parsed with html5. This patch fixes it:
diff --git a/planet/xmlparser.rb b/planet/xmlparser.rb index b7d3093..729e287 100644 --- a/planet/xmlparser.rb +++ b/planet/xmlparser.rb @@ -31,7 +31,7 @@ module Planet doc = REXML::Document.new source end bozo = false - rescue + rescue Exception => e # If everything is being bozo'd, enable this to see why. # print "PARSE ERROR: #{$!}\n #{$!.backtrace.join("\n ")}\n"
Good catch, though I’d prefer something more along the lines of
if node.elements.size == 0 && node.text == nil node.text = '' unless HTML5::VOID_ELEMENTS.include? node.name end
It is not so much the retaining of the element that matters to me, but letting the HTML5 packages maintain the list of element names.
And more important to me is the test cases. If you don’t get around to it, I’ll try to write up a test case each for these patches.
planet.intertwingly.net is produced using Venus. Your feed has:
<updated>2008-07-10T22:30:05Z</updated>
Which, per the spec, indicates the most recent instant in time when an entry or feed was modified in a way the publisher considers significant.
Your planet is subscribed to your RSS 2.0 feed, where the correct behavior in this matter is undefined.
Right. So the difference is that I’m subscribed to the rss and you’re subscribed to the atom. Thanks for taking the time to point that out.
It seems to me that for a Wordpress user, “a way the publisher considers significant” means “always”. I guess I could patch my Wordpress to make it let me control that at edit time.
You’re a star, thanks again. Shouldn’t you be asleep over there at this time? And how did you (according to my display) post your reply four minutes before my question?
a) Posted by Ciaran at 07:49:17
b) Posted by Sam Ruby at 07:45
And, why is that particular time the only one out of all the ones on this page that doesn’t have seconds? No need to answer any of this - I’ll find the answers to my own stupid questions this time.
Posted by CiaranShouldn’t you be asleep over there at this time?
I tend to be an early riser. This morning, more so than most, apparently.
why is that particular time the only one out of all the ones on this page that doesn’t have seconds?
I’ve seen that before, intermittently; but have not been able to track it down. Dates in the page are initially in GMT, and are converted on the client side to local time.
Ok, after much cursing I see what’s happening. When you’re iterating through the ‘times’ array in localizeDates() you sometimes adjust the headers to correspond to the local time. When this happens, it causes the array to change - in the case above, the header adjustment happens after my comment “Right. So the difference....”, and after the adjustment, an element has been removed from the array. The loop iterator, i, however, is unchanged, so the net result is that your comment starting “Idea 272...” gets skipped and the loop carries on from the following comment.
Hope that makes sense.
OK, I added two lines which will copy the live NodeList to a static Array before iterating over it. You may need to refresh before you see the effect.
Thanks!