It’s just data

JSON for Map/Reduce

James Snell: Abdera has always included the ability to serialize Atom entries to JSON. The mapping, however, was not all that ideal. So I rewrote it. The new serialization is VERY verbose but covers extensions, provides better handling of XHTML content, etc. I ran my initial try by Sam Ruby who offered some suggested refinements and I made some changes. The new output is demonstrated here (a json serialization of Sam Ruby’s blog feed). The formatting is very rough, which I’ll be working to fix up, but you should be able to get the basic idea.

Based on the comments, Patrick and Elias do not seem amused.  Guys, I’ve got a use case in mind, and I wonder if you wouldn’t mind helping me?

Imagine I have a database designed from the ground up for JSON.  One where incremental map/reduce jobs replace queries.  The data I plan to put in that database is from feeds: RSS 1.0, RSS 2.0, Atom 0.3, whatever; I don’t care.  With the components that go into Venus (UFP, HTML5LIB, and reconstitute) I can do a LOT of normalization.  Which is good, because I’d like to do all the normalization I can once, so that the subsequent map/reduce tasks can focus on more on the problem they are trying to solve and less on the syntax.

The map/reduce jobs will typically be written in JavaScript.  By that I mean what you get when you apt-get install spidermonkey-bin and run from the command line, and not what you get whey you run within Firefox.  If you like, other languages could be substituted, if a strong enough case could be made.

The set of potential microformats is unbounded.  I’d like to be ready to handle microformats that haven’t been invented yet.  But to provide some specifics to this use case, lets consider hCalendar.  It contains dates and locations.

First, do all three of you agree that this is a reasonable use case?  If so, what would the ideal JSON format be for this case?  Remember, I’m willing to throw virtually unlimited resources at the one time pre-normalization step in the hopes that such efforts can help shave microseconds off of subsequent map tasks.

Again, to keep this grounded, try to sketch out the code for the map task.  Input is a key and a single JSON document, output is an array of [[dtstart, dtend, location], key].  I don’t care if the locations were originally in summary, content, description, content:encoded, or even title elements, I simply want the data.


I certainly think this is a worthy experiment.  Boils down to a question of when do you boil the data out of your microformats.

Are we stretching microformats further than they were intended?  They work great while in markup, but if you’re going to convert the markup to data (for whatever reason), then I think the case for them becomes murky.  Why wouldn’t you convert the microformat markup to data as well?  ie, at the outer-most div or whatever for the mf’d content, add a new sub-element called hCalendar with an attribute or text content consisting of your JSON literal.

For atom/rss feeds, I think it would be nice to have a JSON representation of the feed and entry structures (not neccessarily a machine-built translation; one that actually makes sense), where you could ‘attach’ extracted data to the feed itself, instead of forcing someone to poke around the actual data.  Again, in this case, a property named hCalendar with a value of [dtstart, dtend, location] (or whatever).

That’s what feels right to me.

On the other hand, I’ve thought for a while now that the XPATH/XSLT of JSON is ... your flavorite programming language, and so this is a bit of a real-life experiment to see how that works out.

Posted by Patrick Mueller at

Patrick, I’d like to play back what I think I hear you saying just so that we are in synch.

If I know a priori what data I would like to mine before it even occurs to me, I could extract just that data and either discard or store the rest as a blob.  The trouble is, I don’t know what data I am going to want to mine next.  Does that make sense?

Perhaps hCalendar was a too ambitious of an example.  A simpler example might do then.  My mememes display on planet intertwingly simply looks for href attributes.  it doesn’t much care where it finds these attributes, it simply extracts them all then reduces the result.

Is this stretching href further than it was "intended"?  To be frank, I don’t care what the original intent of the data was, I just care that I can mine it easily.

I can do much of that today with Venus using the XML based Atom 1.0 format.  Would JSON make that easier or harder?

Posted by Sam Ruby at

Patrick, consider the following scenario.

Say I have been keeping a database of feed items for the past 5 years. Now I hear about a microformat that everyone else has been using, but which I was unaware of; or I discover that some interesting data can be mined from HTML by looking at a common semantic use (million-dollar markup style); or I just never used to care about hEvent data and now I suddenly do for whatever reason.

If I have been normalising the data, and I am interested in mining my historical dataset, then my normalisation must have been designed from the start to conserve enough information so that its output would be accommodating of that sort of mining in the future. That also precludes parsing microformats in the normalisation step, because it would be necessary to re-run the normalisation on historical data in order to parse out currently unsupported microformats.

What I need is a mapping that conserves enough information so that microformats can be parsed out of the normalised output.

Posted by Aristotle Pagaltzis at

Since the normalized data is just going to be a hash similar to one’s produced by the UFP, one could supplement this hash by adding an “extractions” key at the top level that points to a hash of named and previously calculated extractions of microformat data.  Something like <code>{ “extractions":  {"hcalendar": [["firststart”, “firstend”, “location”]...}}</code>.  If the extraction isn’t in the hash already, a function is called up to walk the hash, extract the data, store the extraction in the hash for later use, and pass on the extracted data to Map. 

Speaking of the UFP, why couldn’t the UFP Atom tests be used to help structure and check the output of Atom2Json or some similar library?

Posted by Jeff Hodges at

If the extraction isn’t in the hash already, a function is called up to walk the hash, extract the data, store the extraction in the hash for later use, and pass on the extracted data to Map.

Perhaps we have a nomenclature problem, but “extract the data” is what Map is all about.  And with databases like CouchDB, the result is a “view” which amounts to an index, stored separately from the original data.  And CouchDB will do maps and reduces incrementally, so changes in other documents that affect a given view won’t require map to be run again on unchanged documents.

So, again, I see canonicalization or normalization as something I have tools at my disposal to address.  A specific reduce job simply has to know about the output of a given map job, so that’s an easy problem.  The one remaining problem I see: what’s the best normalization that would make the authoring of map scripts easiest?

Posted by Sam Ruby at

Sam, I think your use case is perfectly fine, but I think we need to separate the issues here. The first is Atom as JSON and the other is HTML as JSON. CouchDB or any other db for that matter that accepts BLOBs cannot always be aware of all possible formats in it. It should just store them. We could have inside JSON many things like XML, HTML, XHTML, SVG, RDFa, N3, TURTLE, etc. I think this is a problem for the map/reduce subsystem to support any given format. I think we should always store the original content and with time add/improve the parsers to extract information from them.

Having said that, I’m more interested in the map/reduce code to efficiently pass tuples around until we get the desired results whether it be microformats or RDFa. I think we need to extend spidermonkey to leverage code like UFP/html5lib so we don’t have to reinvent the wheel, but I’m yet to see the need of “pre-parsing” to universal JSON before it goes into the db.

Posted by Elias Torres at

I’m yet to see the need of “pre-parsing” to universal JSON before it goes into the db.

Hmmm.  Devil’s advocate time.  Perhaps by outlining the other extreme, we can profitably find the middle ground.

XML as a data format allows attributes to be quoted using either single quotes or double quotes.  Let’s standardize on quoting attributes using single quotes.  Now, XML allows double quotes in either attribute values or text to be expressed as &quot; so let’s do that too.  Consistently.  Finally, lets take the whole Atom document and add a double quote at the beginning and a double quote at the end, and voilà we have an instant JSON compatible encoding of a Atom 1.0 document.  No information is lost in the process, and everything else is “over the top”, after all “does it matters when the conversion actually happens?”

Posted by Sam Ruby at

I can see where you are taking me. I’m ok with your transformation (or XML-escaping-into-JSON) but I’m still trying to see what do you gain by it? or by turning XML into a valid JSON string? or Atom syntactically to JSON? or an AST representation of the DOM in JSON? Which all could be done when you are consuming the information. The first map call would use a specific parser to give you the tuples you want from the original content and the rest of reduce/map/reduce/* iterations can extract what you ultimately need for the view/result.

Posted by Elias Torres at

Elias: spend just a few minutes looking at CouchDB.  It is scalable.  It is distributed.  It has replication.  It is heading towards a true map/reduce (it isn’t there yet, but I’m confident it will be shortly).  Sharding is in the plans too.

Every document in CouchDB is a JSON object.  The input to map is JSON.  The output is JSON.  The default language for map functions is JavaScript.  If you follow this to its logical conclusion, what you see is a need to represent in JSON anything from which you would like to map or reduce.

You can take the position that no meaningful metadata ever comes out of content, or you can attack content with regular expressions, or you can build a HTML5Parser in JavaScript; or you can come up with a JSON representation of HTML.

Posted by Sam Ruby at

I’m waiting til a few more of the cool guys experiment before I write some code myself. I have gone as far as reading the tutorial and now I can at least follow your programs written in Erlang. In regards to CouchDB, I’ll definitely will have to go in deeper. I have followed Damian since the early days of CouchDB and having done a lot with Notes myself, I see the validity of his efforts, but with my recent work on db, replication, distributed, scalable, sharding, I’d need to see to believe.

<strike>You can take the position that no meaningful metadata ever comes out of content, or you can attack content with regular expressions,</strike> or <maybe>you can build a HTML5Parser in JavaScript</maybe>; <obviously-given-these-options-this-one-wins>or you can come up with a JSON representation of HTML</obviously-given-these-options-this-one-wins>, <new>is it possible to reuse existing libraries? or would that break Erlang’s distributed, scalable, multi-core power, etc?</new>

I’ll give my last rant, my point is that sooner or later something else will be shoved into a json property that needs parsing (text ical, text vcard, etc). But if you’re point is HTML should be a special case and we should handle it natively, just as RDBMS are doing with XML today, I’ll buy in. But please let’s not serialize XHTML syntactically into JSON. Let’s use html5lib to normalize/sanitize the content and dump a proper DOM output in JSON so JS can effortlessly iterate over it.

Posted by Elias Torres at

But please let’s not serialize XHTML syntactically into JSON. Let’s use html5lib to normalize/sanitize the content and dump a proper DOM output in JSON so JS can effortlessly iterate over it.

Can you tell me what the difference between these two options are?  An example of what a “proper DOM output in JSON” would look like would be sufficient.

Posted by Sam Ruby at

Sam Ruby: JSON for Map/Reduce

[link]...

Excerpt from del.icio.us/tag/atom at

Wow, interesting situation. A couple of random coppers in the well:

“Every document in CouchDB is a JSON object.  The input to map is JSON.  The output is JSON.  The default language for map functions is JavaScript.  If you follow this to its logical conclusion, what you see is a need to represent in JSON anything from which you would like to map or reduce.”

Or: Atom is XML; Atom payloads are XML* - if you follow this to its logical conclusion, CouchDB needs to natively support XML (with Javascript-style operational semantics) - or at least the DOM.

The process of extracting embedded data from the markup seems like something well suited for happening pretty locally, in a per-content block, in its own process i.e. parallelise the parsing. Why not?

I could well be wrong, but the full XML(/DOM)2JSON sounds like a difficult approach to the kind of scenario you describe. Personally I’d probably approach it by treating the source payload opaquely in the Atom/JSON, whether or not the data extraction took place at Atom-to-JSON time or later, after the material was in the store.

Either way I’d start by looking at GRDDL, “XSLT like transforms in Erlang” (Google) and RDF/JSON [link]

Posted by Danny at

I dunno. After looking closely at the format, my suggestions have already been addressed, except maybe for different handling of namespaces. You’d know better, but I’m not sure if html5lib handles namespaces the same way. Maybe treat the tag as “svg:svg” : { “xmlns:prefix” : “http://...” } but I don’t think it really matters.

My distaste in the format was in proposing it as a way for most browsers to consume feeds in JSON and replacing the default Abdera JSON writer. The current format violates DRY and it’s most likely unnecessary within a browser. My only question left is whether do you think that this format should be the one used always for Atom as JSON or was done simply for this CouchDB exercise?

Posted by Elias Torres at

BTW, Sam could you add Jabber notifications to the email in the comment form when a new comment has been posted to entry? pleeeeeeeeeeaase.

Posted by Elias Torres at

Maybe treat the tag as “svg:svg”

Agreed.

My distaste in the format was in proposing it as a way for most browsers to consume feeds in JSON and replacing the default Abdera JSON writer.

I’m guessing that James would prefer that there be one format.  James?

The current format violates DRY

Strongly agreed.  I’ve mentioned that offline to James on numereous occasions, but so far he’s ignored me.  :-)

and it’s most likely unnecessary within a browser.

Now, here’s an interesting question.  Some people have a preference for document.createElement, others see nothing wrong with the use of innerHTML.  Do you have opinion on this matter?

The format that you reacted so negatively to initially is slanted towards the former.  This could also help if the client wants to filter or augment the markup.

My only question left is whether do you think that this format should be the one used always for Atom as JSON or was done simply for this CouchDB exercise?

Again, this is the DRY and multiple-formats questions.

BTW, Sam could you add Jabber notifications to the email in the comment form when a new comment has been posted to entry? pleeeeeeeeeeaase.

I already get Jabber notifications, what’s your problem?

Oh, YOU want to get notified?  :-)

Any suggestions for the UI?

Posted by Sam Ruby at

So to digress from the ostensible topic, Elias asked for “Jabber notifications to the email in the comment form”...

I’m all for yet more notification options and every system seems to expand until it can generate email or feeds. I thought that once you had feeds, you’d have all you need. I guess I’m missing the need for Jabber.

The current best practice in tracking one’s comments or attention data is to use co.comments or similar and subscribe to one’s conversation feed.

Or should one also ask for a universal twitter event broker service that also generates sms updates and such?

Posted by Koranteng Ofosu-Amaah at

I’m guessing that James would prefer that there be one format.  James?

Yep.

Strongly agreed.  I’ve mentioned that offline to James on numereous occasions, but so far he’s ignored me.  :-)

Bah! I haven’t ignored anyone.  I just haven’t made all the requested changes yet.  I’ve been waiting to see what direction this conversation was going to take.  As always, I am very open to suggestions.  I will say that examples of the kind of output you want to see will work a whole heck of a lot better than just complainin' about it :-)

Posted by James Snell at

The format that you reacted so negatively to initially is slanted towards the former.  This could also help if the client wants to filter or augment the markup.

Yes, in most cases I’d prefer having the HTML as a string, but what a use case you present here for when the client wants to sanitize the (X)HTML. I knew from the beginning that this would be an uneven discussion. Anyway, I’m trying to be microformatic and focus on the 80/20, at least for the most common representation. :)

Oh, YOU want to get notified?  :-)

But of course! Although, if you do you’d force me to always comment on your entries. For starters just another checkbox next to remember info, live preview, “keep me notified” or whatever. I think we could do a few things in any order: check email for jabber, default to email, lastly check url for two things: if openid: could we have an attribute that specifies preferred notification mechanism?, else maybe a meta-link on my weblog indicating how to contact me. I bet the latter wouldn’t work in mass scale because of spambots abusing it, oh well, just thinking outloud.

Koranteng: subscribing to a feed is so early century. Disclaimer: I don’t use Twitter, but I do use Jabber. :)

Posted by Elias Torres at

James: I will say that examples of the kind of output you want to see will work a whole heck of a lot better than just complainin' about it :-)

boo!

Elias: whine, whine, whine.

Posted by Elias Torres at

When will Google let us run our own Map/Reduce

programs? After reading about Sam Ruby’s issues with JSON for Map/Reduce , it got me thinking. How long before Google will let us run our own Map/Reduce programs on their clusters? We all know, one of the best ways to scale is to push the operation to the...

Excerpt from Planet RDF at

Atom2Json, 2nd try

Based on feedback I’ve made some more tweaks to the atom-to-json serializer in Abdera. Here’s an example of the current output....

Excerpt from snellspace.com at

Compelling discussion here - reminds me of thread on the XSPF mailing list about crafting a JSON transform. The tension in that thread was between a more verbose (generic and reversible XML -> JSON) transformation, or one which took the known aspects of the XSPF schema as a prerequisite for simplifying the JSON.

We ended up using the later, with the compromise that the one area of the XSPF spec which allows for arbitrary XML, allows for arbitrary JSON (allowing rather than requiring a verbose XML -> JSON transform).

I know this isn’t quite a discussion about standardizing ATOM in JSON, but should such a discussion arise, bear in mind that JSON’s principle virtue is simplicity, and that coders want to use it because it makes their lives easier. Perhaps providing a node in the JSONified ATOM which allows for arbitrary JSON is just the ticket. In that node, Sam, you could store your DOM-style parse tree of the entry’s content, and your CouchDB jobs could reference it.

Thinking about it this way separates concerns - ATOM should have the simplest JSON serialization that could possibly work, and it looks like that serialization will need a bucket where users can pour extra complexity (like parse trees and other application-specific data). Keeping the extra complexity out of the standard JSON elements lowers the barrier to entry and raises the readability of the rest of the format.

Posted by Chris Anderson at

ATOM over JSON

This was posted as a comment to Sam Ruby’s discussion on the merits of a verbose JSON serialization of ATOM entries’ HTML content to CouchDB-style map/reduce jobs. I found the direction a potential ATOM -> JSON serialization was...

Excerpt from Daytime Running Lights : Coding at

Sam Ruby: JSON for Map/Reduce

[link]...

Excerpt from del.icio.us/harrisj at

[from whitmo] Sam Ruby

It’s just data...

Excerpt from del.icio.us/network/calvinhp at

Playing with functional programming

I’m still earring a colleague of mine saying: “all the human problems are solvable with an Object Oriented approach.” I didn’t argue a long time with her. She maybe never had to think about other kind of problems, problems that require to break...

Excerpt from Yoan Blanc’s weblog at

CouchDB, XML, and E4X

On the upcoming E4X support in CouchDB....

Excerpt from about:cmlenz (Christopher Lenz) at

Add your comment