Danny Ayers: data integration on the web (and elsewhere) offers a continuous stream of problems like this, across diverse domains, formats and services. In other words, the web of data is the Whole Problem.
I have some data. It is in a format readily amenable to XSLT transformations. If you know of, or can readily create, an XSLT transformation which takes Atom 1.0 as input and produces RDF (either RDF/XML or N3, it matters not to me) as output, the “RDF Tax” for me is essentially zero.
So, tell me what you want, tell me what you plan to do with it, and I’ll do my part.
Meanwhile, I have a request. Would it be possible to add something to the Planet RDF feed which is the equivalent of atom:source/atom:title, so I can correctly attribute posts to the original owner in my automated excerpts?
[Comment on Sam Ruby post RDF Tax , his comments choked] If the domain of your data is already covered by Atom, then pretty soon you won’t even need to worry about the XSLT - just publish Atom and let GRDDL look after the conversion for the consumer...
For someone outside RDF/semweb work, this perhaps is fine-grained distinction to be dwelling on, but I find people too easily gloss over the differences between RDF, N3 and the various OWL languages, especially when these are mixed and matched and serialised and shipped over HTTP. If you mix the notion of safe operations, with content negotiation/sniffing, and placed it one or two layers up the stack where we are arguing about logical conclusions instead of character encoding, that gets a sense of the mire. I have a real doubt whether semantic web languages layer properly on top of web infrastructure.
Bill, I don’t have the foggiest idea what you’re talking about, but I’ve never known you to be an idiot, so...
Why would the content type of an OWL document change its meaning? Are you saying N3 is not 1-1 mappable to RDF/XML? How is RDF-over-HTTP different from RDF/XML-on-my-hard-drive or RDF-triples-in-memory-with-rdflib? (These are not related questions, I’m just throwing them out there.)
I haven’t thought about RDF in quite a while, and apparently the big news while I was gone is that people like Danny have figured out that they’re going to have to pay the RDF tax themselves, and they’ve cooked up some framework called GRDDL to map everything in the world to RDF. (I may have botched the details a bit.) Is that what he’s talking about here, and if not, why not? Is that what you’re talking about?
Bill, I think Sam meant Turtle, a subset of the N3 syntax that is a direct serialization of an RDF graph.
Mark, Bill’s talking about problems regarding ambiguous or incorrect authoritative metadata. Although, my understanding of N3 is that it’s a strict superset of the RDF model so I’m not sure what the issue would be in this case.
GRDDL is a way to connect an XSL transformation with an arbitrary XML document, either through an explicit declaration in the document, a declaration in the “namespace document” or with a XHTML profile. It won’t provide a standard way for the RDF community to associate transformations with various XML formats, but will give content authors and the creators of XML languages a way to associate a mapping, if they so choose. I’m sure Danny can explain better.
You give me too much credit. But I will note that I said RDF/XML, not RDF. My understanding is that both N3 and RDF/XML are serializations of RDF.
I suspect that the biggest problem isn’t formats, or even semantics, but logistics. This data is currently stored in over 11,000 files on my hard drive, and hourly rsynch’ed to my hosting provider. Converting this data to another format is likely the least of the issues, having every query fetch, transform, and collate each of these files is clearly a non-starter.
N3 is Tim Berners-Lee’s noodling language, it includes RDF but a whole lot more (formulas and stuff). Turtle follows the same syntax style, but sticks to RDF graphs, so is effectively equivalent to RDF/XML. It’s also used in SPARQL as the basis for graph patterns. (There’s also the NTriples serialisation, which is a subset of Turtle which covers RDF in a canonical kind of style, it’s used in some of the W3C tests).
Bill asks a good question in relation to different treatment of RDF/RDFS/OWL/N3 etc, which each have different levels of expressivity and inference. RDF is the simple graph model, RDFS adds things like subclassing capabilities, OWL adds things like cardinality constraints (I suspect there’s only a couple of people that understand what N3 adds). But at least as far as the official specs go, the interpretation angle is covered in principle at least. The languages are layered, logically they’re monotonic (new knowledge can’t break old knowledge) and follow the open world assumption (unknown things are unknown, not assumed to be false as in SQL). So although an OWL-capable consumer may be able to infer more than an RDF-only consumer, the inferences won’t contradict any of the RDF-only statements in the RDF interpretation. It’s not unlike a mustIgnore rule in XML, moved a little along the line. It may be desirable to track provenance, particularly where material is/has been aggregated, processed and republished, pretty much as in the case of Atom.
Things do get stickier when you bring in the mime types, but still the same general principles applies as with other web languages - it’s up to the consumer what interpretation is applied, based on the chain of authority, the document contents and local capabilities/requirements. And caveat lector.
Justin’s right in strict terms, but I don’t see a problem with Mark framing it as RDF folks “paying the tax” themselves. Although people have been saying for a long time that the model is the important part of RDF, clearly syntax isn’t insignificant. There has been recognition within the RDF community that people are reluctant to publish RDF/XML, when as far as most publishers are concerned the net benefits aren’t clear. So moving on...
If you’ve got unambiguous data published as clean XML like Atom or microformats, and a mapping to RDF has been determined, then it’s possible to treat the plain XML serialisation as Yet Another RDF syntax, one for the specific domain. GRDDL is one way of automating the interpretation of the syntax into the RDF model using standard XML tools, parsers like Raptor can read such stuff directly (and even RSS tag soup).
Logistics is definitely an issue, and I guess it’s just down to cost/benefit for each case. I’m in the (slow, ongoing) process of reorganising my own data, material that’s persisted in various different forms - the blog’s a native RDF store, I’ve got stuff in a RDB, quite few static HTML, RDF/XML, Turtle & text files. Because I want a rich, consistent query/presentation interface across the lot my easiest option is to integrate as RDF but then expose different views as appropriate - HTML, Atom, RDF/XML & SPARQL (incidentally the latest ARQ SPARQL engine has Lucene built in, which should be a boon given the doc nature of most of the data).
“Are you saying N3 is not 1-1 mappable to RDF/XML?”
Yes. N3 is a ‘bigger’ (more expressively powerful) language. Now, the formal definition of N3 was embedded in Python source code last time I looked, but iirc, it had cool things like negation and existential quantifiers, which RDF doesn’t. Such things allow a semweb reasoner (or a fancy-dan rules engine) like cwm to draw a different range of conclusions. RDF is brain dead compared to OWL and other description logics used in certain industry settings today.
The problem is whether I want some downstream consumer to be applying a more powerful reasoner than I had in mind - most people are not going to be good at thinking through the logical consequences of their metadata. I’m pretty sure this sort of thing is why the Atom WG explicitly specified that atom:rights wasn’t for machine 2 machine processing.
“How is RDF-over-HTTP different from RDF/XML-on-my-hard-drive or RDF-triples-in-memory-with-rdflib?”
If HTTP headers are authoritative, then if I serve RDF/XML as application/xml, as best I can tell, you have no business treating it as RDF. If I serve OWL as application/rdf+xml you have no business treating it as OWL. I can short-circuit your systems, and can’t technically be held responsible for the conclusions you drew by running that data through a semantic web rules engine as being of language X instead of language Y.
But as you’ve documented as well as anyone, no-one really pays any attention to conneg and authoritative HTTP headers. And that’s fine when most of what we’re doing is presentational in nature. Yet, I have no idea how any of this is meant to work when you start mixing and matching formal languages in a single document and running the data through rules engines instead of browsers and planets. It’s complicated enough just figuring out unicode.
Maybe it won’t be problem, and maybe the W3C is already planning for success. Nothing here is new - databases that use SSNs as pks, credit rating checkers, and spam engines draw bogus inferences every second of every day. The semweb will simply help to make that seamless ;)
“if I serve RDF/XML as application/xml, as best I can tell, you have no business treating it as RDF” - I’ve argued that myself in the past, but what is a consumer expected to do with anything served as application/xml? Ok, there are omnivores like XSLT, but otherwise for the doc to be useful you have to go sniff. I think this points to strongly favouring more specific types. (I may be off-base on this, but really don’t have the will to re-read RFC 3023).
There has been a lot more attention given to media types & conneg around RDF since the TAG finding on httpRange-14 and the appearance of GRDDL. I’ve heard convincing pragmatic arguments that HTTP authority is more trouble than it’s worth, but in this context at least it does seem to be useful.
I can’t think of any situation where you could shortcircuit my systems. With RDF/S & OWL, worst-case is an OWL DL reasoner finding the data inconsistent - something that has to be factored into the design of any system using an OWL DL reasoner.
Beyond that, arbitrary rules are no different than any current arbitrary processing. If I decide two HTML documents are similar through word matching, you can’t technically be held responsible if I got it wrong.
Re. your last paragraph - re. bogus inferences, I don’t disagree. But I do reckon in the near term on the web the big gains are to be had through RDF as a fairly dumb data language, with minimal inference. As if MySQL were made distributed with URIs as keys.
On the slightly apocalyptic angle you imply, the nearest parallels I can think of in recent years are y2k meltdown and the net gridlock caused by everyone doing RSS polling. The web proved to be pretty resilient in both cases. Fingers crossed it’ll be ok with a bit of service interwiring.
Services based on SemWeb languages do have a better chance than usual of behaving predictably, given their formal logic base. But to prove such stuff, we’ll probably need to wire together RDF/OWL docs and reasoners over the web ;-)
“but what is a consumer expected to do with anything served as application/xml?”
Hi Danny. The answer is not much. But I’ll claim there’s a qualitative difference between introspecting an RSS1.0 file to render it in feeddemon, and introspecting an RSS1.0 file to see what ontologies and facts are present so it can be run through a rules engine.
“I can’t think of any situation where you could shortcircuit my systems”
What I meant was there was, short-circuit via the publisher’s intent. If I do serve OWL under an RDF/XML media type, why are you sniffing instead of treating it as I asked? Is content-sniffing suddenly ok when it comes to semweb activity? Which comes back to my original point - the layering here seems to be off. Microformats by the way, present exactly the same problems, but uF are less likely to be run through a rules engine.
I think I can hijack this post a second time (sorry Sam). Here are 2 datapoints and an opinion:
- The web-as-is doesn’t function properly without content sniffing. That includes syndication technology.
- I and others, are building system where to obtain the data in another format (eg, atom), you hit a new URL. You do not negotiate with the resource. You follow rel links or read someone’s “Web API” document.
- I’m certain microformats and rdf based languages will require content sniffing to be useful.
I see a direct mismatch between what’s being built right now and what the architectural canon says should be built. If media type declarations actually matter, we seem to be building out very badly; if they don’t matter, some of W3C findings are in need of questioning.
“If you serve OWL as application/rdf+xml, you’re following the recommendations of the Web Ontology Working Group.”
Nick, that’s the issue in a nutshell. Given RDF and OWL are different languages, that recommendation shows disregard for web architecture. What I’m trying to figure out is whether the recommendation or the architecture needs to be questioned.
I’ve always thought that serving content as application/xml authoritively declares the content to be XML, but that doesn’t preclude it from being a special type of XML, such as RDF/XML. It is just partial information. Similarly, sending an email attachment as application/octet-stream doesn’t mean that the content authoritively isn’t a zip file.
If application/xml was an assertion that the content type was from the complement of the set of IANA registered XML types at the time of publication, then that is just fragile and silly. What should applications do - deliberately break overnight when a new MIME-type registration gets posted?
If there was a recommendation for publishers to use the most specific MIME type available, then I think a SHOULD would probably be too strong a requirement.
Tim O’Reilly has a follow up to his Metaweb post, My “Outdated View” of the Semantic Web . Snippet: Just to be clear, I’ve always loved the vision of the Semantic Web. But much of the early work at the W3C always seemed to me to be a case of...