It’s just data

Fun with XPath

Last twenty weblog entries which:


Last 20 with a comment by me:

http://www.intertwingly.net/blog/?q=//atom:feed[contains(atom:entry/atom:author,%20'Obasanjo')]

Not bad. It seems your XPath engine doesn't support multiple boolean expressions in the predicate. I was tried the following query but kept getting 404s

http://www.intertwingly.net/blog/?q=//atom:feed[contains(atom:entry/atom:author,%20'Winer')%20and%20contains(atom:entry/atom:author,%20'Pilgrim')%20]

Posted by Dare Obasanjo at

Seeing that both //atom and //xhtml nodes are supported, how does your search engine handle those? Does it look at a single data set that includes both atom and xhtml versions of your posts, or does it route to two different data sets?

Posted by Jay Fienberg at

Dare, it seems that Mark Pilgrim simply signs his name as Mark.  A more reliable indicator would be the presence of "diveintomark" in the url, thus: posts containing comments by both Dave Winer and Mark Pilgrim.

Posted by Sam Ruby at

Jay, single dataset.  My atom version of my posts contain my xhtml version of my posts, inside the <atom:content> element.

Posted by Sam Ruby at

Wow, nice stuff.  You're doing this with a lots-of-little-files XML repository?

Keep doing this, and I might just have to clean up my blogging data and start playing with this.  :)

Posted by l.m.orchard at

Very nice. You won't of course forget this is tying the relational semantics to a tree structure, and that has inherent limitations...

Posted by Danny at

Les: yes, lots of little files.

Danny: care to identify a tangible limitation?  I like a good challenge...

Posted by Sam Ruby at

Thanks for the info Sam.

To pick up you challenge to Danny about relational vs tree:

How about: show all the entries that have the same first word (i.e., without specifying what that word is).

This is a kind-of relational (recursive) join query.

But, in general, I bet that searching for matches on blog entries probably isn't really a case where the relational / tree structure limitations can be really explored.

The blog entries are, in one sense, essentially a single table (see my "table" syndication format view of my blog at: http://icite.net/blog/200309/really_tabular_synidcation.html ).

Posted by Jay Fienberg at

Jay, I may be misunderstanding what you are suggesting, but that sounds to me like something XSLT excels at.  In pseudo-code, what one can do with XSLT is:

foreach entry
  $id=id
  $word=substring-before(entry.content,' ')
  foreach preceding::entry(word=$word)
    print match($id,id)
Posted by Sam Ruby at

Sam,
  I'm trying to figure out why your query is so complex why not

  for $e in collection("atom-files-directory")//atom:entry,
  $word = substring-before($e/atom:content/text(), ' ')
  where $word = $id
  return $e

My XQuery is a little rusty since I haven't kept up with the spec drafts but that should work. 

PS: You aren't accepting posts sent to your blog via the CommentAPI. Is this a bug on your end or mine?

Posted by Dare Obasanjo at

Dare, I may not fully understand Jay's example, but he did indicate that a join was required.

P.S.  I just tried a few test posts via the Comment API, and they appeared to work.  Can you capture a wire trace?

Posted by Sam Ruby at

Sam, your XSLT pseudo-code looks like it will work for what I was thinking it wouldn't work for, so I was wrong about this as being an example showing a limitation with a tree structure.

For my example, I was thinking of a query in SQL like:

select id from entries a join entries b where substr(a.entry,0,locate(a.entry,' ') = substr(b.entry,0,locate(b.entry,' ')

And I was thinking that this couldn't be expressed in a single XPath statement. But, SQL vs XPath is not the same issue as graph vs tree anyway.

Posted by Jay Fienberg at

OK, I see where I misunderstood his example. The XQuery should be

for $e in collection("atom-files-directory")//atom:entry,
  $word = substring-before($e/atom:content/text(), ' ')
  where

for $e2 collection("atom-files-directory")//atom:entry
  $word2 = substring-before($e2/atom:content/text(), ' ')
  where $word = $word2
  return true()

  return $e

Posted by Dare Obasanjo at

This is brilliant stuff, Sam. I see that expressions that haven't been executed (and therefore isn't cached) take some time. Do you have any thoughts on this and to the DoS issue of XPath searches? What would you do to prevent your site from being DoS-attacked with complex queries?

Posted by Asbjørn Ulsberg at

The short answer is: if you try it, I'll block your ip address.  ;-)

The longer answer is: there are many ways to do a DoS against my site or any site.  As you note, I do have a cache, so I can easily easily put a per day cap on the number of unique queries I will serve (effectively disabling new unique queries for a day or so) enabling the rest of the site to be served.

Posted by Sam Ruby at

The DOS issue is a little overstated. The example that was given for Syncato exploited a bug in Pathan that results in an infinite query execution time. This is fixed in a newer Pathan release that the site hasn't yet been upgraded to. Syncato also has a cache that would mitigate any repeat requests on the same query (assuming it's not exploiting a bug). Regardless it is always possible to DOS a site that generates content dynamically.

XPath may be slightly worse then previous tools, but this should not in any way dissuade anyone from exploring its potential. I can think of dozens of reasons why you "shouldn't" be doing this kind of thing, but I made an explicit decision to shove those aside and focus on the exploration of what power this kind of thing brings.

Posted by Kimbro Staken at

Data Flow

Data flow of comments to feeds, focusing on how indexing and caching work. Les Orchard types in this comment without needing to worry about formatting.  It it stored here in blosxom format as well formed XHTML.  The index page is regenerated with an up... [more]

Trackback from Sam Ruby

at

Paul Ford, of Ftrain.com has been using an XML-based content management system for years, although his is messy and not-so-dynamic, it performs much like Syncato.

Posted by Taylor House at

To the DoS-issue; Maybe it's just to keep the execution timeout for XPath queries low. Then, heavy queries will be terminated (and not take up much CPU time), and fast queries will be executed and preferably cached afterwards.

Posted by Asbjørn Ulsberg at

Atom2Yaml

The goal is to support these queries.  It will be interesting to see how _why handles the second one given that he is currently cheating on the content element.  ;-)... [more]

Trackback from Sam Ruby

at

Top 3 Features I Want To Add To RSS Bandit

Early on when I started working on RSS Bandit I use to take my cues for feature from other .NET aggregators like Syndirella and SharpReader. However in the past couple of months I've realized that RSS Bandit is more featureful and provides more ...

Pingback from Dare Obasanjo aka Carnage4Life - Top 3 Features I Want To Add To RSS Bandit

at

Render Services; Enhanced XHTML

Recently I had a business meeting where someone liked very much the SlideML presentation format - they struggle with Powerpoint. As I also showed them the KAYWA Blogsoftware, the question came up, if one could write SlideML via the Bloginterface. I...

Excerpt from Bitflux Blog at

Beyond XPath

Sam Ruby has some RDF questions. Typically I'm too knackeredto give a proper answer right now. But if any RDF...... [more]

Trackback from Raw

at

Using XPath to mine XHTML

This morning, I finally decided to install libxml2 and see what all the fuss was about, in particular with respect to XPath. What followed is best described as an enlightening experience. XPath is a beautifully elegant way of adressing "nodes" ...

Pingback from Simon Willison: Using XPath to mine XHTML

at

What XPath is, and why its a Good Thing

For a while now some colleagues have been raving about XPath, but I must admit its something I’ve never really looked into. In a brief post Simon has managed to not only explain what XPath is, but also why its...... [more]

Trackback from magpiebrain

at

Content Management and Data Mining with RDF, XPath, XHTML and the rest...

Simon Willinson has a good post about using XPath to mine XHTML. In it he says "XHTML is an ideal...... [more]

Trackback from Raw

at

2004-01-15 links

PGP Signing FOAF Files XHTML 1.0 Symbol Character References XHTML 1.0 Latin-1 Character References Languages/xml/xpath XHTML 1.0 Special Character References Foaf-check XHTML Web Design for Beginners - Part 2 Fun with XPath...

Excerpt from dealmeida.net at

Ming the Mechanic on Micro-Content.....

Flemming Funch raps it out. My reply below.... "Microcontent" seems to be one of the buzzwords now. So, what is that, really?Jakob Nielsen, interface guru, used it (first?) in 1998 about stuff like titles, headlines and subject lines. The idea being...

Excerpt from Marc's Voice at

So, apologies to your Apache log, but I see you still support the XHTML queries, but the atom-namespace sample queries you provide are now (wrongly, in some cases, I think) broken. Is this particular feature of your blog no longer working, or is my XPath just too poor?

Posted by Phil Wilson at

Add your comment