Fun with XPath
Last twenty weblog entries which:
Seeing that both //atom and //xhtml nodes are supported, how does your search engine handle those? Does it look at a single data set that includes both atom and xhtml versions of your posts, or does it route to two different data sets?
Posted by Jay Fienberg at
Dare, it seems that Mark Pilgrim simply signs his name as Mark. A more reliable indicator would be the presence of "diveintomark" in the url, thus: posts containing comments by both Dave Winer and Mark Pilgrim.
Posted by Sam Ruby at
Jay, single dataset. My atom version of my posts contain my xhtml version of my posts, inside the <atom:content> element.
Posted by Sam Ruby at
Wow, nice stuff. You're doing this with a lots-of-little-files XML repository?
Keep doing this, and I might just have to clean up my blogging data and start playing with this. :)
Posted by l.m.orchard atVery nice. You won't of course forget this is tying the relational semantics to a tree structure, and that has inherent limitations...
Posted by Danny at
Les: yes, lots of little files.
Danny: care to identify a tangible limitation? I like a good challenge...
Posted by Sam Ruby atThanks for the info Sam.
To pick up you challenge to Danny about relational vs tree:
How about: show all the entries that have the same first word (i.e., without specifying what that word is).
This is a kind-of relational (recursive) join query.
But, in general, I bet that searching for matches on blog entries probably isn't really a case where the relational / tree structure limitations can be really explored.
The blog entries are, in one sense, essentially a single table (see my "table" syndication format view of my blog at: http://icite.net/blog/200309/really_tabular_synidcation.html ).
Posted by Jay Fienberg atJay, I may be misunderstanding what you are suggesting, but that sounds to me like something XSLT excels at. In pseudo-code, what one can do with XSLT is:
foreach entry $id=id $word=substring-before(entry.content,' ') foreach preceding::entry(word=$word) print match($id,id)Posted by Sam Ruby at
Sam,
I'm trying to figure out why your query is so complex why not
for $e in collection("atom-files-directory")//atom:entry,
$word = substring-before($e/atom:content/text(), ' ')
where $word = $id
return $e
My XQuery is a little rusty since I haven't kept up with the spec drafts but that should work.
PS: You aren't accepting posts sent to your blog via the CommentAPI. Is this a bug on your end or mine?
Posted by Dare Obasanjo atDare, I may not fully understand Jay's example, but he did indicate that a join was required.
P.S. I just tried a few test posts via the Comment API, and they appeared to work. Can you capture a wire trace?
Posted by Sam Ruby atSam, your XSLT pseudo-code looks like it will work for what I was thinking it wouldn't work for, so I was wrong about this as being an example showing a limitation with a tree structure.
For my example, I was thinking of a query in SQL like:
select id from entries a join entries b where substr(a.entry,0,locate(a.entry,' ') = substr(b.entry,0,locate(b.entry,' ')
And I was thinking that this couldn't be expressed in a single XPath statement. But, SQL vs XPath is not the same issue as graph vs tree anyway.
Posted by Jay Fienberg atOK, I see where I misunderstood his example. The XQuery should be
for $e in collection("atom-files-directory")//atom:entry,
$word = substring-before($e/atom:content/text(), ' ')
where
for $e2 collection("atom-files-directory")//atom:entry
$word2 = substring-before($e2/atom:content/text(), ' ')
where $word = $word2
return true()
return $e
This is brilliant stuff, Sam. I see that expressions that haven't been executed (and therefore isn't cached) take some time. Do you have any thoughts on this and to the DoS issue of XPath searches? What would you do to prevent your site from being DoS-attacked with complex queries?
Posted by Asbjørn Ulsberg at
The short answer is: if you try it, I'll block your ip address. ;-)
The longer answer is: there are many ways to do a DoS against my site or any site. As you note, I do have a cache, so I can easily easily put a per day cap on the number of unique queries I will serve (effectively disabling new unique queries for a day or so) enabling the rest of the site to be served.
Posted by Sam Ruby atThe DOS issue is a little overstated. The example that was given for Syncato exploited a bug in Pathan that results in an infinite query execution time. This is fixed in a newer Pathan release that the site hasn't yet been upgraded to. Syncato also has a cache that would mitigate any repeat requests on the same query (assuming it's not exploiting a bug). Regardless it is always possible to DOS a site that generates content dynamically.
XPath may be slightly worse then previous tools, but this should not in any way dissuade anyone from exploring its potential. I can think of dozens of reasons why you "shouldn't" be doing this kind of thing, but I made an explicit decision to shove those aside and focus on the exploration of what power this kind of thing brings.
Posted by Kimbro Staken atData Flow
Data flow of comments to feeds, focusing on how indexing and caching work. Les Orchard types in this comment without needing to worry about formatting. It it stored here in blosxom format as well formed XHTML. The index page is regenerated with an up... [more]Trackback from Sam Ruby at
Paul Ford, of Ftrain.com has been using an XML-based content management system for years, although his is messy and not-so-dynamic, it performs much like Syncato.
Posted by Taylor House at
To the DoS-issue; Maybe it's just to keep the execution timeout for XPath queries low. Then, heavy queries will be terminated (and not take up much CPU time), and fast queries will be executed and preferably cached afterwards.
Posted by Asbjørn Ulsberg at
Atom2Yaml
The goal is to support these queries. It will be interesting to see how _why handles the second one given that he is currently cheating on the content element. ;-)... [more]Trackback from Sam Ruby at
Top 3 Features I Want To Add To RSS Bandit
Early on when I started working on RSS Bandit I use to take my cues for feature from other .NET aggregators like Syndirella and SharpReader. However in the past couple of months I've realized that RSS Bandit is more featureful and provides more ...Pingback from Dare Obasanjo aka Carnage4Life - Top 3 Features I Want To Add To RSS Bandit at
Render Services; Enhanced XHTML
Recently I had a business meeting where someone liked very much the SlideML presentation format - they struggle with Powerpoint. As I also showed them the KAYWA Blogsoftware, the question came up, if one could write SlideML via the Bloginterface. I...Excerpt from Bitflux Blog at
Beyond XPath
Sam Ruby has some RDF questions. Typically I'm too knackeredto give a proper answer right now. But if any RDF...... [more]Trackback from Raw at
Using XPath to mine XHTML
This morning, I finally decided to install libxml2 and see what all the fuss was about, in particular with respect to XPath. What followed is best described as an enlightening experience. XPath is a beautifully elegant way of adressing "nodes" ...Pingback from Simon Willison: Using XPath to mine XHTML at
What XPath is, and why its a Good Thing
For a while now some colleagues have been raving about XPath, but I must admit its something I’ve never really looked into. In a brief post Simon has managed to not only explain what XPath is, but also why its...... [more]Trackback from magpiebrain at
Content Management and Data Mining with RDF, XPath, XHTML and the rest...
Simon Willinson has a good post about using XPath to mine XHTML. In it he says "XHTML is an ideal...... [more]Trackback from Raw at
2004-01-15 links
PGP Signing FOAF Files XHTML 1.0 Symbol Character References XHTML 1.0 Latin-1 Character References Languages/xml/xpath XHTML 1.0 Special Character References Foaf-check XHTML Web Design for Beginners - Part 2 Fun with XPath...Excerpt from dealmeida.net at
Ming the Mechanic on Micro-Content.....
Flemming Funch raps it out. My reply below.... "Microcontent" seems to be one of the buzzwords now. So, what is that, really?Jakob Nielsen, interface guru, used it (first?) in 1998 about stuff like titles, headlines and subject lines. The idea being...Excerpt from Marc's Voice at
So, apologies to your Apache log, but I see you still support the XHTML queries, but the atom-namespace sample queries you provide are now (wrongly, in some cases, I think) broken. Is this particular feature of your blog no longer working, or is my XPath just too poor?
Posted by Phil Wilson at
Last 20 with a comment by me:
http://www.intertwingly.net/blog/?q=//atom:feed[contains(atom:entry/atom:author,%20'Obasanjo')]
Not bad. It seems your XPath engine doesn't support multiple boolean expressions in the predicate. I was tried the following query but kept getting 404s
http://www.intertwingly.net/blog/?q=//atom:feed[contains(atom:entry/atom:author,%20'Winer')%20and%20contains(atom:entry/atom:author,%20'Pilgrim')%20]
Posted by Dare Obasanjo at