Dave's search is based on
Google. Mine on
Swish++.
Both display the full weblog entries of any matches in reverse
chronological order.
One interesting feature of Dave's: Google indexes pages based not just on their content, but on the link text in links pointing to them. So searching for (pick your favorite insult aimed at Dave) will return the posts where that was most often linked.
Sam, have you figured out how to get swish++ to do a search on an exact string, like "Sam Ruby"? By default it will look for "sam and ruby" which will present a bunch of irrelevant results, I'd expect...
I think that Dave's ordering of the results chronologically hides the point of doing the search through Google. If you leave the results ordered by relevance, you get something you can't get any other way. Google's method of determining relevance still kicks everyone else's butt.
However, even with chronological sorting, this does mean that the weblog's server software doesn't have to have any code that deals with indexing or searching content, you can just leave it to Google. This lowers the barrier for adding this functionality to your server-based weblog, freeing developers to work on something else.
Plus, I can think of several interesting ways to explore this search integration further, for example retrieving a list of pages that point to a particular result URL (either inter- or intra-blog), finding similar documents (likewise), using Google Sets to expand the search keywords (Google Sets are not currently available via the Google API), correcting misspelled searches, and likely others that don't occur to me immediately.
However, as cool as this is, I don't think this qualifies as a true mind-bomb, but more of a mini mind-grenade. Or you can consider it a single bomblet in the whole web services cluster-mind-bomb. It's really a continuation of the same thought.
I'm really not at all sure why Dave is so excited about this. As Michael points out, it could be useful for people who can't install a native search engine, so I'm more than happy to give it credit for being nifty. But how do we go from that to the "most innovative and intelligent search technology for weblogs today"?
I mean, is that it? Maybe I'm not paying close enough attention, but don't most blogging apps provide on-board search? Maybe I'm missing something obvious. I've been wrong before, so maybe there's something to it that eludes me.
I must agree with Roger, in that I couldn't see where the excitement is on this. I'm assuming, Sam, this was the point of your alternative implementation.
Dave has a passion for his work, I don't begrudge him for that.
My "alternative implementation" is more than that - it is prior art. It is the way that my search interface has always worked. The substantive difference between Dave's implementation and mine is what search engine is used in the back end. I started with Lucene. Now I use swish++. I plan to explore swish-e. Google would certainly not be difficult.
The point of this method of searching, which I think has been made before very nicely by John Udell in his book Practical Internet Groupware, is that URL structure constitutes an 'API' in its own right, and can an ought to be exploited by applications..
To put it another way, that precise use kinds validates what the RESTians like to tell us: give endpoints a structure!
As others have stated there are numerous ways to index a site, using the API or scraping Google, swish++, indexing service, etc.
The jump would be to create a standardized interface to search a website/blog. Who cares what they use on the backend? Rather, provide me with a feed of my search returning everything from views & rankings to the date, links, comments, and the post itself. Then, let me decide how I want to view it. The extension being that search engines and things like feedster would be able to not only parse & index - they would also have the ability to deal with sites & blogs in a different way, possibly even giving the ability to make the distinction between a website and a blog in the index space, as well as, hopefully increasing the accuracy of the index.
Using the Google API for the reason it was built isn't that memorable, especially when you have to circumvent the "key" issue.
Sam, its not about being able
to access the archives, but search being sorted by a particular
metadata criterion using the URL as the identifier. In this case,
rather naturally, time. Your search code does that already, though I am
guessing the time axis in your sort comes directly out of the search.
More generally, not just on blogs, but arbitrary websites too,
designing url's along some metadata axes and combining with google is a
'poor man''s way of getting sorted search. I wonder for blogs if there
are any other metadata useful: an example might be conversations one
participates in across blogs for example but thats substantially more
complex involving comments, trackbacks and other peoples blog entries
(a good app tho for a centralized registration service).
BTW, given that you used Lucene, why swish-e and not Lupy?
Rahul
Time is but one potential axis for a search. I suspect that once you start exploring other metadata axises, you will find that sorting by URLs limiting.
What URLs do is uniquely identify the resource. Using it to select the result set from the content management system works even if there are multiple paths to that item. Once you have selected the items, you can choose to format and sort it multiple ways. As this point, you have access to all the relevant metadata, so sorting is trivial.
Swish++ and Swish-e indexes the data significantly faster than Lucene does.
Sam, we have a search engine like yours at UserLand, we've had it since 1998 or so. It works pretty well, but when Google came along, I started using it more, because it works better. Over time the Google advantage became greater, it's fast, when it crashes Google brings it back up, when it gets slow, they pay for more hardware, etc.
The only problem with Google was that it mashed up the order of the searches and often quoted meaningless bits of text from the archive page it found. Then a couple of weeks ago I needed to get a chronologic listing my writing on some subject. Then I noticed that I could infer the date from the urls, and that with the Google API I could have any form of presentation I wanted. This linkup between the URLs returned by Google and the chronology of the weblog was the "aha" in this process.
Anyway, sorry to say you don't have prior art, unless your blog search engine predates your blog. ;->
Dave, that search engine doesn't sort by date. It tries to sort by relevance. Something that Google does exceptionally well.
You and I have now both chosen a different path. Deciding that time is extremely relevant. And within the limited scope of weblogs (as compared to the internet as a whole), my experience has been overwhelmingly positive.
Sam,
Thanks for the pointer. Lupy is probably slower still being Lucene in Python..
Agreed that other metadata would make sorting by URL limiting. Especially is there are many axes. If there are few, though, prior thought into their design (a good example is categories, conversations) is worth it compared to asking a user to manually input metadata, which in my experience, never happens...