libxml2 screams
Inside the libxml2 Python distribution is a few tests. One of them is named, simply enough, xpath. The purpose of this test apparently is to parse a small file, evaluate an xpath expression against it, cleanup, and repeat this in a loop one thousand times. This runs in subsecond time on my machine.
What this leads me to conclude is that libxml2 is optimized for parsing lots of small files. So I tested the theory by running a more realistic query against all of the weblog entries on my site. The result was still subsecond.
Sweet.
That does not mean that I shouldn't migrate to an XML database, but merely that I don't need to do so today.
What it does mean is that I can spend my time thinking about what I want my url space to look like and designing the schema I chose to expose. There are some obvious things, like it makes sense to have all of the structure exposed instead of obscured. And a date format that can easily be collated.
As far as the url space goes, I want to make sure that the results are readily cachable. Thinking about the usage pattern, what I am likely to find is:
- not overly complicated queries
- queries which are ad-hoc and therefore hard to optimize for
- the number of unique queries issued per day is likely to be small
- the total number of queries issued (including repetitions) may be highly variable. All it takes is for someone like Jon Udell to post a few links to cause this to happen.
Given this usage pattern, it would seem that my existing cache exactly fits this requirement. Sweet.
I'll probably play with this for a few days before I deploy it publically.
Les: that is the direction I am exploring. To be honest, the built-in unix metadata is a mixed blessing. For example, I just found out that created dates don't survive copies.
Also, my current weblog source directory has over nine thousand files in it. I'm starting to see some operations slow down. I also can no longer issues commands like "grep foo *" as Unix complains that the command line length is too long.
Things I am exploring: putting comments in the same file as the parent. Splitting up the directories by date. Who knows, I might even move away from my sequential numbering schema (a holdover from when I kept bidirectional synchronization with my Radio weblog).
In any case, I will likely try to keep the file system metadata in synch. Lots of fun things to play with.
Posted by Sam Ruby atSadly, the mysterious 3rd time isn't "create time" at all, although many people seem to think it is. It's technically "inode change time". Exact semantics probably vary between Unixen, but I would expect that it may well change for something as simple as a rename operation, or a hard link.
-Dom
Posted by Dominic Mitchell atHere's a Unix "best practice" from the heyday of eToys -- use the file system as a semi-balanced tree by creating N directories (you choose N based on what OS you're using), chosen to roughly balance out the load across all directories, and then stash the files in there.
For, e.g., SKUs: 494943.html goes in directory './49/49' as file '43.html.'
Several orders of magnitude in speed improvement, but you need to know about your inodes.
Posted by Wilhelm at- NetNewsWire Pro XML Database JuJu Bright Eyed Mister Zen Mac OS X Web Services Xindice Realm kstaken@xmldatabases.org Apache Xindice native XML database XML:DB initiative XML:DB XML database API - Xindice (Formally dbXML) - Xindice XML-RPC...
Excerpt from RUBEN VIDAL at
LINKS GET YOU EVERYWHERE
kstaken@xmldatabases.orgApache Xindice native XML databaseXML:DB initiativeXML:DB XML database API - Xindice (Formally dbXML) - Xindice XML-RPC Interface - Xindice HTTP Interface - XML:DB API Reference Implementation - Jive XML Database Port -...Excerpt from RUBEN at
WISH YOU A GOD RIDE
kstaken@xmldatabases.org Apache Xindice native XML database XML:DB initiative XML:DB XML database API - Xindice (Formally dbXML) - Xindice XML-RPC Interface - Xindice HTTP Interface - XML:DB API Reference Implementation - Jive XML Database...Excerpt from LINKS at
kstaken@xmldatabases.orgApache Xindice native XML databaseXML:DB initiativeXML:DB XML database API - Xindice (Formally dbXML) - Xindice XML-RPC Interface - Xindice HTTP Interface - XML:DB API Reference Implementation - Jive XML Database Port -...
Excerpt from Link Everywhere at
It almost sounds like you're heading for an XML-ish blosxom, given the many-little-files scenario. Neat! (Of course, that's minus all the leveraging built-in unix metadata, but hey!)
Posted by l.m.orchard at