Inside the libxml2 Python
distribution is a few tests. One of
them is named, simply enough, xpath. The purpose of this test
apparently is to parse a small file, evaluate an xpath expression
against it, cleanup, and repeat this in a loop one thousand
times. This runs in subsecond time on my machine.
What this leads me to conclude is that libxml2 is optimized for
parsing lots of small files. So I tested the theory by
running a more realistic query against all of the weblog entries on
my site. The result was still subsecond.
Sweet.
That does not mean that I shouldn't migrate to an XML database,
but merely that I don't need to do so today.
What it does mean is that I can spend my time thinking about
what I want my url space to look like and designing the schema I
chose to expose. There are some obvious things, like it makes
sense to have all of the structure
exposed
instead of
obscured.
And a date format that can easily be collated.
As far as the url space goes, I want to make sure that the
results are readily cachable. Thinking about the usage
pattern, what I am likely to find is:
not overly complicated queries
queries which are ad-hoc and therefore hard to optimize
for
the number of unique queries issued per day is likely to be
small
the total number of queries issued (including repetitions) may
be highly variable. All it takes is for someone like Jon
Udell to
post
a few links to cause this to happen.
Given this usage pattern, it would seem that my existing
cache
exactly fits this requirement. Sweet.
I'll probably play with this for a few days before I deploy it
publically.
It almost sounds like you're heading for an XML-ish blosxom, given the many-little-files scenario. Neat! (Of course, that's minus all the leveraging built-in unix metadata, but hey!)
Les: that is the direction I am exploring. To be honest, the built-in unix metadata is a mixed blessing. For example, I just found out that created dates don't survive copies.
Also, my current weblog source directory has over nine thousand files in it. I'm starting to see some operations slow down. I also can no longer issues commands like "grep foo *" as Unix complains that the command line length is too long.
Things I am exploring: putting comments in the same file as the parent. Splitting up the directories by date. Who knows, I might even move away from my sequential numbering schema (a holdover from when I kept bidirectional synchronization with my Radio weblog).
In any case, I will likely try to keep the file system metadata in synch. Lots of fun things to play with.
Sadly, the mysterious 3rd time isn't "create time" at all, although many people seem to think it is. It's technically "inode change time". Exact semantics probably vary between Unixen, but I would expect that it may well change for something as simple as a rename operation, or a hard link.
Here's a Unix "best practice" from the heyday of eToys -- use the file system as a semi-balanced tree by creating N directories (you choose N based on what OS you're using), chosen to roughly balance out the load across all directories, and then stash the files in there.
For, e.g., SKUs: 494943.html goes in directory './49/49' as file '43.html.'
Several orders of magnitude in speed improvement, but you need to know about your inodes.
- NetNewsWire Pro XML Database JuJu Bright Eyed Mister Zen Mac OS X Web Services Xindice Realm kstaken@xmldatabases.org Apache Xindice native XML database XML:DB initiative XML:DB XML database API - Xindice (Formally dbXML) - Xindice XML-RPC...