It’s just data

libxml2 screams

Inside the libxml2 Python distribution is a few tests.  One of them is named, simply enough, xpath.  The purpose of this test apparently is to parse a small file, evaluate an xpath expression against it, cleanup, and repeat this in a loop one thousand times.  This runs in subsecond time on my machine.

What this leads me to conclude is that libxml2 is optimized for parsing lots of small files.  So I tested the theory by running a more realistic query against all of the weblog entries on my site.  The result was still subsecond.

Sweet.

That does not mean that I shouldn't migrate to an XML database, but merely that I don't need to do so today.

What it does mean is that I can spend my time thinking about what I want my url space to look like and designing the schema I chose to expose.  There are some obvious things, like it makes sense to have all of the structure exposed instead of obscured.  And a date format that can easily be collated.

As far as the url space goes, I want to make sure that the results are readily cachable.  Thinking about the usage pattern, what I am likely to find is:

Given this usage pattern, it would seem that my existing cache exactly fits this requirement.  Sweet.

I'll probably play with this for a few days before I deploy it publically.


It almost sounds like you're heading for an XML-ish blosxom, given the many-little-files scenario.  Neat!  (Of course, that's minus all the leveraging built-in unix metadata, but hey!)

Posted by l.m.orchard at

Les: that is the direction I am exploring.  To be honest, the built-in unix metadata is a mixed blessing.  For example, I just found out that created dates don't survive copies.

Also, my current weblog source directory has over nine thousand files in it.  I'm starting to see some operations slow down.  I also can no longer issues commands like "grep foo *" as Unix complains that the command line length is too long.

Things I am exploring: putting comments in the same file as the parent.  Splitting up the directories by date.  Who knows, I might even move away from my sequential numbering schema (a holdover from when I kept bidirectional synchronization with my Radio weblog).

In any case, I will likely try to keep the file system metadata in synch.  Lots of fun things to play with.

Posted by Sam Ruby at

$ ggrep -r foo .

:-)

Posted by James Aylett at

Or even:

# grep -r foo parent_dir

:-)

Posted by Adam Shand at

Mark Pilgrim : libxml2 screams...

Excerpt from HotLinks - Level 1 at

Sadly, the mysterious 3rd time isn't "create time" at all, although many people seem to think it is.  It's technically "inode change time".  Exact semantics probably vary between Unixen, but I would expect that it may well change for something as simple as a rename operation, or a hard link.

-Dom

Posted by Dominic Mitchell at

Here's a Unix "best practice" from the heyday of eToys -- use the file system as a semi-balanced tree by creating N directories (you choose N based on what OS you're using), chosen to roughly balance out the load across all directories, and then stash the files in there.

For, e.g., SKUs: 494943.html goes in directory './49/49' as file '43.html.'

Several orders of magnitude in speed improvement, but you need to know about your inodes.

Posted by Wilhelm at

- NetNewsWire Pro XML Database JuJu Bright Eyed Mister Zen Mac OS X Web Services Xindice Realm kstaken@xmldatabases.org Apache Xindice native XML database XML:DB initiative XML:DB XML database API - Xindice (Formally dbXML) - Xindice XML-RPC...

Excerpt from RUBEN VIDAL at

LINKS GET YOU EVERYWHERE

kstaken@xmldatabases.orgApache Xindice native XML databaseXML:DB initiativeXML:DB XML database API - Xindice (Formally dbXML) - Xindice XML-RPC Interface - Xindice HTTP Interface - XML:DB API Reference Implementation - Jive XML Database Port -...

Excerpt from RUBEN at

WISH YOU A GOD RIDE

kstaken@xmldatabases.org Apache Xindice native XML database XML:DB initiative XML:DB XML database API - Xindice (Formally dbXML) - Xindice XML-RPC Interface - Xindice HTTP Interface - XML:DB API Reference Implementation - Jive XML Database...

Excerpt from LINKS at

kstaken@xmldatabases.orgApache Xindice native XML databaseXML:DB initiativeXML:DB XML database API - Xindice (Formally dbXML) - Xindice XML-RPC Interface - Xindice HTTP Interface - XML:DB API Reference Implementation - Jive XML Database Port -...

Excerpt from Link Everywhere at

Add your comment