Introduction
Rationale
The current API proposals include a simple GET-based search entry point. More sophisticated interfaces are likely to appear; this document discusses some issues relating to these, and looks to how much these various interfaces can be harmonised.
Observations
-
an XML-based POST interface fits well with the rest of the project.
-
POST interfaces will be optional ("MAY" in standard RFC terminology), while the GET interface will likely be considered more important ("SHOULD").
-
Sharing vocabulary between all APIs would be good (eg: always using "title" to mean an entry's title, rather than having some APIs call it "title", some "headline", and so forth).
-
It should be possible to define relatively few search structures, being the layout of the data POSTed to the search URI, while allowing the search syntax to vary more. For instance, both XPath and a non-standard "My-Atom-Search-Syntax" might be used to construct details of filters, using the same search structure to lay these queries out.
-
All the APIs should be able to use the same XML format to return the search results.
-
The current proposed GET-based interface returns results by reference, but it is sometimes desirable to return the entries inline (for instance, see AggregatorApi). This issue has yet to be fully addressed here.
Work required
There are two levels of detail here, the first being more conceptual (questions like "should we do this?" and "where does this fit?"), and the second being more detailed.
Conceptual work
-
Working with the existing PieApi (as in RestEchoApiDiscuss) to see how far we should go towards the GET interface view we currently have here
-
If JoeGregorio's suggestions to make search unnecessary in the editing facet are adopted, we run on our own, and just need to ensure we meet the base needs of AggregatorApi (probably)
-
Deciding how discovery of search structures and syntaxes might work. Structures currently feel rather like separate entries in the introspection document, so do we need to extend the syntax of that to cover syntaxes, or do we make structure+syntax into a separate API entry in the introspection document? If the former, how does the Producer (ie the software implementing the API entry points) know which syntax is required?
Detailed work
-
Specifying how the feed XML maps onto a search vocabulary for structures and/or syntaxes that don't interrogate the XML directly
-
Specifying the structures fully
-
Specifying the syntaxes fully (XPath, XQuery are beyond are scope, but the GET syntax needs tying down)
-
Work on all structures to support both result-by-reference and result-inline (or for only some structures, if not all need them)
-
For structures that support result-by-reference, specify what the result XML looks like, including namespace
-
For structures that support partial result-inline (eg the document-centric structure), specify what the result XML looks like, including namespace (this may be the same as for result-by-reference)
Search vocabulary
The search vocabulary gives a mapping from elements and attributes of the data of a feed to textual nouns that can be used in searches.
Some search structures may not need a vocabulary separate from what the standard <feed> XML format defines (the document-centric structure presented below does not, for instance), so the vocabulary used by other search structures, including the simple GET interface, should use a shared vocabulary derived from and as close as possible to the XML format names. Those structures that do use the vocabulary may choose to restrict those nouns it makes accessible.
As well as nouns derived from the data, meta-nouns may prove useful. An example from the draft API would be for the position of an entry in the entire feed, ordered by (decreasing) modification time. This might be called 'modification-position', or something equivalent but more terse. (In the current draft API this is pretty much the only noun, and so isn't given explicitly; since here we are considering richer searches, even with the GET interface, we prefer to make anything that might be searched against into an explicit noun or meta-noun.)
Proposed vocabulary mapping
A simple mapping from the XML data to textual nouns would be to take advantage of the element/attribute hierarchy, from the <feed> element down to the element or attribute in question. For instance, given the XML structure as follows:
<?xml version="1.0" encoding="utf-8"?> <feed version="0.2" xmlns="http://purl.org/atom/ns#"> <!-- required elements --> <title>dive into mark</title> <link>http://diveintomark.org/</link> <modified>2003-08-05T12:29:29Z</modified> <!-- optional elements --> <tagline>A lot of effort went into making this effortless</tagline><!-- 1 --> <id>tag:diveintomark.org,2003:3</id> <generator name="Movable Type">http://www.movabletype./org/?v=2.64</generator><!-- 2 name attribute --> <copyright>Copyright (c) 2003, Mark Pilgrim</copyright> <entry> <!-- required elements --> <title>Atom 0.2 snapshot</title> <link>http://diveintomark.org/2003/08/05/atom02</link> <id>tag:diveintomark.org,2003:3.2397</id> <issued>2003-08-05T08:29:29-04:00</issued> <modified>2003-08-05T18:30:02Z</modified><!-- 3 --> <!-- optional elements --> <created>2003-08-05T12:29:29Z</created> <summary>The Atom 0.2 snapshot is out. Here are some sample feeds.</summary> <author> <name>Mark Pilgrim</name> <url>http://diveintomark.org/</url><!-- 4 --> <email>f8dy@example.com</email> </author> <contributor> <name>Sam Ruby</name> <url>http://intertwingly.net/blog/</url> <email>rubys@example.com</email> </contributor> <contributor> <name>Joe Gregorio</name> <url>http://bitworking.org/</url> <email>joe@example.com</email> </contributor> <content type="application/xhtml+xml" mode="xml" xml:lang="en-us"><!-- 5 xml:lang attribute --> <div xmlns="http://www.w3.org/1999/xhtml"> <p>The Atom 0.2 snapshot is out ... [snip]</p> </div> </content> </entry> </feed>
we could map the commented elements and attributes to the following nouns:
-
tagline
-
generator@name
-
entry.modified
-
entry.author.url
-
entry.content@xml:lang
Note that there cannot be mappings for anything inside the <content> element, since that is the boundary of what our XML structure concerns itself with.
'.' was chosen as a separator for elements because it is readable and convenient, and we are unlikely to generate XML element names containing it.
'@' was chosen as a separator for attributes by analogy with XPath. '.' could be used instead, and this may prove clearer - the natural reading of 'generator@name' is almost exactly the opposite of the intention here.
This mapping does not consider extension elements or attributes in different namespaces. It is intended that search structures that do not need a vocabulary would be used in these cases.
Need we consider anything outside <entry>?
There is an open question: is there anything outside the <entry> element itself we might want to filter on? With the current XML structure, it seems unlikely, but this may change. If not, clearly a common subpart of the mapped nouns could be dropped (eg: the 'entry.' prefix in the above proposal; those not sharing it would not be accessible anyway).
Search structures
GET query
The GET interface effectively gives a very simple search structure, where a (small) set of filters is provided, all of which must be satisfied for an entry to appear in the search results.
Each filter takes the form of a boolean-valued binary operator from the search syntax, acting on (the value of) a noun from the vocabulary and a literal value. For instance, you might filter using the contains operator, the 'entry.author.name' noun and the literal string 'James' to find all entries written by someone called James.
The filters are expressed as GET query, '&'-separated, attribute=value pairs, with the attribute formed from the noun and operator, and the value being the value from the filter. The attribute name is constructed as 'search-' noun ['-' operator], with an omitted operator taken to be the equality operator.
For example, with a query API URI of /search?:
GET /search?search-modified-order-gt=15 GET /search?search-entry.title=My+entry GET /search?search-entry.author.name-contains=James&search-entry.modified-gt=2003-08-05
Document-centric
The POST data could contain the skeleton of a <feed> element, but with the contents of the different searchable parts of the XML being filter expressions in the relevant search syntax.
An example, using XPath as the search syntax [derived from the "Atom 0.2 snapshot":
<?xml version="1.0" encoding="utf-8"?> <feed version="0.2" xmlns="http://purl.org/atom/ns#"> <!-- Put no clause on the title --> <title /> <link /> <modified /> <entry> <title /> <author> <!-- Use the XPath 'contains'-function to find all entries where the author has the name 'John' --> <name>contains(., 'James')</name> </author> <!-- Get entries where 'modified' is newer than 5th august 2003 --> <modified>. > 2003-08-05</modified> </entry> </feed>
Any elements provided in the POSTed search data would be supplied for entries returned in the search. For example, the above search might return:
<?xml version="1.0" encoding="utf-8"?> <feed version="0.2" xmlns="http://purl.org/atom/ns#search-results"> <title>My Feed</title> <link>http://example.com/myfeed/</link> <modified>2003-08-05T12:29:29Z</modified> <entry> <title>Mumble mumble title</title> <author> <name>James Aylett</name> </author> <modified>2003-08-05T18:30:02Z</modified> </entry> <entry> <title>Another entry</title> <author> <name>James Aylett</name> </author> <modified>2003-08-07T12:17:31Z</modified> </entry> </feed>
(A different namespaces needs to be used (which implies another Schema), because we may omit elements that the usual XML format declares as required.)
Query-centric
A simpler structure than the above would focus on the query in the POST data. It needs to give the query itself (in whatever search syntax is chosen), and also potentially which elements of the feed, entry and so forth to supply in the results. This is more compact than the document-centric structure, and has the advantage of grouping all filters into one place.
An example, using XQuery as the syntax:
<?xml version="1.0" encoding="utf-8"?> <search-entries xmlns="http://purl.org/atom/ns#search-query-centric"> <query>//entry[fn:contains(author/name, "James") and modified gt "2003-08-05"]</query> </search-entries>
For this structure, it appears much harder to find a convenient way of asking for only some elements to appear in the result document. It might be best to restrict this structure to just the query itself, perhaps with a choice of result-by-reference (as with the current GET interface in the draft API) and result-inline (as our examples here work).
Search syntaxes
Simple GET
The simple GET interface is designed to work with the GET query structure, and so need only define its operators. The following are expected to be required:
-
equality (unnamed operator)
-
contains
-
gt (greater than)
-
lt (less than)
XPath
XPath is a fairly natural syntax to consider for an XML-based format. An example was given above.
XQuery
XQuery is a fairly natural syntax to consider for an XML-represented database. An example was given above.
Discussion
[JamesAylett] I want to leave this all here to get comments, but my feelings at the moment are that for the vocabulary we should just consider within <entry>, use '.' to separate both attributes and elements, give access to everything for the GET interface, and start locking down metanouns. If necessary, we could come up with a way for the GET structure to complain that it doesn't recognise all your nouns (and return nothing). Given there isn't that much even in the maximal example of 0.2 on Mark's site, and I'm explicitly barring extensions from the vocabulary, this shouldn't be too onerous (but is really powerful).
I also think, after much consideration, that we really are going to need at least those three search structures, as they really seem to solve different types of problem.
-
[AsbjornUlsberg] In the GET API, I think we need to differentiate between the '-' that separates 'search' from what is being searched on, and the '-' that separates what is being searched on from the operator used. If we keep hyphens for the first, and replace the second with colons, we'd get something like /search?search-entry.author.name:contains=James&search-entry.modified:gt=2003-08-05. My point is only that the delimiter should be different for element words and element/action.
[JamesAylett] I personally think this is more confusing, not less, but perhaps I'm thinking of the different sections of the query variable name in different ways (in particular I don't see the 'search' as part of the element word). I think we probably need opinions from other people on this to resolve it ...
[AsbjornUlsberg] As we haven't (or have we?) setteled on a naming standard of attributes and elements in Atom's data model, I don't think we can decide anything on this point either. But I feel that it's important that we don't use dash '-' as an element/action delimiter if it's also chosen as a word delimiter (instead of camelCasing and underscore_separating) in the data model. I don't think of the different sections of the query in different ways, either -- I think it's a "whole" which should be unified, indisputable and unconfusing. If the same delimiter character means different things in different contexts (though in the same query), I'll get confused. But maybe that's just me.
[JoeGregorio] Thoughts on finding entries to edit
[JamesAylett] Nice. Having the entryURI (ie a URI where the Pie XML representation is available) in the main feed document strikes me as a major step forward; advocating having it linked from the HTML version is even better. This might make search an unnecessary part of the editing facet of PieApi, so that search is a completely orthogonal facet.