Categorization - Atom Wiki

Categories are an important component, and available in most blogging systems. How should we support them in Atom?

Convergence on this is important, as it affects both what a WellFormedEntry contains, and the AtomApi.

[ZhangYining RefactorOk], Categorization is also a(the?) way reader subscribes to one weblogger's feeds of blogs that he is interested reading.

Category for an entry: Optional; Multiple Category for an entry: Optional;

It's possible that many blogs (esp. personal blogs) do not want/have any category;
Some webloggers blog in mutliple languages, and readers can choose to subscribe feed written in languages they understand, xml:lang might not help here, unless their Atom reader provides features that look at the xml:lang and do the filtering;
Many webloggers, example: techies blog both tech and personal life topics, others might be interested in the tech-related topic;

NormanWalsh writes, "My entries are also divided into broad categories and have subjects. The distinction between category and subject is a bit vague, but it's roughly where does the essay fit into the general framework of the universe of things described by my blog as a whole (category) and what interesting things, people, places, events, etc., are mentioned by particular log entries (subjects)."

In that context, one might consider a "category" like a table-of-contents and "subjects" like an index.

Blosxom has a hierarchy of categories, mirroring the filesystem/URL namespace. Title of the post is quite different from category.

Upcoming MovableTypePro allegedly also has hierarchical categories. B2Evolution has hierarchical categories too.

MovableType allows an entry to be associated with multiple categories. I believe B2Evolution and RadioUserland can do this too.

[RuiCarmo RefactorOK] here's a thought: wiki entries can be categorized by references/backlinks to other entries. Why not sets of interrelated entries, instead of fixed categories? PhpWiki has a SubPages concept, but I found it lacking, so I implemented SeeAlso. I've never actually needed categories since.

Yes. Give incremental clustering of entries at least as much status as predefined categories. This is another use for Containers

[AdriaanTijsseling RefactorOK] Categories must be included. A SeeAlso is certainly useful, but being able to categorize data is intrinsic to human nature. Plus that it is already widespread in most blogs. Best to have hierarchical categorization with the possibility of multiple categories.

[ChristianCrumlish, RefactorOk] Where does the idea of categories fit into this model? Clearly categories are optional, but even if you have them, they can be conceptually handled different ways. For example, in the Radio model, the default "Home" category can be unapplied, and the interface encourages multiple-category application. In the MT model, there is no equivalent of "not on the home page" and the interface encourages single-category application. I don't know if these distinctions have ramifications for the data model, but since I'm interop/interchange-advocate boy, I'm wondering.

[MattMower RefactorOk] My RadioUserland weblog uses categories as a way of routing content to different weblogs (e.g. my public blog, a test blog and an intranet blog). As implemented by UserLand I find categories unsuitable for use as an organising tool on a single weblog. This is because they have to be created in advance, have no relationships to each other or the weblog and duplicate information.

Instead we have developed a client (k-collector) that allows each post to be associated with multiple topics from a shared topic set. These topics are then presented in the RSS (via the ENT extension) for use in filtering & routing posts. At the moment our topics are hierarchical based upon a type (e.g. Person, Place, Thing) but that may change to allow multiple levels of hierarchy. The way we present topics in ENT is designed to encourage their being backed by an XTM topic map which further defines the topic and it's relations.

[DavidEngel, RefactorOk] There seems to be a good deal of support for internal categorization. I'm interested in knowing how external categorization / linkage would work (the hinted at DMOZ, LOC).

[Skware, RefactorOk] I think it's worth thinking of internal categories as attirbutes or keywords that are associated with the entry, rather than the entry being associated with a category. In UML speak we I'm kind of saying Entry has an attribute, rather than category contains entry. This leaves the problem of indexing and collecting categories as an external part of the spec.

[DiegoDoval RefactorOk] movable type for example uses categoryIDs. The categoryIDs are then attributes of the entries. Setting category info as attributes of an entry reflects common usage today. By using a category ID we'd be adding a level of indirection, enough to let applications handle the relationship as they require.

[JeffreyWinter, RefactorOk] I have always thought that the XBEL format represented an interesting means of providing direct categorization, and an ability to directly manage categories via an HTTP/XML API. I've written up some thoughts on the subject here. Something similar could be considered here, although this tpye of hierarchical representation may strike some as too complicated. I find it pretty valuable myself :).

[ArveBersvendsen, RefactorOk] When we're looking at categorization, we should also look into TopicMapping

[NicholasAvenell, RefactorOk] I'm tempted to look at Categories as another relationship, as TrackBack and SeeAlso would be. This is partly selfish, because me (and my blogging system) has the ability to relate to a category as something seperate from the title (For example, the entry "Introducing the ESF Specification" would relate to the category "ESF" by the phrase "Original announcement") meaning that simple "This belongs to this category" methodology is over-simple.

ZhangYining

RefactorOk

<feed>
  ...
  <categories>
    <category>
      <id>1</id>
      <description>ESF</description>
      <link>http://link/to/ESF/site</link>
    <category>
  </categories>
  ...
  <entry>
    ...
    <title>
    <relation>
      <relationRole>http://purl.org/atom/category</relationRole>
      <relationTitle>Original Annoucement</relationTitle>
      <relationHref>1</relationHref>
    </relation>
  </entry>
  ...
</feed>

[AdriaanTijsseling RefactorOK] Can we start building from this proposal? It looks decent and readable to me.

Dublin Core

DublinCore has a "category" definition, which they call "subject":

Comment: Typically, Subject will be expressed as keywords, key phrases or classification codes that describe a topic of the resource. Recommended best practice is to select a value from a controlled vocabulary or formal classification scheme.

DublinCore also defines several qualified subjects, such as Dewey Decimal Classification and Library of Congress Subject Headings. We can define our own qualifications or, more likely, leave it open where one provides a URI of a provider of categories and then keywords or identifiers drawn from that provider.

The following examples show terms in the Atom namespace, that would be derived from elements in DublinCore, ie. 'subject' IS-A 'dc:subject'. 'provider' is not a DublinCore attribute, but one that we would define to support our needs.

An unqualified subject, like just a keyword, might look like:

  <entry xmlns="uri/of/Atom">
       :
       :
    <subject>reverberation</subject>
    <subject>resounding</subject>
  </entry>

An entry using keywords and DMOZ:

  <entry xmlns="uri/of/Atom">
       :
       :
    <subject>reverberation</subject>
    <subject provider="http://dmoz.org/">Arts:Movies:Titles:L:Looking for an Atom</subject>
  </entry>

An entry describing a location:

  <entry xmlns="uri/of/Atom">
    <title>In the City of New York</title>
       :
    <subject provider="http://geourl.org/">40.7650070, -73.9861298</subject>
  </entry>

For this example, see also GeoLocation.

[DannyAyers] I do like the above use of DC but I think we need to support a wide as possible range of categorisation mechanisms (DC, ENT, TMs, RDF etc.). One solution would be to include a <metadata> element that can contain any valid XML which would refer to its parent element (see also ExtraInterop).

[AsbjornUlsberg] Calling an element "metadata" is kind of silly, as everything in Atom except the text inside <content> is metadata. An <addinfo> (additional information) element, or something in that direction would be better, imho.

The name of the element wouldn't really be an issue I don't think - metadata just seemed the obvious choice, and it's in use already in SVG. - Danny

[AdamRice] Categorization is an interesting-but-hairy problem. There are many different schemes, all of which !Echo should, ideally, accommodate. Let's see:

Simple, author-defined (eg "this entry is in the sushi category")
multiple, author-defined ("this entry is in the sushi and favorite restaurants categories")
hierarchical (businesses: restaurants: sushi"). Multiple-hierarchical also possible
External--this would be working from a list/hierarchy of categories someone else has created in the interest of uniformity. This would require a reference to where that scheme is defined. This is actually a limited case of the next scheme,
author-defined/external mapping: My categories "sushi," "pizza," and "barbecue" all correspond to a commonly defined "restaurants" category for the purposes of aggregation.

This is just a first stab at defining the different categorization schemes--I'm sure others can think of more. FWIW, I consider keywords and categories related but not identical (so does Movable Type, for that matter). Perhaps once we nail down how things are categorized we can nail down a syntax for representing categories. This also suggests publishing the author's personal categorization scheme as a reference point. It might be a GoodThing for !Echo to provide a structure for doing so, but not to require it anytime anyone wanted to use categories.

Working from this, I'll suggest

Each category or hierarchical category wrapped in a separate <subject>
Hierarchical tiers indicated through colons (or some other reserved character). Ideally parsers will recognize hierarchy by that, less desirable would be to use an attribute in the subject tag.
Externally derived categories get difficult to represent, but there are three key pieces of information: a link to a definition for that categorization scheme, how this entry is categorized according to that scheme, and how this entry is categorized according to the author's personal scheme. The DC <subject provider="..."> notation gets us two out of three. If we can put IDs into the subject tags, then we might be able to get away with using two subject tags for the same thing, and put <subject link="..."> in the local category name to associate back to the universal category name, but this is hairy and I don't like it.

[BrianMcCallister] XML is hierarchical! I would suggest representing categorical hierarchies via nested elements rather than delimited tokens. I would further specify that categories can be freely amnipulated by anyone in the stream of getting you the feed. In other words an aggregator is free to re-categorize, remove subjetcs, change subjects, etc. Categorization is suggestion.

[JakobVoss] Category is a must but do not make it too complicated nor undefined. There are only two cases:

* freely created by the author (Keywords) * fixed folders to coose from (Categories)

A keyword is just a string and only the author realy knows what is meant by it. A category must be defined somewhere so you have to provide an URI/URL. Categories that are not related to any controlled vocabulary or formal classification scheme are only stupid keywords.

  <entry xmlns="uri/of/Atom">
    <title>You have to see this movie!</title>
       :
    <keyword>Star Trek XVI</keyword>
    <category provider="http://dmoz.org/">Arts:Movies</category>
    <category provider="http://myblog.org/myowncategory/">recommendations<category>
  </entry>

And do not try to model hierarchical categories on this level! A category is a category no matter how it is related to other categories (hierarchical, oppsitional, related...). You do not want to model all this relations.

CategoryExtension, CategoryMetadata, CategoryModel, CategoryRss