It’s just data

HTMLifying and unHTMLifying

If you look around, you can find feeds with titles with escaped HTML markup, and content that is plain text.  However, your application may very well want plain text titles, and HTML content.  One of the basic premises of Atom is that such data will be unambiguously identified.

atomef.py extends lazydom.py with two methods:

wx3pa.py is an update to an earlier wx3pa.py demonstrating this support.  In particular, as titles are displayed in a list, they need to be plain text.  Also note the rendering of the word "façade" in this post.

While this was developed on, and targeted for, Python 2.3, two patches are included to enable this to run on Python 2.2.


"non-Unicode enabled applications" should be fixed, if any exists. What possible reason is there for accepting the HTML-encoded version of a high-bit character but not the character itself? In fact, we should strive to get rid of this HTML-encoding altogether -- it only makes everything more complicated and harder-to-use.

Posted by Aaron Swartz at

Unicode support is a little bit like accessibility. Most developers (or managers) do not see the benefits in supporting it. If the world was completely English, we would be just fine with ascii ;)

Posted by Claude at

Python's support for Unicode is improving.  Note, that the following doesn't work in Python 2.2:

from atomef import unescape
print unescape('Señor González')

I'm pleased to say that this does work in 2.3.

Posted by Sam Ruby at

I agree with Aaron. This looks dirty. When a string can be represented as XML character data ("text node"), it is pointless to introduce additional escaping. The XML spec requires XML processors to support UTF-8 and UTF-16. An app that is supposed to handle XML and can't even pass though Unicode as opaque strings can be considered broken. (Putting HTML entity references in XML source generally does not work, because the entities would have to be declared in the DTD and the XML processor would have to process the DTD, so I'm assuming the intent is to first escape the data using pseudo-entity references and then to use this string as the text node value on the parsed level so that additional escaping takes place when writing to an XML file.)

Posted by Henri Sivonen at

s/pass though/pass through/ in the comment above

Posted by Henri Sivonen at

Henri, I'm not sure, but you might be mixing two things.  If you look at Aaron's existing rss feed, what you will see is instances of things like:

<![CDATA[&#8220;]]>

As an aside, I am curious as to how Aaron would represent the entire CDATA string above if it were contained inside the content of his feed, as CDATA doesn't nest.

Posted by Sam Ruby at

Closing All My Tags

HTMLifying and unHTMLifying — is this what we've made of the Web? It's not HTML or CSS or XML, it's the crap we stuff between the brackets. Purpose....... [more]

Trackback from Crushing Blow

at

Sam, I tried that snippet on 2.2 and 2.3 and got the same error on both (ASCII encoding error: ordinal not in range(128)) -- are you sure you just didn't declare a defaultencoding of UTF-8 for 2.3?

(I feel that UTF-8 should be the default encoding, but I'm not sure that I would really consider doing that "improved Unicode support" in a serious way.)

To answer your second question, xmltramp is smart enough to encode that as '&lt;![CDATA[&amp;#8220;]]&amp;gt;'.

Posted by Aaron Swartz at

I simply installed Python 2.3 on Windows.  In any case, it appears that Python still has a ways to go to handle Unicode consistently and completely.  Until then, there still is some value in 7 bit ASCII equivalent HTML representations of Unicode.

re: xmltramp, from my admittedly limited testing, it doesn't appear to handle CDATA correctly.  Here are two examples:

xmltramp.parse("<x><![CDATA[>]]></x>").__repr__(1)

u'<x>></x>'

xmltramp.parse("<x>]]&gt;</x>").__repr__(1)

u'<x>]]&amp;gt;</x>'

Posted by Sam Ruby at

The first is correct AFAIK. The second is a bug due to my misreading the spec; I've fixed it and released 2.13.

Posted by Aaron Swartz at

Indeed, the first is correct.  I've rerun the test with (what I intended) "<![CDATA[<]]>" and the output is (correctly) "&lt;".  This is with 2.12.

Where is 2.13 posted?

Posted by Sam Ruby at

Oops, didn't check the error code of my upload. It's there now:

http://www.aaronsw.com/2002/xmltramp/xmltramp.py

Posted by Aaron Swartz at

It doesn't look like attributes are escaped:

xmltramp.parse('<x a="&lt;"/>').__repr__()

u'<x a="<"></x>'

Posted by Sam Ruby at

Oops, I forgot to integrate a patch I received for that. Fixed in 2.14

Posted by Aaron Swartz at

I like that 2.14 has "better Unicode".  Here's another bug report:

xmltramp.parse("""<x a='"'/>""").__repr__()

u'<x a="""></x>'

Posted by Sam Ruby at

Yeah, there was a subtlety having to do with __str__ and __unicode__ that I didn't know about before.

OK, fixed in 2,15.

Posted by Aaron Swartz at

New test case:

xmltramp.parse('<a xmlns="http://a"><b xmlns="http://b"/></a>')

Posted by Sam Ruby at

Wow, that's a major annoyance in the SAX spec. Fixed in 2.16.

Posted by Aaron Swartz at

New test case:

doc='<title>Postel&#8217;s Law Has No Exceptions</title>'
xmltramp.parse(xmltramp.parse(doc).__repr__(1))

Posted by Sam Ruby at

xmltramp is returning a unicode string, which you're trying to parse as an XML document. The Python XML parser converts this to a string in the default encoding for it. Apparently your default encoding doesn't support Unicode characters, so it throws an error.

This works if your default encoding is something enlightened, like UTF-8. But really the Python XML parser should convert Unicode strings to UTF-8, not the default encoding (since XML's default encoding is UTF-8 and it supports all Unicode characters), so the bug is there.

I could include a workaround for this (have the parse function convert its argument to UTF-8), but I don't think I'm going to since a) it's not my fault, b) it's fixable by the user without modifying xmltramp, and c) it won't come up very often. Let me know if you disagree.

Posted by Aaron Swartz at

"non-Unicode enabled applications" should be fixed, if any exists.

Posted by Sam Ruby at

I'm not sure what you're getting at. I was referring to application that supported escapes but not UTF-8. Not applications that support UTF-8 but not Python's internal decoded Unicode memory structure which isn't even a real string, let alone an XML file.

Posted by Aaron Swartz at

The input to xmltramp.parse is a file?  I thought it was a string.

Note that the user of xmltramp does not need to know anything about SAX or StringIO.  These are implementation choices of xmltramp.

What should one expect as the output of __repr__(1)?  Valid XML?  What is the default encoding for XML?  What is the default encoding for Python?  The former can be controlled.  The latter can be detected.

= = =

Meta comment: in a perfect world, there would only be perfect tools.  xmltramp is certainly above average, and yet it was not difficult to find several bugs in a short period of time - either directly attributable to xmltramp itself or indirectly via its dependencies.

Meanwhile, "enabling Unicode to be handled safely even by non-Unicode enabled applications" is, IMHO, a necessity.  Because non-Unicode enabled applications do exist.  And, yes, they should be fixed.  But until then, things like toHtml, toString, and html2xml are valuable things to have.

Note: for consistency with minidom, I'll probably rename these methods to be ashtml and asstring.  And add, asxml.

Posted by Sam Ruby at

You've found 3 bugs in my support of XML and 1 in my support of SAX. If anything, merely reinforcing my opinion that XML sucks.

The input to parse is an XML file in a Python string.

If you disagree with the API of __repr__(1), I'm happy to fix that.

I'm still waiting for you to point me to an app that requires your encoding job.

And your HTML conversion system munged up your link with <em> tags.

Posted by Aaron Swartz at

Page up.  Click on wx3pa.py.

Posted by Sam Ruby at

This seems to work:

header = '<META HTTP-EQUIV="Content-Type" CONTENT="text/html; charset=UTF-8">'
def utf8ize(t): return header + t.encode('utf-8')

Posted by Aaron Swartz at

I just tried it, using the March 2003 version of wx3pa against your RSS 1.0 feed.  Selecting the item "What A Little Bug Can Do" results in:

UnicodeDecodeError: 'ascii' codec can't decode byte 0xe2 in position 24: ordinal not in range(128)

Posted by Sam Ruby at

Quick links

Sam Ruby: HTMLifying and unHTMLifying The Apple Store (U.S.): Belkin iPod Voice Recorder: The iPod is no longer read-only DrBacchus' Journal: Fluid dynamics EFF: MP3 Caper (Produced by BookmarkBlogger.)...

Excerpt from 0xDECAFBAD at

Sam Ruby: HTMLifying and unHTMLifying

[link]...

Excerpt from del.icio.us/jonas at

Sam Ruby: HTMLifying and unHTMLifying

[link]...

Excerpt from del.icio.us/lsdr/samruby at

Add your comment