If you look around, you can find feeds with titles with escaped
HTML markup, and content that is plain text. However, your
application may very well want plain text titles, and HTML
content. One of the basic premises of Atom is that such data
will be
unambiguously
identified.
toHtml will introduce
HTML entity
references as appropriate for special characters and symbols,
enabling Unicode to be handled safely even by non-Unicode enabled
applications.
toString will strip all markup and convert HTML entity
references to their equivalent Unicode characters.
wx3pa.py
is an update to an earlier
wx3pa.py
demonstrating this support. In particular, as titles are
displayed in a list, they need to be plain text. Also note
the rendering of the word "façade" in
this post.
While this was developed on, and targeted for, Python 2.3, two
patches are included to enable this to run on Python 2.2.
"non-Unicode enabled applications" should be fixed, if any exists. What possible reason is there for accepting the HTML-encoded version of a high-bit character but not the character itself? In fact, we should strive to get rid of this HTML-encoding altogether -- it only makes everything more complicated and harder-to-use.
Unicode support is a little bit like accessibility. Most developers (or managers) do not see the benefits in supporting it. If the world was completely English, we would be just fine with ascii ;)
I agree with Aaron. This looks dirty. When a string can be represented as XML character data ("text node"), it is pointless to introduce additional escaping. The XML spec requires XML processors to support UTF-8 and UTF-16. An app that is supposed to handle XML and can't even pass though Unicode as opaque strings can be considered broken. (Putting HTML entity references in XML source generally does not work, because the entities would have to be declared in the DTD and the XML processor would have to process the DTD, so I'm assuming the intent is to first escape the data using pseudo-entity references and then to use this string as the text node value on the parsed level so that additional escaping takes place when writing to an XML file.)
Henri, I'm not sure, but you might be mixing two things. If you look at Aaron's existing rss feed, what you will see is instances of things like:
<![CDATA[“]]>
As an aside, I am curious as to how Aaron would represent the entire CDATA string above if it were contained inside the content of his feed, as CDATA doesn't nest.
HTMLifying and unHTMLifying — is this what we've made of the Web? It's not HTML or CSS or XML, it's the crap we stuff between the brackets. Purpose.......
[more]
Sam, I tried that snippet on 2.2 and 2.3 and got the same error on both (ASCII encoding error: ordinal not in range(128)) -- are you sure you just didn't declare a defaultencoding of UTF-8 for 2.3?
(I feel that UTF-8 should be the default encoding, but I'm not sure that I would really consider doing that "improved Unicode support" in a serious way.)
To answer your second question, xmltramp is smart enough to encode that as '<![CDATA[&#8220;]]&gt;'.
I simply installed Python 2.3 on Windows. In any case, it appears that Python still has a ways to go to handle Unicode consistently and completely. Until then, there still is some value in 7 bit ASCII equivalent HTML representations of Unicode.
re: xmltramp, from my admittedly limited testing, it doesn't appear to handle CDATA correctly. Here are two examples:
xmltramp is returning a unicode string, which you're trying to parse as an XML document. The Python XML parser converts this to a string in the default encoding for it. Apparently your default encoding doesn't support Unicode characters, so it throws an error.
This works if your default encoding is something enlightened, like UTF-8. But really the Python XML parser should convert Unicode strings to UTF-8, not the default encoding (since XML's default encoding is UTF-8 and it supports all Unicode characters), so the bug is there.
I could include a workaround for this (have the parse function convert its argument to UTF-8), but I don't think I'm going to since a) it's not my fault, b) it's fixable by the user without modifying xmltramp, and c) it won't come up very often. Let me know if you disagree.
I'm not sure what you're getting at. I was referring to application that supported escapes but not UTF-8. Not applications that support UTF-8 but not Python's internal decoded Unicode memory structure which isn't even a real string, let alone an XML file.
The input to xmltramp.parse is a file? I thought it was a string.
Note that the user of xmltramp does not need to know anything about SAX or StringIO. These are implementation choices of xmltramp.
What should one expect as the output of __repr__(1)? Valid XML? What is the default encoding for XML? What is the default encoding for Python? The former can be controlled. The latter can be detected.
= = =
Meta comment: in a perfect world, there would only be perfect tools. xmltramp is certainly above average, and yet it was not difficult to find several bugs in a short period of time - either directly attributable to xmltramp itself or indirectly via its dependencies.
Meanwhile, "enabling Unicode to be handled safely even by non-Unicode enabled applications" is, IMHO, a necessity. Because non-Unicode enabled applications do exist. And, yes, they should be fixed. But until then, things like toHtml, toString, and html2xml are valuable things to have.
Note: for consistency with minidom, I'll probably rename these methods to be ashtml and asstring. And add, asxml.
Sam Ruby: HTMLifying and unHTMLifying The Apple Store (U.S.): Belkin iPod Voice Recorder: The iPod is no longer read-only DrBacchus' Journal: Fluid dynamics EFF: MP3 Caper (Produced by BookmarkBlogger.)...