If you look around, you can find feeds with titles with escaped
HTML markup, and content that is plain text. However, your
application may very well want plain text titles, and HTML
content. One of the basic premises of Atom is that such data
toHtml will introduce
references as appropriate for special characters and symbols,
enabling Unicode to be handled safely even by non-Unicode enabled
toString will strip all markup and convert HTML entity
references to their equivalent Unicode characters.
is an update to an earlier
demonstrating this support. In particular, as titles are
displayed in a list, they need to be plain text. Also note
the rendering of the word "façade" in
While this was developed on, and targeted for, Python 2.3, two
patches are included to enable this to run on Python 2.2.
"non-Unicode enabled applications" should be fixed, if any exists. What possible reason is there for accepting the HTML-encoded version of a high-bit character but not the character itself? In fact, we should strive to get rid of this HTML-encoding altogether -- it only makes everything more complicated and harder-to-use.
I agree with Aaron. This looks dirty. When a string can be represented as XML character data ("text node"), it is pointless to introduce additional escaping. The XML spec requires XML processors to support UTF-8 and UTF-16. An app that is supposed to handle XML and can't even pass though Unicode as opaque strings can be considered broken. (Putting HTML entity references in XML source generally does not work, because the entities would have to be declared in the DTD and the XML processor would have to process the DTD, so I'm assuming the intent is to first escape the data using pseudo-entity references and then to use this string as the text node value on the parsed level so that additional escaping takes place when writing to an XML file.)
I simply installed Python 2.3 on Windows. In any case, it appears that Python still has a ways to go to handle Unicode consistently and completely. Until then, there still is some value in 7 bit ASCII equivalent HTML representations of Unicode.
re: xmltramp, from my admittedly limited testing, it doesn't appear to handle CDATA correctly. Here are two examples:
xmltramp is returning a unicode string, which you're trying to parse as an XML document. The Python XML parser converts this to a string in the default encoding for it. Apparently your default encoding doesn't support Unicode characters, so it throws an error.
This works if your default encoding is something enlightened, like UTF-8. But really the Python XML parser should convert Unicode strings to UTF-8, not the default encoding (since XML's default encoding is UTF-8 and it supports all Unicode characters), so the bug is there.
I could include a workaround for this (have the parse function convert its argument to UTF-8), but I don't think I'm going to since a) it's not my fault, b) it's fixable by the user without modifying xmltramp, and c) it won't come up very often. Let me know if you disagree.
I'm not sure what you're getting at. I was referring to application that supported escapes but not UTF-8. Not applications that support UTF-8 but not Python's internal decoded Unicode memory structure which isn't even a real string, let alone an XML file.
The input to xmltramp.parse is a file? I thought it was a string.
Note that the user of xmltramp does not need to know anything about SAX or StringIO. These are implementation choices of xmltramp.
What should one expect as the output of __repr__(1)? Valid XML? What is the default encoding for XML? What is the default encoding for Python? The former can be controlled. The latter can be detected.
= = =
Meta comment: in a perfect world, there would only be perfect tools. xmltramp is certainly above average, and yet it was not difficult to find several bugs in a short period of time - either directly attributable to xmltramp itself or indirectly via its dependencies.
Meanwhile, "enabling Unicode to be handled safely even by non-Unicode enabled applications" is, IMHO, a necessity. Because non-Unicode enabled applications do exist. And, yes, they should be fixed. But until then, things like toHtml, toString, and html2xml are valuable things to have.
Note: for consistency with minidom, I'll probably rename these methods to be ashtml and asstring. And add, asxml.
Sam Ruby: HTMLifying and unHTMLifying The Apple Store (U.S.): Belkin iPod Voice Recorder: The iPod is no longer read-only DrBacchus' Journal: Fluid dynamics EFF: MP3 Caper (Produced by BookmarkBlogger.)...