It’s just data

Clean-utf-8-for-XML

clean_utf8_for_xml.c

Is the 0x8F on line 99 meant to be 0xBF? This is not a loaded question, I have no idea: It just jumped out at me as the RHS of the blocks above and below are all 0xBF.

Posted by Jon Dowland at

Apparently, that’s the “top” of the Unicode range, in the so-called “plane 16”.  Take a look at the regular expression here.

Posted by Sam Ruby at

Are you willing for this code to be treated as in the public domain? Because it might be handy.. :)

Posted by Paul Findlay at

Paul, I am comfortable with this code being treated as either covered by the MIT or the Apache License, Version 2.0.  If neither of these are acceptable to you, let me know and I will see what I can do to accommodate.

Posted by Sam Ruby at

My fast XML generation library genx (see [link]) has routines genxCheckText and genxScrubText - the latter brutally removes anything that’s not either well-formed UTF-8 or a valid XML character.

Posted by Tim Bray at

[from seefeld] Sam Ruby: Clean-utf-8-for-XML

[link]...

Excerpt from del.icio.us/network/delook at

Diff for clean compile on OS X (Darwin 8.7.1):

--- clean_utf8_for_xml.c.orig   2006-07-04 07:57:28.000000000 -0500
+++ clean_utf8_for_xml.c        2006-07-05 21:25:13.000000000 -0500
@@ -1,4 +1,4 @@
-#include <malloc.h>
+#include <stdlib.h>
 #include <string.h>
 #include <stdio.h>
 
@@ -11,7 +11,7 @@
  * At a minimum, XML markup characters needs to be escaped.
  *
  * In the normal case, this code does nothing more than a quick scan of
- * the input, and returns it back.  If, however, it finds something amis
+ * the input, and returns it back.  If, however, it finds something amiss
  * it will allocate another block of memory and attempt to correct a few of
  * the most common errors.  If this occurs, it is the callers responsibility
  * to free the block that was allocated.


Posted by Paul Smith at

GentleCMS Development Log: Part 3

The extract method is basically done. I’m sure it could be improved a bit more, but it seems to be fairly effective. I added a few extra features beyond the original URI class’s capabilities, such as supplying a base uri to resolve...

Excerpt from Sporkmonger at

if (*in == 0x09 && *in == 0x0A && *in == 0x0D) {
*c++ = *in;
} else {

looks for me as if it should be

if (*in == 0x09 || *in == 0x0A || *in == 0x0D) {
*c++ = *in;
} else {

As for using it in my own projects, is MIT/Apache License compatible with GPL? What do I Need to make it "right"?

Posted by Christian Forster at

Christian: good catch. Patches by both you and Paul have been applied.  I’ve also added an explicit MIT/X11 license header to the code.  The FSF has deemed this license to be GPL compatible.

Posted by Sam Ruby at

GentleCMS Development Log: Part 3

The extract method is basically done. I’m sure it could be improved a bit more, but it seems to be fairly effective. I added a few extra features beyond the original URI class’s capabilities, such as supplying a base uri to resolve...

Excerpt from gentlecms on SWiK at

Add your comment