It’s just data

Urlnorm

urlnorm.py

Passes the tests defined in PaceCanonicalIds.  Passes all but three of the tests defined in MNot's urlnorm.py, as I interpret the specs differently for these three.

Only exercised significantly for http URIs.

Testcases welcome.


License?

Posted by Mark at

I want to check with mnot first.  Default_ports and the second set of tests are the only substantial reuse from his codebase.  My preference is the Python License.

Posted by Sam Ruby at

Hi Sam,

Interesting. Given that you're shooting for RFC2396bis, which gets rid of separate path params, you should probably use urlparse.urlsplit instead of urlparse.urlparse (and likewise urlunsplit instead of urlunparse).

This line:
  (auth,host,port)=re.search('([^@]*@)?([^:]*):?(.*)',auth).groups()
seems to find (userinfo, host, port); is that what you meant?

Similarly what's going on here?
  if auth=="@": auth=""

WRT atom:id, why go through all this when you can just do a lexical compare on the strings; if they're just IDs, why do they need to be normalised at this level?

WRT license, just ack me and link to the original; license yours however you like. I'm also amenable to folding changes back in if you like.

Posted by Mark Nottingham at

Possibly this is outside of the scope you're imagining for the function, but in the context of normalizing urls 'in the wild', it would be useful to strip() whitespace and line returns from the url as a whole and also perhaps in the parsed fragments as well.  This would resolve cases like:

"http:// www.mysite.com" and

"""http://www.mysite.com

"""

Posted by Phil McCluskey at

This is going to be difficult since your comment parser will munge these, but here goes:

[link]
[link]
[link]
[link]

Is your intention to produce an ultra-liberal URL normalizer?  If not, you'll need some defensive code to guard against invalid URLs, such as ones that include unescaped high-bit characters.

Posted by http://diveintomark.org/ at

Wow, that didn't work at all.  Let's try without the scheme.  These are all http URLs:

@example.com/
:@example.com/
127.0.0.1/
127.0.0.1:80/

Posted by Mark at

OK, I've updated urlnorm based on the feedback above.

The way http://:@example.com/ was handled previously was a bug, it now is normalized to http://example.com/.  Unless I'm missing something, http://127.0.0.1/ is correctly normalized.

I'd like non-ASCII characters in URIs to be handled the same way a typical query works: these characters are escaped.

My comment parser won't munge URIs found in code.

Posted by Sam Ruby at

I reckon maybe canonicalization isn't such a good idea after all. Too fiddly. I didn't think so before, but after looking at the source, there's an awful lot for someone to get wrong in those 200 loc, and then what if there's any minor change to RFC 2396? Unless someone's prepared to maintain a public domain repository of normalizers for every language under the sun, the additional complexity is asking for non-compliant feeds.

Posted by Danny at

Danny, well then perhaps you shouldn't look at the source.  ;-)

Seriously: can you name one URI on your site that is not normalized?  Actually, looking at your feed, I can name exactly one: http://dannyayers.com, and that particular one is not likely to appear as an entry id.

My plans are to add this code to the feedvalidator - whether this results in error, warning, or informational messages.  So, I'm quite prepared to worry about the "fiddly bits", but I seriously doubt that many other people will have to.

P.S.  Most of the 200 loc are comments, tests, and blank lines.

Posted by Sam Ruby at

Should also replace backslashes in the path with slashes:
  r"http://example.com\test.html" --> "http://example.com/test.html"
  r"http://example.com/a\test.html" --> "http://example.com/a/test.html"
  r"http://example.com\a\test.html" --> "http://example.com/a/test.html"

Posted by Anonymous at

No, backslashes should be translated to %5C.

Posted by Sam Ruby at

Looks good so far, but you could add a "bozo" mode to the parser that handles backslashes and unintended spaces as requested above. Just a thought.

Posted by Asbjørn Ulsberg at

This is what I meant when I asked about whether you were planning on making this an ultra-liberal normalizer.  There are lots of goofy things you can do to URLs that work in IE, or particular versions of Netscape, or something.

Posted by Mark at

There are a number of "goofy" things that are legal.  My intent is not to mimic IE, Netscape, etc, but to faithfully implement the rules listed in the initial comment in this source file.  Of course, people are free to compare the output of this function with the input to see if anything changed, and may chose to make value judgments based on this.  In fact, when I originally wrote this function, it was my intent for the feedvalidator to do exactly that.

Suggestions for new rules, comments on the existing rules, and testcases are all welcome.

Posted by Sam Ruby at

Hey, this is a nice article! I took the mentioned test cases (and some more) and adjusted them to my own URL "normalizer".
But it is more an URL "fixer" than a normalizer. So RFC compliance is not guaranteed :) For example I replace backslashes with slashes in the path part (as suggested above). This fixes some broken window-ish URL paths.
Implementation (as url_norm()) and unit tests.

Posted by Calvin at

Sam Ruby: Urlnorm

[link]...

Excerpt from del.icio.us/jonas/webstandards at

Hey, Sam. Mind if we use this in iPodder?

Regards,
Garth.

Posted by Garth T Kidd at

Garth: you are welcome to do so.  My understanding is that the Python license is GPL compatible.

Posted by Sam Ruby at

Sam Ruby: Urlnorm

Sam Ruby: Urlnorm by benoit python url Copy | React (0) [link]...

Excerpt from Public marks from user benoit at

Answer by cobra libre for How can I normalize a URL in python

Because this page is a top result for Google searches on the topic, I think it’s worth mentioning some work that has been done on URL normalization with Python that goes beyond urlencoding space characters. For example, dealing with default ports,...

Excerpt from How can I normalize a URL in python - Stack Overflow at

In my environment (Python 2.5.1, urllib 1.17), I had one offending http:// URL:

fer-martin.com/flying-suit-up/

that returned with an exception:

TypeError: decoding Unicode is not supported

I used to pass the URLs to be normalised as: url.encode('utf-8'), so they are <type ‘str'>, but this one kept being <type 'unicode'> even after .encode(). Something was going wrong with the unquote(string) command. I changed clean() as following:

  def clean(string):
  string=unquote(string)
  if type(string) == type(unicode()):
  print 'changing from unicode to str’
  string=string.encode('utf-8')
  string=str(string)
 
  string=unicode(string,'utf-8','replace')
  return unicodedata.normalize('NFC',string).encode('utf-8')

Posted by Manos at

URL Normalization in Clojure

Bandwidth is often one of the first bottlenecks you’ll hit when web crawling. So, it’s in your best interest to crawl each page only once (ignoring recrawls). In order to know that you’ve already crawled a page you need to keep an...

Excerpt from X-Combinator at

URL Normalization in Clojure

Bandwidth is often one of the first bottlenecks you’ll hit when web crawling. So, it’s in your best interest to crawl each page only once (ignoring recrawls). In order to know that you’ve already crawled a page you need to keep an...

Excerpt from Eigenjoy at

Add your comment