Unescaping HTML in Python

Posted by Michael Giarlo on August 01, 2008

Dear Future Me,

You've forgotten how to decode (or unescape) HTML or XML in Python again, haven't you?  My, my, that old age does catch up with you.

Well, it turns out that xml.sax.saxutils.unescape() works like a charm.  I'm certain that edge cases lurk here and there, so caveat, um, coder.

UPDATE: Edge case found. Note that unescape() will not work on ' or ", and so there is:
xml.sax.saxutils.unescape("<p>This is &quot;markup&quot;</p>", {"&apos;": "'", "&quot;": '"'})

Trackbacks

Use this link to trackback from your own site.

Comments

Leave a response

  1. gsf Fri, 01 Aug 2008 15:30:15 UTC

    Don\'t forget that, as noted at http://wiki.python.org/moin/EscapingXml, you can pass in additional entities like so:

    [pre]
    >>> unescape(\"' "\", {\"'\": \"\'\", \""\": \'\"\'})
    \'\\\' \"\'
    [/pre]

  2. Websites tagged "coder" on Postsaver Thu, 07 Aug 2008 23:45:14 UTC

    [...] – Unescaping HTML in Python saved by [...]

Comments