Parsing of undecoded UTF-8 will give garbage when decoding entities

July 20th, 2007 at 10:16 am

I’m using HTML::TreeBuilder to analyze HTML files. On some pages, although the analysis works fine, the following warning message is printed:

Parsing of undecoded UTF-8 will give garbage when decoding entities at D:/Perl/site/lib/HTML/TreeBuilder.pm line 96.

Apparently, it comes from the HTML::Parser module that is used by TreeBuilder under the hood. I searched the web and the newsgroups, and found this solution: before passing the page contents to TreeBuilder, feed them through decode_utf8:


use HTML::TreeBuilder;
use Encode;

my $contents = ...; # HTML webpage contents
my $htree = HTML::TreeBuilder->new_from_content(decode_utf8 $contents);

Related posts:

  1. Firebug and HTML analysis

One Response to “Parsing of undecoded UTF-8 will give garbage when decoding entities”

  1. NothusNo Gravatar Says:

    Yup, that did the trick for me! Thanks. It seems that the mediawiki wiki pages like to spit out utf-8!

Leave a Reply

To post code with preserved formatting, enclose it in `backticks` (even multiple lines)