Tags Perl
I'm using HTML::TreeBuilder to analyze HTML files. On some pages, although the analysis works fine, the following warning message is printed:
Parsing of undecoded UTF-8 will give garbage when decoding entities at D:/Perl/site/lib/HTML/TreeBuilder.pm line 96.
Apparently, it comes from the HTML::Parser module that is used by TreeBuilder under the hood. I searched the web and the newsgroups, and found this solution: before passing the page contents to TreeBuilder, feed them through decode_utf8:

use HTML::TreeBuilder;
use Encode;

my $contents = ...; # HTML webpage contents
my $htree = HTML::TreeBuilder->new_from_content(decode_utf8 $contents);

Comments

comments powered by Disqus