PHP解析html类库simple_html_dom的转码bug

这几天有在用simple_html_dom抓一些文章。不同网站的编码在国内基本上是gbk gb2312 utf-8。而以gb2312和utf-8居多。


我这一版的simple_html_dom有一个方法 convert_text 是这个样子的。




复制代码 代码如下:



 // PaperG – Function to convert the text from one character set to another if the two sets are not the same.


 function convert_text($text)


 {


  global $debug_object;


  if (is_object($debug_object)) {$debug_object->debug_log_entry(1);}


  $converted_text = $text;


  $sourceCharset = “”;


  $targetCharset = “”;


  if ($this->dom)


  {


   $sourceCharset = strtoupper($this->dom->_charset);


   $targetCharset = strtoupper($this->dom->_target_charset);


  }


  if (is_object($debug_object)) {$debug_object->debug_log(3, “source charset: ” . $sourceCharset . ” target charaset: ” . $targetCharset);}


  if (!empty($sourceCharset) && !empty($targetCharset) && (strcasecmp($sourceCharset, $targetCharset) != 0))


  {


   // Check if the reported encoding could have been incorrect and the text is actually already UTF-8


   if ((strcasecmp($targetCharset, ‘UTF-8’) == 0) && ($this->is_utf8($text)))


   {


    $converted_text = $text;


   }


   else


   {


    $converted_text = iconv($sourceCharset, $targetCharset, $text);


   }


  }


  // Lets make sure that we don’t have that silly BOM issue with any of the utf-8 text we output.


  if ($targetCharset == ‘UTF-8’)


  {


   if (substr($converted_text, 0, 3) == “/xef/xbb/xbf”)


   {


    $converted_text = substr($converted_text, 3);


   }


   if (substr($converted_text, -3) == “/xef/xbb/xbf”)


   {


    $converted_text = substr($converted_text, 0, -3);


   }


  }


  return $converted_text;


 }




来看这一行:




复制代码 代码如下:



    $converted_text = iconv($sourceCharset, $targetCharset, $text); 




会引起转码不正确。比如会把gb2312的文字转成:




复制代码 代码如下:



4月26日在<span style=”color:#C03″>