crawler抓取内容出现乱码了。。

exports.index = function(req, res){ var Crawler = require(“crawler”).Crawler;

var c = new Crawler({ “maxConnections”:10, “debug”:true, “forceUTF8”:true, // This will be called for each crawled page "callback":function(error,result,$) {

    // $ is a jQuery instance scoped to the server-side DOM of the page
  
    var te=$("#top .gengxin table  tr:first").html();
    console.log(te);
     res.render('index',{title:te});
}

}); c.queue(“http://psv.tgbus.com”); };

后台输出: GET http://psv.tgbus.com … Got http://psv.tgbus.com (107044 bytes)… forceUTF8 true Detected charset windows-1252 (95% confidence) <td width=“10” valign=“top” class="">茂驴陆茂驴陆</td><td class=""><a class="" href=“http://psv.tgbus.com/yxgl/201303/20130314105417.shtml” title=“茂驴陆茂驴陆茂驴陆茂驴陆茂驴陆茂驴陆茂驴陆寐济柯矫柯矫柯矫柯矫陆茂驴陆寐柯矫柯矫柯矫柯?nbsp;茂驴陆茂驴陆茂驴陆茂驴陆茂驴陆茂驴陆茂驴陆寐济柯矫柯矫柯矫柯矫柯矫柯矫柯? target=”_blank"><font color="#FF0000">茂驴陆茂驴陆茂驴陆茂驴陆茂驴陆茂驴陆茂驴陆寐济柯矫柯矫柯矫柯矫陆茂驴陆寐柯矫柯矫柯矫柯?nbsp;茂驴陆茂驴陆茂驴陆茂驴陆茂驴陆茂驴陆茂驴陆寐济柯矫柯矫柯矫柯矫柯矫柯矫柯?/font></a></td><td align=“right” class="" width=“40”><font color=“red”>03-14</font></td>

难道是 windows-1252编码不支持吗？我抓取utf-8的网页是正常的。。

anuxs 1楼•13 年前

爬虫要面对的第一个问题就是编码的问题。建议用fetch，自动转码。

shiedman 2楼•13 年前

并非不支持windows-1252编码，而是crawler调用的jschardet库将gb2312误判成windows-1252. 除非修改crawler的代码，无解。

ronincn 3楼•13 年前

自己抓，然后jquery dom

jathya2 4楼•12 年前

@anuxs 谢谢大神。。搞了一天了。啥request,bufferHelper,needle,iconv,spider在编码问题上都是没用的。最后才看见了fetch…解决了各种编码问题通用性极强… 虽然是老帖子，但是确实解决了问题 btw,编码问题在http://stackoverflow.com完全不知道搜索啥关键字英文太差了

XadillaX 5楼•12 年前

也可以爬虫用nodegrassex