关于Node.js编写爬虫获取特殊的URL的问题。

第一次使用node.js编写爬虫，希望爬虫能够爬取一个页面上的所有链接。

看了论坛的一些文章，尝试了以下方法：一、使用request和cheerio模块，解析dom树来获取URL。二、使用正则匹配来获取url。

但是这些方法遇到一些特殊的情况，比如ajax或者javascript代码时产生的url就没有办法了。比如有一个链接，需要用户点击一个按钮，才能生成链接等各种情况。请各位帮帮忙，看看有没有这些方面的模块或者方法。

另：附上爬虫实际测试的测试地址：http://demo.aisec.cn/demo/aisec/。爬虫希望能够爬取到上面的所有链接。请各位不吝赐教！

nnabuuu 1楼•11 年前

这种用phantom.js就好了嘛。。。我记得phantom.js已经可以集成到node.js里面了

asfman 2楼•11 年前

phantom没有集成到node.js吧，还是要单独装phantom.js的吧，装好后，npm install spooky,可以去github看看spooky怎么使用 try { var Spooky = require(‘spooky’); var spooky = new Spooky({ child: { transport: ‘http’ }, casper: { pageSettings: { loadImages: false, loadPlugins: false }, verbose: false } }, function (err) { if (err) { e = new Error(‘Failed to initialize SpookyJS’); e.details = err; throw e; }

		        spooky.start(fetchUrl);
				spooky.on('html', function (doc) {

					//console.log(doc.url);//最终抓取的url
					var cheerio = require('cheerio');
					var $ = cheerio.load(doc.html);
					var product = {};
					//todo
					res.json(product);
				});
		        spooky.then(function () {

					this.emit('html', this.evaluate(function () {

						return {
							url: location.href,
							html: document.querySelector('html').outerHTML
						};
					}));
		        });
		        spooky.run();
		});

		spooky.on('error', function (e, stack) {
		    res.status(500).json({error: (stack?JSON.stringify(stack):"spooky error")});
		});

nodevc 3楼•11 年前作者

@asfman 我安装好了三个库，phantomjs，casperjs，spooky，然后运行spooky里的example目录下的hello.js。程序报错了。我是在windows下运行的。安装三个库都是用的npm。报错如下：

events.js:72 throw er; // Unhandled ‘error’ event ^ Error: spawn ENOENT at errnoException (child_process.js:1011:11) at Process.ChildProcess._handle.onexit (child_process.js:802:34)

不知道是什么原因？

nnabuuu 4楼•11 年前

@asfman 有一些非官方的解决方法，看起来是会损失一些性能，不过用用应该没事。

见 https://github.com/sgentle/phantomjs-node