爬取微信公众号文章列表

爬取微信公众号文章列表

请确保(nodejs,anyporxy)环境已经搭建完毕!! 若无,请参考另一篇文章: Linux(Centos)上搭建anyProxy抓包工具

一: 通过anyproxy GUI界面寻找微信公众号请求

  • 点击公众号>历史文章

gzh

history

二:进入anyproxy控制台,找到获取文章列表的请求

gzhArticleList

  • 获取到的request详细信息如下:
General:
	Method: GET
	URL:https://mp.weixin.qq.com/mp/homepage?__biz=MzU5NDg5NTYwMw==&hid=1&sn=8df2b8b64b3e0453f94c71f6b459f5b0&scene=18&devicetype=iOS12.3.1&version=17000529&lang=zh_CN&nettype=WIFI&ascene=7&session_us=gh_938b0fe09d67&fontScale=100&pass_ticket=36XjBZ5ioJgiXfGDKW6pNteF14izqa4qEGa807pDRTAO2Kh%2FHwI7%2FFYh5NNBGor2&wx_header=1

Header:
	Host :mp.weixin.qq.com
	...

我们已经找到了获取文章列表的请求,并且得到了明文html源码,那么我们只需要编写自己的界面解析器即可

你需要知道的几个anyproxy的特性!

anyproxy给开发者提供了自定义规则(rule)的特性,我们可以在代理流程中加入自己的代码逻辑

rule接口文档

  • summary :规则模块的介绍文案,用于AnyProxy提示用户, 可以是一个函数,也可以是一个普通的字符串
  • beforeSendRequest(requestDetail)
    • AnyProxy向服务端发送请求前,会调用beforeSendRequest,并带上参数requestDetail
    • requestDetail:
      • protocol : 请求使用的协议,http或者https
      • requestOptions : 即将发送的请求配置
      • requestData : 请求Body
      • url : 请求url
      • _req : 请求的原始request
  • beforeSendResponse(requestDetail, responseDetail)核心:
    • AnyProxy向客户端发送请求前,会调用beforeSendResponse,并带上参数requestDetail responseDetail
    • requestDetailbeforeSendRequest中的参数
    • responseDetail
      • response 服务端的返回信息,包括statusCode header body三个字段
      • _res 原始的服务端返回对象

点击查看完整文档

使用示例:

module.exports = {
  // 模块介绍
  summary: 'my customized rule for AnyProxy',
  // 发送请求前拦截处理
  *beforeSendRequest(requestDetail) { /* ... */ },
  // 发送响应前处理
  *beforeSendResponse(requestDetail, responseDetail) { /* ... */ },
  // 是否处理https请求
  *beforeDealHttpsRequest(requestDetail) { /* ... */ },
  // 请求出错的事件
  *onError(requestDetail, error) { /* ... */ },
  // https连接服务器出错
  *onConnectError(requestDetail, error) { /* ... */ }
};

了解了以上特性以后,那么我们可以利用beforeSendResponse拦截器进行数据拦截

一:编写解析器parser

function log(log){
    utils.log('{DParser wechat:history article list: ' + log);
}

try{

    var $ = cheerio.load(htmlString);

    $('script').each(function (i, ele) {
        if(i === 13){
            var content = $(this).text();
            var appendText = 'var window = {__initCatch:false};';
            content = content.replace('seajs.use','');
            content = appendText + content;
            result.content = eval('(function() {'+content+'; return data;})()');;
        }
    });


}catch(exp){
    log('Parsing exception:' + exp.message);
}

解析器很简单,根据在anyproxy控制台拦截到的html源码编写对应的解析器

二:编写anyproxy的自定义规则文件

rule.js

var cheerio = require('cheerio');
var urlencode = require('urlencode');
var utils = require('./utils');
var vm = require('vm');
var lineReader = require('line-reader');
var iconv = require('iconv-lite');

//针对文章列表界面的解析器   (使用jquery解析dome元素,十分方便)
var parser = 'function log(log){utils.log("{DParser wechat:history article list: "+log)}try{var $=cheerio.load(htmlString);$("script").each(function(i,ele){if(i===13){var content=$(this).text();var appendText="var window = {__initCatch:false};";content=content.replace("seajs.use","");content=appendText+content;result.content=eval("(function() {"+content+"; return data;})()")}})}catch(exp){log("Parsing exception:"+exp.message)};';

module.exports = {
    summary: "抓取微信公众号历史文章",
    * beforeSendResponse(requestDetail, responseDetail) {
        //只针对获取公众号历史文章列表的请求做拦截
        if (/mp\/homepage\?__biz=/.test(requestDetail.url)) {
            console.log('[INFO]进入拦截器,开始解析html源码........................')
            //获取html源码
            var htmlString = responseDetail.response.body.toString();
            try {
                //定义沙箱环境需要的参数
                var sharedParam = {
                    vm:vm,                       // NodeJS 里面的核心模块,用于创建一个独立的沙箱运行空间
                    iconv:iconv,                 //编码/解码源码
                    htmlString:htmlString,       //原始html源码  
                    utils:utils,                 //自定义的一些工具类
                    cheerio:cheerio,             //可以将html转换为jquery对象(我们喜爱的$)
                    parserInfo:{},               //解析器的一些基本信息
                    result:{}                    //接受解析器返回的对象
                }
                //执行解析器
                vm.runInNewContext(parser,sharedParam);
                //sharedParam.result 就是解析器返回的数据
                console.log('爬取到的公众号历史文章列表......................' + JSON.stringify(sharedParam.result));
                //这里可以编写保存到数据库的操作
                //saveToDB(sharedParam.result);
                return null;
            } catch (e) {
                console.log('[ERROR]' + e.toString())
                return null;
            }
        } else {
            return null
        }
    }
};

三:编写anyproxy的启动脚本,并且引入rule文件

index.js

//引入anyproxy
const AnyProxy = require('anyproxy');
//引入自定义规则
const rule = require('./rule');
const options = {
    //代理监听端口
    port: 8001,
    //指定自定义规则文件的路径
    rule: rule,
    //web控制台配置
    webInterface: {
        enable: true,
        webPort: 8002
    },
    throttle: 10000,
    forceProxyHttps: true,
    wsIntercept: true, // 不开启websocket代理
    silent: false
};
const proxyServer = new AnyProxy.ProxyServer(options);

proxyServer.on('ready', () => { console.log("proxyServer is ready") });
proxyServer.on('error', (e) => { console.log("proxyServer Error:") + e.toString() });
proxyServer.start();

四:启动

[root@10-10-127-163 script]# node index.js 
[AnyProxy Log][2019-08-31 17:36:55]: throttle :10000kb/s
[AnyProxy Log][2019-08-31 17:36:56]: Http proxy started on port 8001
[AnyProxy Log][2019-08-31 17:36:56]: web interface started on port 8002
[AnyProxy Log][2019-08-31 17:36:56]: Active rule is: 抓取微信公众号历史文章
proxyServer is ready

五:在手机上访问文章列表。控制台将会输出如下内容

[AnyProxy Log][2019-08-31 17:37:33]: received request to: POST extshort.weixin.qq.com/mmtls/6a86f614
[AnyProxy Log][2019-08-31 17:37:34]: received request to: POST extshort.weixin.qq.com/mmtls/6a8737bb
[AnyProxy Log][2019-08-31 17:37:34]: received https CONNECT request mp.weixin.qq.com
[AnyProxy Log][2019-08-31 17:37:34]: will forward to local https server
[AnyProxy Log][2019-08-31 17:37:34]: [internal https]proxy server for mp.weixin.qq.com established
[AnyProxy Log][2019-08-31 17:37:34]: received request to: GET mp.weixin.qq.com/mp/homepage?__biz=MzU5NDg5NTYwMw==&hid=1&sn=8df2b8b64b3e0453f94c71f6b459f5b0&scene=18&devicetype=iOS12.3.1&version=17000529&lang=zh_CN&nettype=WIFI&ascene=7&session_us=gh_938b0fe09d67&fontScale=100&pass_ticket=36XjBZ5ioJgiXfGDKW6pNteF14izqa4qEGa807pDRTAO2Kh%2FHwI7%2FFYh5NNBGor2&wx_header=1
[INFO]进入拦截器,开始解析html源码........................
爬取到的公众号历史文章列表......................{"content":{"appmsg_list":[{"aid":"2247483712_1","title":"像人体骨骼系统一样,设计B端产品架构","cover":"http://mmbiz.qpic.cn/mmbiz_jpg/3WiafXak9EliaRPhteHKh1JNwwMXiaxDFGG1OhYWcKeqC8n8mDEczds6icuUvleP0rR9TFvjC43icm3TMMT8VqZoD2A/0","link":"http://mp.weixin.qq.com/s?__biz=MzU5NDg5NTYwMw==&mid=2247483712&idx=1&sn=6a637a6273a8c42795223e80e8bbdb13&scene=19#wechat_redirect","digest":"做好一个产品的产品架构,能清晰地组织好业务系统的逻辑、明确指导产品的设计、迭代、优化。而细化到B端产品架构上,我认为有以下四步需要注意。","appmsgid":2247483712,"itemidx":1,"type":9,"item_show_type":0,"copyright_stat":11,"author":"李雨","sendtime":1561125701},{"aid":"2247483704_1","title":"如何设计B端产品的首页?","cover":"http://mmbiz.qpic.cn/mmbiz_jpg/3WiafXak9EljTnF9I807iaVYtBV0EicRLFxTRoEFCc353CMM6TO78NFhbzniaj4gTnX0EGmjrODdowqStWhGmrI2Bg/0","link":"http://mp.weixin.qq.com/s?__biz=MzU5NDg5NTYwMw==&mid=2247483704&idx=1&sn=5620deec222c18e80907d40ad6edf245&scene=19#wechat_redirect","digest":"为什么要单独总结B端产品的首页怎么","appmsgid":2247483704,"itemidx":1,"type":9,"item_show_type":0,"copyright_stat":11,"author":"李雨","sendtime":1560433716},{"aid":"2247483698_1","title":"基于5W2H方法,设计深入业务的用户访谈方案","cover":"http://mmbiz.qpic.cn/mmbiz_jpg/3WiafXak9ElhBbyJXJJa6vD7kt0c7CHyA5AIYZF0WYem29lKEFo2aObpfj3AFeicwtiblTLFnt2Ckp0iahia16aT4sg/0","link":"http://mp.weixin.qq.com/s?__biz=MzU5NDg5NTYwMw==&mid=2247483698&idx=1&sn=dd966daee32d739118d86b16970ba579&scene=19#wechat_redirect","digest":"对于B端产品经理,用户访谈是是产品经理深入客户场景、了解客户业务的重要途径。","appmsgid":2247483698,"itemidx":1,"type":9,"item_show_type":0,"copyright_stat":11,"author":"李雨","sendtime":1559570057},{"aid":"2247483689_1","title":"深入业务,打造行业背景下的BI系统","cover":"http://mmbiz.qpic.cn/mmbiz_jpg/3WiafXak9EliawcLO9w8YN5icOVOElibc8PLVHCqoMx6YjhrVtibWdQ3dIJ30b17P8lPVQyLlgrdcH3xeP6HseyXJQw/0","link":"http://mp.weixin.qq.com/s?__biz=MzU5NDg5NTYwMw==&mid=2247483689&idx=1&sn=c1f9f6fa02753d104741fb7b0841d1db&scene=19#wechat_redirect","digest":"那如何搭建行业背景下的BI系统那?主要分为两大步骤。首先通过需求分析深入业务,明确系统解决的问题。然后,结合业务,整理源数据,制定指标和算法,设计展现形式,最后完成数据分析的设计。下文结合实例,详细讲解如何搭建行业背景下的BI系统。","appmsgid":2247483689,"itemidx":1,"type":9,"item_show_type":0,"copyright_stat":11,"author":"李雨","sendtime":1558532301}]}}

article

如上图,文章连接,缩略图,点赞数,阅读数,评论数都已经采集成功

本文没有具体介绍是如何采集到点赞,阅读等数据。因为细节太多,各种token,sign,sn....的对比测试才得到一个可用的url地址。有兴趣的朋友可以到我的git地址上去下载完整代码观看细节~ 另外本人主要做java后端的,nodeJs属于入门级水平,请带着宽容的心阅读 ☺

完整代码地址

# Nodejs  爬虫 

评论

Your browser is out-of-date!

Update your browser to view this website correctly. Update my browser now

×