位置：首页-资讯-后端开发

python如何实现微信公众号文章爬取

2023-06-19 09:42

短信预约 -IT技能 免费直播动态提醒

小编给大家分享一下python如何实现微信公众号文章爬取，相信大部分人都还不怎么了解，因此分享这篇文章给大家参考一下，希望大家阅读完这篇文章后大有收获，下面让我们一起去了解一下吧！

python如何实现微信公众号文章爬取

具体步骤如下：

一、安装代理服务器

目前使用的是Anyproxy。这个软件的特点是可以获取到https链接的内容。

1 在命令行或者终端运行 npm install -g anyproxy，mac系统需要加上sudo；

2 生成RootCA，https需要这个证书：运行命令sudo anyproxy --root（windows可能不需要sudo）；

3 启动anyproxy运行命令：sudo anyproxy -i；参数-i是解析HTTPS的意思；

4 安装证书，在手机中安装证书，手机浏览器打开 http://localhost:8002/fetchCrtFile ，能获取rootCA.crt文件。

localhost修改为运行anyproxy的电脑的ip地址，注意手机跟电脑要位于同一个局域网哦。

5 设置代理：在手机wifi连接管理中，设置代理，代理服务器地址就是运行anyproxy的电脑的ip地址。代理服务器默认端口是8001；

现在打开微信，点击到任意一个公众号历史消息或文章中，在终端都可以看到响应的代码滚动。

6 电脑打开浏览器地址http://localhost:8002 可以看到anyproxy的web界面。从微信中点开一个历史消息页面，然后再看浏览器的web界面，会滚动出现历史消息页面的地址。

二、用SPY爬取文章列表

由于要保存到数据库里，所以我动用了自己开发的SPY爬虫软件，如果不需要保存到数据库，用chrome就可以了。

1 手机打开公众号的历史文章列表，下拉至最底下，把所有文章都加载出来。

2 打开SPY，输入地址http://localhost:8002，贴入代码。

代码大致的逻辑是：

a、获取mp/profile_ext?action=home&__biz=MzA3ODkyNDg4OA=

中获取到的文章列表数据。

b、由于文章列表数据是异步加载的，所以暂时需要手工在手机里把下拉文章列表，把所有的文章加载进来。

c、然后，SPY里把所有的文章数据提取出来，保存到数据库里。

代码如下：

var results = [];
var doms = document.querySelectorAll('.record_status_done');
var pages = [];
doms.forEach(function(dom, i) {
var isUrl = dom.children[4].getAttribute('title');
if (isUrl.match(/\/mp\/profile\_ext\?action\=getmsg\&/i)) {
pages.push(dom);
}
});
var step = 0;
stepByStep();
function stepByStep() {
pages[step].click();
var res;
setTimeout(function() {
if (document.querySelector('.resBodyContent')) {
res = JSON.parse(JSON.parse(document.querySelector('.resBodyContent').innerText).general_msg_list).list;
}
if (res) {
res.forEach(function(r, i) {
if (r.app_msg_ext_info) {
var target = r.app_msg_ext_info;
console.log(target, step, 'num');
var obj_save = {
author: target.author,
content_url: target.content_url,
cover: target.cover,
digest: target.digest,
title: target.title,
};
spy.save(obj_save);
results.push(obj_save);
console.log(results.length, step);
}
});
} else {
console.log(res, document.querySelector('.resBodyContent'))
}
step = step + 1;
setTimeout(function() {
document.querySelector('.escBtn').click();
}, 1000);
if (step < pages.length) {
setTimeout(function() {
window.stepByStep();
}, 3000);
} else {
spy.getResult(results)
}
}, 1000);
};