位置：首页-资讯-后端开发

Python3 爬虫 requests

2023-01-31 08:21

短信预约 -IT技能 免费直播动态提醒

刚学Python爬虫不久，迫不及待的找了一个网站练手，新笔趣阁：一个小说网站。

安装Python以及必要的模块（requests，bs4），不了解requests和bs4的同学可以去官网看个大概之后再回来看教程

刚开始写爬虫的小白都有一个疑问，进行到什么时候爬虫还会结束呢？答案是：爬虫是在模拟真人在操作，所以当页面中的next链接不存在的时候，就是爬虫结束的时候。

1.用一个queue来存储需要爬虫的链接，每次都从queue中取出一个链接，如果queue为空，则程序结束
2.requests发出请求，bs4解析响应的页面，提取有用的信息，将next的链接存入queue
3.用os来写入txt文件

需要把域名和爬取网站对应的ip 写入host文件中，这样可以跳过DNS解析，不这样的话，代码运行一段时间会卡住不动

'''
抓取新笔趣阁https://www.xbiquge6.com/单个小说
爬虫线路： requests - bs4 - txt
Python版本： 3.7
OS： windows 10
'''
import requests
import time
import sys
import os
import queue
from bs4 import BeautifulSoup 
# 用一个队列保存url
q = queue.Queue()
# 首先我们写好抓取网页的函数
def get_content(url):

    try:
        headers = {
            'User-Agent': 'Mozilla/5.0 (Windows NT 10.0; WOW64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/70.0.3538.67 Safari/537.36',
        }

        r = requests.get(url=url, headers=headers)
        r.encoding = 'utf-8'
        content = r.text
        return content
    except:
        s = sys.exc_info()
        print("Error '%s' happened on line %d" % (s[1], s[2].tb_lineno))
        return " ERROR "

# 解析内容
def praseContent(content):
    soup = BeautifulSoup(content,'html.parser')
    chapter = soup.find(name='div',class_="bookname").h1.text
    content = soup.find(name='div',id="content").text
    save(chapter, content)
    next1 = soup.find(name='div',class_="bottem1").find_all('a')[2].get('href')
    # 如果存在下一个章节的链接，则将链接加入队列
    if next1 != '/0_638/':
        q.put(base_url+next1)
    print(next1)
# 保存数据到txt
def save(chapter, content):
    filename = "修罗武神.txt"
    f =open(filename, "a+",encoding='utf-8')
    f.write("".join(chapter)+'\n')
    f.write("".join(content.split())+'\n') 
    f.close

# 主程序
def main():
    start_time = time.time()
    q.put(first_url)
    # 如果队列为空，则继续
    while not q.empty():
        content = get_content(q.get())
        praseContent(content)
    end_time = time.time()
    project_time = end_time - start_time
    print('程序用时', project_time)

# 接口地址
base_url = 'https://www.xbiquge6.com'
first_url = 'https://www.xbiquge6.com/0_638/1124120.html'
if __name__ == '__main__':
    main()

结果蛮成功的吧，就是过程比较慢，程序用时1个半小时。。23333继续学习，有改进方案的欢迎提出来，一起交流。
QQ:1156381157

免责声明：

① 本站未注明“稿件来源”的信息均来自网络整理。其文字、图片和音视频稿件的所属权归原作者所有。本站收集整理出于非商业性的教育和科研之目的，并不意味着本站赞同其观点或证实其内容的真实性。仅作为临时的测试数据，供内部测试之用。本站并未授权任何人以任何方式主动获取本站任何信息。

② 本站未注明“稿件来源”的临时测试数据将在测试完成后最终做删除处理。有问题或投稿请发送至: 邮箱/279061341@qq.com QQ/279061341

爬虫 requests

阅读原文内容投诉