Python 通过xpath属性爬取豆瓣热映的电影信息
前言
声明一下:本文主要是研究使用,没有别的用途。
GitHub仓库地址:github项目仓库
页面分析
主要爬取页面为:https://movie.douban.com/cinema/nowplaying/nanjing/
至于后面的地区,可以按照自己的需要改一下,不过多赘述了。页面需要点击一下展开全部影片,才能显示全部内容,不然只有15部。所以我们使用selenium的时候,需要加一个打开页面后的点击逻辑。页面图如下:
通过F12展开的源码,用xpath helper工具验证一下右键复制下来的xpath路径。
为了避免布局调整导致找不到,我把xpath改为通过class名获取。
然后看看每个影片的信息。
分析一下,是不是可以通过nowplaying的div,作为根节点,然后获取下面class为list-item的节点,里面的属性就是我们要的内容。
没什么问题,那么就按照这个思路开始创建项目编码吧。
实现过程
创建项目
创建一个较douban_playing的项目,使用scrapy命令。
scrapy startproject douban_playing
Item定义
定义电影信息实体。
# Define here the models for your scraped items
#
# See documentation in:
# https://docs.scrapy.org/en/latest/topics/items.html
import scrapy
class DoubanPlayingItem(scrapy.Item):
# define the fields for your item here like:
# name = scrapy.Field()
# 电影名
title = scrapy.Field()
# 电影分数
score = scrapy.Field()
# 电影发行年份
release = scrapy.Field()
# 电影时长
duration = scrapy.Field()
# 地区
region = scrapy.Field()
# 电影导演
director = scrapy.Field()
# 电影主演
actors = scrapy.Field()
中间件操作定义
主要是点击展开全部影片,需要加一段代码。
# Define here the models for your spider middleware
#
# See documentation in:
# https://docs.scrapy.org/en/latest/topics/spider-middleware.html
import time
from scrapy import signals
# useful for handling different item types with a single interface
from itemadapter import is_item, ItemAdapter
from scrapy.http import HtmlResponse
from selenium.common.exceptions import TimeoutException
class DoubanPlayingSpiderMiddleware:
# Not all methods need to be defined. If a method is not defined,
# scrapy acts as if the spider middleware does not modify the
# passed objects.
@classmethod
def from_crawler(cls, crawler):
# This method is used by Scrapy to create your spiders.
s = cls()
crawler.signals.connect(s.spider_opened, signal=signals.spider_opened)
return s
def process_spider_input(self, response, spider):
# Called for each response that goes through the spider
# middleware and into the spider.
# Should return None or raise an exception.
return None
def process_spider_output(self, response, result, spider):
# Called with the results returned from the Spider, after
# it has processed the response.
# Must return an iterable of Request, or item objects.
for i in result:
yield i
def process_spider_exception(self, response, exception, spider):
# Called when a spider or process_spider_input() method
# (from other spider middleware) raises an exception.
# Should return either None or an iterable of Request or item objects.
pass
def process_start_requests(self, start_requests, spider):
# Called with the start requests of the spider, and works
# similarly to the process_spider_output() method, except
# that it doesn't have a response associated.
# Must return only requests (not items).
for r in start_requests:
yield r
def spider_opened(self, spider):
spider.logger.info('Spider opened: %s' % spider.name)
class DoubanPlayingDownloaderMiddleware:
# Not all methods need to be defined. If a method is not defined,
# scrapy acts as if the downloader middleware does not modify the
# passed objects.
@classmethod
def from_crawler(cls, crawler):
# This method is used by Scrapy to create your spiders.
s = cls()
crawler.signals.connect(s.spider_opened, signal=signals.spider_opened)
return s
def process_request(self, request, spider):
# Called for each request that goes through the downloader
# middleware.
# Must either:
# - return None: continue processing this request
# - or return a Response object
# - or return a Request object
# - or raise IgnoreRequest: process_exception() methods of
# installed downloader middleware will be called
# return None
try:
spider.browser.get(request.url)
spider.browser.maximize_window()
time.sleep(2)
spider.browser.find_element_by_xpath("/*;q=0.8',
'Accept-Language': 'en',
'User-Agent': 'Mozilla/5.0 (Windows NT 6.2; WOW64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/27.0.1453.94 Safari/537.36'
}
# Enable or disable spider middlewares
# See https://docs.scrapy.org/en/latest/topics/spider-middleware.html
SPIDER_MIDDLEWARES = {
'douban_playing.middlewares.DoubanPlayingSpiderMiddleware': 543,
}
# Enable or disable downloader middlewares
# See https://docs.scrapy.org/en/latest/topics/downloader-middleware.html
DOWNLOADER_MIDDLEWARES = {
'douban_playing.middlewares.DoubanPlayingDownloaderMiddleware': 543,
}
# Enable or disable extensions
# See https://docs.scrapy.org/en/latest/topics/extensions.html
#EXTENSIONS = {
# 'scrapy.extensions.telnet.TelnetConsole': None,
#}
# Configure item pipelines
# See https://docs.scrapy.org/en/latest/topics/item-pipeline.html
ITEM_PIPELINES = {
'douban_playing.pipelines.DoubanPlayingPipeline': 300,
}
# Enable and configure the AutoThrottle extension (disabled by default)
# See https://docs.scrapy.org/en/latest/topics/autothrottle.html
#AUTOTHROTTLE_ENABLED = True
# The initial download delay
#AUTOTHROTTLE_START_DELAY = 5
# The maximum download delay to be set in case of high latencies
#AUTOTHROTTLE_MAX_DELAY = 60
# The average number of requests Scrapy should be sending in parallel to
# each remote server
#AUTOTHROTTLE_TARGET_CONCURRENCY = 1.0
# Enable showing throttling stats for every response received:
#AUTOTHROTTLE_DEBUG = False
# Enable and configure HTTP caching (disabled by default)
# See https://docs.scrapy.org/en/latest/topics/downloader-middleware.html#httpcache-middleware-settings
#HTTPCACHE_ENABLED = True
#HTTPCACHE_EXPIRATION_SECS = 0
#HTTPCACHE_DIR = 'httpcache'
#HTTPCACHE_IGNORE_HTTP_CODES = []
#HTTPCACHE_STORAGE = 'scrapy.extensions.httpcache.FilesystemCacheStorage'
执行验证
还是老样子,不直接使用scrapy命令,构造一个py执行cmd。注意该py的位置。
看一下执行后的结果。
完美!!!
总结
最近都在写一些爬虫的案例,也是边学习边摸索,把一些实现过程记录一下,也分享一下,等过段时间还可以回忆回忆。
分享:
情之一字,不知所起,不知所栖,不知所结,不知所解,不知所踪,不知所终。 ——《雪中悍刀行》
如果本文对你有用的话,请不要吝啬你的赞,谢谢!
以上就是Python 通过xpath属性爬取豆瓣热映的电影信息的详细内容,更多关于Python 爬虫豆瓣的资料请关注编程网其它相关文章!
免责声明:
① 本站未注明“稿件来源”的信息均来自网络整理。其文字、图片和音视频稿件的所属权归原作者所有。本站收集整理出于非商业性的教育和科研之目的,并不意味着本站赞同其观点或证实其内容的真实性。仅作为临时的测试数据,供内部测试之用。本站并未授权任何人以任何方式主动获取本站任何信息。
② 本站未注明“稿件来源”的临时测试数据将在测试完成后最终做删除处理。有问题或投稿请发送至: 邮箱/279061341@qq.com QQ/279061341