Python Selenium自动化爬虫的方法是什么

2023-06-28 19:35

短信预约 -IT技能 免费直播动态提醒

本篇内容主要讲解“Python Selenium自动化爬虫的方法是什么”，感兴趣的朋友不妨来看看。本文介绍的方法操作简单快捷，实用性强。下面就让小编来带大家学习“Python Selenium自动化爬虫的方法是什么”吧!

简单介绍：

Selenium是一个Web的自动化测试工具，最初是为网站自动化测试而开发的，Selenium 可以直接运行在浏览器上，它支持所有主流的浏览器（包括PhantomJS这些无界面的浏览器（2018年开发者说暂停开发，chromedriver也可以实现同样的功能）），可以接收指令，让浏览器自动加载页面，获取需要的数据，甚至页面截屏。

1.安装

pip install selenium -i https://pypi.tuna.tsinghua.edu.cn/simple

2.下载浏览器驱动

这里用的谷歌浏览器

http://npm.taobao.org/mirrors/chromedriver/

查看自己的浏览器版本下载对应的驱动。

Python Selenium自动化爬虫的方法是什么

把解压后的驱动放在自己的python.exe 目录下。

3.实例

3.1下载对应版本的浏览器驱动

http://npm.taobao.org/mirrors/chromedriver/

Python Selenium自动化爬虫的方法是什么

把解压后的驱动放在自己的python.exe 目录下

Python Selenium自动化爬虫的方法是什么

3.2测试code，打开一个网页，并获取网页的标题

from selenium.webdriver import Chromeif __name__ == '__main__':    web = Chrome()    web.get("https://baidu.com")    print(web.title)

Python Selenium自动化爬虫的方法是什么

3.3一个小样例

from selenium.webdriver import Chromeif __name__ == '__main__':    web = Chrome()    url = 'https://ac.nowcoder.com/acm/home'    web.get(url)    # 获取要点击的a标签    el = web.find_element_by_xpath('/html/body/div/div[3]/div[1]/div[1]/div[1]/div/a')    # 点击    el.click()                          # "/html/body/div/div[3]/div[1]/div[2]/div[2]/div[2]/div[1]/h5/a"    # 爬取想要的内容    lists = web.find_elements_by_xpath("/html/body/div/div[3]/div[1]/div[2]/div[@class='platform-item js-item ']/div["                                       "2]/div[1]/h5/a")    print(len(lists))    for i in lists:        print(i.text)

3.4自动输入并跳转

from selenium.webdriver import Chromefrom selenium.webdriver.common.keys import Keysimport timeif __name__ == '__main__':    web = Chrome()    url = 'https://ac.nowcoder.com/acm/home'    web.get(url)    el = web.find_element_by_xpath('/html/body/div/div[3]/div[1]/div[1]/div[1]/div/a')    el.click()    time.sleep(1)    input_el = web.find_element_by_xpath('/html/body/div/div[3]/div[1]/div[1]/div[1]/form/input[1]')    input_el.send_keys('牛客', Keys.ENTER)    # 　do something

4.开启无头模式

是否开启无头模式（即是否需要界面）

from selenium.webdriver import Chromefrom selenium.webdriver.chrome.options import Optionsoption = Options()  # 实例化option对象option.add_argument("--headless")  # 给option对象添加无头参数if __name__ == '__main__':    web = Chrome(executable_path='D:\PyProject\spider\venv\Scripts\chromedriver.exe',options=option) # 指定驱动位置,否则从python解释器目录下查找.    web.get("https://baidu.com")    print(web.title)

5.保存页面截图

from selenium.webdriver import Chromefrom selenium.webdriver.chrome.options import Optionsoption = Options()  # 实例化option对象option.add_argument("--headless")  # 给option对象添加无头参数if __name__ == '__main__':    web = Chrome()    web.maximize_window()  # 浏览器窗口最大化    web.get("https://baidu.com")    print(web.title)    web.save_screenshot('baidu.png')  # 保存当前网页的截图  保存到当前文件夹下    web.close()  # 关闭当前网页

6.模拟输入和点击

from selenium.webdriver import Chromefrom selenium.webdriver.chrome.options import Optionsoption = Options()  # 实例化option对象option.add_argument("--headless")  # 给option对象添加无头参数if __name__ == '__main__':    web = Chrome()    web.maximize_window()  # 浏览器窗口最大化    web.get("https://baidu.com")    el = web.find_element_by_id('kw')    el.send_keys('Harris-H')    btn = web.find_element_by_id('su')    btn.click()    # web.close()  # 关闭当前网页

貌似现在百度可以识别出selenium，还需要图片验证。

6.1根据文本值查找节点

# 找到文本值为百度一下的节点driver.find_element_by_link_text("百度一下") # 根据链接包含的文本获取元素列表，模糊匹配driver.find_elements_by_partial_link_text("度一下")

6.2获取当前节点的文本

ele.text # 获取当前节点的文本ele.get_attribute("data-click")  # 获取到属性对应的value

6.3打印当前网页的一些信息

print(driver.page_source)  # 打印网页的源码print(driver.get_cookies())  # 打印出网页的cookieprint(driver.current_url)  # 打印出当前网页的url

6.4关闭浏览器driver.close() # 关闭当前网页

driver.close()  # 关闭当前网页driver.quit()  # 直接关闭浏览器

6.5模拟鼠标滚动

from selenium.webdriver import Chromeimport timeif __name__ == '__main__':    driver = Chrome()    driver.get(        "https://www.baidu.com/s?ie=utf-8&f=8&rsv_bp=1&rsv_idx=1&tn=78000241_12_hao_pg&wd=selenium%20js%E6%BB%91%E5%8A%A8&fenlei=256&rsv_pq=8215ec3a00127601&rsv_t=a763fm%2F7SHtPeSVYKeWnxKwKBisdp%2FBe8pVsIapxTsrlUnas7%2F7Hoo6FnDp6WsslfyiRc3iKxP2s&rqlang=cn&rsv_enter=1&rsv_dl=tb&rsv_sug3=31&rsv_sug1=17&rsv_sug7=100&rsv_sug2=0&rsv_btype=i&inputT=9266&rsv_sug4=9770")    #  1.滚动到网页底部    js = "document.documentElement.scrollTop=1000"    # 执行js    driver.execute_script(js)    time.sleep(2)    # 滚动到顶部    js = "document.documentElement.scrollTop=0"    driver.execute_script(js)  # 执行js    time.sleep(2)    driver.close()

7.ChromeOptions

options = webdriver.ChromeOptions()options.add_argument("--proxy-server=http://110.52.235.176:9999") # 添加代理options.add_argument("--headless") # 无头模式options.add_argument("--lang=en-US") # 网页显示英语prefs = {"profile.managed_default_content_settings.images": 2, 'permissions.default.stylesheet': 2} # 禁止渲染options.add_experimental_option("prefs", prefs)driver = webdriver.Chrome(executable_path="D:\ProgramApp\chromedriver\chromedriver73.exe",chrome_options=options) driver.get("http://httpbin.org/ip")

8.验证滑块移动

目标：滑动验证码

定位按钮
按住滑块
滑动按钮

import timefrom selenium import webdriverif __name__ == '__main__':    chrome_obj = webdriver.Chrome()    chrome_obj.get('https://www.helloweba.net/demo/2017/unlock/')    # 1.定位滑动按钮    click_obj = chrome_obj.find_element_by_xpath('//div[@class="bar1 bar"]/div[@class="slide-to-unlock-handle"]')    # 2.按住    # 创建一个动作链对象，参数就是浏览器对象    action_obj = webdriver.ActionChains(chrome_obj)    # 点击并且按住，参数就是定位的按钮    action_obj.click_and_hold(click_obj)    # 得到它的宽高    size_ = click_obj.size    width_ = 298 - size_['width']  # 滑框的宽度 减去 滑块的 宽度 就是 向x轴移动的距离(向右)    print(width_)    # 3.定位滑动坐标    action_obj.move_by_offset(298-width_, 0).perform()    # 4.松开滑动    action_obj.release()    time.sleep(6)    chrome_obj.quit()

9.打开多窗口和页面切换

有时候窗口中有很多子tab页面。这时候肯定是需要进行切换的。selenium提供了一个叫做switch_to_window来进行切换，具体切换到哪个页面，可以从driver.window_handles中找到

from selenium import webdriverif __name__ == '__main__':    driver = webdriver.Chrome()    driver.get("https://www.baidu.com/")    driver.implicitly_wait(2)    driver.execute_script("window.open('https://www.douban.com/')")    driver.switch_to.window(driver.window_handles[1])    print(driver.page_source)

10.Cookie操作

# 1.获取所有的cookie：for cookie in driver.get_cookies():    print(cookie)# 2.根据cookie的key获取value：value = driver.get_cookie(key)# 3.删除所有的cookie：driver.delete_all_cookies()# 4.删除某个cookie：driver.delete_cookie(key)# 添加cookie：driver.add_cookie({"name":"password","value":"111111"})

11.模拟登录

这里模拟登录我们学校教务处：

from selenium.webdriver import Chromeif __name__ == '__main__':    web = Chrome()    web.get('http://bkjx.wust.edu.cn/')    username = web.find_element_by_id('userAccount')    username.send_keys('xxxxxxx') # 这里填自己的学号    password = web.find_element_by_id('userPassword')    password.send_keys('xxxxxxx') # 这里填自己的密码    btn = web.find_element_by_xpath('//*[@id="ul1"]/li[4]/button')    btn.click()    # do something

因为没有滑块啥的验证，所以就很简单qwq。然后后面进行自己的操作即可。