xpath解析库

在上⼀章中, 我们基本上掌握了抓取整个⽹⻚的基本技能。但是呢, ⼤多数情况下, 我们并不需要整个⽹⻚的内容, 只是需要那么⼀⼩部分。怎么办呢? 这就涉及到了数据提取的问题。

本课程中, 提供三种解析⽅式:

  • xpath解析
  • bs4解析
  • re解析

这三种⽅式可以混合进⾏使⽤, 完全以结果做导向, 只要能拿到你想要的数据。⽤什么⽅案并不重要。当你掌握了这些之后。再考虑性能的问题。

XPath是⼀⻔在 XML ⽂档中查找信息的语⾔, XPath可⽤来在 XML⽂档中对元素和属性进⾏遍历,⽽我们熟知的HTML恰巧属于XML的⼀个⼦集,所以完全可以⽤xpath去查找html中的内容。

⾸先, 先了解⼏个概念:

图片

在上述html中,

  1. book, id, name, price….都被称为节点.

  2. Id, name, price, author被称为book的⼦节点

  3. book被称为id, name, price, author的⽗节点

  4. id, name, price,author被称为同胞节点

OK~ 有了这些基础知识后, 我们就可以开始了解xpath的基本语法了

在python中想要使⽤xpath,需要安装lxml模块。

1
pip install lxml

基础⽤法:

  1. 将要解析的html内容构造出etree对象.

  2. 使⽤etree对象的xpath()⽅法配合xpath表达式来完成对数据的提取

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
from lxml import etree
html = """
<book>
<id>1</id>
<name>野花遍地⾹</name>
<price>1.23</price>
<nick>臭⾖腐</nick>
<author>
<nick id="10086">周⼤强</nick>
<nick id="10010">周芷若</nick>
<nick>周杰伦</nick>
<nick>蔡依林</nick>
<div>
<nick>惹了</nick>
</div>
</author>
<partner>
<nick id="ppc">胖胖陈</nick>
<nick id="ppbc">胖胖不陈</nick>
</partner>
</book>
"""
et = etree.XML(html)
# 根据节点进⾏搜索
# result = et.xpath("/book")
# result = et.xpath("/book/id") # /在开头表示⽂档最开始, /在中间表示⼉⼦
# result = et.xpath("/book//nick") # //表示后代
result = et.xpath("/book/*/nick") # *表示通配符
print(result)

xpath如何提取属性信息. 我们上⼀段真实的HTML来给各位讲解⼀下:

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
<!DOCTYPE html>
<html>
<head>
<meta charset="UTF-8" />
<title>Title</title>
</head>
<body>
<ul>
<li><a href="http://www.baidu.com">百度
</a></li>
<li><a href="http://www.google.com">⾕
歌</a></li>
<li><a href="http://www.sogou.com">搜狗
</a></li>
</ul>
<ol>
<li><a href="feiji">⻜机</a></li>
<li><a href="dapao">⼤炮</a></li>
<li><a href="huoche">⽕⻋</a></li>
</ol>
<div>李嘉诚</div>
<div>胡辣汤</div>
</body>
</html>
1
2
3
4
5
6
7
8
9
10
from lxml import etree
tree = etree.parse("1.html")
result = tree.xpath("/html/body/ul/li/a/@href")
print(result)
result = tree.xpath("/html/body/ul/li")
for li in result:
print(li.xpath("./a/@href")) # 局部解析
result = tree.xpath("//div[@class='job']/text()")
# [@class='xxx']属性选取 text()获取⽂本
print(result)

实战案例:

一、58二手房标题

1
2
3
4
5
6
7
8
9
10
11
12
#!/usr/bin/env python
# -*- coding:utf-8 -*-
from lxml import etree
import requests
if __name__ == "__main__":
headers ={'User-Agent': 'Mozilla/5.0 (Windows NT 10.0; WOW64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/70.0.3538.25 Safari/537.36 Core/1.70.3884.400 QQBrowser/10.8.4560.400'}
url = "https://bj.58.com/ershoufang/"
page_text = requests.get(url,headers = headers).text
tree = etree.HTML(page_text)
titles = tree.xpath("//h3/text()")
for title in titles:
print(title)

图片

二:彼岸壁纸下载

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
#!/usr/bin/env python
# -*- coding:utf-8 -*-
from lxml import etree
import requests
import os
if __name__ == "__main__":
headers ={'User-Agent': 'Mozilla/5.0 (Windows NT 10.0; WOW64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/70.0.3538.25 Safari/537.36 Core/1.70.3884.400 QQBrowser/10.8.4560.400'}
url = "https://pic.netbian.com/4kfengjing/"
page_text = requests.get(url,headers = headers).text.encode("ISO-8859-1")

tree = etree.HTML(page_text)
li_list = tree.xpath('//div[@class="slist"]//li')

if not os.path.exists("./piclibs"):
os.mkdir('./piclibs')

for li in li_list:
img_src = 'https://pic.netbian.com/'+li.xpath('./a/img/@src')[0]
# print()
img_name = li.xpath('./a/img/@alt')[0]+'.jpg'
# print(img_name,img_src)
img_data = requests.get(url=img_src,headers = headers).content
img_path = "piclibs/"+img_name
with open(img_path,"wb") as fp:
fp.write(img_data)
print(img_name,"下载成功")

图片

三、全国城市名称爬取

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
#!/usr/bin/env python
# -*- coding:utf-8 -*-
from lxml import etree
import requests
import os
if __name__ == "__main__":
headers ={'User-Agent': 'Mozilla/5.0 (Windows NT 10.0; WOW64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/70.0.3538.25 Safari/537.36 Core/1.70.3884.400 QQBrowser/10.8.4560.400'}
url = "https://www.aqistudy.cn/historydata/"
page_text = requests.get(url,headers = headers).text

tree = etree.HTML(page_text)
#全部城市
all_city_list = tree.xpath('//div[@class="bottom"]//li')
all_city_names = []
for li in all_city_list:
city_name = li.xpath("./a/text()")[0]
all_city_names.append(city_name)
for city in all_city_names:
print(city)
# print(all_city_names)

#热门城市
# hot_city_names = []
# hot_city_list = tree.xpath('//div[@class="hot"]//li')
# for li in hot_city_list:
# city_name = li.xpath("./a/text()")[0]
#
# hot_city_names.append(city_name)
# print(hot_city_names)

图片