0%

链家新房信息获取练习

使用python对链家新房相关数据进行爬取,并进行持久化存储。

多练习 熟能生巧

前言

以下是本篇文章正文内容,下面案例可供参考

一、页面分析

老样子进行页面分析,ul下的li中存放着我们想要的信息,没什么好讲的。

理清楚思路,对所有一共八十几个页面进行访问,随机伪装机型和ip代理,并找到链接的请求格式。

接下来只需要使用xpath定位到相关的元素信息即可,将解析的数据存入数据库中。
进入代码编写阶段。

二、代码编写

1.数据库表的建立

代码如下:

1
2
3
4
5
6
7
8
9
10
11
12
13
14
CREATE TABLE `xinfang` (
`id` int(255) NOT NULL AUTO_INCREMENT,
`name` varchar(255) DEFAULT NULL,
`resblock_type` varchar(10) DEFAULT NULL,
`sale_status` varchar(10) DEFAULT NULL,
`area` varchar(10) DEFAULT NULL,
`location` varchar(255) DEFAULT NULL,
`resblock_room` varchar(10) DEFAULT NULL,
`resblock_area` varchar(30) DEFAULT NULL,
`main_price` varchar(150) DEFAULT NULL,
`second` varchar(30) DEFAULT NULL,
`img_url` varchar(255) DEFAULT NULL,
PRIMARY KEY (`id`)
) ENGINE=InnoDB AUTO_INCREMENT=843 DEFAULT CHARSET=utf8;

2.代码编写

代码如下:

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
import requests
from lxml import etree
from fake_useragent import UserAgent
import random
import pymysql

# 代理池
proxy_pool = [{'HTTP': '112.84.53.165:9999'}, {'HTTP': '171.35.169.58:9999'}, {'HTTP': '49.86.180.142:9999'},
{'HTTP': '113.194.131.190:9999'}, {'HTTP': '110.243.22.233:9999'}, {'HTTP': '123.169.163.99:9999'},
{'HTTP': '123.163.117.140:9999'}, {'HTTP': '113.195.20.166:9999'}, {'HTTP': '114.235.23.237:9000'},
{'HTTP': '202.109.157.64:9000'}, {'HTTP': '171.35.175.31:9999'}, {'HTTP': '113.195.168.235:9999'},
{'HTTP': '125.108.75.135:9000'}, {'HTTP': '123.101.237.3:9999'}, {'HTTP': '139.155.41.15:8118'},
{'HTTP': '118.212.104.240:9999'}]
# 伪装头
headers = {
'Referer': 'https://nj.fang.lianjia.com/',
'User-Agent': UserAgent().random
}

if __name__ == '__main__':
print('打开数据库...')
# 打开数据库
conn = pymysql.Connect(host='localhost', port=3306, user='root',
password='', db='spider', charset='utf8')
# 链接模板 使用format填充
base_url = 'https://nj.fang.lianjia.com/loupan/pg{}/'
# 存放url的数组
data = []
print('初始化数据...')
for i in range(1, 86):
# 循环写入url
data.append(base_url.format(i))
print('开始爬取...')
# 遍历url 请求网址 并解析
for url in data:
# get请求访问 使用代理
response = requests.get(url=url, headers=headers, proxies=random.choice(proxy_pool))
# etree解析
res_data = etree.HTML(response.content.decode())
# xpath定位到所有的li
lis = res_data.xpath('//li[@class="resblock-list post_ulog_exposure_scroll has-results"]')
# 在循环的li下进行解析
for li in lis:
# 图片地址
img_url = li.xpath('./a/img/@data-original')[0].split('.592x432.jpg')[0]
# 小区名字
name = li.xpath('./div[@class="resblock-desc-wrapper"]/div[@class="resblock-name"]/a[1]/text()')[0]
# 以下都是相关信息 就不做赘叙
resblock_type = li.xpath('./div[@class="resblock-desc-wrapper"]/div[@class="resblock-name"]/span[1]/text()')[0]
sale_status = li.xpath('./div[@class="resblock-desc-wrapper"]/div[@class="resblock-name"]/span[2]/text()')[0]
area = li.xpath('./div[@class="resblock-desc-wrapper"]/div[@class="resblock-location"]/span[1]/text()')[0]
location = li.xpath('./div[@class="resblock-desc-wrapper"]/div[@class="resblock-location"]/span[1]/text()')[0] + '/' + \
li.xpath('./div[@class="resblock-desc-wrapper"]/div[@class="resblock-location"]/span[2]/text()')[0] + '/' + \
li.xpath('./div[@class="resblock-desc-wrapper"]/div[@class="resblock-location"]/a[1]/text()')[0]
resblock_room = li.xpath('./div[@class="resblock-desc-wrapper"]/a[@class="resblock-room"]/span[1]/text()')
# 有些数据为空 要进行替换 否则数据库写入报错
if resblock_room:
resblock_room = resblock_room[0]
else:
resblock_room = '暂无信息'
resblock_area = li.xpath('./div[@class="resblock-desc-wrapper"]/div[@class="resblock-area"]/span[1]/text()')
if resblock_area:
resblock_area = resblock_area[0]
else:
resblock_area = '暂无信息'
main_price = li.xpath('./div[@class="resblock-desc-wrapper"]/div[@class="resblock-price"]/div[@class="main-price"]/span[@class="number"]/text()')[0]
second = li.xpath('./div[@class="resblock-desc-wrapper"]/div[@class="resblock-price"]/div[@class="second"]/text()')
if second:
second = second[0]
else:
second = '暂无信息'
# sql语句
sql = 'insert into xinfang(name, resblock_type, sale_status, area, location, resblock_room, resblock_area, main_price, second, img_url) values ("{}", "{}", "{}", "{}", "{}", "{}", "{}", "{}", "{}", "{}")'.format(
name, resblock_type, sale_status, area, location, resblock_room, resblock_area, main_price, second,
img_url)
cursor = conn.cursor()
# 事务 提交 回滚
try:
cursor.execute(sql)
conn.commit()
except Exception as e:
print(e)
conn.rollback()
print('爬取结束关闭数据库...')
# 关闭数据库链接 程序结束
conn.close()


结果

程序运行结果如下:

本文标题:链家新房信息获取练习

文章作者:fanchen

发布时间:2021年01月17日 - 14:10:21

最后更新:2021年02月05日 - 17:58:30

原始链接:http://88fanchen.github.io/posts/b3588f8a/

许可协议:署名-非商业性使用-禁止演绎 4.0 国际 转载请保留原文链接及作者。