爬取链家的二手房信息，存储到数据库方便以后查看

页面分析

分析页面后发现是前后端未分离的状态，所以需要使用xpath分析界面元素

在li中存放着对应的div，有相关的信息：

分析请求链接：

只需要更改pg后面的数字即可，页面分析完毕。

提示：以下是本篇文章正文内容，下面案例可供参考

引入库

代码如下：

from lxml import etree
from fake_useragent import UserAgent
import requests
import random
import pymysql

# 代理池
proxy_pool = [{'HTTP': '175.43.151.3:9999'}, {'HTTP': '220.249.149.140:9999'}, {'HTTP': '175.44.108.206:9999'},
              {'HTTP': '120.83.101.115:9999'}, {'HTTP': '175.42.122.233:9999'}, {'HTTP': '60.13.42.107:9999'},
              {'HTTP': '113.195.152.127:9999'}, {'HTTP': '36.248.133.196:9999'}, {'HTTP': '120.83.105.95:9999'},
              {'HTTP': '112.111.217.160:9999'}, {'HTTP': '171.12.221.158:9999'}, {'HTTP': '113.121.72.221:9999'}]
# 伪装头
headers = {
    'Host': 'nj.lianjia.com',
    'User-Agent': UserAgent().random
}
# 开启数据库
conn = pymysql.Connect(host='localhost', port=3306, user='用户名',
                       password='数据库密码', db='对应数据库', charset='utf8')

方法编写

代码如下：

def get_page(url):
	# 请求页面
    response = requests.get(url=url, headers=headers, proxies=random.choice(proxy_pool)).text
    # 使用etree解析对象
    parse_data = etree.HTML(response)
    # 获取li的列表
    li_list = parse_data.xpath('//ul[@class="sellListContent"]/li')
    # 循环
    for li in li_list:
    	# 标题
        title = li.xpath('./div[@class="info clear"]/div[@class="title"]/a/text()')[0]
        # 地址
        flood = li.xpath('./div[@class="info clear"]/div[@class="flood"]/div[@class="positionInfo"]/a[1]/text()')[0] + \
                '- ' + \
                li.xpath('./div[@class="info clear"]/div[@class="flood"]/div[@class="positionInfo"]/a[2]/text()')[0]
        # 房型描述
        address = li.xpath('./div[@class="info clear"]/div[@class="address"]/div[@class="houseInfo"]/text()')[0]
        # 状态
        followInfo = li.xpath('./div[@class="info clear"]/div[@class="followInfo"]/text()')[0]
        # 总价
        totalPrice = li.xpath('./div[@class="info clear"]/div[@class="priceInfo"]/div[@class="totalPrice"]/span['
                              '1]/text()')[0] + '万'
        # 单价
        unitPrice = li.xpath('./div[@class="info clear"]/div[@class="priceInfo"]/div[@class="unitPrice"]/span[1]/text()')[0]
        # mysql数据库操作语句
        sql = 'insert into lianjia(title, flood, address, followInfo,totalPrice, unitPrice) ' \
              'values ("{}", "{}", "{}", "{}", "{}", "{}")'.format(title, flood, address, followInfo, totalPrice, unitPrice)
        cursor = conn.cursor()
        try:
        	# 执行语句
            cursor.execute(sql)
            # 事务
            conn.commit()
        except Exception as e:
            print(e)
            # 异常则回滚，保证数据安全
            conn.rollback()

主函数编写

代码如下：

if __name__ == '__main__':
	# 数据库模板
    base_url = 'https://nj.lianjia.com/ershoufang/pg{}/'
    # 循环遍历
    for i in range(1, 101):
        get_page(base_url.format(i))
        print('正在存储第{}条'.format(i) + '....')
    # 关闭数据库链接
    conn.close()

运行结果

对其进行条件查找，找出自己想要的数据：

注：案例仅供学习

范某帅的博客

爬取链家二手房信息并存储到数据库

页面分析

引入库

方法编写

主函数编写

运行结果