使用python爬取古诗文网

作者：xin 时间：22-12-10 20:12:02 阅读数：1109人阅读

python在爬虫方面还是很给力的，要爬取数据，需要使用requests库，直接使用pip install requests 就可以下载。

步骤：

1、使用requests.get发起get请求，部分网站有反爬需要设置user-agent，判断返回的状态码，如果状态码为200，就表示get成功，将响应的数据传到解析的def中：

def get(url):

    resp = requests.get(url,
                        headers={
                            "user-agent": setheader.get_ua()
                        })
    if resp.status_code==200:
        parse(resp.text)

2、使用xpath解析数据，把需要的内容储存到csv中，判断是否有下一页，如果有下一页再调用get方法即可爬取下一页：

item={}
def parse(html):
    root=etree.HTML(html)
    sons=root.xpath('//div[@class="left"]//div[@class="sons"]')
    for son in sons:
        title=son.xpath('./div[1]/p[1]//text()')
        auther=son.xpath('./div[1]/p[@class="source"]//text()')
        content=son.xpath('./div[1]/div[@class="contson"]//text()')
        if len(title) and len(auther):
            print(f'正在保存古诗：《{title}》')
            item['title']=title[0]
            item['auther']=auther[0]
            item['content']=content
            save4csv(item)
    next_url=root.xpath('//a[@class="amore"]/@href')[0]
    next_url="https://so.gushiwen.cn"+str(next_url)

    if next_url:
        get(next_url)

3、将爬取好的内容存到csv中：

def save4csv(item):
    has_header = os.path.exists("gushiwen.csv")
    with open("gushiwen.csv","a") as f:
        writer=DictWriter(f,header_field)
        if not has_header:
            writer.writeheader()
        writer.writerow(item)

# requests # lxml

上一篇：使用python生成动态条形图

下一篇：scrapy基础

使用python爬取古诗文网

相关文章