使用scrapy下载笔趣阁小说

作者:xin 时间:23-01-13 08:47:14 阅读数:711人阅读

下载scrapy可以直接使用pip install scrapy.

创建项目:scrapy startproject bqg

创建爬虫:scrapy genspider bqgspider bqg.com

图怪兽_10dfd0f59ae351a8320d011ef6f39432_90753.jpg

创建完成上面的基础目录后,先修改setting文件:

BOT_NAME = "bqg"
SPIDER_MODULES = ["bqg.spiders"]
NEWSPIDER_MODULE = "bqg.spiders"
USER_AGENT = "bqg (+http://www.yourdomain.com)"
DOWNLOAD_DELAY = 3
ITEM_PIPELINES = {
    "bqg.pipelines.BqgPipeline": 300,
}
REQUEST_FINGERPRINTER_IMPLEMENTATION = "2.7"
TWISTED_REACTOR = "twisted.internet.asyncioreactor.AsyncioSelectorReactor"
FEED_EXPORT_ENCODING = "utf-8"

接下来就可以开始编写爬虫了:

1、首先修改allowed_domains 和start_urls

class BqgspiderSpider(scrapy.Spider):
    name = "bqgspider"
    allowed_domains = ["ydxrf.com"]
    start_urls = ["http://www.ydxrf.com/sort/6_1/"]

2、获取该分类下小说的名字和链接,并传递给解析小说章节的方法

def parse(self, response,**kwargs):
    li_list=response.xpath('//div[@id="newslist"]/div/ul/li')
    for li in li_list:
        xs_name=li.xpath('./span[@class="s2"]/a/text()').get()
        xs_link=li.xpath('./span[@class="s2"]/a/@href').get()
        xs_link='http://www.ydxrf.com'+xs_link
        yield scrapy.Request(
            url=xs_link,
            callback=self.parse_chapter,
            cb_kwargs={
                'xs_name':xs_name
            }
        )

3、获取小说的章节名字和链接,并传递给解析小说内容的方法

def parse_chapter(self,response,**kwargs):
    xs_name=kwargs['xs_name']
    li_list=response.xpath('//div[@id="list"]/dl/dd')
    for li in li_list:
        chapter_name=li.xpath('./a/text()').get()
        chapter_link=li.xpath('./a/@href').get()
        chapter_link='http://www.ydxrf.com'+chapter_link
        yield scrapy.Request(
            url=chapter_link,
            callback=self.parse_content,
            cb_kwargs={
                'xs_name':xs_name,
                'chapter_name':chapter_name
            }
        )

4、获取小说的内容,并把小说名、章节名、章节内容传给pipelines

def parse_content(self,response,**kwargs):
    xs_name=kwargs['xs_name']
    chapter_name=kwargs['chapter_name']
    content=response.xpath('//div[@id="htmlContent"]/p/text()').getall()
    content_str='\n'.join(content)

    yield BqgItem(xs_name=xs_name,chapter_name=chapter_name,content_str=content_str)

5、将小说存为txt文件

# Define your item pipelines here
#
# Don't forget to add your pipeline to the ITEM_PIPELINES setting
# See: https://docs.scrapy.org/en/latest/topics/item-pipeline.html
import os

# useful for handling different item types with a single interface
from itemadapter import ItemAdapter


class BqgPipeline:
    def process_item(self, item, spider):
        print('当前小说'+item['xs_name']+item['chapter_name'])
        if not os.path.exists(f'C:/Users/Administrator/Desktop/xs/{item["xs_name"]}'):
            os.mkdir(f'C:/Users/Administrator/Desktop/xs/{item["xs_name"]}')
        with open(f'C:/Users/Administrator/Desktop/xs/{item["xs_name"]}/{item["chapter_name"]}.txt','w',encoding='utf-8') as f:
            f.write(item['content_str'])
        return item


上一篇:scrapy基础

没有最新的文章了...