使用scrapy下载笔趣阁小说
下载scrapy可以直接使用pip install scrapy.
创建项目:scrapy startproject bqg
创建爬虫:scrapy genspider bqgspider bqg.com
创建完成上面的基础目录后,先修改setting文件:
BOT_NAME = "bqg" SPIDER_MODULES = ["bqg.spiders"] NEWSPIDER_MODULE = "bqg.spiders" USER_AGENT = "bqg (+http://www.yourdomain.com)" DOWNLOAD_DELAY = 3 ITEM_PIPELINES = { "bqg.pipelines.BqgPipeline": 300, } REQUEST_FINGERPRINTER_IMPLEMENTATION = "2.7" TWISTED_REACTOR = "twisted.internet.asyncioreactor.AsyncioSelectorReactor" FEED_EXPORT_ENCODING = "utf-8"
接下来就可以开始编写爬虫了:
1、首先修改allowed_domains 和start_urls
class BqgspiderSpider(scrapy.Spider): name = "bqgspider" allowed_domains = ["ydxrf.com"] start_urls = ["http://www.ydxrf.com/sort/6_1/"]
2、获取该分类下小说的名字和链接,并传递给解析小说章节的方法
def parse(self, response,**kwargs): li_list=response.xpath('//div[@id="newslist"]/div/ul/li') for li in li_list: xs_name=li.xpath('./span[@class="s2"]/a/text()').get() xs_link=li.xpath('./span[@class="s2"]/a/@href').get() xs_link='http://www.ydxrf.com'+xs_link yield scrapy.Request( url=xs_link, callback=self.parse_chapter, cb_kwargs={ 'xs_name':xs_name } )
3、获取小说的章节名字和链接,并传递给解析小说内容的方法
def parse_chapter(self,response,**kwargs): xs_name=kwargs['xs_name'] li_list=response.xpath('//div[@id="list"]/dl/dd') for li in li_list: chapter_name=li.xpath('./a/text()').get() chapter_link=li.xpath('./a/@href').get() chapter_link='http://www.ydxrf.com'+chapter_link yield scrapy.Request( url=chapter_link, callback=self.parse_content, cb_kwargs={ 'xs_name':xs_name, 'chapter_name':chapter_name } )
4、获取小说的内容,并把小说名、章节名、章节内容传给pipelines
def parse_content(self,response,**kwargs): xs_name=kwargs['xs_name'] chapter_name=kwargs['chapter_name'] content=response.xpath('//div[@id="htmlContent"]/p/text()').getall() content_str='\n'.join(content) yield BqgItem(xs_name=xs_name,chapter_name=chapter_name,content_str=content_str)
5、将小说存为txt文件
# Define your item pipelines here # # Don't forget to add your pipeline to the ITEM_PIPELINES setting # See: https://docs.scrapy.org/en/latest/topics/item-pipeline.html import os # useful for handling different item types with a single interface from itemadapter import ItemAdapter class BqgPipeline: def process_item(self, item, spider): print('当前小说'+item['xs_name']+item['chapter_name']) if not os.path.exists(f'C:/Users/Administrator/Desktop/xs/{item["xs_name"]}'): os.mkdir(f'C:/Users/Administrator/Desktop/xs/{item["xs_name"]}') with open(f'C:/Users/Administrator/Desktop/xs/{item["xs_name"]}/{item["chapter_name"]}.txt','w',encoding='utf-8') as f: f.write(item['content_str']) return item
上一篇:scrapy基础
没有最新的文章了...