【scrapy爬虫】crawl自动化模板爬取网易新闻

网友投稿 272 2022-09-22

网易新闻数据爬取

1. 新建项目2. 修改itmes.py文件3. 定义spider，创建一个爬虫模板

3.1 创建crawl爬虫模板3.2 xpath选择器3.3. 分析网页内容

4. 修改spider下创建的爬虫文件

4.1 导入包4.2 正则表达式的简单介绍4.3 回调函数

5. 修改pipeline文件下的内容

5.1 导入csv文件储存包5.2 定义进程函数

6. 运行结果

手动反爬虫：原博地址

知识梳理不易，请尊重劳动成果

1. 新建项目

在命令行窗口下输入scrapy startproject news,如下

然后就自动创建了相应的文件，如下

关于每一个文件的作用，上一个博客上有详细介绍，可以回头看一下

2. 修改itmes.py文件

打开scrapy框架自动创建的items.py文件，如下

编写里面的代码，确定我要获取的信息，比如线程，新闻标题，url，时间，来源，来源的url，新闻的内容等

import scrapyclass NewsItem(scrapy.Item): # define the fields for your item here like: # name = scrapy.Field() news_thread = scrapy.Field() news_title = scrapy.Field() news_url = scrapy.Field() news_time = scrapy.Field() news_source = scrapy.Field() source_url = scrapy.Field() news_body = scrapy.Field()

3. 定义spider，创建一个爬虫模板

3.1 创建crawl爬虫模板

在命令行窗口下面创建一个crawl爬虫模板，指令执行会在spider文件夹生成一个news163.py文件

注意：在文件的根目录下面，指令检查别输入错误，-t 表示使用后面的crawl模板，news163为爬虫文件名称，最后的news.163.com为网易新闻域名

然后看一下这个‘crawl’模板和一般的模板有什么区别，多了链接提取器还有一些爬虫规则，这样就有利于我们做一些深度信息的自动化爬取

3.2 xpath选择器

支持xpath和css，其中css选择器之前的爬虫案例中介绍过了，这里是补充xpath的操作，xpath语法如下

① 手写输入的：

/html/head/title 定位标题

/html/head/title/text() 提取标题内容

//td (深度提取的话就是两个/) 直接定位td标签

//div[@class='mine'] 定义带有mine属性的div标签

② 手动copy的：

直接定位某一具体位置的标签信息，往往复制粘贴之后使用的是相对路径，即//开头，后面再接标签相关信息，比如的内容copy之后，粘贴的内容为：//*[@id="js_top_news"]/div[2]/ul/li[2]/a

3.3. 分析网页内容

在谷歌chrome浏览器下，打在网页新闻的网站，选择查看源代码，确认我们可以获取到itmes.py文件的内容（其实那里面的要获取的就是查看了网页源代码之后确定可以获取的）

确认标题、时间、url、来源url和内容可以通过检查和标签对应上，比如正文部分

4. 修改spider下创建的爬虫文件

4.1 导入包

打开创建的爬虫模板，进行代码的编写，除了导入系统自动创建的三个库，我们还需要导入news.items(这里就涉及到了包的概念了，最开始说的–init–.py文件存在说明这个文件夹就是一个包可以直接导入，不需要安装)

注意：使用的类ExampleSpider一定要继承自CrawlSpider，因为最开始我们创建的就是一个‘crawl’的爬虫模板，对应上

import scrapyfrom news.items import NewsItemfrom scrapy.linkextractors import LinkExtractorfrom scrapy.spiders import CrawlSpider, Ruleclass New163Spider(CrawlSpider): name = 'new163' allowed_domains = ['new163.com'] start_urls = [' rules = ( Rule(LinkExtractor(allow=r'/18/04\d+/*'), callback='parse_news', follow=True), ) def parse_item(self, response): item = {} #item['domain_id'] = response.xpath('//input[@id="sid"]/@value').get() #item['name'] = response.xpath('//div[@id="name"]').get() #item['description'] = response.xpath('//div[@id="description"]').get() return

Rule(LinkExtractor(allow=r’/18/04\d+/*’), callback=‘parse_news’, follow=True),其中第一个allow里面是书写正则表达式的（也是我们核心要输入的内容），第二个是回调函数，第三个表示是否允许深入

4.2 正则表达式的简单介绍

系统的介绍会在爬虫专项里面进行讲解，这里介绍一些基础性的可以用在这个项目里面的知识点,，正则表达式是由字符和操作符组成的，常见的语法如下图

记住一个：“.*?” 惰性匹配，匹配成功一次即可，几乎可以解决大部分的问题，还有一些需要我们自己动手编写

对比新闻的标签，如下

第一个新闻的url是：“= ( Rule(LinkExtractor(allow=r'callback='parse_item', follow=True), )

然后在命令行窗口运行指令：scrapy crawl news163

输出结果为：请求返回200，代表请求成功

4.3 回调函数

parse_item是我们要设置的回调函数，先处理两个较为简单的获取内容，获取thread（去掉网址的后五个字符的内容）和title（一般就是网页源代码的一个title标签里的内容），代码设置如下

def parse_item(self, response): item = NewsItem() item['news_thread'] = response.url.strip().split("/")[-1][:-5] self.get_title(response,item) return itemdef get_title(self,response,item): title = response.css('title::text').extract() if title: print("title:{}".format(title[0])) item['news_title'] = title[0]

保存后运行命令行窗口，输出如下

然后再获取时间，在页面中选择检查，找到新闻时间对应的源代码中的标签信息，然后采用css选择器，找到该标签信息，如下

获取新闻时间的代码如下，time后面的内容就是属于字符串处理的方式了，目的是为了获得正常格式的时间数据

self.get_time(response,item) #这个代码要放在回调函数里面def get_time(self,response,item): time = response.css('div.post_time_source::text').extract() if time: print('time:{}'.format(time[0].strip().replace("来源","").replace('\u3000:',""))) item['news_time'] = time[0].strip().replace("来源","").replace('\u3000:',"")

输出结果为：

接下来获取新闻来源，查看网页源代码，发现新闻来源是存储id标签下面，直接就可以进行标签的查找锁定（id唯一）

获取新闻来源的代码如下

self.get_source(response,item) #这个代码要放在回调函数里面def get_source(self,response,item): source = response.css("ne_article_source::text").extract() if source: print("source:{}".format(source[0])) item['news_source'] = source[0]

获取新闻原文URL的方式也是类似，这里直接给出代码（注意这里不是获取id标签的文本内容了，而是属性）

self.get_source_url(response,item)def get_source_url(self,response,item): source_url = response.css("ne_article_source::attr(href)").extract() if source_url: print("source_url:{}".format(source_url[0])) item['source_url'] = source_url[0]

获取新闻内容，也是直接给出参考代码如下

self.get_text(response,item)def get_text(self,response,item): text = response.css(".post_text p::text").extract() if text: print("text:{}".format(text)) item['news_body'] =

获取新闻URL（最初的那个），也是直接给出参考代码如下

self.get_url(response,item)def get_url(self,response,item): url = response.url if url: item['news_url'] =

至此news163.py的全部代码编写如下：

import scrapyfrom news.items import NewsItemfrom scrapy.linkextractors import LinkExtractorfrom scrapy.spiders import CrawlSpider, Rule#News163Spider(CrawlSpider): name = 'news163' allowed_domains = ['news.163.com'] start_urls = [' rules = ( Rule(LinkExtractor(allow=r'callback='parse_item', follow=True), ) def parse_item(self, response): item = NewsItem() item['news_thread'] = response.url.strip().split("/")[-1][:-5] self.get_title(response,item) self.get_time(response,item) self.get_source(response,item) self.get_source_url(response,item) self.get_text(response,item) self.get_url(response,item) return item def get_url(self,response,item): url = response.url if url: item['news_url'] = url def get_text(self,response,item): text = response.css(".post_text p::text").extract() if text: print("text:{}".format(text)) item['news_body'] = text def get_source_url(self,response,item): source_url = response.css("ne_article_source::attr(href)").extract() if source_url: #print("source_url:{}".format(source_url[0])) item['source_url'] = source_url[0] def get_source(self,response,item): source = response.css("ne_article_source::text").extract() if source: print("source:{}".format(source[0])) item['news_source'] = source[0] def get_time(self,response,item): time = response.css('div.post_time_source::text').extract() if time: print('time:{}'.format(time[0].strip().replace("来源","").replace('\u3000:',""))) item['news_time'] = time[0].strip().replace("来源","").replace('\u3000:',"") def get_title(self,response,item): title = response.css('title::text').extract() if title: print("title:{}".format(title[0])) item['news_title'] = title[0]

保存后，运行命令行输出如下（注意、注意、注意，在调试的过程中不要频繁的运行这个指令，否则会导致服务器无法访问），只截取部分输出结果

5. 修改pipeline文件下的内容

5.1 导入csv文件储存包

要将数据储存在本地，需要以一种格式作为储存的条件，逗号分隔符（csv）文件就可以满足这种要求，而且也是现在主要存储数据的工具

from scrapy.exporters import

5.2 定义进程函数

首先从初始化函数，包含了创建收集数据的文件和项目启动器

def __init__(self): self.file = open('news_data.csv', 'wb') self.exporter = CsvItemExporter(self.file, encoding = 'utf-8') self.exporter.start_exporting()

其次，定义爬虫结束器，进行项目的收尾工作，把进程和文件都关闭掉，防止内存溢出

def close_spider(self,spider): self.exporter.finish_exporting() self.file.close()

最后在处理函数里面，开启导入，最后返回Item

def process_item(self, item, spider): self.exporter.export_item(item) return

至此，pipeline里的代码编写就已经完成了，这时候就要在setting.py文件里面进开启pipeline通道，取消如下内容的注释，如下

最后整个pipeline.py的文件代码如下，注意检查缩进的问题（Sublime编辑器里面有讲缩进全部转换成为tab格式的选项，确保缩进一致，而且还是要注意一下网页编码的问题，否则会出现乱码的情况，encoding要根据爬取网页的编码格式设定）

from scrapy.exporters import CsvItemExporterclass NewsPipeline(object): def __init__(self): self.file = open('news_data.csv', 'wb') self.exporter = CsvItemExporter(self.file, encoding = 'gbk') self.exporter.start_exporting() def close_spider(self,spider): self.exporter.finish_exporting() self.file.close() def process_item(self, item, spider): self.exporter.export_item(item) return

6. 运行结果

最后在命令行窗口，运行指令，在窗口界面出现爬取内容输出的同时，在news文件夹下也自动生成了news_data.csv文件，如下

news_data.csv文件中数据样式如下，至此整个利用Scrapy爬取网页新闻的项目就全部完结了

标签：工具

暂时没有评论，来抢沙发吧~

【scrapy爬虫】crawl自动化模板爬取网易新闻

linux cpu占用率如何看

宝塔数据库如何清理缓存

oracle怎么创建存储过程

推荐文章

api接口有哪几种分类及功能

什么是API接口?API接口简单介绍

短信API接口概述，短信API接口的优势

7款快递物流的物流查询API工具，物流快递查询API接口怎么对接？

企业四要素: 了解企业经营成功的关键

什么是语音验证码?,语音验证码平台有哪些

全国工商查询系统怎么查企业名录

哪些平台提供实名认证的接口？

PHP如何调用API接口?

如何使用百度天气预报API接口?

最近发表

热评文章

数据接口api（数据接口API开发平台）

数据开放接口api（数据服务api开发）

Python爬虫教程：爬取酷狗音乐（python爬取

hbuilder怎么更改字体大小和颜色

直播平台api接口 - 构建卓越的直播平台

实时股票数据api接口（股票实时行情api接口）