Scrapy组件分析
- Spiders ->网页分析器
- Item Pipline -> 数据管道
- Scheduler -> 调度器
- Downloader -> 下载器
- Scraoy Engine -> 核心引擎
Scrapy执行过程分析
- Spider(我们编码的一个网站的爬虫) yield 一个 Request 出来,并发送给Engine(产生request,处理response)
- Engine拿到Request以后发送给Scheduler(调度器)
- Scheduler生成一个Requests交给Engine
- Engine拿到 Scheduler的request后(注意是Scheduler发来的而不是Spider发来的),经过设置好的DownloaderMiddleware,一步一步将Request送达Downloader
- Downloader下载网页后,再经过设置好的DownloaderMiddleware,一步一步将HttpResponse返回给Engine
- Downloader拿到Response以后发送给Spiders进行处理分析(比如正则表达式,CSS选择器的配合使用提取网页字段)
- Spider处理完的结果分为两类,一类是Item,一类是Request,这两类都会发给Engine,Engine拿到后判断如果是Items则会走8,如果是Requests则重复走2
- Engine将Spiders发送过来的item发送给Item Piplines,将结果一步一步的Piplines将数据持久化到不同存储体里,比如JSON,Mysql,ES等
源码分析
Scrapy 核心的代码都在scrapy类库的scrapy/core文件夹下
(downloader 支持多种类型下载)spider,pipline,middleware 是自己编写的
-
Engine源码简析
... ... # 此处为执行过程中1-2步,Engine拿到request后发送给Scheduler def schedule(self, request, spider): self.signals.send_catch_log(signal=signals.request_scheduled, request=request, spider=spider) # 调用scheduler的enqueue_request方法将request放到Scheduler中 if not self.slot.scheduler.enqueue_request(request): self.signals.send_catch_log(signal=signals.request_dropped, request=request, spider=spider) ... ... # 此处为执行过程中第三步,从Engine中拿request给Scheduler def _next_request_from_scheduler(self, spider): slot = self.slot request = slot.scheduler.next_request() # 爬虫首次启动的时候先执行这个_next_request_from_scheduler方法, # 但是scheduler里此时没有request,所以就会去从Spider中读取start_urls if not request: return d = self._download(request, spider) d.addBoth(self._handle_downloader_output, request, spider) d.addErrback(lambda f: logger.info('Error while handling downloader output', exc_info=failure_to_exc_info(f), extra={ 'spider': spider})) d.addBoth(lambda _: slot.remove_request(request)) d.addErrback(lambda f: logger.info('Error while removing request from slot', exc_info=failure_to_exc_info(f), extra={ 'spider': spider})) d.addBoth(lambda _: slot.nextcall.schedule()) d.addErrback(lambda f: logger.info('Error while scheduling new request', exc_info=failure_to_exc_info(f), extra={ 'spider': spider})) return d ... ...复制代码
-
Request类(由Spider产生)构造函数参数分析
class Request(object_ref): # url: 请求参数 # callback:请求回调函数 # method: http请求类型 # headers: 请求头 # body:请求体 # cookies:浏览器cookie,自动登录后,scrapy会自动把cookie加入request中 # 该操作的实现是由scrapy.downloadermiddlewares.cookies.CookiesMiddleware的scrapy内置Middleware完成的 # meta:元信息,(可以在Request中传递) # encoding:网页编码格式,默认UTF-8 # priority:设置在scheduler的调度优先级 # dont_filter:是否不过滤同时发出的相同request请求 # errback:失败的回调函数 # def __init__(self, url, callback=None, method='GET', headers=None, body=None, cookies=None, meta=None, encoding='utf-8', priority=0, dont_filter=False, errback=None, flags=None):复制代码
-
Response类(由Downloader产生)构造函数参数分析
class Response(object_ref): # url 网页的url # status 返回状态码,默认是200,代表成功 # headers 服务器返回的响应头 # body 返回的内容体 # request 之前yield的Request,对应的请求 def __init__(self, url, status=200, headers=None, body=b'', flags=None, request=None): self.headers = Headers(headers or {}) self.status = int(status) self._set_body(body) self._set_url(url) self.request = request self.flags = [] if flags is None else list(flags)复制代码
其子类有HtmlResponse,TextResponse,XmlResponse
from scrapy.http.response.text import TextResponseclass HtmlResponse(TextResponse): pass复制代码
class TextResponse(Response): ... ... # Response内部已经引入了selector拱xpath,css方法调用 @property def selector(self): from scrapy.selector import Selector if self._cached_selector is None: self._cached_selector = Selector(self) return self._cached_selector # xpath 选择器 def xpath(self, query, **kwargs): return self.selector.xpath(query, **kwargs) # css 选择器 def css(self, query): return self.selector.css(query) ... ...复制代码