scrapy

本文最后更新于:4 个月前

学习分布式爬虫第一天

xpath编写

xpath学习,谷歌商店下载xpath helper,可以大大提高效率

jd也买你获取今日秒杀的商品信息

//div[@class="slider_wrapper"]/a[position()<5][@title]
三星Galaxy Note20 Ultra 5G(SM-N9860)S Pen&三星笔记 120Hz自适应屏幕 5G手机 游戏手机 12GB+512GB 迷雾金¥9899.00¥9999.00
清风抽纸纸巾整箱24包金装原木3130抽婴儿适用卫生纸餐巾纸抽¥57.90¥169.00
vivo Y3s 5000mAh大电池长续航 128GB大内存 AI智慧摄影 全网通新品手机 海风青 4GB+64GB¥999.00¥1098.00

一些tip

  • 关闭配置信息显示

在setting.py添加LOG_LEVEL = "WARNING"

  • 爬虫运行

在pycharm中运行爬虫项目,需要在爬虫项目所在路径下建立一个py文件(与cfg文件同目录)

from scrapy import cmdline
cmdline.execute("scrapy crawl 爬虫名字".split())
  • 小知识

将爬取到数据返回使用的时yield而不是return,返回给pipeline,pioeline中可以自定义方法

使用pipeline之前需要到seting.py中开启,就是将注释删除

爬取腾讯招聘信息的scrapy爬虫

1.首先创一个项目

创建项目的命令如下

scrapy startproject tencent

在进入项目创建一个爬虫

scrapy genspider hr careers.tencent.com
 // hr 为爬虫的网站
 // careers.tencent.com这个是爬虫允许爬取的地址
image-20210331215742791

2.分析网页结构和url组成

进入https://careers.tencent.com/search.html.

要爬取的内容为

image-20210331215908145

查看源码可以发现并没有上面的字段,只有一些js代码,这就说明https://careers.tencent.com/search.html,并不是我们爬虫的start_url,所以第一步需要在该页面中找到我们的start_url

F12查看元素,并刷新网页,检查network中的XHR,很容易看到主页面显示的字段,说明我们的start_url,在这里,再查看headers获取到url

https://careers.tencent.com/tencentcareer/api/post/Query?timestamp=1617199191983&countryId=&cityId=&bgIds=&productId=&categoryId=&parentCategoryId=&attrId=&keyword=&pageIndex=1&pageSize=10&language=zh-cn&area=cn

image-20210331220441969

接下来就是查看url的变化规则,让爬虫实现自动翻页,点击第二页再次查看上面的url

https://careers.tencent.com/tencentcareer/api/post/Query?timestamp=1617199575064&countryId=&cityId=&bgIds=&productId=&categoryId=&parentCategoryId=&attrId=&keyword=&pageIndex=2&pageSize=10&language=zh-cn&area=cn

很容易发现规律,就是pageIndex的值发生了变化

接下来就是工作的详情页的url

进入工作详情页,查看源码依然只有js,所以继续使用上面的方法查看url和内容

https://careers.tencent.com/tencentcareer/api/post/ByPostId?timestamp=1617199879005&postId=1377253503910551552&language=zh-cn

其中改变的参数是postId,与priview中的值对应

image-20210331221442816

3.编写爬虫

import scrapy
import json

'''
https://careers.tencent.com/tencentcareer/api/post/Query?timestamp=1617180957390&countryId=&cityId=&bgIds=&productId=&categoryId=&parentCategoryId=&attrId=&keyword=&pageIndex=1&pageSize=10&language=zh-cn&area=cn

https://careers.tencent.com/tencentcareer/api/post/ByPostId?timestamp=1617181089850&postId=1123175628615454720&language=zh-cn

https://careers.tencent.com/tencentcareer/api/post/Query?timestamp=1617181122812&countryId=&cityId=&bgIds=&productId=&categoryId=&parentCategoryId=&attrId=&keyword=&pageIndex=2&pageSize=10&language=zh-cn&area=cn
'''


class HrSpider(scrapy.Spider):
    name = 'hr'
    allowed_domains = ['careers.tencent.com']
    # start_urls = ['http://careers.tencent.com/']
    first_url = "https://careers.tencent.com/tencentcareer/api/post/Query?timestamp=1617180957390&countryId=&cityId=&bgIds=&productId=&categoryId=&parentCategoryId=&attrId=&keyword=&pageIndex={}&pageSize=10&language=zh-cn&area=cn"
    # firsturl的pageindex改变
    # 职位详情页的url
    detial_url = "https://careers.tencent.com/tencentcareer/api/post/ByPostId?timestamp=1617181089850&postId={}&language=zh-cn"
    start_urls = [first_url.format(1)]

    def parse(self, response):
        for page in range(1, 9):
            url = self.first_url.format(page)
            # 将url发送给parse_one继续处理
            # 这里也是分布式爬虫与普通爬虫的最大的地方
            yield scrapy.Request(
                url=url,
                callback=self.parse_one
            )

    def parse_one(self, response):
        # 将网页的数据json格式化,方便获取数据
        data = json.loads(response.text)
        # print(data)
        for job in data['Data']['Posts']:
            item = {}
            # 获取json格式中的数据
            item['job_name'] = job['RecruitPostName']
            item['job_type'] = job['CategoryName']
            postid = job['PostId']
            detial_url = self.detial_url.format(postid)
            # 将详情也的url交给另外一个parse_two处理
          
            yield scrapy.Request(
                url=detial_url,
                meta={'item': item}, #这里下面会有解释 'item'可以任意取名 item则是上面的item={}
                callback=self.parse_two
            )

    def parse_two(self, response):
        item = response.meta.get('item') # 这里是上面的'item'
        data = json.loads(response.text)
        # print(data)
        item['job_duty'] = data['Data']['Responsibility']
        item['job_require'] = data['Data']['Requirement']
        yield item

request类

class Request(object_ref):

    def __init__(self, url, callback=None, method='GET', headers=None, body=None,
                 cookies=None, meta=None, encoding='utf-8', priority=0,
                 dont_filter=False, errback=None, flags=None, cb_kwargs=None):
        self._encoding = encoding  # this one has to be set first
        self.method = str(method).upper()
        self._set_url(url)
        self._set_body(body)
        if not isinstance(priority, int):
            raise TypeError(f"Request priority not an integer: {priority!r}")
        self.priority = priority

        if callback is not None and not callable(callback):
            raise TypeError(f'callback must be a callable, got {type(callback).__name__}')
        if errback is not None and not callable(errback):
            raise TypeError(f'errback must be a callable, got {type(errback).__name__}')
        self.callback = callback
        self.errback = errback

        self.cookies = cookies or {}
        self.headers = Headers(headers or {}, encoding=encoding)
        self.dont_filter = dont_filter

        self._meta = dict(meta) if meta else None
        self._cb_kwargs = dict(cb_kwargs) if cb_kwargs else None
        self.flags = [] if flags is None else list(flags)

可以看到其中的参数很多,常用的是url,cookie,meta

其中meta为字典类型


本博客所有文章除特别声明外,均采用 CC BY-SA 4.0 协议 ,转载请注明出处!

 目录

Copyright © 2020 my blog
载入天数... 载入时分秒...