您现在的位置是：网站首页 >网络爬虫 >Python网络爬虫网络爬虫

【Python网络爬虫】02.正则表达式爬取数据写入文件

admin2018年11月7日 22:45 【Python | 爬虫】 1953人已围观

Python网络爬虫简介 从0开始学Python网络爬虫

# 正则表达式爬取数据使用requests和re模块进行爬虫 ## 常用符号 ### 一般字符 |字符|含义|示例| |---|---|---| |`.`|匹配任意单个字符（不包含换行符\n）|`a.b`可以匹配`acb`、`a&b`、`aib`| |`\`|转义字符（把有特殊汉字的字符转换成字面意思）|`.`是特殊字符，但只想匹配`.`，就需要使用`\.`来匹配| |`[...]`|字符集。对应字符集中的任意字符|`a[bcd]`匹配`ab`、`ac`、`ad`| ### 预定义字符集共6个 |预定义字符集|含义| |---|---| |`\d`|匹配一个数字字符，等价于`[0-9]`| |`\D`|匹配一个非数字字符，等价于`[^0-9]`| |`\s`|匹配任何空白字符，包含空格、制表符、换页符等。等价于`[\f\n\r\t\v]`| |`\S`|匹配任何非空白字符。等价于`[^\f\n\r\t\v]`| |`\w`|匹配包括下划线的任何单词字符。等价于`[A-Za-z0-9]`| |`\W`|匹配任何非单子字符。等价于`[^A-Za-z0-9]`| ### 数量词 |数量词|含义|示例| |---|---|---| |`*`|匹配前一个字符0或无限次|`ab*c`匹配`ac`、`abc`、`abbc`等| |`+`|匹配前一个字符1或无限次|`ab+c`匹配`abc`、`abbc`、`abbbc`等| |`?`|匹配前一个字符0或1次|`ab?c`匹配`ac`和`abc`| |`{m}`|匹配前一个字符m次|`ab{3}c`匹配`abbbc`| |`{m, n}`|匹配前一个字符m至n次|`ab{1,3}c`匹配`abc`、`abbc`、`abbbc`| ### 边界匹配 |便捷匹配|含义|示例| |---|---|---| |`^`|匹配字符串开头|`^abc`匹配`abc`开头的字符串| |`$`|匹配字符串结尾|`abc$`匹配`abc`以为的字符串| |`\A`|仅匹配字符串开头|`\Aabc`| |`\Z`|仅匹配字符串结尾|`abc\Z`| 边界匹配在爬虫中使用比较少，因为爬虫提取的数据大部分是标签中的数据。 ### `(.*?)` 爬虫中常用的`(.*?)`，`()`表示括号内的内容作为返回结果，`.*?`是非贪心算法，匹配任意的字符。 ```python >>> import re >>> a = 'xxIxxfsdfdsxxlovexxtewrwexxpythonxx' >>> res = re.findall('xx(.*?)xx', a) >>> res ['I', 'love', 'python'] ``` ## re模块及其方法 ### search()函数匹配并提取第一个服务规律的内容，返回一个正则表达式对象。 ```python re.match(pattern, string, flags=0) ``` - pattern为匹配的正则表达式 - string为要匹配的字符串 - flags为标志位，用于控制正则表达式的匹配方式，如是否区分大小写，多行匹配等。 ```python >>> a = 'one1two2three3' >>> res = re.search('\d+', a) >>> res <_sre.SRE_Match object; span=(3, 4), match='1'> >>> res.group() # 返回匹配到的字符串 '1' ``` ### sub()函数用于替换字符串中的匹配项 ```python re.sub(pattern, repl, string, count=0, flags=0) ``` - pattern为匹配的正则表达式 - repl为替换的字符串 - string为要被查找替换的原始字符串 - count为模式匹配后替换的最大次数，默认为0表示替换所有的匹配 - flags为标志位，用于控制正则表达式的匹配方式，如是狗区分大小写，多行匹配等 ```python >>> phone = '123-4567-8900' >>> res = re.sub('\D', '', phone) >>> res '12345678900' ``` `sub()`类似于字符串的`replace()`函数，但`sub()`更加灵活，可以通过正则表达式来匹配需要替换的字符串，而`replace`却做不到。`sub()`函数在爬虫中也用得比较少，因为爬虫是爬取数据，而不是替换数据。 ### findall()函数匹配所有符合规律的内容，并以列表的形式返回结果。 ```python >>> a = 'one1two2three3' >>> res = re.findall('\d+', a) >>> res ['1', '2', '3'] ``` 在爬虫中，`findall()`使用频率最多，例如获取所有价格 ```html <span class="result_price">¥<i>298</i>起/晚</span> <span class="result_price">¥<i>236</i></span> ``` 正则的时候`¥`需要用`¥` ```python >>> import re >>> import requests >>> req = requests.get('http://cd.xiaozhu.com/') >>> prices = re.findall('<span class="result_price">¥<i>(.*?)</i>.*?</span>', req.text) >>> prices ['238', '235', '298', '258', '268', '368', '228', '178', '388', '188', '188', '278', '327', '288', '188', '208', '218', '258', '258', '248', '189', '308', '368', '328'] >>> len(prices) 24 ``` 通过正则表达式爬取数据，比之前的方法代码更少也更简单，那是因为少了解析数据这一步，通过requests库请求返回的html文件就是字符串的类型，代码可以直接通过正则表达式来提取数据。 ### re模块修饰符包含一些可选标识符来控制匹配的模式。 |修饰符|描述| |---|---| |`re.I`|使匹配对大小写不敏感| |`re.L`|做本地化识别匹配| |`re.M`|多行匹配，影响`^`和`$`| |`re.S`|使匹配包含换行在内的所有字符| |`re.U`|根据Unicode字符集解析字符。这个标志影响`\w`、`\W`、`\b`、`\B`| |`re.X`|该标志通过给予更灵活的格式，以便将正则表达式写得更易理解| 在爬虫中，`re.S`是最常用的修饰符，它能换行匹配。例如提取标签中的文字 ```python >>> import re >>> a = '<div>文字</div>' >>> w = re.findall('<div>(.*?)</div>', a) >>> w ['文字'] ``` 换行匹配 ```python >>> b = """<div> ... 换行的文字 ... </div> ... """ >>> w = re.findall('<div>(.*?)</div>', b) >>> w [] ``` `findall()`函数是逐行匹配的，当第1行没有匹配到数据时，就会从第2行开始重新匹配，这样就没法匹配到换行`div`中的数据，可以通过`re.S`来进行跨行匹配。 ```python >>> w = re.findall('<div>(.*?)</div>', b, re.S) >>> w ['\n换行的文字\n'] ``` 看结果有换行符，这种数据需要清洗才能存入数据库 ```python >>> w = re.findall('<div>(.*?)</div>', b, re.S) >>> w ['\n换行的文字\n'] >>> >>> w[0].strip() '换行的文字' ``` ## 实例1：爬取全文小说爬取的内容输出到本地文件中 ### 思路分析爬取所有章节的信息，通过手动浏览 - 第一章：http://www.doupoxs.com/doupocangqiong/2.html - 第二章：http://www.doupoxs.com/doupocangqiong/5.html - 第三章：http://www.doupoxs.com/doupocangqiong/6.html - 第四章：http://www.doupoxs.com/doupocangqiong/7.html - 第五章：http://www.doupoxs.com/doupocangqiong/8.html - 第六章：http://www.doupoxs.com/doupocangqiong/9.html - 第七章：http://www.doupoxs.com/doupocangqiong/10.html - 第八章：http://www.doupoxs.com/doupocangqiong/11.html 按照URL规律，如果是1,3,4，则会显示404页面，则使用`res.status_code`来检测页面是否存在。 ### 爬虫代码及分析 ```python import re import requests import time headers = { "User-Agent": "Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/70.0.3538.77 Safari/537.36" } file = open('实例：爬取全文小说.txt', 'a+') # 以追加的方式创建文件 def get_content(url): res = requests.get(url, headers=headers) # print(type(res.encoding)) # <class 'str'> # print(res.text) # 头部<meta charset="utf-8">，但是中文乱码 # print(res.content) # 二进制响应内容 if res.status_code == 200: # 判断请求码是否为200，这样才能访问 title = re.findall('<h1>(.*?)</h1>', res.content.decode('utf-8'))[0] # print(title) # 获取标题存入到文件 file.write('\n{}\n{}\n'.format(title, '*' * 180)) # 获取P标签中的值 contents = re.findall('<p>(.*?)</p>', res.content.decode('utf-8'), re.S) # res.content->TypeError: cannot use a string pattern on a bytes-like object for content in contents: print(content) file.write(content + '\n') else: print('无法法访问' + url) if __name__ == '__main__': get_content('http://www.doupoxs.com/doupocangqiong/12.html') urls = ['http://www.doupoxs.com/doupocangqiong/{}.html'.format(page) for page in range(2, 10)] # 只取到10 for url in urls: get_content(url) time.sleep(1) file.close() ``` ## 实例2：爬取糗百的段子信息 ### 思路分析对于糗百的文字专题，通过手动浏览 - 第1页：https://www.qiushibaike.com/text/ 和 https://www.qiushibaike.com/text/page/1/ 一样 - 第2页：https://www.qiushibaike.com/text/page/2/ - 第3页：https://www.qiushibaike.com/text/page/3/ - 第4页：https://www.qiushibaike.com/text/page/4/ - 第5页：https://www.qiushibaike.com/text/page/5/ 需要爬取的数据有：用户ID、用户性别、用户等级、段子信息、好笑数量、评论数量性别男：`<div class="articleGender manIcon">40</div>` 性别女：`<div class="articleGender womenIcon">30</div>` ### 爬虫代码及分析 ```python import re import requests headers = { "User-Agent": "Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/70.0.3538.77 Safari/537.36" } def get_info(url): res = requests.get(url) names = re.findall('<h2>(.*?)</h2>', res.text, re.S) sex_levels = re.findall('<div class="articleGender (.*?)Icon">(.*?)</div>', res.text, re.S) # print(levels) # [('man', '40'), ('women', '25'),...] contents = re.findall('<div class="content">.*?<span>(.*?)</span>.*?</div>', res.text, re.S) # <span>前面有\n，</span>后面有\n\n，需要匹配掉 # print(contents) votes = re.findall('<span class="stats-vote"><i class="number">(\d+)</i> 好笑</span>', res.text, re.S) # print(votes, len(votes)) comments = re.findall('<i class="number">(\d+)</i> 评论', res.text, re.S) # print(comments, len(comments)) info_list = list() for name, sex_level, content, vote, comment in zip(names, sex_levels, contents, votes, comments): info = { 'name': name.strip(), 'sex': sex_level[0], 'level': sex_level[1], 'content': content.strip(), 'vote': vote, 'comment': comment } # print(info) info_list.append(info) return info_list if __name__ == '__main__': # get_info('https://www.qiushibaike.com/text/page/1/') urls = ['https://www.qiushibaike.com/text/page/{}/'.format(page) for page in range(1, 2)] file = open('实例：爬取糗百的段子信息.txt', 'a+') for url in urls: for info in get_info(url): file.write(info['name'] + '\n') file.write(info['sex'] + '\n') file.write(info['level'] + '\n') file.write(info['content'] + '\n') file.write(info['vote'] + '\n') file.write(info['comment'] + '\n') file.close() ```