python爬虫

解释：

爬虫：一段自动抓取互联网信息的程序，从互联网上抓取对于我们有价值的信息

使用：

既然要爬虫互联网信息：那么我们需要有工具对互联网信息的访问、获取、当然还需要有工具对互联网上访问到的xml、html进行一定的处理、来筛选我们需要的信息：

使用到的库

既然是对互联网信息的请求、那么就要用到 requests 库的帮忙了

对于xml、html的处理那就需要用到新的库 BeautifulSoup

当然要是用到下载功能的话、那就要用到 urllib2 库

安装方法：

1 2	pip3 install beautifulsoup4 pip3 install requsets

引用方法：

1 2	import requests from bs4 import BeautifulSoup

使用方法：

基本操作：

import requests
from bs4 import BeautifulSoup

url = 'http://blog.like4h.top/'

re = requests.get(url)
#print(re.text)
bs = BeautifulSoup(re.text,'html.parser')     #指定解析器为html
#print(bs.prettify())      #返回的是格式化处理好的网页信息
like4h = bs.prettify()

# f = open('like4h.html','w')      #结合上面学的文件操作、我们学习将信息写入到本地文件中
# f.write(like4h)
# f.close()

#print(bs.title)     #打印网页的title
#print(bs.head)      #单独打印标签中的所有属性、如果有过个标签的话，只会打印第一个标签中的内容
#print(bs.a)       #打印标签的内容
#print(bs.a.attrs)   #打印标签中的所有属性
print(bs.a.attrs['href'])    #打印标签内的任意属性的内容

find_all()函数遍历：

现在我们已将对html的网站、标签、属性、值有一定的了解了、那么下面就是bs4模块用的最多的遍历的函数了、帮助我们爬取到指定的内容。

关于find_all()可以参考一下下面的链接

https://www.cnblogs.com/zipon/p/6129280.html

import requests
from bs4 import BeautifulSoup

url = 'http://blog.like4h.top/'

re = requests.get(url)
#print(re.text)
bs = BeautifulSoup(re.text,'html.parser')    #指定解析器为html
#print(bs.prettify())      #返回的是格式化处理好的网页信息
like4h = bs.prettify()

#测试

# xml = BeautifulSoup('<a href="/2022/03/03/97570718.html" title="文件操作与多线程"></a>','html.parser')

# print(xml.a.attrs['title'])


list = bs.find_all('img')   #使用bs的遍历功能进行标签的遍历
list = bs.find_all(['a','img'])    #一次性查找多个标签
list = bs.find_all(class_ = 'article-title')   #直接属性定位标签、返回的就是含有这个特定属性的标签、并将整个标签的内容输出
list = bs.find_all(['post-meta-date-created',"article-title"])  #一次性使用多个属性去定位属性所在的标签
list = bs.find_all(href = re.compile('/2022/03/03/'))   #通过正则来搜索标签（这个意思是搜索所有href属性里带有/2022/03/03/的关键字的标签）需要使用到re库
list = bs.find_all("a",class_='article-title')    #查找所有a标签里面class属性为article-title的标签

for i in list:
    #print(i)              #打印所有的a标签
    print(i.string)        #打印标签里面的字符串
    #print(i.attrs['title'])  #打印这个标签内的title属性的内容

css选择器：

import requests
import re
from bs4 import BeautifulSoup

url = 'http://blog.like4h.top/'

re_html = requests.get(url)
#print(re.text)
bs = BeautifulSoup(re_html.text,'html.parser')

#   css选择器

print(bs.select('title'))   #通过css选择器来进行查找含有关键字的标签
print(bs.select('.article-title'))   #使用css选择器来指定属性来查找标签（使用css选择器通过属性查找标签的时候、需要在属性的前面加上一个 . ）、以一个列表的形式返回所有包含关键属性的标签

通过id的形式来进行查找

print(bs.select("#page-header"))  #查找id=page-header的标签（#就是代表id的意思）、也是以一个列表的形式返回所有包含关键属性的标签
print(bs.select('a[class="article-title"]'))   #查找带有某个属性的标签、以一个列表的形式返回所有包含关键属性的标签
print(bs.select('head > meta'))  #查找head父标签下的所有meta的子标签

list = bs.head.children     #获取head标签下的所有子标签
list = bs.head.contents     #和childents的效果一样、都是获取父标签下所有子标签
for i in list:
    print(i)