python爬虫-白红宇

python爬虫

阅读量：3951 次

发布时间：2019-05-24

本文共 1060 字，大约阅读时间需要 3 分钟。

库

urllibrebs4selenium

函数解释

由一个URL获取html

html = urlopen(url).read().decode("utf-8")

用html生成一个BeautiSoup对象

soup = BeautifulSoup(html,"html.parser")

find_all()用于搜索bs的文档树，返回所有符合条件的tag列表，可使用正则表达式。

urls=soup.find_all("a",{
   "target":"_blank","href":re.compile("^/item/(%.{2})+")})

实例

从一个百度百科词条出发，随机进入一个词条，走20层。

from urllib.request import urlopenimport refrom bs4 import BeautifulSoupfrom random import samplebaseUrl=r"https://baike.baidu.com"his=[r"/item/%E5%85%A8%E5%9B%BD%E9%9D%92%E5%B0%91%E5%B9%B4%E4%BF%A1%E6%81%AF%E5%AD%A6%E5%A5%A5%E6%9E%97%E5%8C%B9%E5%85%8B%E8%81%94%E8%B5%9B"]for i in range(20):    if len(his)==0:        break    html = urlopen(baseUrl+his[-1]).read().decode("utf-8")    soup = BeautifulSoup(html,"html.parser")    print(i,soup.h1.get_text(),"  url: "+his[-1])    sub_urls=soup.find_all("a",{
   "target":"_blank","href":re.compile("^/item/(%.{2})+")})    if len(sub_urls) != 0:        his.append(sample(sub_urls,1)[0]["href"])        # print("Next        ：",his[-1])    else:        print("No eligible href is found.")        his.pop()

转载地址：http://sgkzi.baihongyu.com/

你可能感兴趣的文章

android 如何开关Mediatek开发的Feature