本文共 1060 字,大约阅读时间需要 3 分钟。
urllibrebs4selenium
由一个URL获取html
html = urlopen(url).read().decode("utf-8")
用html生成一个BeautiSoup对象
soup = BeautifulSoup(html,"html.parser")
find_all()用于搜索bs的文档树,返回所有符合条件的tag列表,可使用正则表达式。
urls=soup.find_all("a",{ "target":"_blank","href":re.compile("^/item/(%.{2})+")})
从一个百度百科词条出发,随机进入一个词条,走20层。
from urllib.request import urlopenimport refrom bs4 import BeautifulSoupfrom random import samplebaseUrl=r"https://baike.baidu.com"his=[r"/item/%E5%85%A8%E5%9B%BD%E9%9D%92%E5%B0%91%E5%B9%B4%E4%BF%A1%E6%81%AF%E5%AD%A6%E5%A5%A5%E6%9E%97%E5%8C%B9%E5%85%8B%E8%81%94%E8%B5%9B"]for i in range(20): if len(his)==0: break html = urlopen(baseUrl+his[-1]).read().decode("utf-8") soup = BeautifulSoup(html,"html.parser") print(i,soup.h1.get_text()," url: "+his[-1]) sub_urls=soup.find_all("a",{ "target":"_blank","href":re.compile("^/item/(%.{2})+")}) if len(sub_urls) != 0: his.append(sample(sub_urls,1)[0]["href"]) # print("Next :",his[-1]) else: print("No eligible href is found.") his.pop()
转载地址:http://sgkzi.baihongyu.com/