内容
访问古诗文网站名句主页(https://so.gushiwen.cn/mingjus/)
爬取里面的名句和出处(包括链接)保存到一个文本文件poems.txt中去。每个名句占用一行,内容格式如下:
编号(从1开始,占3位做对齐):名句--出处(全诗链接)
空两格(诗句的译文注释和赏析)
环境准备
确保已经安装了以下Python库:
可以使用以下命令安装:
1
| pip install requests beautifulsoup4
|
代码
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41
| from bs4 import BeautifulSoup as BS import requests
rank = 0 temp_line2 = '' fs = open("诗词.txt", 'w', encoding='utf-8')
soup = BS(requests.get("https://so.gushiwen.cn/mingjus/").content.decode("utf-8"), "lxml") content = soup.select('body > div.main3 > div.left > div.sons > div.cont')
for i in content: str = i.find_all('a') url = 'https://so.gushiwen.cn' + i.find('a')['href'] temp_soup = BS(requests.get(url).content.decode("utf-8"), "lxml") temp_content = temp_soup.select('#sonsyuanwen > div.cont > div.contson') for x in temp_content: temp_line1 = x.text.split('\n') for z in temp_line1: temp_line2 += " " + z + '\n' line2 = temp_line2[:-1] temp_line2 = '' poem = str[0].text if len(str) == 1: poet = "没有出处" else: poet = "出自" + str[1].text rank += 1 line1 = f"{rank}: {poem}--{poet}({url})" fs.write('{0:>3}'.format(line1) + '\n') fs.write(line2)
fs.close()
|
结果展示