内容

访问古诗文网站名句主页(https://so.gushiwen.cn/mingjus/)
爬取里面的名句和出处(包括链接)保存到一个文本文件poems.txt中去。每个名句占用一行,内容格式如下:

编号(从1开始,占3位做对齐):名句--出处(全诗链接)
空两格(诗句的译文注释和赏析)

环境准备

确保已经安装了以下Python库:

  • requests
  • beautifulsoup4

可以使用以下命令安装:

1
pip install requests beautifulsoup4

代码

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
from bs4 import BeautifulSoup as BS
import requests

# 变量
rank = 0
temp_line2 = ''
fs = open("诗词.txt", 'w', encoding='utf-8')

# 获取名句页面内容
soup = BS(requests.get("https://so.gushiwen.cn/mingjus/").content.decode("utf-8"), "lxml")
content = soup.select('body > div.main3 > div.left > div.sons > div.cont')

for i in content:
# 诗词出处、网址
str = i.find_all('a')
url = 'https://so.gushiwen.cn' + i.find('a')['href']
temp_soup = BS(requests.get(url).content.decode("utf-8"), "lxml")

# 诗词翻译内容
temp_content = temp_soup.select('#sonsyuanwen > div.cont > div.contson')
for x in temp_content:
temp_line1 = x.text.split('\n')
for z in temp_line1:
temp_line2 += " " + z + '\n'

line2 = temp_line2[:-1] # 去掉最后一个换行符
temp_line2 = ''
poem = str[0].text
if len(str) == 1:
poet = "没有出处"
else:
poet = "出自" + str[1].text

rank += 1
line1 = f"{rank}: {poem}--{poet}({url})"

fs.write('{0:>3}'.format(line1) + '\n')
fs.write(line2)

fs.close()

结果展示