github: https://github.com/buriy/python-readability
pypi: https://pypi.org/project/readability-lxml/
安装
$ pip install readability-lxml
代码示例
# -*- coding: utf-8 -*-
from readability import Document
import requests
url = "https://blog.csdn.net/mouday/article/details/94021769"
response = requests.get(url)
response.encoding = "utf-8"
doc = Document(response.text)
print(doc.title()) # 标题
print(doc.summary()) # 主体内容
尝试过几个网页后,发现部分网页可以正常提取主体内容,有些网站提取不正确