Python:使用readability-lxml 提取网页标题和主体内容 github: https://github.com/buriy/python-readability
pypi: https://pypi.org/project/readability-lxml/

安装

$ pip install readability-lxml

代码示例

# -*- coding: utf-8 -*-

from readability import Document
import requests

url = "https://blog.csdn.net/mouday/article/details/94021769"
response = requests.get(url)
response.encoding = "utf-8"

doc = Document(response.text)

print(doc.title())     # 标题
print(doc.summary())   # 主体内容

尝试过几个网页后,发现部分网页可以正常提取主体内容,有些网站提取不正确