大家好,我是何三,80后老猿,独立开发者
今天要给大家安利一个超酷的文档处理神器,叫SmolDocling。听名字是不是有点可爱?不过它的本事可一点都不简单,是专门用来处理各种图文混合文档的超级模型呢!
简单来说,SmolDocling就像一个全能的文档翻译官,不管你是遇到满是图表的学术论文,还是带代码块的技术文档,它都能轻松搞定。比如你拍了一张含有表格、公式和图片的书页照片,它能自动把这些元素识别出来,还能保持原有的排版结构,最后转换成你想要的格式,比如Markdown或者HTML。是不是听起来特别实用?
它最厉害的地方在于“麻雀虽小,五脏俱全”。虽然体积超紧凑,但功能却超级全面。像什么OCR文字识别、表格结构提取、代码格式还原,甚至连数学公式和图表数据都能精准解析。举个例子,如果你上传一张布满复杂公式的数学试卷,它不仅能认出每个公式,还能自动转换成LaTeX代码,是不是很神奇?
🚀 特征:
- 🏷️ 用于高效标记化的 DocTags – 引入 DocTags,这是一种与 DoclingDocuments 完全兼容的文档的高效且最小表示形式。
- 🔍 OCR (光学字符识别) – 从图像中准确提取文本。
- 📐 Layout and Localization (布局和本地化) – 保留文档结构和文档元素边界框。
- 💻 Code Recognition (代码识别) – 检测代码块并设置其格式,包括标识。
- 🔢 Formula Recognition (公式识别) – 识别和处理数学表达式。
- 📊 图表识别 – 提取和解释图表数据。
- 📑 Table Recognition (表识别) – 支持用于结构化表提取的列标题和行标题。
- 🖼️ Figure Classification (图形分类) – 区分图形和图形元素。
- 📝 Caption Correspondence (字幕对应) – 将字幕链接到相关图像和图形。
- 📜 List Grouping (列表分组) – 正确组织和构建列表元素。
- 📄 整页转换 – 处理整个页面以进行全面的文档转换,包括所有页面元素(代码、方程式、表格、图表等)
- 🔲 带边界框的 OCR – 使用边界框的 OCR 区域。
- 📂 一般文档处理 – 接受过科学和非科学文档的培训。
- 🔄 无缝 Docling 集成 – 导入 Docling 并以多种格式导出。
- 💨 使用 VLLM 进行快速推理 – 在 A100 GPU 上平均每页 0.35 秒。
🚧 即将推出!
- 📊 更好的图表识别 🛠️
- 📚 一次性多页推理 ⏱️
- 🧪 化学品认可
- 📙 数据
入门示例
📄 使用 Tranformers 进行单页图像推理🤖
# Prerequisites:
# pip install torch
# pip install docling_core
# pip install transformers
import torch
from docling_core.types.doc import DoclingDocument
from docling_core.types.doc.document import DocTagsDocument
from transformers import AutoProcessor, AutoModelForVision2Seq
from transformers.image_utils import load_image
from pathlib import Path
DEVICE = "cuda" if torch.cuda.is_available() else "cpu"
# Load images
image = load_image("https://upload.wikimedia.org/wikipedia/commons/7/76/GazettedeFrance.jpg")
# Initialize processor and model
processor = AutoProcessor.from_pretrained("ds4sd/SmolDocling-256M-preview")
model = AutoModelForVision2Seq.from_pretrained(
"ds4sd/SmolDocling-256M-preview",
torch_dtype=torch.bfloat16,
_attn_implementation="flash_attention_2" if DEVICE == "cuda" else "eager",
).to(DEVICE)
# Create input messages
messages = [
{
"role": "user",
"content": [
{"type": "image"},
{"type": "text", "text": "Convert this page to docling."}
]
},
]
# Prepare inputs
prompt = processor.apply_chat_template(messages, add_generation_prompt=True)
inputs = processor(text=prompt, images=[image], return_tensors="pt")
inputs = inputs.to(DEVICE)
# Generate outputs
generated_ids = model.generate(**inputs, max_new_tokens=8192)
prompt_length = inputs.input_ids.shape[1]
trimmed_generated_ids = generated_ids[:, prompt_length:]
doctags = processor.batch_decode(
trimmed_generated_ids,
skip_special_tokens=False,
)[0].lstrip()
# Populate document
doctags_doc = DocTagsDocument.from_doctags_and_image_pairs([doctags], [image])
print(doctags)
# create a docling document
doc = DoclingDocument(name="Document")
doc.load_from_doctags(doctags_doc)
# export as any format
# HTML
# output_path_html = Path("Out/") / "example.html"
# doc.save_as_html(output_filoutput_path_htmle_path)
# MD
print(doc.export_to_markdown())
🚀 使用 VLLM 进行快速批量推理
# Prerequisites:
# pip install vllm
# pip install docling_core
# place page images you want to convert into "img/" dir
import time
import os
from vllm import LLM, SamplingParams
from PIL import Image
from docling_core.types.doc import DoclingDocument
from docling_core.types.doc.document import DocTagsDocument
from pathlib import Path
# Configuration
MODEL_PATH = "ds4sd/SmolDocling-256M-preview"
IMAGE_DIR = "img/" # Place your page images here
OUTPUT_DIR = "out/"
PROMPT_TEXT = "Convert page to Docling."
# Ensure output directory exists
os.makedirs(OUTPUT_DIR, exist_ok=True)
# Initialize LLM
llm = LLM(model=MODEL_PATH, limit_mm_per_prompt={"image": 1})
sampling_params = SamplingParams(
temperature=0.0,
max_tokens=8192)
chat_template = f"<|im_start|>User:<image>{PROMPT_TEXT}<end_of_utterance>
Assistant:"
image_files = sorted([f for f in os.listdir(IMAGE_DIR) if f.lower().endswith((".png", ".jpg", ".jpeg"))])
start_time = time.time()
total_tokens = 0
for idx, img_file in enumerate(image_files, 1):
img_path = os.path.join(IMAGE_DIR, img_file)
image = Image.open(img_path).convert("RGB")
llm_input = {"prompt": chat_template, "multi_modal_data": {"image": image}}
output = llm.generate([llm_input], sampling_params=sampling_params)[0]
doctags = output.outputs[0].text
img_fn = os.path.splitext(img_file)[0]
output_filename = img_fn + ".dt"
output_path = os.path.join(OUTPUT_DIR, output_filename)
with open(output_path, "w", encoding="utf-8") as f:
f.write(doctags)
# To convert to Docling Document, MD, HTML, etc.:
doctags_doc = DocTagsDocument.from_doctags_and_image_pairs([doctags], [image])
doc = DoclingDocument(name="Document")
doc.load_from_doctags(doctags_doc)
# export as any format
# HTML
# output_path_html = Path(OUTPUT_DIR) / f"{img_fn}.html"
# doc.save_as_html(output_path_html)
# MD
output_path_md = Path(OUTPUT_DIR) / f"{img_fn}.md"
doc.save_as_markdown(output_path_md)
print(f"Total time: {time.time() - start_time:.2f} sec")
ONNX 推理
# Prerequisites:
# pip install onnxruntime
# pip install onnxruntime-gpu
from transformers import AutoConfig, AutoProcessor
from transformers.image_utils import load_image
import onnxruntime
import numpy as np
import os
from docling_core.types.doc import DoclingDocument
from docling_core.types.doc.document import DocTagsDocument
os.environ["OMP_NUM_THREADS"] = "1"
# cuda
os.environ["ORT_CUDA_USE_MAX_WORKSPACE"] = "1"
# 1. Load models
## Load config and processor
model_id = "ds4sd/SmolDocling-256M-preview"
config = AutoConfig.from_pretrained(model_id)
processor = AutoProcessor.from_pretrained(model_id)
## Load sessions
# !wget https://hf-mirror.com/ds4sd/SmolDocling-256M-preview/resolve/main/onnx/vision_encoder.onnx
# !wget https://hf-mirror.com/ds4sd/SmolDocling-256M-preview/resolve/main/onnx/embed_tokens.onnx
# !wget https://hf-mirror.com/ds4sd/SmolDocling-256M-preview/resolve/main/onnx/decoder_model_merged.onnx
# cpu
# vision_session = onnxruntime.InferenceSession("vision_encoder.onnx")
# embed_session = onnxruntime.InferenceSession("embed_tokens.onnx")
# decoder_session = onnxruntime.InferenceSession("decoder_model_merged.onnx"
# cuda
vision_session = onnxruntime.InferenceSession("vision_encoder.onnx", providers=["CUDAExecutionProvider"])
embed_session = onnxruntime.InferenceSession("embed_tokens.onnx", providers=["CUDAExecutionProvider"])
decoder_session = onnxruntime.InferenceSession("decoder_model_merged.onnx", providers=["CUDAExecutionProvider"])
## Set config values
num_key_value_heads = config.text_config.num_key_value_heads
head_dim = config.text_config.head_dim
num_hidden_layers = config.text_config.num_hidden_layers
eos_token_id = config.text_config.eos_token_id
image_token_id = config.image_token_id
end_of_utterance_id = processor.tokenizer.convert_tokens_to_ids("<end_of_utterance>")
# 2. Prepare inputs
## Create input messages
messages = [
{
"role": "user",
"content": [
{"type": "image"},
{"type": "text", "text": "Convert this page to docling."}
]
},
]
## Load image and apply processor
image = load_image("https://ibm.biz/docling-page-with-table")
prompt = processor.apply_chat_template(messages, add_generation_prompt=True)
inputs = processor(text=prompt, images=[image], return_tensors="np")
## Prepare decoder inputs
batch_size = inputs['input_ids'].shape[0]
past_key_values = {
f'past_key_values.{layer}.{kv}': np.zeros([batch_size, num_key_value_heads, 0, head_dim], dtype=np.float32)
for layer in range(num_hidden_layers)
for kv in ('key', 'value')
}
image_features = None
input_ids = inputs['input_ids']
attention_mask = inputs['attention_mask']
position_ids = np.cumsum(inputs['attention_mask'], axis=-1)
# 3. Generation loop
max_new_tokens = 8192
generated_tokens = np.array([[]], dtype=np.int64)
for i in range(max_new_tokens):
inputs_embeds = embed_session.run(None, {'input_ids': input_ids})[0]
if image_features is None:
## Only compute vision features if not already computed
image_features = vision_session.run(
['image_features'], # List of output names or indices
{
'pixel_values': inputs['pixel_values'],
'pixel_attention_mask': inputs['pixel_attention_mask'].astype(np.bool_)
}
)[0]
## Merge text and vision embeddings
inputs_embeds[inputs['input_ids'] == image_token_id] = image_features.reshape(-1, image_features.shape[-1])
logits, *present_key_values = decoder_session.run(None, dict(
inputs_embeds=inputs_embeds,
attention_mask=attention_mask,
position_ids=position_ids,
**past_key_values,
))
## Update values for next generation loop
input_ids = logits[:, -1].argmax(-1, keepdims=True)
attention_mask = np.ones_like(input_ids)
position_ids = position_ids[:, -1:] + 1
for j, key in enumerate(past_key_values):
past_key_values[key] = present_key_values[j]
generated_tokens = np.concatenate([generated_tokens, input_ids], axis=-1)
if (input_ids == eos_token_id).all() or (input_ids == end_of_utterance_id).all():
break # Stop predicting
doctags = processor.batch_decode(
generated_tokens,
skip_special_tokens=False,
)[0].lstrip()
print(doctags)
doctags_doc = DocTagsDocument.from_doctags_and_image_pairs([doctags], [image])
print(doctags)
# create a docling document
doc = DoclingDocument(name="Document")
doc.load_from_doctags(doctags_doc)
print(doc.export_to_markdown())
更贴心的是,SmolDocling还设计了一套叫做DocTags的标签系统。这就好比给文档里的每个元素都贴上了专属“身份证”,让模型在处理时不会混淆。比如遇到图片就标上<figure>
,碰到代码块就用<code>
标记,这样一来,转换成其他格式时既高效又准确,再也不用担心排版乱成一团啦!
而且它的处理速度也超快,用A100 GPU的话,平均每页只需要0.35秒。无论是学术研究者处理论文,还是上班族整理报告,都能省下不少时间。最让人期待的是,开发团队还在不断升级功能,比如未来会支持更强大的图表分析和化学公式识别,想想都觉得兴奋!
总的来说,SmolDocling就像一个多才多艺的文档管家,把复杂的图文处理变得简单又有趣。如果你经常和各种文档打交道,不妨关注一下这个宝藏工具,说不定能给你的工作效率带来意想不到的提升哦!
🔥 福利时间:关注公众号【何三笔记】,后台回复关键词「20250217」,即可免费领取《清华大学出品DeepSeek使用精髓》系列资料:
- 【清华大学第一版】DeepSeek从入门到精通.pdf
- 【清华大学第二版】DeepSeek赋能职场.pdf
- 【清华大学第三版】普通人如何抓住DeepSeek红利.pdf
- 【清华大学第四版】DeepSeek+DeepResearch:让科研像聊天一样简单.pdf
- 【清华大学第五版】DeepSeek与AI幻觉.pdf