HyperAIHyperAI

Command Palette

Search for a command to run...

Reader-LM: Convert HTML to MarkDown Quickly and Efficiently

1. Tutorial Introduction

该教程使用的基础算力为 RTX 4090 。

Reader-LM is a series of small language models developed by Jina AI in 2024, specifically for converting raw HTML content on the web into clear and tidy Markdown format. These models include Reader-LM-0.5B and Reader-LM-1.5B, which excel in processing long texts and multilingual content, supporting context lengths up to 256K bytes.

The Reader-LM models are designed to address the need for efficient and economical data extraction from noisy web content. They outperform several large language models such as GPT-4o and Gemini-1.5-Flash in HTML to Markdown conversion tasks, while being smaller and more suitable for running in resource-constrained environments.

The model is trained on a curated collection of HTML content and its corresponding Markdown content. This tutorial demonstrates how to convert HTML to markdown using reader-lm-1.5b or reader-lm-0.5b.

请注意!模型的输入(即提示)是原始 HTML—不需要前缀指令。

2. Operation steps

1. 启动容器后点击 API 地址即可进入 Web 界面 (需要完成实名认证,无需打开工作空间)
2. WebUI Demo 详细教程
* 模型输入:一定要注意模型的输入(即提示)是原始 HTML—不需要前缀指令。

* 模型选择:jina 提供了 2 个参数量不同的模型,分别为 reader-lm-1.5B 和 reader-lm-0.5B,可根据自己的需要进行选择。

* 这里我们选择一个示例点击提交即可看到模型输出结果,一定要注意模型的输入(即提示)是原始 HTML—不需要前缀指令。
* 生成结果
  • Reader LM Output: the result of using the model output;
  • Markdownify Output: markdownify is a Python library that can convert HTML content to Markdown format. This library is particularly useful when you need to display data originally in HTML format on a platform that supports Markdown.
    • Save the file as shown in the figure below: Two md files are generated each time, the file name is timestamp + generation method, and the save directory is: ./HTML-to-Markdown/output_md/「timestamp」_「generation method」.md 

Build AI with AI

From idea to launch — accelerate your AI development with free AI co-coding, out-of-the-box environment and best price of GPUs.

AI Co-coding
Ready-to-use GPUs
Best Pricing
Get Started

Hyper Newsletters

Subscribe to our latest updates
We will deliver the latest updates of the week to your inbox at nine o'clock every Monday morning
Powered by MailChimp
Reader-LM: Convert HTML to MarkDown Quickly and Efficiently | Tutorials | HyperAI