HyperAIHyperAI

Command Palette

Search for a command to run...

5 months ago

CED: Catalog Extraction from Documents

Tong Zhu; Guoliang Zhang; Zechang Li; Zijian Yu; Junfei Ren; Mengsong Wu; Zhefeng Wang; Baoxing Huai; Pingfu Chao; Wenliang Chen

CED: Catalog Extraction from Documents

Abstract

Sentence-by-sentence information extraction from long documents is an exhausting and error-prone task. As the indicator of document skeleton, catalogs naturally chunk documents into segments and provide informative cascade semantics, which can help to reduce the search space. Despite their usefulness, catalogs are hard to be extracted without the assist from external knowledge. For documents that adhere to a specific template, regular expressions are practical to extract catalogs. However, handcrafted heuristics are not applicable when processing documents from different sources with diverse formats. To address this problem, we build a large manually annotated corpus, which is the first dataset for the Catalog Extraction from Documents (CED) task. Based on this corpus, we propose a transition-based framework for parsing documents into catalog trees. The experimental results demonstrate that our proposed method outperforms baseline systems and shows a good ability to transfer. We believe the CED task could fill the gap between raw text segments and information extraction tasks on extremely long documents. Data and code are available at \url{https://github.com/Spico197/CatalogExtraction}

Code Repositories

spico197/catalogextraction
Official
pytorch
Mentioned in GitHub

Benchmarks

BenchmarkMethodologyMetrics
catalog-extraction-on-chcatextTRACER
F1: 82.39

Build AI with AI

From idea to launch — accelerate your AI development with free AI co-coding, out-of-the-box environment and best price of GPUs.

AI Co-coding
Ready-to-use GPUs
Best Pricing
Get Started

Hyper Newsletters

Subscribe to our latest updates
We will deliver the latest updates of the week to your inbox at nine o'clock every Monday morning
Powered by MailChimp
CED: Catalog Extraction from Documents | Papers | HyperAI