Automatically identify the structure of a document’s contents

 

The Document Structure Analysis API identifies the main structural components of a document or email, extracting titles, section headings, subject, recipient, sender, and more to generate an outline resembling a “table of contents” of the document or message. Use it to get an overview of the component structure of a document.

MeaningCloud’s Document Structure Analysis API

Unfortunately, not all documents come with their built-in table of contents. Many documents and other contents (such as emails) are presented as a sequence of words that should be traversed from beginning to end to get an idea of their structure. The Structure Analysis API of a MeaningCloud Document automatically extracts that structure from both documents (title, section headings, and subsections) and emails (recipient, sender, subject).

In this way we can achieve a structural understanding of the content, identifying the components of the document and their titles as they appear in the original.

Document structure analysis applications

Automatically identifying the parts of a document provides you with a structural view that can be very useful in a variety of applications.


Knowledge management

When the knowledge of the organization is stored in thousands of documents, identifying the components that integrate each one allows to better leverage them.


Content publishing

Complementing contents with a description of their structure makes them more exploitable and valuable.


Communication surveillance

Being able to automatically analyze the structure of a collection of emails allows for detection of suspicious patterns in compliance applications.

Highlights of the Document Structure Analysis API

The Document Structure Analysis API is powerful, versatile, and useful in a wide range of scenarios.

 

Multilingual

It works regardless of the language the text is written in.

 

Powerful

It leverages both document markup and language markers.

 

For documents and emails

It identifies parts of documents and email components.

 

Flexible and easy to integrate

It supports various formats, and its standard interface allows for easy integration with any application.