Our APIs use the last version of the Apache Tika toolkit to detect and extract metadata and structured text information from submitted content. A comprehensive list of supported formats can be found in this page. However, there is a summary below of the main supported formats with respect to MeaningCloud's APIs. This applies to both the URLs and the files that can be analyzed.

The following table shows the behavior of types of formats and the specific formats supported within that type:

Type Behavior Specific formats supported
Markup language formats Everything contained between markup tags will be analyzed. The tags will be ignored.
  • HTML
  • XML
  • RSS
General text formats The whole content of the files will be analyzed.
  • Plain text
  • RTF
Documents The whole content of the files will be analyzed.
  • PDF
  • Microsoft Office (Word, Excel, Powerpoint, Publisher, Outlook, Visio)
  • Open Document Format (ODF) (writer, spreadsheet, presentation)
  • OOXML
  • iWork
  • Electronic Publication Format (EPUB)
Media file formats Only the metadata associated to the file will be analyzed.
  • AIFF
  • WAV
  • Midi
  • AAC
  • ASF
  • FLAC
  • MS-Wave
  • MPEG
  • Ogg/Vorbis
  • Nist Sphere
  • Sun AU
Other formats Any email extension that complies with the Mbox format.