NLP technologies: state of the art, trends and challenges

This post presents MeaningCloud’s vision on the state of Natural Language Processing technology by the end of 2019, based on our work with customers and research projects.

NLP technology has practically achieved human quality (or even better) in many different tasks, mainly based on advances in machine learning/deep learning techniques, which allow to make use of large sets of training data to build language models, but also due to the improvement in core text processing engines and the availability of semantic knowledge databases.

NLP Tasks at human level

NLP is everywhere as technology for many NLP tasks is almost achieving human quality:

  • Text categorization is the most popular task, for spam detection, message routing or analysis of information
  • Topic extraction is also a common task, mainly for tagging unstructured contents and creating recommendation systems
  • Text clustering is the preferred unsupervised algorithm for exploratory analysis and trending topic detection
  • Fuzzy search and matching, for similarity detection, plagiarism, catalogue analysis, etc.
  • Machine translation
  • Core text processing tasks (parsing, semantic tagging, disambiguation) are the basis for the other tasks

Other popular tasks worth mentioning are information extraction, text understanding, implementation of chatbots, summary and text generation.

Traditional NLP techniques are still used, such as rule-based models, dependency parsing or state automatons, although machine learning and, specifically, deep learning, has brought in many advances in NLP tasks such as text categorization or semantic disambiguation.

Machine/deep learning

Machine learning makes model building to be easy and fast, but the drawback is that most often systems are a black box where adding new knowledge is hard/impossible (apart from adding more samples to the training data and rebuild the model).

In addition, machine learning has not become general for NLP tasks yet because of the problem of lack of training corpus (large tagged datasets):

  • This problem is partially overcome by advanced techniques (such as transfer learning with transformers and improved attention layers) and pretrained language models (such as Google’s BERT, OpenAI’s GPT-2, ELMo or Microsoft’s MT-DNN). Transfer learning, in the context of NLP, is essentially the ability to train a model on one dataset and then adapt that model to perform different NLP functions on a different dataset. This shows promising results for generic domains like text generation, summary extraction or machine translation, breaking state-of-the-art results up-to-date.
  • Unfortunately, pretrained models are mostly language-dependent (mainly for English) and domain-independent (generic), and transfer learning has not yet advanced to adapt them to languages with less training data or domains with specific vocabulary.

Thus, traditional NLP methods, although more intensive in human work, are still the best choice for many scenarios, as errors, in general, are easy to correct and precision can be streamlined.

Hybrid solutions

Deep learning is, in general, the best choice for text categorization where a large volume of training data is available. When training data is scarce, other more classical machine learning techniques such as decision trees or SVMs, in general provide better results with less computational cost.

Hybrid solutions combining machine learning (the machine’s opinion) with a rule-based post-filtering (a human-like correction) provide the best results in terms of precision, and they have to become popular in the near future.

Additionally, some machine/deep learning techniques are becoming helpful for supporting humans in the process of building/improving models:

  • Rule induction techniques for generating a first draft rule model.
  • Semantic expansion techniques (such as word/sentence embeddings) for improving rule recall.

The Near or Mid-term Future

Enhanced pretrained models for more languages and specific domains (e.g. banking, marketing) have yet to come, ready to be used in non-generic scenarios.

Enhanced transfer learning techniques, allowing further adaptation of those pretrained models using (reduced) domain-specific training data have yet to be developed.

Currently deep learning is still costly in terms of hardware for training models and running the service, but hardware and machine-learning-as-a-service platforms will be cheaper and more accessible in the near future.

Automatic optimization of model parameters, such as the present AutoML, will improve with techniques such as evolutionary algorithms, simplifying the model building and achieving better results.

Popularization of other NLP tasks (whose precision is currently below the threshold):

Our Steps towards the Future

Presently MeaningCloud is delivering on its Deep Semantic Analytics vision using an advanced Semantic Rule approach:

  • It leverages the deep morphosyntactic and semantic analysis of the text performed by MeaningCloud’s core engines.
  • Building on that analysis, advanced rules that combine the extracted semantic information and powerful operators are used.
  • The results are: advanced pattern detection, fine-grained, passage-level categorization, extraction of semantic relationships, etc.

Into The Future

In addition, we are researching on model generation/improvement:

  • Automatic training of top-performance machine learning classifiers using tagged data
  • Automatic generation of rule-based models for categorization or extraction using training data
  • Automatic generation of suggestions for model improvement based on training data and QA metrics
  • Automatic retraining of models using user feedback

We are using, among others:

Moreover, we are researching on semantic representations of unstructured-made-structured information:

  • Generation of document semantic graph (e.g. in RDF) from unstructured text
  • Exploitation of semantic graph (e.g. natural language query to SPARQL)
  • Discovery of relationships among documents:
    • Trending topic detection, for discovering the topics emerging from a collection of documents
    • Document clustering, for grouping similar documents
  • Text generation
    • Summary, for automatically generating a meaningful summary for a document
    • Automatic descriptions, for automatically creating a title for a document (auto-title)
    • Chatbot conversations, for generating the responses of a conversational bot

Using, among others:

  • Insight extraction models: our technology for extracting deep, composite insights from unstructured text
  • Entity disambiguation
  • Text understanding techniques
  • Machine/Deep learning

About Julio Villena

Technology enthusiast. Head of Innovation at @MeaningCloud: natural language processing, semantics, voice of the customer, text analytics, intelligent robotic process automation. Researcher and lecturer at @UC3M, in love with teaching and knowledge sharing.

Leave a Reply

Your email address will not be published. Required fields are marked *

*
*