Category Archives: Research

Posts related to research

Machine Learning for NLP/Text Analytics, beyond Machine Learning

In the field of text analytics, aside from the development of categorization models, the application of machine learning (and more specifically, deep learning) has proved to be very helpful for supporting our teams in the process of building/improving rule-based models.

This post analyzes some of the applications of machine/deep learning for NLP tasks, beyond machine/deep learning itself, that are used to approach different scenarios in projects for our customers.

Continue reading


Accuracy measures in Sentiment Analysis: the Precision of MeaningCloud’s Technology

Accuracy Measures of Commercial Sentiment Analysis APIs

Our clients frequently ask, “what’s the precision of MeaningCloud technology?” How does it compare with other commercial competitors and with state-of-the-art technology? And they demand precise numbers.

That’s not an easy question to answer. Even when there are milliards of research studies on this issue. For the sake of simplicity, let’s concentrate on the well-studied scenario of accuracy measures in Sentiment Analysis. Continue reading


Performance Metrics for Text Categorization

One of the most common and extensively studied knowledge extraction task is text categorization. Frequently customers ask how we evaluate the quality of the output of our categorization models, especially in scenarios where each document may belong to several categories.

The idea is to be able to keep track of changes in the continuous improvement cycle of models and know if those changes have been for good or bad, to commit or reject them.

This post gives answer to this question describing the metrics that we commonly adopt for model quality assessment, depending on the categorization scenario that we are facing.

 

Continue reading


NLP technologies: state of the art, trends and challenges

This post presents MeaningCloud’s vision on the state of Natural Language Processing technology by the end of 2019, based on our work with customers and research projects.

NLP technology has practically achieved human quality (or even better) in many different tasks, mainly based on advances in machine learning/deep learning techniques, which allow to make use of large sets of training data to build language models, but also due to the improvement in core text processing engines and the availability of semantic knowledge databases.

Continue reading


Case Study: Text Analytics against Fake News

Everybody has heard about fake news. Fake news is a neologism that can be formally defined as a type of yellow journalism or propaganda that consists of deliberate disinformation or hoaxes spread via traditional print and broadcast news media or online social media. It is also commonly used to refer to fabricated or junk news, with no basis in fact, but presented as being factually accurate.

The reason for putting someone’s efforts in creating fake news is mainly to cause financial, political or reputational damage to people, companies or organizations, using sensationalist, dishonest, or outright fabricated headlines to increase readership and dissemination among readers using viralization. In addition, clickbait stories, a special type of fake news, earn direct advertising revenue from this activity.

Continue reading


TASS 2018: Fostering Research on Semantic Analysis in Spanish

MeaningCloud and University of Jaen have been the organizers of TASS, the Workshop on Semantic Analysis in Spanish language at SEPLN (International Conference of the Spanish Society for Natural Language Processing), again in 2018.

TASS logo

During the years, the research has extended to other tasks related to the processing of the semantics of texts that attempt to further improve natural language understanding systems. Apart from sentiment analysis, other tasks attracting the interest of the research community are stance classification, negation handling, rumor identification, fake news identification, open information extraction, argumentation mining, classification of semantic relations, and question answering of non-factoid questions, to name a few.

TASS 2018 was the 7th event of the series and was held in conjunction with the 34rd International Conference of the Spanish Society for Natural Language Processing, in Seville (Spain), on September 18th, 2018. Four research tasks were proposed. MeaningCloud sponsored this edition with prizes for the best systems in each of the tasks. A comprehensive description paper is (to be) published in Procesamiento del Lenguaje Natural journal, vol 62: TASS 2018: The Strength of Deep Learning in Language Understanding Tasks.

Continue reading


MeaningCloud sponsors the award for Author Profiling Research at PAN also in 2018

Author Profiling and Text Forensics Research

CLEF Conference 2018Since 2009 the PAN Lab organizes shared tasks on digital text forensics in general, and in author profiling in particular. Pan Lab is part of CLEF, the European Conference and Evaluation Forum around Information Retrieval. CLEF consists of an independent peer-reviewed conference on a broad range of issues in the field of multilingual and multimodal information access evaluation, and a set of labs and workshops designed to test different aspects of mono and cross-language information retrieval systems. CLEF 2018 will be hosted by the University of Avignon, France, 10-14 September 2018.

MeaningCloud has been sponsoring the award to the best performing team in the author profiling task at CLEF since 2015.

Author profiling is a task that given a document has the aim to infer what are the traits of its author.
In 2017 the task focused on gender and language variety identification in Twitter addressing four languages and several of their varieties: English (Australia, Canada, Great Britain, Ireland, New Zealand, United States), Spanish (Argentina, Chile, Colombia, Mexico, Peru, Spain, Venezuela), Portuguese (Brazil, Portugal), and Arabic (Egypt, Gulf, Levantine, Maghrebi).

Paolo Rosso delivers the 2017 PAN Author Profiling Price to the team of University of Groningen

Paolo Rosso delivers the 2017 PAN Price to the team of University of Groningen

Twenty-two were the participating teams from all over the world in 2017 and the best results were obtained by Angelo Basile, Gareth Dwyer, Maria Medvedeva, Josine Rawee, Hessel Haagsma, and Malvina Nissim, from the University of Groningen, The Netherlands.

This year the task will go multimodal and not only textual information in tweets will be taken into account but also images of URLs will be used as information sources in order to infer gender demographics. Three will be the languages that will be addressed: English, Spanish and Arabic [http://pan.webis.de/clef18/pan18-web/author-profiling.html].

Paolo Rosso
Universitat Politècnica de València, Spain
Co-organizer of the author profiling task at PAN

References

Rangel F., Rosso P., Potthast M., Stein B. (2017). Overview of the 5th Author Profiling Task at PAN 2017: Gender and Language Variety Identification in Twitter. In: Cappellato L., Ferro N., Goeuriot L, Mandl T. (Eds.) CLEF 2017 Labs and Workshops, Notebook Papers. CEUR Workshop Proceedings. CEUR-WS.org, vol. 1866. [http://ceur-ws.org/Vol-1866/invited_paper_11.pdf]

Potthast M., Rangel F., Tschuggnall M., Stamatatos E., Rosso P., Stein B. (2017). Overview of PAN’17: Author Identification, Author Profiling, and Author Obfuscation. In: 8th Int. Conf. of CLEF on Experimental IR Meets Multilinguality, Multimodality, and Visualization, CLEF 2017,
Springer-Verlag, LNCS(10456), pp. 275–290 [http://www.uni-weimar.de/medien/webis/publications/papers/stein_2017k.pdf]


Applying text analytics to financial compliance

In one of our previous posts we talked about Financial Compliance, FinTech and its relation to Text Analytics. We also showed the need for normalized facts for mining text in search of suspects of financial crimes and proposed the form SVO (subject, verb, object) to do so.

financial crime

Financial crime

Thus, we had defined clause as the string within the sentence capable to convey an autonomous fact. Finally, we had explained how to integrate with the Lemmatization, PoS and Parsing API in order to get a fully syntactic and semantic enriched JSON-formatted tree for input text, from which we will work extracting SVO clauses.

In this post, we are going to continue with the extraction process, seeing in detail how to work to extract those clauses from the response returned by the Parsing API.

Continue reading


How to build a Financial Compliance model ready for FinTech

What is Financial Compliance and what is FinTech?

financial crime

Financial crime

Financial crime has increasingly become of concern to governments throughout the world. The emergence of vast regulatory environments furthered the degree of compliance expected even from other non-governmental organizations that conduct financial transactions with consumers, including credit card companies, banks, credit unions, payday loan companies, and mortgage companies.

Technology has helped financial services address the increased burden of compliance in innovative ways which have also yielded other benefits, including improved decision-making, better risk management, and an enhanced user experience for the consumer or investor.

The rapid development and employment of AI (Artificial Intelligence) techniques within this specific domain have the potential to transform the financial services industry.

FinTech (Financial Technology) solutions have recently arised as the new applications, processes, products, or business models in the financial services industry, composed of one or more complementary financial services and provided as an end-to-end process via the Internet. You can find additional interesting information in this article.

Continue reading


MeaningCloud sponsors prize for Author Profiling Research

Author Profiling ResearchCLEF Initiative and Conference

MeaningCloud sponsors the prize to the best team at the 5th International Competition on Author Profiling Research, PAN@CLEF 2017. This competition is part of PAN (Plagiarism, Authorship and Social Software Misuse), a series of scientific events and shared tasks on digital text forensics. The 17th evaluation lab on digital text forensics will be held as part of the CLEF conference in Dublin, Ireland, on September 11-14, 2017.

Continue reading