Entities recognition: the engineering problem
As in every engineering endeavor, when you face the problem of automating the identification of entities (proper names: people, places, organizations, etc.) mentioned in a particular text, you should look for the right balance between quality (in terms of precision and recall) and cost from the perspective of your goals. You may be tempted to compile a simple list of such entities and apply simple but straightforward pattern matching techniques to identify a predefined set of entities appearing “literally” in a particular piece of news, in a tweet or in a (transcribed) phone call. If this solution is enough for your purposes (you can achieve high precision at the cost of a low recall), it is clear that quality was not among your priorities. However… What if you can add a bit of excellence to your solution without technological burden for… free? If you are interested in this proposition, skip the following detailed technological discussion and go directly to the final section by clicking here.
Where difficulties come from?
Now, I will summarize some of the difficulties that may arise when designing an automatic system for “Named Entities Recognition” (NER, in short, in the technical papers). Difficulties may come from two fronts:
- Do you deal with texts in several languages? Do you know the language of each text in advance?
- What is the source of the documents or items of text that you have to manage? Do they come from a professional newsroom? Did you ingest them from OCR (Optical Character Recognition) or ASR (Automatic Speech Recognition) systems? Did you catch them with the API of your favorite social network?
- Do your texts follow strict academic conventions regarding spelling and typography? (i.e. do you always deal with well-written text?) Did users generate them with their limited and error-prone devices (smartphones)? Did second language speakers or learners produce them?
Designing the perfect NER system: the language nightmare
The previous questions end up in a set of complex challenges:
1. Translingual equivalence:
Problem: When you deal with multilingual content, you are interested in recognizing not language-dependent names, but entities that are designated differently in different languages.
Example: Eiffel Tower (EN), Tour Eiffel (FR) and Torre Eiffel (ES) refer to the very same object.
Solution: You need to use semantic processing to identify meanings, relative to a consistent, language-independent world model (e.g. using ontologies or referring to linked data sources).
2. Intralingual or intratext equivalence:
Problem: For a particular language, texts usually refer to the same entities in different flavors (to avoid repetition, due to style considerations or communication purposes).
Example: Nelson Mandela, Dr. Mandela (depending on the context) and Madiba are recognized by English speakers as the same entity.
Solution: Again, in the general case, you need to link multiword strings (tokens) to meanings (representing real world objects or concepts).
3. Transliteration ambiguity:
Problem: translation of names between different alphabets.
Example: Gaddafi, Qaddafi, Qadhdhafi can refer to the same person.
Solution: It is always difficult to decide the strategy to attach a sense to an unknown word. Should you apply phonetic rules to find equivalents from Arabic or from Chinese? Expressing it otherwise: is the unknown word just a typo, a cognitive mistake, a spelling variant or even an intended transformation? Only when context information is available you can rely on specific disambiguation strategies. For example, if you know or you deduce that you are dealing with a well-written piece of news about Libya, you should surely try to find alternative transliterations from Arabic. This problem is usually treated at dictionary level, incorporating the most widespread variants of foreign names.
4. Homonyms disambiguation
Problem: Proper names have usually more than one bearer.
Example: Washington may refer to more or less known people (starting by George Washington), the state on the Pacific coast of the USA, the capital of the USA (Washington, D.C.) and quite a few other cities, institutions and installations in the same and other countries. It can even be a metonym for the Federal government of the United States.
Solution: Semantic and contextual clues are needed for proper disambiguation. Are there any other references to the same name (maybe in a more complete form) along the piece of text under scrutiny? Can semantic analysis tell us if we deal with a person (producing human actions) or a place (where things happen)? Can we establish with confidence a geographical context for the text? This could also lead to favorite particular interpretations.
5. Fuzzy recognition and disambiguation:
Problem: in the general case, how to deal with unknown words when you rely on (maybe huge) multilingual dictionaries plus (maybe smart) tokenizers and morphological analyzers?
Example: If you find in an English text the word “Genva”, should you better interpret it as Geneva (in French Genève) or Genoa (in Italian Genova).
Solution: the presence of unknown words is linked most of times to the source of the piece of text that you are analyzing. When the text has been typed with a keyboard, the writer may have failed to type the right keys. When the text comes from a scanned image through OCR, the result can be erroneous depending on image resolution, font type and size, etc. Something similar occurs when you get a text through ASR. The strategy to interpret correctly the unknown word (identifying the meaning intended by the author) implies using metrics for distance between the unknown word and other words that you can recognize as correct. In our example, if the text has been typed with a qwerty keyboard, it seems that the distance between Genva and Geneva involves a single deletion operation, while the distance between Genva and Genoa involves a single substitution using a letter that is quite far apart. So, using distance metrics, Geneva should be preferred. But contextual information is equally important for disambiguation. If our text includes mentions to places in Switzerland, or it can be established as the right geographical context, then Geneva gains chances. Otherwise, if the text is about Mediterranean cruises, Genoa seems to be the natural choice.
Systems or platforms for Content Management (CMS), Customer Relationship Management (CRM), Business Intelligence (BI) or Market Surveillance incorporate information retrieval functionality allowing the search of individual tokens (typically alphanumeric strings) or literals in unstructured data. However, they are very limited in terms of recognition of semantic elements (entities, concepts, relationships, topics, etc.) This kind of text analytics is very useful not only for indexing and search purposes, but also for content enrichment. The final aim of these processes is adding value in terms of higher visibility and findability (e.g. for SEO purposes), content linkage and recommendation (related contents), ads placing (contextual advertisement), customer experience analysis (Voice of Customer, VoC analytics), social media analysis (reputation analysis), etc.
To facilitate the integration of semantic functionality in any software application, Daedalus opened its multilingual semantic APIs to the community through the cloud-based service Textalytics. On the client side, you can send a call (petition) to our service in order to process one item of text (a piece of news, a tweet, etc.): what you get is the result of our processing in an interchange format (XML or JSON). Textalytics APIs offer natural language processing functionality in two flavors:
- Core APIs: one API call for each single process (extraction of entities, text classification, spell checking, sentiment analysis, content moderation, etc.) Fine tuning is achieved through multiple parameterization. Besides natural language core processing, audio transcription to text is also available, as well as auxiliary functions. Auxiliary APIs are useful, for example, to link entities with open linked data repositories, as DBpedia/Wikipedia, or to guess crucial demographic features (type, gender, age) for a given social media user.
- Vertical APIs (Media Analysis, Semantic Publishing): one API call provides highly aggregated results (e.g. extraction of entities and topics, plus classification, plus sentiment analysis…), convenient for standard use in a vertical market (media industry, publishing industry…)
To end this post, let me stress other benefits of selecting Textalytics for semantic processing:
- SDKs (Java, Python, PHP and Visual Basic) are offered for quick integration. Software developers take not more than half an hour to read the documentation and integrate our semantic capabilities in any environment.
- You can register in Textalytics, subscribe to the API or APIs of your choice, get your personal key and send as many petitions as you want for free, up to a maximum of 500.000 words processed per month. For research, academic or commercial usage. No matter.
- If you need processing higher volumes of text (exceeding the free basic plan) or in case you require launching more than five API calls per second, you can subscribe at affordable prices. No long-term commitment. Pay per month. Check out our pricing plans.
José C. González (@jc_gonzalez)