Case Study: Text Analytics against Fake News

Everybody has heard about fake news. Fake news is a neologism that can be formally defined as a type of yellow journalism or propaganda that consists of deliberate disinformation or hoaxes spread via traditional print and broadcast news media or online social media. It is also commonly used to refer to fabricated or junk news, with no basis in fact, but presented as being factually accurate.

The reason for putting someone’s efforts in creating fake news is mainly to cause financial, political or reputational damage to people, companies or organizations, using sensationalist, dishonest, or outright fabricated headlines to increase readership and dissemination among readers using viralization. In addition, clickbait stories, a special type of fake news, earn direct advertising revenue from this activity.

Easy access to online advertisement revenue, increased political polarization and the popularity of social media has contributed to the relevance of fake news.  In this information society where viewers have limited attention and are saturated with content choices, fake information seems to be more appealing or engaging to viewers. It is known that online fake news spread much more quickly and more widely than real news: by 2022, people in developed economies could be encountering more fake news than real information.

Some people reject this term and prefer to speak of “mis-information” (false information disseminated without harmful intent), “dis-information” (harmful intent) or “mal-information” (genuine information shared to cause harm).

From the point of view of business and regulation, there are some organizations working for developing transparency standards that help to assess the quality and credibility of journalism, such as The Trust Project or The Credibility Coalition. MeaningCloud has also invested some time, from the point of view of technology, in trying to provide a probably partial solution for detecting fake news. Below in this post we briefly describe how Text Analytics can approach this problem and present a Proof-of-Concept API developed for this specific task.

First step: Learn how Humans Detect Fake News

People may have a hard time figuring out what is true or false, but some hints for detecting fake news are:

  1. Be suspicious if there are misspellings, grammatical errors, weird page layouts and more ads than news
  2. Take a careful look at images and videos: Are they poor-quality? Do they look authentic? Are they date-stamped and credited to someone who can be verified?
  3. Where possible, trace data points back to the creator to verify authenticity.
  4. Search for cited individuals and organizations to validate that (a) they are a respected expert in that field and (b) they are the true source of that quote.
  5. Check if news stories contain timelines that make no sense, ambiguous references, or event dates that have been altered.
  6. Look at the methodology applied for data that’s been gathered from surveys
  7. Check whether the source is known for parody, and whether the story’s details and tone suggest it may be just for fun.

Infographic: How to spot fake news published by the International Federation of Library Associations and Institutions [source: IFLA]

Second Step: AI for Detecting Fake News

For decades, AI has been successful in fighting against spam email, using natural language processing to analyze the text of messages and then machine learning algorithms to determine how likely it is that a particular message is a real communication from an actual person or instead a mass-distributed spam message.

Detecting fake news can be approached in a similar way. AI systems can evaluate the reliability (or falsehood) of a post’s text or headline, comparing different features with real (non-fake) news stories. Text processing be used to analyze the author’s writing style and determine whether a headline agrees with the article body (stance classification). Another method could examine similar articles in Internet to check whether other news media mention different facts.

Some other clues for detecting fake news can be found besides the content itself of the article. For instance, comparing the ratio of reactions versus shares on social media, or considering the credibility of the source.

There are many research studies and papers that have described a set of indicators that seem to be useful to detect fake news, focusing on different points of view of the categorization problem. Those indicators can be divided into internal and external indicators.

1. Internal indicators

Internal indicators are those which can be determined by analyzing the title and text of the article without considering outside sources or metadata.

Some indicators commonly mentioned in papers are:

  • Title Representativeness: Article titles can be misleading about the content.
  • Clickbait Title: degree of “sensationalism”, score from 0.0 to 1.0.
  • Quotes from external sources: highlight where sources were quoted in the article (usually reveals a level of journalistic rigor).
  • Citations of organizations and studies: highlight where any scientific studies or any organizations were cited and whether the article was primarily about a single study.
  • Calibration of confidence of authors: highlight sections of an article where authors acknowledge their level of uncertainty (e.g. hedging, tentative, assertive language).
  • Logical Fallacies: poor but tempting arguments. There are several subtypes:
    • straw man fallacy: presenting a counterargument as a more obviously wrong version of existing counterarguments.
    • false dilemma fallacy: treating an issue as binary when it is not.
    • slippery slope fallacy: assuming one small change will lead to a major change.
    • appeal to fear fallacy: exaggerating the dangers of a situation.
    • naturalistic fallacy: assuming that what is natural must be good.
  • Tone: readers can be misled by the emotional tone of articles: exaggerated claims or emotionally charged sections, especially for expressions of contempt, outrage, spite, or disgust.
  • Inference Consistency: wrong association of correlation and causation, or making a fact to generalize into an incorrect conclusion (for instance: as the criminal was Latino, Latinos are to be blamed for all crimes).

2. External indicators

External indicators (or context indicators) are those which require to look outside of the article text and research external sources or examine the metadata surrounding the article text, such as advertising and layout.

Some indicators are:

  • Originality: whether the article was an original piece of writing or duplicated elsewhere. Reasons include licensing agreements from a wire service such as Reuters, or the article can simply be stolen or reworded without attribution.
  • Fact-checked information: whether the central claim of the article was fact-checked by an approved fact-checking organization, such as the Poynter’s International Fact-Checking Network (IFCN), emergent.info, snopes.com and politifact.com. Many of them use schema.org’s ClaimReview schema to mark results.
  • Representative Citations: how accurately the description in the article represented the original content citated.
  • Reputation of Citations: credibility of a cited source.
  • Number of Ads: number of ads as well as recommended content (sponsored or not) from services such as Taboola or Outbrain.
  • Number of Social Calls: number of calls to share on social media, email, or join a mailing list.
  • “Spammyness” of ads: ads with disturbing or titillating imagery, or celebrities, or clickbait titles.
  • Placement of Ads and/or Social Calls: aggressiveness of the placement of ads e.g. in pop-up windows, covering up article content, or distracting through additional animation and audio.

Third Step: Our Fake News Analytics API

Our contribution for this problem is a REST API, currently working for English and Spanish, which, given a news article (title and content), analyzes its content considering different features, and returns a falsehood score ranging from 0.0 (the article seems to be real news) to 1.0 (the article is definitely thought to be fake news).

The detection engine considers the following internal indicators:

  • Title Representativeness: the title and the content of the article are compared and a similarity score is assigned. Texts are analyzed using natural language processing including tokenization, lemmatization, stopword removal and synonym normalization and then represented as multidimensional vectors and compared using a slightly modified version of the cosine-distance, which returns values from 0.0 (completely different) and 1.0 (full match).
  • Clickbait Classifier: two machine learning classifiers (one for each language) based on n-grams and logistic regression have been trained using a corpus of news titles tagged as clickbait or non-clickbait. For instance, “23 Places You Won’t Believe Are In England” would be tagged as probably clickbait (score between 0.5 and 0.9) and “US government stops Haiti evacuations” would be non-clickbait (score lower than 0.1).
  • Polarization and Stance: a sentiment analysis is performed using our sentiment analysis engine to detect highly polarized terms (positive or negative) and/or signals of conflicting polarities, and a score ranging from 0.0 (neutral, non-polarized) to 1.0 (polarized) is returned.
  • Citation Analysis. A list of identified quoted sources is returned.We use the classification of information sources by Melvin Mencher’s News Reporting and Writing (2010):
    • “On the record”: attributed to people (“Trump said…”) or organizations (“according to Microsoft…”).
    • “On background”: attributed in general terms (not direct quotes), such as mentions of roles (“the family spokesman declared…” or “personnel of the government…”).
    • “On deep background”: general mentions (“in the EU plans…”).
    • “Off the record”: mentions to hidden sources that cannot be quoted.

    Our Part-Of-Speech and Parsing engine is used to analyze the text. References to communicative verbs (such as declare, explain, assure) are used as focus to detect quotes, which are extracted from the appropriate verb complement, usually the subject and the direct object, for instance: “<someone> said that <something>”. Finally a score is generated with a heuristic formula that considers the number of each type of citations (from the previously described) and the total length of the text, defined based on the analysis of a corpus of real news.

A final score (a real number) is calculated with a combination of these previous scores, indicating the estimated degree of falsehood.

This API is in beta status and can be integrated as a validation tool in a media website, or deployed as a plugin in a browser for online checking when browsing the Internet. Using our Text Analytics platform, this could be quite easily extended to other languages.

MeaningCloud has a strong expertise in Natural Language Processing and Text Analytics, for over 20 years. Our team can tackle projects in any complex scenario with the maximum guarantees of success. If you have any question or need in these areas, please do not hesitate to contact us at support@meaningcloud.com: we will be happy to help you!


About Julio Villena

Technology enthusiast. Head of Innovation at @MeaningCloud: natural language processing, semantics, voice of the customer, text analytics, intelligent robotic process automation. Researcher and lecturer at @UC3M, in love with teaching and knowledge sharing.

Leave a Reply

Your email address will not be published. Required fields are marked *

*
*