Dirty Social Media Data Is Misguiding Brands Tracking Consumer Behaviour [REPORT]

Brands today have to make data from social media inevitably a part of the marketing strategy. Every brand is aware that social listening is key to their success and profitability. Hence, brands are always listening social media and processing the humongous amount of data that they get from their social channels to analyze sentiment, traction, loyalty and many such factors that have come to define brands now. Whilst listening to social channels is key, pre-processing the big data that social media feeds channels is more critical. This is due to the fact that all data is not useful or valid and if used without filtering, can potentially pollute the sentiment analysis and lead to drastically misguided branding decisions. Let us take a look at what is the “dirt” that pollutes social data and try to analyze methods to cleanse data.

So where does the dirt come from? Based on a recent analysis of social media data by Networked Insights, nearly 10% of total the data from social media posts that brands analyze to understand their consumer’s behavior are not actually coming from real consumers. They come from non-consumers, these include social bots, celebrities, brand handles and inactive accounts. Spam is a particularly major concern with forums, which report up to 28% of all posts are from non-consumers.

percent of non-consumer type

Bots are scripts or programs that behave like persons posting on social media, but a closer study of their posting frequency and repetitive message content being dominated by links will reveal the truth about them. Sometimes celebrities are brand ambassadors and get paid to talk positively about brands on social media. Their accounts will have massive following and significant influence, but we can not add their posts into valid brand data. They are paid to post. Similarly, brand handles that belong to the company will post for the brand and competitors will post against the brand. These posts are also considered spam.

Social spam is a huge and considerably complicated problem when listening in on brand conversations; social media spamming grew by 658% in the last one year, some brands have reported that more than 90% of their recorded social media posts can be classified as spam. This is a very high percentage, given the sheer frequency and size of conversations on social media. Brands today are employing sophisticated methods and tools for analyzing social media to discover consumer insights and then make them into actionable marketing and branding decisions. But, if social data contains a large amount of spam, then the brands’ analyzes will not be accurate or actionable.

According to a recent New York Times article, 50% to 80% of a data scientist’s time now involves cleaning data. And really complex tools using Artificial Intelligence and Natural Language Processing are at the forefront of technologies employed by brands to clean data. Machine learning algorithms are used to identify spam. Networked Insights’ models with NLP capability can identify social spam with an accuracy of greater than 80% and have the ability to process millions of data points quickly.

percentage of total spams on social media

We also need to remember that Social Spam also includes posts, reviews or blog comments containing:

  1. Coupons – coupons, product listings, contests and giveaways
  2. Adult Content – adult or pornographic content
  3. General Spam – posts which contain gibberish or nonsense

Shopping, Finance and Technology have been identified as the top categories that contain maximum spam ranging from 13% to 10% of all conversations. While Sports, Science and Religion are the categories that contain less than 1% spam. Although the overall spam percentages are less than 10% across social media platforms, conversations for some brands are dominated by non-consumer data. And these brands have to employ more complicated methods to filter out spam.

So the conclusion is that spam and non-consumer generated posts are problems that cannot be ignored by brands. Doing so will skew data and give erroneous results in sentiment analysis. Important brand to brand comparisons can have unknown results due to differing amounts of spam occurring among brands and hence a right combination of Machine Learning, Natural Language Processing and Networked Learning algorithms need to be employed for cleaning out the dirt from the data. Sometimes we will see that data granularity and actionable trends improve greatly after cleaning. If you are a brand listening on Social Media, get a laundromat with the right algorithms before analyzing data.

To Top