Smesh Blog

Posts Tagged: monitoring

Why buzz monitoring tools have different results

Posted by on August 29th, 2010 in Theory & Research
Tagged:,
0 Comments

When people use tools for monitoring social media content, such as Brandwatch, Radian6 or our own Smesh technology, they expect that the different tools should all produce very similar or identical results for any given term being monitored and are a bit surprised when they don’t. They’re all looking at the same content, aren’t they? The stuff that’s out there on the web?

Perhaps surprisingly, though hopefully less surprising once you’ve read this, the answer is ‘no': they’re not all looking at the same content and will frequently produce differing results.

There are many different aspects of how tools work that can lead to big differences in results. This posting tries to explain a few of the major ones in layman’s terms. Perhaps a web-obsessed layman, rather than your regular kind.

A bit of understanding about what goes on under the hood of monitoring tools will also help you to ask wholesomely difficult questions next time someone tries to sell you social media monitoring services.

Very broadly, two key factors, which we’ll look at in a bit more detail shortly are:

  1. what data gets fed into the system in the first place?
  2. how do searches get run against the data in the system?

What data gets fed into the system in the first place?

Big search engines like Google, Yahoo and Bing devote, to adopt non industry-standard terminology, honking great dollops of infrastructure to finding (‘crawling’) and storing for search (‘indexing’) all of the publicly visible pages on the web, along with RSS feeds and some other types of data. These are ‘full web indexes’.

Most social media companies simply don’t have the large scale infrastructure needed to run their own full web indexes, so they use other techniques instead — which has a huge impact in terms of what data is available in their tools in the first place. A typical social media monitoring solution will have its own small-scale index (typically millions as opposed to billions of items), and try to ensure that this contains as much relevant data for users as possible.

Here are some of the common approaches to getting hold of relevant web data to put into a small-scale index without having to be Google- or Yahoo-sized. This variation in the sourcing of data contributes massively to variation in the search results that end users of social media monitoring tools see, because these different techniques provide varying data.

Use someone else’s full web index

Yahoo has a full web index going back years that you can run queries against on a commercial basis; there aren’t many competitor services; it’s also not clear exactly where the future of this specific service lies now that Yahoo are passing responsibility for search results to Microsoft. Note that Google generally doesn’t let commercial service providers use their index; mostly they just want individual humans who they can serve ads to looking at their results.

Using someone else’s index, you’re stuck with their techniques for deciding just how relevant given pieces of content are; if their results suck, yours probably will too. You can get round this to some extent by first slurping results into your own small-scale search system; but still, if the incoming data is bad, you’ll struggle to provide quality data going out.

Piggyback on someone else’s search results

A variant of the above, a bit like meta-search from days of yore. See what results other search engines supply and either use these as ‘your’ results (legally dubious), or less dubiously, use the results as an indicator of web sites to grab content from yourself, without having to crawl the whole web.

Buy a 3rd party data feed

Various providers (e.g. Moreover, Spinn3r) offer data feeds that tool vendors can consume some or all of, containing data from a wide range of sources and often with additional layers of data (data about the data, called ‘meta-data’) added, such as categorisation information for sites. Of course, a ‘wide range of sources’ as provided by such tools isn’t ‘the entire web’ so risks missing relevant content, and the various services typically have their own strengths and weaknesses when it comes to things like filtering out spam and adding or discovering new sources.

Constrained crawling

Find (‘crawl’) and index everything for yourself, but heavily constrain this process so that you don’t fall down the rabbit hole of trying to crawl the entire web. This approach can be used effectively in tandem with ‘piggybacking’ (above) — you might sneak a peek at search engine results for a query, then grab data from the web pages that are mentioned, plus all of the pages that those pages have links to, but no more than that.

Managed crawling services

There are some new services (e.g. 80legs) that offer hosted crawling — instead of trying to manage your own servers for crawling and indexing, you pay someone else to do it for you. You just define what sources you want to look at, and then get results back. This still involves constraining what sources you’re looking at though, unless a) you have a very large budget and b) the provider can actually offer full web indexing.

It’s not hard to see how if different tools use very different approaches to getting hold of data, then they’re not going to be looking at the same content, and won’t be able to show the same results to their users, even for identical queries.

How do searches get run against the data in the system?

All of the above epic nerdery gets your social media monitoring system a pile of data that you can search, analyse and generally rummage around in to find information to show to your users.

Variations in how different tools set about this rummaging around is another major source of variation in their results. Even if all tools used exactly the same approach to get hold of content in the first place (the processes described above), they’d still end up showing different results through using different technologies and techniques for deciding what items to subsequently show to users for their queries.

Typically, any given monitoring tool will be using some kind of search index, which is a special kind of database for handling search-engine style queries. Differences between the various indexing tools used produces differences in results. Even if all of the tool vendors used the same method for getting hold of data and exactly the same indexing technology, you’d *still* see a lot of variation in results, because there are loads of ways you can ‘tweak’ the results you get out of search index.

For example, one vendor might attach more significance to how recent query results are; another might decide that the search term appearing in a web page’s title is massively significant where the first decides it is less so. Although these variations sound small, they can have a big impact on the quality of results presented back to users; there is lots of scope for variation even within a single search technology.

In fact, different vendors probably use different indexing systems, have different approaches to filtering out spam, keep data for varying periods of time, and so on.

To return to our original question, different tools can be given the same query, but despite the world’s wider Web being largely the same for each, the content that they’re actually using to find results for you and the tools and techniques that they’re using to do differ wildly.

To resort to a terribly contrived analogy, imagine sitting two different artists down in front of the same sunflower and asking them the paint a picture of it — despite both being given the same instructions and being faced with the same object, the end results are likely to be wildly different (especially if one of them is a depressive genius prone to hacking off parts of his own body). You might object that something like social media buzz monitoring must surely be scientific. However, the mere involvement of computers doesn’t make something scientific, and there is so much leeway for variation and creativity in how such systems are actually built that there is actually a large degree of artistry involved.

Difficult questions

As for those ‘wholesomely difficult questions’ to ask next time someone tries to sell you social media monitoring services, draw on the above and see if you can get a detailed account of their data:

  • Where exactly does their data come from?
  • How often does it get updated?
  • Do they include Twitter and Facebook? If so, how much of them?
  • When you add a query, how do they decide what results to show you? Do you have any control over this?

If you’re told that they index the entire internet in real-time (which at time of writing even the mighty Google don’t do), then you can probably draw your own conclusions about their sales practices. Similarly if they haven’t a clue how to answer.