Skip to main content

Analyzing the news with RAW

· 2 min read
Jeremy Posner

There’s 50 billion web pages apparently. These could be your own web pages from your web site, or a customer, competitor, partner website(s), or even news feeds. These make plenty of choice for data sources which we can use.

But, how can we query this wealth of data? Most websites rely on RSS, a well known format to present updates to websites in a computer-readable format. RSS is an XML standard and despite the reports of the death of XML, there’s still plenty of XML in the news and publishing spaces. If we can query these, we can then analyze the news, live.

RSS often doesn't have the actual content, and there’s more metadata inside each page, so we can use RSS as a nice index, but we then need to traverse down to process more data.

Here's the plan to query this data:

  1. Query and order the XML, and extract metadata from each page;
  2. Pass the results to a text analysis API to return structured, semantic data (entity extraction) system;
  3. Aggregate up results for presentation.

This is a fairly standard pattern, and here we will use both the OpenGraph.io API for extracting page metadata, along with Google’s Language Entity Analysis API for the text extraction, but there are plenty of choices out there depending on what you want to do.

RAW lets you achieve this in a couple of lines only! Want to learn how? Check our related demo data product here for fully working example that queries the news live from CNN.