As a white hat marketer in the 21st century, you need to be proficient at using online resources to get crucial information from websites. This information, when collected in bulk, equips you with the arsenal you need to be able to market and optimize your products efficiently.

If you’ve found this article – chances are that you’re familiar with the concept of web scraping. Perhaps you’ve even used software like ExifData to extract metadata (things like the page title and description) from various websites in bulk.

Now, grabbing the metadata is certainly a good start, but that only gives you the title, description, and a few other details about your competition. When the right tools are used, there is a lot more depth to what you can do with web-scraping and data blending. In a nutshell, it can be used to extract and catalog any data, anywhere on the page of the website.

So, let’s get started on what kind of data you would want to capture. As a rule of thumb,

we try to web scrape information under these general categories:

– Finding Content Evangelists

– Prospective data from expert opinions, studies, etc

– Trending competitor content on Reddit to build good relationships with your community

– Data to analyze blog performance

– Obtaining guest pitches and avoiding idle websites.

First off, what is web-scraping? What are the tools you do it with?

So here’s an analogy. One of your competitors owns a blog that’s doing really well. You notice that while your content is similar, their titles are much more attractive to viewers. This nets them in a lot of revenue.

One way to analyze, understand and replicate their “success formula” would be to head on over to the HTML and copy-paste their “title” metadata over to, let’s say, an excel sheet. But this would be tedious and, more importantly, unnecessary – because there are applications to do it for you.

Now, There Are Two “Techniques” to Scrape Data

1- Using a path-oriented algorithm in a system such as XPath or CSS Selectors

2- Using a search algorithm (i.e. Regex)

However, it’s important to note that path-based systems excel over search systems (like Regex) because Regex uses search algorithms to find all matching instances within a document. On the other hand, something like XPath is capable of finding specific data within the divs, metadata, an ordered or unordered list (ul/li), etc.

For example, the following command:

XPath: //ul[@class=’books’]/li

..would list all the books in the “ul” category of an HTML page. However, it wouldn’t extract data in any other ul categories – making it relatively clean and simple to use.

If you’re looking to familiarize yourself with XPath/CCS Selector/Regex, there are several useful read-ups to get you started

We’ll be using XPath: An XPath Tutorial, and will follow for all the information we mentioned was worth scraping, earlier in this article.

1. Finding Content Evangelists

A simple way to find out prospective commenters who are quick to comment on posts is to look at one of your previous articles around a similar theme with a lot of comments. You can then use XPath to scrape the names and websites of each of the commenters and reach out to them personally. Simply right-click a name, select “Scrape Similar,” and a list will come up with your commenters and their websites.

2. Removing Junk 2

Idle blogs or those that haven’t been updated in a while are unlikely to oblige pitches for guest comments. Generally, one can easily avoid contacting these blogs by looking at the post date. However, over a bulk of blogs, that’s a lot of time spent. A simple way to automate this process is to look for “pubDate” and scrape it to a separate excel sheet. This will allow you to prepare a concrete list of inactive blogs.

3. Analyzing Blog Performance by Subcategories

Our last tip for today will be how to interpret the performance of your blogs automatically. Rather than guessing what kind of content aligns with your readers, web-scraping gives a lot of cold hard data to work with. The process is as simple as one-two-three.

1- Use any site explorer to look for the “top content” feed.

2- Scrape all the subcategories within the top content feed.

3- Export this information into an excel sheet for analysis.


As they say, time is money. And automation sure saves you a hell of a lot of money. We hope that by using the tips and information we’ve provided – it gives you a good feel for how to improve your revenue with hard statistics.