Recently at Shareaholic we’ve been working on a lot of products that require showing a preview of some piece of content, usually a blog post. This preview usually consists of a title and, most importantly, a thumbnail image. A primary example is the Recommendations product, which shows thumbnails of related posts at the bottom of a blog post. Finding the correct thumbnail image is no trivial task, and I’ve spend a good percentage of my working hours over the past few months developing the library that performs this task for Shareaholic’s various products. I’ve found that there is not a lot of information on this topic floating around, and that what one might assume to be the state of the art (i.e. Facebook’s thumbnail algorithm) is not all that sophisticated compared to what you can roll yourself. So with my employer’s express permission I would like to share with you what I’ve learned in the hopes of saving you some time if you too are attempting to build your own thumbnail scraper.
*There is an article I would very much like to give credit to for getting me started building the image scraper, but I can’t find it. If I do I will post it here.
Finding the right image
In this first part of the series I’m going to cover how to choose the right image on a given page to use as the thumbnail. Later part(s) will cover what technologies to use, the work flow and the implementation. The problem I’m going to tackle here is, given a URL pointing to a piece of content (presumably a blog post), how do we select the best image from the page’s HTML to use as a thumbnail? Modern web pages do not share a common structure and are full of images in the form of ads, banners, and icons scattered throughout the page. It’s not like there’s a standard HTML tag called <featured_image> on every site that will tell you exactly which one to use…
OpenGraph and other meta tags
Actually, the above is not entirely true. Facebook and Twitter have developed standard meta tags for exactly this purpose, and most popular blogs that have any desire to be shared on social media sites have these tags at the top of every page. The following is an example from the Shareaholic Blog:
<meta property='og:image' content='http://blog.shareaholic.com/wp-content/uploads/2012/09/DSC_0056-300x199.jpg' />
An alternate (and less common) version that includes the namespace directly, from a W3 spec example:
<meta property="http://ogp.me/ns#image" content="http://example.com/alice/bob-ugly.jpg" />
The above are part of the Open Graph Protocol and allow bloggers and other site owners to decorate their pages with standardized social metadata to make it easier for social media sites to index their content. This is a win-win for everyone as it is a direct expression of the content creator’s preference, rather than an algorithm’s best guess, giving them a degree of control over how their content is presented. On the other end, it makes thumbnail scraping extremely easy, pointing us to the correct image that is usually already the correct size and dimensions (as in, small and square). I should take this opportunity to mention that Shareaholic for WordPress automatically inserts og:image tags for you based on your Featured Image for a given blog post.
Slightly less popular are the Twitter Cards, which follow a similar format to the Open Graph tags (as in, twitter:image). It should be noted if not already obvious that you do not have to choose between these two, you should in fact be searching for both of them within a page’s HTML. Finding the correct image is going to involve trying a composition of multiple methods in order of preference, with these two protocols likely being first. I should also mention that both of these protocols contain a variety of other useful metadata that may also be present on the page, such as the page title, site name, and description. Of course, you should also look for the small but growing number of sites using the shareaholic:image meta tags!
There are some pitfalls to Open Graph tags, however, so be careful not to just grab the image they point to and assume you are done for a given page. Depending on how/where you are displaying your thumbnails, these images may not be of sufficient size or quality for your purposes. Remember that Open Graph/Twitter Cards tags are specifically designed with Facebook/Twitter sharing in mind, respectively. The thumbnails Facebook shows in your newsfeed are pretty small, and many sites format their Open Graph images accordingly. This might not be sufficient for applications requiring larger thumbnails or full-sized images, and these larger versions can usually be found on the page itself.
Sometimes Open Graph images are not even the intended image at all, but rather an icon or some other irrelevant image. Certain plugins/web frameworks will simply insert an Open Graph tag with the site’s favicon or worse, their own product’s icon without the site owner noticing. In these cases, and in the case above when an image is too small, it is often best to move on to another image selection method (see below).
This brings me to a hard-learned lesson I discovered during my work on this library: Websites/site owners are not your allies in this battle. You would think they would make sure their preferred image for each post is clear by providing Open Graph tags, or having the largest image be the most relevant, or having sensibly named <div>’s, or not having spaces in their image URL’s, because it is in their best interest. You are MISTAKEN. In fact it is best to assume that you are competing against them, that they are doing everything in their power to hide the best image from you or trick you.
Finding the largest image
If the page in question doesn’t have any Open Graph tags, you’re going to have to do it the hard way and enumerate through the image tags on the page and use some heuristic to select the one most likely to be relevant, or perhaps none at all. This is an important point to note, as many pages do not have an accompanying image. Showing a default thumbnail in your application is not ideal, but showing the wrong one is worse. It is important to have a minimum threshold and not just take the best image out of a list that were clearly not meant to be thumbnails.
So the question now becomes, of all the images on a web page, which one did the author intend to be the representative image, or in the absence of that, which is the “most important” image that best represents the page, if any? As humans it is pretty easy for us to determine this, and if you think about it our intuition boils down to finding the largest image on the page that is closest to the top of the article/post in question. This is conceptually pretty simple; the problem is that the pages’s HTML obscures where images are actually located visually and how large they actually are. Determining this is the difficult part.
Finding the actual size
Unfortunately, finding the size of an image is not as easy as just looking at the width and height attributes, because most of the time they are not present. You’re probably accustomed to your browser telling you exactly how large each image is, but remember that your browser has already downloaded every image on the page and run the HTML through a renderer to get the final output, presumably something you do not want to have to do for obvious performance reasons. Sometimes the width and height will be specified via CSS; however, this is also applied by the browser during rendering and unlike the width and height attribute is not as easy for most HTML parsers to extract. Plus, like width and height, they are usually not present.
As it turns out, you pretty much do have to download images off the page to determine their size, but the performance hit does not have to be as bad as this implies (though it should be obvious by now that thumbnail scraping should not be done at request time, but rather in a background job or as part of an asynchronous call). There is a technique that involves starting to download an image and then cancelling it as soon as you have enough information to determine the image’s size (usually from the header). Usually this only requires you to download a small fraction of the image. I’m not going to get into the technical details in this article, but there are libraries that do this, notably FastImage for Ruby.
As with the Open Graph tags, however, you should take a composition approach. Attempt to determine the image’s size via the width/height attribute first, and if you can’t, use the partial image downloading technique. This will save on performance by allowing you to skip downloading some images, but it also avoids a potential gotcha I discovered when working with a particularly troublesome site: when we say we want the largest image, we mean the largest as displayed on the web page, not the largest source image. Some web pages may contain images that are 1000×1000 pixels, but are only displayed on the page (via width/height attributes) at 100×100 pixels. If you use only the image downloading approach, this image will be given undue importance simply because the source file is large. Most websites, in an effort to save bandwidth, will scale the source file down for these situations but again, websites are not your friend.
Banners, sprites, and icons
There are several classes of image that may technically be the largest but are clearly not suitable for thumbnails. Banners at the top of the page containing the sites logo image and title are the primary example. Visually we discount them because they tend to be short and wide, but if you’re calculating image size by multiplying width by height the aspect ratio is discarded. The solution is just to translate this visual intuition into an algorithm. Before calculating the area of an image, we first divide its width by its height and vice versa and make sure the resulting ratio is less than some threshold. I’ve been using 3 so far but it may need tweaking. Images with a width to height ratio greater than 3 can be classified as banners and discarded.
Another type of image that tends to be large but unsuitable are sprites. Sprites are large images that contain a collection of discrete smaller images. The smaller images are displayed on the page by referencing specific coordinate offsets on the large image. In this way, a page can reduce load time by sending all the images it requires as one large sprite rather than a bunch of individual image downloads. When viewed as one image, however, they look pretty ridiculous and make poor thumbnails, so you’ll want to filter them out of your potential candidates. Sprites usually contain only collections of small images like icons rather than large featured images that people are likely to save, so they can be safely ignored. A simple string filter for the word “sprite” in the image file name should be sufficient to identify the vast majority of them.
Lastly, as I mentioned before some pages do not have a suitable thumbnail image and it is better to display none at all in your application than one that is completely wrong. Therefore it is important to establish a minimum image size (area) that you will accept to avoid pulling things like icons or 1×1 advertising pixels. I would advise skipping anything less than 5000 pixels in area, even if they are gleaned from Open Graph tags.
Content zones, sidebars, and comments
Sprites and banners are fairly easy to identify, but what about the other unsuitable images that litter your average webpage, such as advertisements, related content thumbnails, user avatars in the comments section, and images in the header/footer like author photos and the like? Fortunately the big one on this list, advertisements, are usually Flash-based or in iframes and thus won’t be picked up by a parser looking for <img> tags. But more generally, our mind is able to categorize advertisements and these other images as irrelevant because they reside outside of the main content zone, in headers, footers, or sidebars.
As I discussed earlier though, web pages don’t have a standard layout (yet; HTML5 is making some headway here with standard tags denoting sections of a page). Determining which zone a given image is in programmatically is difficult, but it turns out your can make pretty good guesses by searching for <div> tags with ID’s/classes containing certain key words, and then either excluding images contained within them (i.e. have them as ancestors) or giving preference to them depending on the nature of the zone in question.
As the primary example, many CMS’s, including WordPress, encapsulate the blog post content inside a <div> with an ID containing the word “content.” In fact, if you are using a Webkit-based browser and look at the source for this article, you’ll find a <div> element with a “content” ID; if you select it, you’ll notice that it encompasses the article text but excludes the header, footer, and comments section (though it may include the sidebar–again, this is an inexact science). Restricting your search for image tags to within this div, if it exists, will increase the chance of finding the right image while also allowing you to skip downloading most of the other images on the page, a big performance gain. Of course, if you don’t find a suitable image within these content divs or if they don’t exist, you should consider expanding your search to the rest of the page; not all sites follow this pseudo-convention.
I don’t want to cover the specific implementation in this article but I do think it is worth briefly going over how to find these content divs as it can be a bit tricky (this also applies to finding sections you want to exclude, which is described below). If you are using an HTML parser that can search via XPaths, such as Nokogiri, you can use the query below to find img tags in a document that have, as their ancestors, tags whose ID contain the word “content”:
//img[ancestor::*[contains(@id, 'content')]]
I make no claims about this being the best XPath query for this task, as I am not an expert on XML parsing, but it seems to do the trick. You may also want to consider other keywords in your query, such as “main”, or some of the new HTML5 semantic tags, such as <article>. You can add additional clauses to your query using “and” and “or” like so:
//img[ancestor::*[contains(@id, 'content') or contains(@id, 'main')]]
Similarly, we want to exclude images in zones we know are not going to contain suitable images or worse, seemingly suitable but incorrect images. We can use the same technique and specify in our query that we don’t want img tags whose ancestors have certain keywords in their ID, such as “sidebar”, “header”, “footer”, and “comment”. Below is an example of such a query combined with the query above:
//img[not(ancestor::*[contains(@id, 'sidebar') or contains(@id, 'comment') or contains(@id, 'footer') or contains(@id, 'header')]) and ancestor::*[contains(@id, 'content')]]
Notice the “not” clause. One thing to keep in mind is that while you may want to expand your search beyond content divs if you can’t find any or they contain no suitable images, you probably never want to expand your search into these known “bad” zones. If excluding these zones turns up no images, it is best to just assume there aren’t any on the page and move on. Another thing to be aware of is that in many webpages, these bad zones will sometimes be contained within content zones, and vice versa; for example, the content div may include not only the blog post but also the sidebar. The above query avoids this by combining the clauses together, but if you are implementing your own custom query it is something you need to be careful of. I have seen some pretty ridiculous layouts; as always, websites are not your friend.
The above query is not exhaustive and which zones to exclude and how to identify them is entirely subjective. You may want to consider the HTML5 semantic tags such as <header>, <footer>, and the like; zones related to “nav” (for navigation) might also be good candidates. Most commenting engines use some form of the word “comment”, but it may be worth checking out the most popular to make sure–you definitely do not want to be using some commenter’s Facebook photo as a thumbnail. Trickier than comments are the various related content plugins, whose thumbnails are going to be tempting targets for your algorithm if you are not careful, resulting in mismatched and repeating thumbnails. Again, checking out the most popular plugins to see what they name their zones might be worth the effort.
Combining the above techniques, you should be able to develop a pretty effective algorithm for finding the largest image on a page (that is not a banner or sprite) from a list of candidates gleaned from particular zones. You could leave it at that, but if you really want to find the best thumbnail you’ll need to also account for video.
Video thumbnails
Many blog posts these days have a featured video instead of an image, and often that is all they will have, meaning your thumbnail scraper is going to come up empty handed if it is just searching for the biggest picture on the page. Believe it or not though, it’s actually quite easy to extract a nice thumbnail from an embedded video, provided that its from one of the major hosting sites like Youtube or Vimeo (which is the case the vast majority of the time). The key is a technology called OEmbed. Sites that implement it provide an API endpoint where you can pass in the URL of a piece of content (in this case, a video) and it will send back some metadata describing the video in a fashion similar to Open Graph, including a thumbnail image.
The first step in this process is to find the embedded video in the HTML and extract its ID. Again, I don’t want to dive too deep into the implementation so I will just use Youtube as an example, as it is by far the most common solution. The API endpoints and ID formats for the other major sites are fairly easy to find and follow a similar pattern.
There are two common ways to embed video on a page. The preferred way these days is to use an iframe, so you’ll want to enumerate through the iframes on a given page and look for ones that contain “youtube” in the src attribute. Remember that videos fall victim to the same problems as images in regards to zones, so you should apply the same filtering from the previous section. Below is an example XPath:
.//iframe[not(ancestor::*[contains(@id, 'sidebar') or contains(@id, 'comment') or contains(@id, 'footer') or contains(@id, 'header')]) and contains(@src, 'youtube')]/@src
You’re not likely to get more than one result from this query and if you do, it is probably best to just take the first one unless you have some other clever heuristics you’d like to apply. Once you have the iframe’s src attribute you’ll need to extract the ID of the video. Here is where the science gets a little bit inexact, but should work the majority of the time. Most embedded Youtube video URLs take one of the following forms:
http://youtu.be/VIDEO_ID
http://www.youtube.com/embed/VIDEO_ID?fs=1&feature=oembed
The most straightforward way to extract the ID would be to split on the “/” character and take the last token, though in the latter case you will also want to remove the query params (which are not always present). One gotcha is that Youtube URLs are case-sensitive (except for the domain name) so be careful not to downcase the URL.
Once you have the video ID you’ll need to query Youtube’s oembed API endpoint to get the metadata for that video, including the thumbnail. You can use a library such as ruby-oembed to do this for you, or you can make the request yourself to the endpoint below:
http://www.youtube.com/oembed?url=ENCODED_VIDEO_URL&format=json
where ENCODED_VIDEO_URL is equal to (url-encoded):
http://www.youtube.com/watch?v=VIDEO_ID
You might be asking yourself, “Why do I have to extract the video ID from one URL and insert it into another? Can’t I just pass the iframe’s src attribute directly?” The answer is “You would think so”, but sadly that only works for some URLs and not others (specifically, it does not seem to work for the second type of embed URL). The above format always seems to work though, so I find it safest just to convert all URLs to that format.
After calling the oembed API, you’ll get a blob of JSON with a variety of useful fields, but the one we are interested in is “thumbnail_url” which will point you to a still image from the video you can use as a thumbnail. Pretty easy, huh?
I mentioned that there was another, older method of embedding video using the <embed> tag. You can use a similar XPath as above to extract the src attribute if you are unable to find a suitable iframe video. After a brief investigation it appears that the src attribute for Youtube videos embedded via the <embed> tag all seem to have the “/watch?v=VIDEO_ID” format, so transforming the URL to play nice with the oembed endpoint shouldn’t be necessary.
The above Youtube example can be applied to most of the major video (and multimedia, like Slideshare) hosts, you just need to investigate their embed URL formats and their oembed endpoints and write cases for them in your algorithm. Unfortunately oembed is not so standard that this process can be abstracted across all hosts; however, there are services you can pay for (namely, Embed.ly) that will do this for you.
Putting it together
Combining the methods I outlined in this article, you get an algorithm that roughly flows like this:
- Check of Open Graph/Twitter Card tags
- Find the largest suitable image on the page
- Look for a video thumbnail if no image is found
Using these techniques, I’ve found that you can get results at least in the ball park of the bigger players. There are no doubt significant improvements that could be made, I am not an expert in this area. I’m simply relaying what I have learned after some significant time spent implementing my own solution, in the hope of filling a small void in the Google results for “thumbnail scraping” and saving someone else like myself a considerable amount of time. I may or may not write an article in the future with specific implementation details, such as how to actually crawl the web page, download and resize the image, and upload it somewhere such as S3. Good luck!