Common On-Page SEO Pitfalls

Author: | 8 Comments

A couple of weeks ago, I spoke at Turkey’s first SEO conference, SEOZone. Since our agency, Ads2people, conducts a large number of on-page audits, from very large and often multilingual corporate sites to regular blogs, I thought it would be helpful to talk about some common on-page pitfalls we see over and over again. This is an exclusive write-up for SEMPO summarizing that presentation. I hope it helps you improve your on-page SEO.

#1 Crawl Budget

Given the fact that search engines such as Google assign a certain crawl budget per domain (and sub-domain), I’m always surprised at how often site-owners simply try to push all of their content into the index. And they also often seem to be completely careless in regards to which sites are crawler-accessible at all.

To assess and fix these problems on your site, a good starting place is Google Webmaster Tools (go to: Crawl > Crawl Stats), which gives a first impression of how a site is doing. A successful graph is slightly increasing – which usually reflects that Google picks up on content being added and therefore returns a bit more frequently. Conversely, if that graph is jumping or massively decreasing, you might have a problem.

There are two ways to control search engine crawlers: using a robots.txt directive and implementing a robots meta tag into the HTML mark-up (or serve it as HTTP X-Robots header). However, the issue with both directives is that they don’t solve your (potential) crawling-budget-issues:

-          Robots Meta Tag: Implementing a proper “noindex” does prevent a given page from showing up in search results but that page will still be crawled – and therefore a crawling budget has to be used.

-          robots.txt:  Blocking a URL (or folder, etc.) does prevent the site from being crawled (and therefore does not waste crawling-budget); however, there are massive downsides. One is that pages might still (partially) show up in search results (mainly due to being linked from someplace else) and all inbound link juice will be cut-off. In other words, those links do not help your rankings.

Considering those points, you might think about combining those… but please – don’t! It simply cannot work. If a page is blocked using robots.txt, a site won’t be crawled and the meta robots tag therefore cannot be read at all!

Watch out for things like filters and sorting, pagination, and other potentially useless pages. We see so often that these are simply being pushed to the index but certainly never can or will rank for anything. Don’t waste Google’s resources on that!

As a rule of thumb: If you want to be sure not to waste crawl-budget, only have pages that really are useful (so don’t create others in the first place).  If you have others you don’t want to show up, I’d go with meta robots to at least utilize the inbound link equity.

#2 Duplicate Content

I assume everyone is familiar with duplicate content (DC) issues, but it turns out that’s not the case (if you’re not, please read this first). It always surprises me to see how many sites out there are still not performing well due to a lot of internal (partial) DC. Even though most sites these days are OK in handling session IDs and tracking parameters, here are some “classics” I’d like remind you of: HTTP vs. HTTPs is considered to be DC, products available in multiple categories (and not using a single product URL) are causing DC as well, and sub domains (like staging servers) might get you in trouble.

That said, the rel=”canonical” meta tag (or X-Robots Rel-Canonical Header) can help you fix those issues, but I think this is the third-best option to solve DC issues. In my mind, it’s really all about efficiency – so the best way to solve it is to actually make sure that you only serve contents using one single (canonicalized) URL and not multiple ones. It’s as simple as that.

I’d generally not rely on something that Google calls “a strong hint” – because it’s a hint that they might or might not consider, but essentially it’s not a forcing directive like an HTTP 301 redirect (which they simple have to follow).

Again it comes down to giving Google as few choices as possible.  Enforce single, unique URLs with amazing content and 301 redirect previously existing ones (e.g., old or multiple versions) to this (new) URL and you won’t suffer from DC issues.

#3 Proper Mark-Up

There are quite a few differing opinions on if and why proper mark-up is important. I don’t really jump into that discussion, but I’m a strong believer that doing clean and simple mark-up helps. That’s mainly due to the fact that I really don’t want to take chances that a crawler might have “issues” when trying to extract information from a site. And that’s also why I think doing schema.org mark-up is a good thing: It helps engines (not only crawlers) to actually understand (parts of) content and make sense of it. In short, to understand its meaning.

Obviously you have to consider which information you can and want to provide to Google (and others), but if you don’t give your data, they’ll get it elsewhere. So generally speaking, don’t miss out on this. It’s far more than just gaining more CTR due to more prominent results – which is great by the way – but if you combine structured data with rel=”author” and / or rel=”publisher” that the benefits are even greater. It’s basically Google moving toward understanding and assigning verified entities to sets of queries, and you surely don’t want to miss out on that. In my opinion, Google is massively moving to a point where you need to be a verified authority for a given entity and therefore will automatically benefit from all that long tail traffic that belongs to this entity – which makes a lot of sense given the fact that Google sees a massive ~20% of new queries per day.

So if you’ve not yet played around with Rich Snippet mark-up, I recommend you check-out schema.org to see what’s in store for you, get it implemented, and verify your domain and author profile with Google+ to get things started. Good luck!

If you’re interested in the slide deck, feel free to check it out on SlideShare.

About the author:

Bastian Grimm co-runs Ads2people, a full-service performance marketing agency based in Berlin, Germany, where he heads the SEO department as the VP of Search. Having a passion for software development and everything “Tech,” he loves to challenge IT and marketing departments to come up with outstanding results in search marketing. Bastian is a widely cited authority in SEO having spoken at almost every major search conference including SMXs, ISS, SES, SEOkomm, LAC, BAC, and many more events around the globe.

Find Bastian on Twitter and Google+ or contact him at bg@ads2people.de or +49 30 720209710.

 

Opinions expressed in the article are those of the guest author and not necessarily SEMPO.

Be Sociable, Share!

    8 Comments

    1. Eric Hewitson

      Thanks Bastian, that’s a nice clear explanation of Crawl Budget, Duplicate Content and Proper Markup. A lot clearer for me now. I’ve certainly found schema.org to be incredibly useful this year in view of the new emphasis being placed on authorship.

    2. Peter

      Thanks Bastian,

      It’s a great article you have posted. I have a question about crawl budget optimization. Is it a good idea to “nofollow” images on a website? What I mean are images from landing/service and so on pages? Your opinion on the topic would be much appreciated.

      Thanks,
      Peter

    3. Bastian Grimm

      Peter, probably not. If images are “non-SEO relevant” (like they don’t need to be found in G image search) I’d do lazy loading to a) make the site faster and b) get them “off” the page.

      Generally speaking, a rel=”nofollow” does not seem to make any sense to me.

    4. domain

      hi!,I really like your writing so much! proportion we be in contact extra about your post on AOL?

      I require a specialist in this house to solve my problem.
      May be that is you! Having a look forward to look you.

    Leave a Reply

    Your email address will not be published. Required fields are marked *

    You may use these HTML tags and attributes: <a href="" title=""> <abbr title=""> <acronym title=""> <b> <blockquote cite=""> <cite> <code> <del datetime=""> <em> <i> <q cite=""> <strike> <strong>