How can I prevent duplicate pages being indexed because of load balancer (hosting)?

iam-sold

The site that I am optimising has a problem with duplicate pages being indexed as a result of the load balancer (which is required and set up by the hosting company).

The load balancer passes the site through to 2 different URLs:

Some how, Google have indexed 2 of the same URLs (which I was obviously hoping they wouldn't) - the first on www and the second on www2.

The hosting is a mirror image of each other (www and www2), meaning I can't upload a robots.txt to the root of www2.domain.com disallowing all. Also, I can't add a canonical script into the website header of www2.domain.com pointing the individual URLs through to www.domain.com etc.

Any suggestions as to how I can resolve this issue would be greatly appreciated!

customerparadigm.com

There are two ways to handle load balancing, and it appears that your hosting company / server company chose to use the DNS round-robin routing option.

According to the Wikipedia page on load balancing:
http://en.wikipedia.org/wiki/Load_balancing_(computing)

"Load balancing usually involves dedicated software or hardware, such as a multilayer switch or a Domain Name System server process."

Round Robin DNS Load Balancing: Basically you use the DNS routing system to handle requests. When someone visits your site, 50% of the people are routed to www.domain.com, and 50% are routed to ww1.domain.com. Both sites contain the same identical content; it's the URLs that are slightly different. Sometimes the domains are the same; but you have different IP addresses for www.domain.com.

Advantages: you don't need a dedicated load balancing piece of software or hardware, so it's less expensive.
Disadvantages: this technique exposes the individual web servers to the end user seeing the site. You can also suffer from duplicate content penalties, too. Finally, if you are relying on the round robin DNS system for load balancing, and a DNS server or one of the Web servers goes down, there's not an easy fail-over (as many DNS records are cached).

More about Round Robin DNS: http://en.wikipedia.org/wiki/Round-robin_DNS

Hardware / Software Load Balancer:
In this case, your DNS zone file tells the end user to go to one IP address when they type in www.domain.com. The hardware or software load balancer then sees the request, and then hands off the content to one of the web servers in a cluster.

Advantages: No duplicate content penalty; to the end user, they just see one web server and not individual sub-domains (www.domain.com and ww1.domain.com). A load balancer can also cache specific items like a CSS page, so the load on the Web server is even more minimal.

Disadvantages: You're introducing another piece of hardware or software (i.e. more cost); this piece could also be a single point of failure into the mix. You need someone to figure out how to set this up and make sure it all works.

More on this type of Load Balancing: http://en.wikipedia.org/wiki/Load_balancing_(computing)#Internet-based_services

Load balancing can get complicated as soon as you have databases involved, but with a good design, multiple front end Web servers can talk to one single backend database server. The goal would be to cache as much content as possible as "static" elements, using caching systems like Varnish, that essentially turn database-driven pages into static, old-school HTML pages. And then only when someone needs to save something from the database (i.e. making a purchase on an eCommerce site), the system then interacts with it.

My recommendation:
(1) Move from the Round Robin Robin DNS to a hardware or software load balancer.

(2) If that isn't an easy solution, implement the Round Robin DNS solution to use identical A records for each server.

For example, you might have identical entries in your DNS zone files for both DNS servers:

NS1.domain.com:

www.domain.com A 69.94.15.10

NS2.domain.com:
www.domain.com A 75.64.18.12

This should at least eliminate your duplicate content issue, but you still do have a few disadvantages (described above). This also could lead to server issues, as the servers might be confused if they are the authoritative ones.

And if both servers are sending email, pay special attention to your SPF record, to make sure that you are allowing both IP addresses to be able to send email. (This is often overlooked.)

Hope this is helpful!
-- Jeff

Welcome to the Q&A Forum

Browse the forum for helpful insights and fresh discussions about all things SEO.

Moz Q&A is closed.

How can I prevent duplicate pages being indexed because of load balancer (hosting)?

Got a burning SEO question?

Browse Questions

Explore more categories

Related Questions

Why some websites can rank the keywords they don't have in the page?

If Robots.txt have blocked an Image (Image URL) but the other page which can be indexed has this image, how is the image treated?

Ecommerce Site - Duplicate product descriptions & SKU pages

Pages are Indexed but not Cached by Google. Why?

Getting Pages Requiring Login Indexed

How do you de-index and prevent indexation of a whole domain?

Should pages of old news articles be indexed?

Number of Indexed Pages are Continuously Going Down

Products

Moz Solutions

Free SEO Tools

Resources

About Moz

Why Moz

Get Involved