Moz Q&A is closed.
After more than 13 years, and tens of thousands of questions, Moz Q&A closed on 12th December 2024. Whilst we’re not completely removing the content - many posts will still be possible to view - we have locked both new posts and new replies. More details here.
How is Google crawling and indexing this directory listing?
-
We have three Directory Listing pages that are being indexed by Google:
http://www.ccisolutions.com/StoreFront/jsp/
http://www.ccisolutions.com/StoreFront/jsp/html/
http://www.ccisolutions.com/StoreFront/jsp/pdf/
How and why is Googlebot crawling and indexing these pages? Nothing else links to them (although the /jsp.html/ and /jsp/pdf/ both link back to /jsp/). They aren't disallowed in our robots.txt file and I understand that this could be why.
If we add them to our robots.txt file and disallow, will this prevent Googlebot from crawling and indexing those Directory Listing pages without prohibiting them from crawling and indexing the content that resides there which is used to populate pages on our site?
Having these pages indexed in Google is causing a myriad of issues, not the least of which is duplicate content.
For example, this file <tt>CCI-SALES-STAFF.HTML</tt> (which appears on this Directory Listing referenced above - http://www.ccisolutions.com/StoreFront/jsp/html/) clicks through to this Web page:
http://www.ccisolutions.com/StoreFront/jsp/html/CCI-SALES-STAFF.HTML
This page is indexed in Google and we don't want it to be. But so is the actual page where we intended the content contained in that file to display: http://www.ccisolutions.com/StoreFront/category/meet-our-sales-staff
As you can see, this results in duplicate content problems.
Is there a way to disallow Googlebot from crawling that Directory Listing page, and, provided that we have this URL in our sitemap: http://www.ccisolutions.com/StoreFront/category/meet-our-sales-staff, solve the duplicate content issue as a result?
For example:
Disallow: /StoreFront/jsp/
Disallow: /StoreFront/jsp/html/
Disallow: /StoreFront/jsp/pdf/
Can we do this without risking blocking Googlebot from content we do want crawled and indexed?
Many thanks in advance for any and all help on this one!
-
Thanks so much to you all. This has gotten us closer to an answer. We are consulting with the folks who developed the Web store to make sure that these solutions won't break other things if implemented, particularly something mentioned to me by our IT Director called "Sim links" - I'll keep you posted!
-
I am referring to Web users. If a user or search engine tried to view those directory listing pages, they will get a Forbidden message, which is what you want to happen. The content in those directories will still be accessible by the pages on the site since the files still exist in those directories, but the pages listing the files in those directories won't be accessible in the browser to users/search engines. In other words, turning off the Directory indexes will not affect any of the content on the site.
-
He's got the right idea, you shouldn't be serving these pages (unless you have a specific reason to). The problem is these index pages are returning with a status code of 200 OK, so Google assumes it's fine to index them. These pages should either come back with a 404 or a 403 (forbidden), and users then wouldn't be able to browse your site with these directory pages.
Disallowing in robots.txt may not immediately remove these from search results, you may get that lovely description underneath the results that says, "A description for this result is not available because of this site's robots.txt".
-
Thanks much to you both for jumping in. (thumbs up!)
Streamline, I understand your suggestion regarding .htaccess, however, as I mentioned, the content in these directories is being used to populate content on our pages. In your response you mentioned that users/search engines wouldn't be able to access them. When you say "users," are you referring to Web visitors, and not site admins?
-
There's numerous ways Google could have found those pages and added them to the index, but there's really no way to determine exactly what caused it in the first place. All it takes is for one visit by Google for a page to be crawled and indexed.
If you don't want these pages indexed, then blocking those directories/pages in robots.txt would not be the solution because you would prevent Google from accessing those pages at all going forward. But the problem is that these pages are already in Google's index and by simply using the robots.txt file, you are just telling Google not to visit those pages from now on and thus your pages will remain in the index. A better solution would be to add the no-index, no-cache tags to those pages so the next time Google accesses those pages, they will know to remove those pages from the index.
And now that I've read through your post again, I am now realizing you are talking about file directories rather than normal webpages. What I've wrote above mainly still applies, but I think the quick and easy fix would be to turn off Directory Indexes all together (unless you need them for some reason?). All you have to do is add the following code to your .htaccess file -
Options -Indexes
This will turn off these directory listings so users/search engines can't access them and they should eventually fall out of the Google index.
-
You can use robots to disallow google from even crawling those pages, while the meta noindex still allows the crawling but prevents the indexing of those pages.
If you have any sensitive data that you don't want Google to read, then go ahead and use the robots directives you wrote above. However, if you just want them deindexed I'll suggest to go with the meta noindex, as it will allow other pages (linked) to be indexed but leave that particular page out.
Got a burning SEO question?
Subscribe to Moz Pro to gain full access to Q&A, answer questions, and ask your own.
Browse Questions
Explore more categories
-
Moz Tools
Chat with the community about the Moz tools.
-
SEO Tactics
Discuss the SEO process with fellow marketers
-
Community
Discuss industry events, jobs, and news!
-
Digital Marketing
Chat about tactics outside of SEO
-
Research & Trends
Dive into research and trends in the search industry.
-
Support
Connect on product support and feature requests.
Related Questions
-
Old brand name being suffixed on Google SERP listings
At the end of some of our listings in Google search results pages, our old brand name is being suffixed even though it is not in our title tags. For context, we re-branded several months ago, and at that time also migrated to a new domain name. Our title tags have our current brand name suffixed, like "Shop Example Category | Example©". In the Google search results, but not in Bing nor Yahoo, about half of our pages have titles whcih instead look like this: "Shop Example Category | Example© - oldBrandName". The "dash" and the old brand name are not in our title tags, but they are being appended, even when our title tags are fairly long. For example, even with titles at 54 characters (421 pixels), the suffix is being appended. BUT, not with our longer title tags. We are actually OK with the brand name being appended if our title tags are on the shorter side, but would prefer that our current brand name be appended instead of the older one. I realize we could increase the length of all our title tags, and perhaps we may go that route. But, does anyone know where Google would be getting the old brand name to append onto the URLs? We've checked and it is not in our page source (the old brand name is used in our page source in some areas of text and some url paths, but not in any kind of meta tag). Per Google's guidance (https://www.searchenginejournal.com/google-do-not-put-organization-schema-markup-on-every-page/289981/) we only have schema for the "Organization" on our home page, and not on every page. So, assuming this advice is correct to not add schema to every page, how can we inform Google of our current brand name so that it stops appending our old brand name on pages?
Intermediate & Advanced SEO | | seoelevated0 -
Google Indexing Request - Typical Time to Complete?
In Google Search Console, when you request the (re) indexing of a fetched page, what's the average amount of time it takes to re-index and does it vary that much from site to site or are manual re-index request put in a queue and served on a first come - first serve basis despite the site characteristics like domain/page authority?
Intermediate & Advanced SEO | | SEO18050 -
"Null" appearing as top keyword in "Content Keywords" under Google index in Google Search Console
Hi, "Null" is appearing as top keyword in Google search console > Google Index > Content Keywords for our site http://goo.gl/cKaQ4K . We do not use "null" as keyword on site. We are not able to find why Google is treating "null" as a keyword for our site. Is anyone facing such issue. Thanks & Regards
Intermediate & Advanced SEO | | vivekrathore0 -
Google indexing pages from chrome history ?
We have pages that are not linked from site yet they are indexed in Google. It could be possible if Google got these pages from browser. Does Google takes data from chrome?
Intermediate & Advanced SEO | | vivekrathore0 -
How can I get a list of every url of a site in Google's index?
I work on a site that has almost 20,000 urls in its site map. Google WMT claims 28,000 indexed and a search on Google shows 33,000. I'd like to find what the difference is. Is there a way to get an excel sheet with every url Google has indexed for a site? Thanks... Mike
Intermediate & Advanced SEO | | 945010 -
Yoast SEO Plugin: To Index or Not to index Categories?
Taking a poll out there......In most cases would you want to index or NOT index your category pages using the Yoast SEO plugin?
Intermediate & Advanced SEO | | webestate0 -
Does Google index url with hashtags?
We are setting up some Jquery tabs in a page that will produce the same url with hashtags. For example: index.php#aboutus, index.php#ourguarantee, etc. We don't want that content to be crawled as we'd like to prevent duplicate content. Does Google normally crawl such urls or does it just ignore them? Thanks in advance.
Intermediate & Advanced SEO | | seoppc20120 -
How to stop Google crawling after 301 redirect?
I have removed all pages from my old website and set 301 redirect to new website. But, I have verified old website with Google webmaster tools' HTML verification file which enable me to track all data and existence of pages in Google search for my old website. I was assumed that, Google will stop crawling and DE-indexed all pages after 301 redirect. Because, I have set 301 redirect before 3 months. Now, I'm able to see Google bot activity on my website with help of Google webmaster tools. You can find out attachment to know more about it. How can it possible & How Google can crawl removed pages? You can see following image to know more about it. First & Second
Intermediate & Advanced SEO | | CommercePundit0