Robots.txt: how to exclude sub-directories correctly?

fablau

Hello here,

I am trying to figure out the correct way to tell SEs to crawls this:

http://www.mysite.com/directory/

But not this:

http://www.mysite.com/directory/sub-directory/

or this:

http://www.mysite.com/directory/sub-directory2/sub-directory/...

But with the fact I have thousands of sub-directories with almost infinite combinations, I can't put the following definitions in a manageable way:

disallow: /directory/sub-directory/

disallow: /directory/sub-directory2/

disallow: /directory/sub-directory/sub-directory/

disallow: /directory/sub-directory2/subdirectory/

etc...

I would end up having thousands of definitions to disallow all the possible sub-directory combinations.

So, is the following way a correct, better and shorter way to define what I want above:

allow: /directory/$

disallow: /directory/*

Would the above work?

Any thoughts are very welcome! Thank you in advance.

Best,

Fab.

MickEdwards

I mentioned both. You add a meta robots to noindex and remove from the sitemap.

sjunaidali

But google is still free to index a link/page even if it is not included in xml sitemap.

MickEdwards

Install Yoast Wordpress SEO plugin and use that to restrict what is indexed and what is allowed in a sitemap.

sjunaidali

I am using wordpress, Enfold theme (themeforest).

I want some files to be accessed by google, but those should not be indexed.

Here is an example: http://prntscr.com/h8918o

I have currently blocked some JS directories/files using robots.txt (check screenshot)

But due to this I am not able to pass Mobile Friendly Test on Google: http://prntscr.com/h8925z (check screenshot)

Is its possible to allow access, but use a tag like noindex in the robots.txt file. Or is there any other way out.

fablau

Yes, everything looks good, Webmaster Tools gave me the expected results with the following directives:

allow: /directory/$

disallow: /directory/*

Which allows this URL:

http://www.mysite.com/directory/

But doesn't allow the following one:

http://www.mysite.com/directory/sub-directory2/...

This page also gives an update similar to mine:

https://support.google.com/webmasters/answer/156449?hl=en

I think I am good! Thanks

fablau

Thank you Michael, it is my understanding then that my idea of doing this:

allow: /directory/$

disallow: /directory/*

Should work just fine. I will test it within Google Webmaster Tools, and let you know if any problems arise.

In the meantime if anyone else has more ideas about all this and can confirm me that would be great!

Thank you again.

MickEdwards

I've always stuck to Disallow and followed -

"This is currently a bit awkward, as there is no "Allow" field. The easy way is to put all files to be disallowed into a separate directory, say "stuff", and leave the one file in the level above this directory:"

http://www.robotstxt.org/robotstxt.html

From https://developers.google.com/webmasters/control-crawl-index/docs/robots_txt this seems contradictory

| /* | equivalent to / | equivalent to / | Equivalent to "/" -- the trailing wildcard is ignored. |

I think this post will be very useful for you - http://cloudz.click/community/q/allow-or-disallow-first-in-robots-txt

fablau

Thank you Michael,

Google and other SEs actually recognize the "allow:" command:

https://developers.google.com/webmasters/control-crawl-index/docs/robots_txt

The fact is: if I don't specify that, how can I be sure that the following single command:

disallow: /directory/*

Doesn't prevent SEs to spider the /directory/ index as I'd like to?

MickEdwards

As long as you dont have directories somewhere in /* that you want indexed then I think that will work. There is no allow so you don't need the first line just

disallow: /directory/*

You can test out here- https://support.google.com/webmasters/answer/156449?rd=1

Welcome to the Q&A Forum

Browse the forum for helpful insights and fresh discussions about all things SEO.

Moz Q&A is closed.

Robots.txt: how to exclude sub-directories correctly?

Got a burning SEO question?

Browse Questions

Explore more categories

Related Questions

SEO Best Practices regarding Robots.txt disallow

Should I use noindex or robots to remove pages from the Google index?

Large robots.txt file

The correct hreflang for the GB

Robots.txt - Do I block Bots from crawling the non-www version if I use www.site.com ?

De-indexed Link Directory

Soft 404's from pages blocked by robots.txt -- cause for concern?

Block an entire subdomain with robots.txt?

Products

Moz Solutions

Free SEO Tools

Resources

About Moz

Why Moz

Get Involved