Hello, e-Commerce enthusiast! Contribute to the 2023 edition happening this month. Click.

e-Commerce Advent Calendar

The e-Commerce geek's favorite time of year
2023 Edition

In the age of AI, controlling robots is more important than ever

by Toni Anicic
ABOUT THE AUTHOR
Toni Anicic

Toni Anicic is the founder of Agency 418 specialized in Magento 2 and Adobe Commerce website development and maintenance. Toni spent the last 15 years in the Magento ecosystem at various roles, from leading a digital marketing team to project management.

What do Amazon, IKEA, Cisco, Airbnb, Reuters, Coursera, and Quora have in common?

They are all a part of a growing list of major companies that are disallowing access to their websites to GPTBot.

If you check their robots.txt files, you will find the following code there:

User-agent: GPTBot
Disallow: /

For those of you who are new to robots.txt syntax, what these two lines mean is: if a bot that identifies itself as GPTBot wants to see the content of any page on this domain, it is not allowed to do so.

Audit your e-commerce website’s robots.txt file and prepare it for 2024 and beyond.

When was your robots.txt file last updated? If it was a year ago or even before that, then you certainly didn’t keep up with latest trends such as blocking GPTBot like those major competitors of yours might have done.

Now I am not here to tell you if you should or should not block GPTBot. That is a business decision I leave up to you.

What I am here for is to explain several lesser known facts about robots.txt syntax that will enable you to audit and spot potential issues like a true professional!

NOTE: Google had a tool called robots.txt testing tool, but that tool is being deprecated so knowledge about this needs to be preserved because you will no longer be able to test these at Google Search Console!

1. robots.txt directives are case sensitive

That’s right.

Let’s say you have a folder within your website that you don’t want GPTBot to access.

Let’s say that folder is example.com/test/

If your robots.txt file reads:

User-agent: GPTBot
Disallow: /test/

Can GPTBot access the content of /Test/?Robots.txt

Yes it can! Because robots.txt is case sensitive so /test/ and /Test/ are different URLs to it!

Often your websites will actually return the same content to any combination of uppercase or lowercase letters (tEst, test, Test, teSt, tesT will all return the same content).

This means that if you really need to forbid the indexing of this folder specifically to a bot such as GPTBot, you’ll need a server side solution that will catch all requests and redirect them to a lowercase version to make sure that this kind of directive always applies.

2. Naming an agent makes it ignore all other directives

Let’s say you have a robots.txt file like this:

User-agent: *
Disallow: /secret/
Disallow: /private/
Disallow: /admin/

What this means is, any user agent may not access anything in folders /secret/, /private/, and /admin/.

Now you’d like to update your robots.txt file and say, in addition to these three folders, GPTBot specifically should not be allowed to access /content/.

So you go ahead and edit your robots.txt file into this:

User-agent: *
Disallow: /secret/
Disallow: /private/
Disallow: /admin/

User-agent: GPTBot
Disallow: /content/

Can GPTBot access /secret/ ?

Yes! Yes it can. Because you specifically named GPTBot in your directives, it no longer listens to anything that it was listening to before that matching User-agent: * (any) condition.

The moment you named it, it will only listen to directives under User-Agent: GPTBot and nothing else.

3. Allow beats disallow if they are equally precise

Let’s say you have the following in your robots.txt:

User-agent: GPTBot
Disallow: /test/
Allow: /test/

Is GPTBot allowed to access /test/?

Yes, because Allow is stronger than Disallow if they are equally precise.

Does it matter which one goes first?

No. The following would have exactly the same effect, the order of allow and disallow makes no difference whatsoever:

User-agent: GPTBot
Allow: /test/
Disallow: /test/

4. More precise means more characters matched

Let’s say you have the following robots.txt directive:

User-agent: GPTBot
Disallow: /test/playground/
Allow: /test/

Can GPTBot access /test/playground/?

No it can not.

Because even though allow trumps disallow if they are equally precise, in this case, the disallow directive matched more characters than the allow directive did which made it stronger than allow directive.

Final thoughts

I hope you enjoyed this crash course into technical SEO intricacies of the robots.txt syntax and I hope this will help you with auditing your robots.txt file for 2024 and beyond.

Toni Anicic

Interested in submiting an article?

Please check our contribute page in case you are interest to submit an article.