The robots.txt file is a standard text file placed at the root of a website (e.g., https://example.com/robots.txt) that tells automated web crawlers, spiders, and AI agents which pages or files they are allowed to request from the site.
How it Works
It relies entirely on the honor system. It is a set of instructions and crawler etiquette, not a strict security firewall. A “polite” bot (like Googlebot) will read the file and respect the rules. A malicious scraper or rogue AI agent can easily ignore it.
Syntax Example
User-agent: *
Disallow: /admin/
Disallow: /private-data/
User-agent: GPTBot
Disallow: /
User-agent: Googlebot
Allow: /
Crawl-delay: 5Key Directives
User-agent: Specifies which bot the rule applies to.*means all bots. Specific names can be targeted (e.g.,GPTBotfor OpenAI’s crawler).Disallow: Tells the specified agent not to crawl a specific URL path.Allow: Overrides aDisallowdirective for a specific subdirectory.Crawl-delay: Requests the bot to wait a specified number of seconds between requests to prevent server overload.Sitemap: Points the crawler to the XML sitemap to help it discover allowed pages efficiently.
AI and Robots.txt
With the rapid rise of Generative AI, many publishers now update their robots.txt to explicitly block AI data scrapers from training on their content without permission. For example, adding User-agent: GPTBot \n Disallow: / asks OpenAI to stop indexing the site for model training purposes.