feat(marketing): exempt public blueprints from noindex + fix / route collision

- add_no_crawl_headers now skips marketing.*, legal.*, billing.success, static, and robots_txt endpoints via _is_public_indexable_endpoint helper; all other routes keep the X-Robots-Tag noindex header - recordings.index drops @login_required and instead redirects anonymous users to marketing.landing, resolving the URL-map collision between recordings_bp and marketing_bp at "/" - robots.txt rewritten: public marketing pages and /legal/* allowed, /api/, /admin, /account, /share/, /app/, /checkout, /login, /signup, /webhooks/ disallowed; Googlebot, Bingbot, ClaudeBot, GPTBot, PerplexityBot, Applebot explicitly allowed - New tests/test_no_crawl_headers.py (14 tests) covers exemption helper + integration on /, /robots.txt, /static, /admin, /login - New tests/test_marketing_root_redirect.py (4 tests) verifies anonymous users at / never get a /login redirect Tests verified via AST + logic walkthrough; pytest blocked on Windows by pre-existing fcntl import in src/init_db.py (B-1.2 limitation).
2026-04-27 16:28:55 -04:00
parent 55ae09431d
commit 1071e56173
5 changed files with 299 additions and 54 deletions
--- a/static/robots.txt
+++ b/static/robots.txt
@@ -1,65 +1,48 @@
-# DictIA - Block all web crawlers and search engines
-# This application contains private user data and should not be indexed
+# DictIA - robots.txt
+# Updated 2026-04-27 for marketing redesign (Task B-1.3)
+#
+# Public marketing pages (root, /tarifs, /fonctionnalites, /conformite,
+# /contact, /blog) and legal pages (/legal/*) are indexable.
+# Application routes (/api, /admin, /account, /share, /app, /checkout,
+# /login, /signup, /webhooks) remain blocked.

 User-agent: *
-Disallow: /
+Allow: /
+Allow: /tarifs
+Allow: /fonctionnalites
+Allow: /conformite
+Allow: /contact
+Allow: /blog
+Allow: /legal/
+Disallow: /api/
+Disallow: /admin
+Disallow: /account
+Disallow: /share/
+Disallow: /app/
+Disallow: /checkout
+Disallow: /login
+Disallow: /signup
+Disallow: /oublie
+Disallow: /verifier-email
+Disallow: /webhooks/

-# Specific directives for major search engines
+# Search/AI crawlers explicitly allowed on public marketing surface
 User-agent: Googlebot
-Disallow: /
-
-User-agent: Googlebot-Image
-Disallow: /
+Allow: /

 User-agent: Bingbot
-Disallow: /
+Allow: /

-User-agent: Slurp
-Disallow: /
+User-agent: ClaudeBot
+Allow: /

-User-agent: DuckDuckBot
-Disallow: /
-
-User-agent: Baiduspider
-Disallow: /
-
-User-agent: YandexBot
-Disallow: /
-
-User-agent: ia_archiver
-Disallow: /
-
-# AI Crawlers
 User-agent: GPTBot
-Disallow: /
+Allow: /

-User-agent: ChatGPT-User
-Disallow: /
+User-agent: PerplexityBot
+Allow: /

-User-agent: CCBot
-Disallow: /
+User-agent: Applebot
+Allow: /

-User-agent: anthropic-ai
-Disallow: /
-
-User-agent: Claude-Web
-Disallow: /
-
-User-agent: cohere-ai
-Disallow: /
-
-# Social Media Crawlers
-User-agent: facebookexternalhit
-Disallow: /
-
-User-agent: Twitterbot
-Disallow: /
-
-User-agent: LinkedInBot
-Disallow: /
-
-User-agent: Slackbot
-Disallow: /
-
-User-agent: Discordbot
-Disallow: /
+Sitemap: https://dictia.pages.dev/sitemap.xml