feat(marketing): exempt public blueprints from noindex + fix / route collision

- add_no_crawl_headers now skips marketing.*, legal.*, billing.success,
  static, and robots_txt endpoints via _is_public_indexable_endpoint
  helper; all other routes keep the X-Robots-Tag noindex header
- recordings.index drops @login_required and instead redirects
  anonymous users to marketing.landing, resolving the URL-map
  collision between recordings_bp and marketing_bp at "/"
- robots.txt rewritten: public marketing pages and /legal/* allowed,
  /api/, /admin, /account, /share/, /app/, /checkout, /login, /signup,
  /webhooks/ disallowed; Googlebot, Bingbot, ClaudeBot, GPTBot,
  PerplexityBot, Applebot explicitly allowed
- New tests/test_no_crawl_headers.py (14 tests) covers exemption
  helper + integration on /, /robots.txt, /static, /admin, /login
- New tests/test_marketing_root_redirect.py (4 tests) verifies
  anonymous users at / never get a /login redirect

Tests verified via AST + logic walkthrough; pytest blocked on Windows
by pre-existing fcntl import in src/init_db.py (B-1.2 limitation).
This commit is contained in:
Allison
2026-04-27 16:28:55 -04:00
parent 55ae09431d
commit 1071e56173
5 changed files with 299 additions and 54 deletions

View File

@@ -1,65 +1,48 @@
# DictIA - Block all web crawlers and search engines
# This application contains private user data and should not be indexed
# DictIA - robots.txt
# Updated 2026-04-27 for marketing redesign (Task B-1.3)
#
# Public marketing pages (root, /tarifs, /fonctionnalites, /conformite,
# /contact, /blog) and legal pages (/legal/*) are indexable.
# Application routes (/api, /admin, /account, /share, /app, /checkout,
# /login, /signup, /webhooks) remain blocked.
User-agent: *
Disallow: /
Allow: /
Allow: /tarifs
Allow: /fonctionnalites
Allow: /conformite
Allow: /contact
Allow: /blog
Allow: /legal/
Disallow: /api/
Disallow: /admin
Disallow: /account
Disallow: /share/
Disallow: /app/
Disallow: /checkout
Disallow: /login
Disallow: /signup
Disallow: /oublie
Disallow: /verifier-email
Disallow: /webhooks/
# Specific directives for major search engines
# Search/AI crawlers explicitly allowed on public marketing surface
User-agent: Googlebot
Disallow: /
User-agent: Googlebot-Image
Disallow: /
Allow: /
User-agent: Bingbot
Disallow: /
Allow: /
User-agent: Slurp
Disallow: /
User-agent: ClaudeBot
Allow: /
User-agent: DuckDuckBot
Disallow: /
User-agent: Baiduspider
Disallow: /
User-agent: YandexBot
Disallow: /
User-agent: ia_archiver
Disallow: /
# AI Crawlers
User-agent: GPTBot
Disallow: /
Allow: /
User-agent: ChatGPT-User
Disallow: /
User-agent: PerplexityBot
Allow: /
User-agent: CCBot
Disallow: /
User-agent: Applebot
Allow: /
User-agent: anthropic-ai
Disallow: /
User-agent: Claude-Web
Disallow: /
User-agent: cohere-ai
Disallow: /
# Social Media Crawlers
User-agent: facebookexternalhit
Disallow: /
User-agent: Twitterbot
Disallow: /
User-agent: LinkedInBot
Disallow: /
User-agent: Slackbot
Disallow: /
User-agent: Discordbot
Disallow: /
Sitemap: https://dictia.pages.dev/sitemap.xml