Skip to content
Merged
Show file tree
Hide file tree
Changes from all commits
Commits
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
2 changes: 1 addition & 1 deletion DEVELOPMENT.md
Original file line number Diff line number Diff line change
Expand Up @@ -614,7 +614,7 @@ If you need to change the website URL:

#### Step 1: Update Critical Configuration
- [ ] Update `_config.yml` → `url:` field
- [ ] Update `robots.txt` → `Sitemap:` line
- [ ] Verify `robots.txt` → `Sitemap:` line (generated from `{{ site.url }}{{ site.baseurl }}`)
- [ ] Update or remove `CNAME` file if using custom domain

#### Step 2: Test Locally
Expand Down
49 changes: 33 additions & 16 deletions robots.txt
Original file line number Diff line number Diff line change
Expand Up @@ -13,18 +13,39 @@ permalink: /robots.txt
# - Malicious crawlers may ignore this file
# - For GitHub Pages, this provides basic protection

# Allow major search engines with rate limiting
# Allow major search engines.
# Note: Googlebot ignores Crawl-delay directives, so we omit it to avoid Search Console warnings.
User-agent: Googlebot
Crawl-delay: 10
Allow: /
Disallow: /images/
Disallow: /assets/
Disallow: /_site/
Disallow: /bin/
Disallow: /CNAME
Disallow: /README.md
Disallow: /DEVELOPMENT.md
Disallow: /.htaccess

User-agent: Bingbot
Crawl-delay: 10
Allow: /
Disallow: /images/
Disallow: /assets/
Disallow: /_site/
Disallow: /bin/
Disallow: /CNAME
Disallow: /README.md
Disallow: /DEVELOPMENT.md
Disallow: /.htaccess

User-agent: Slurp
Crawl-delay: 10
Allow: /
Disallow: /images/
Disallow: /assets/
Disallow: /_site/
Disallow: /bin/
Disallow: /CNAME
Disallow: /README.md
Disallow: /DEVELOPMENT.md
Disallow: /.htaccess

# Block aggressive/problematic crawlers
User-agent: MJ12bot
Expand Down Expand Up @@ -64,18 +85,14 @@ Crawl-delay: 10
Disallow: /images/
Disallow: /assets/
Disallow: /_site/
Disallow: /bin/
Disallow: /CNAME
Disallow: /README.md
Disallow: /DEVELOPMENT.md
Disallow: /.htaccess

# Allow access to main pages
Allow: /$
Allow: /allnews
Allow: /allnews.html
Allow: /team
Allow: /publications
Allow: /contact
Allow: /funding
Allow: /gallery
Allow: /openings
Allow: /sitemap.xml
# Allow access to main pages (everything else is allowed by default)
Allow: /

# Sitemap location (helps good crawlers index efficiently)
Sitemap: {{ site.url }}{{ site.baseurl }}/sitemap.xml
24 changes: 8 additions & 16 deletions sitemap.xml
Original file line number Diff line number Diff line change
Expand Up @@ -4,25 +4,17 @@ permalink: /sitemap.xml
---
<?xml version="1.0" encoding="UTF-8"?>
<urlset xmlns="http://www.sitemaps.org/schemas/sitemap/0.9">
{% for page in site.pages %}
{% if page.url == nil %}
{% continue %}
{% endif %}

{% if page.exclude_from_sitemap == true %}
{% continue %}
{% endif %}

{% if page.url == "/404.html" or page.url == "/sitemap.xml" or page.url == "/robots.txt" %}
{% continue %}
{% endif %}

{% if page.url contains ".css" or page.url contains ".js" or page.url contains ".xml" or page.url contains ".txt" %}
{% continue %}
{% endif %}
{% assign pages_list = site.pages | where_exp: "p", "p.url != nil" %}
{% for page in pages_list %}
{% if page.exclude_from_sitemap == true %}{% continue %}{% endif %}
{% if page.url == "/404.html" or page.url == "/sitemap.xml" or page.url == "/robots.txt" %}{% continue %}{% endif %}
{% if page.url contains ".css" or page.url contains ".js" or page.url contains ".xml" or page.url contains ".txt" %}{% continue %}{% endif %}

<url>
<loc>{{ site.url }}{{ site.baseurl }}{{ page.url | replace: "index.html", "" }}</loc>
{% if page.last_modified_at %}
<lastmod>{{ page.last_modified_at | date_to_xmlschema }}</lastmod>
{% endif %}
</url>
{% endfor %}
</urlset>