Table of Contents
There is a predictable pattern with website downtime: the developer finds out from a customer, a tweet, or a Slack message from a colleague who happened to visit the site. The average detection time without monitoring is 3–4 hours. With monitoring, it is under 60 seconds.
This checklist gives you a concrete setup to run through before every launch — whether it is a new project, a major feature, or a new microservice. It takes about 20 minutes to set up and will save you hours the first time something breaks.
This checklist assumes you have access to an uptime monitoring tool. Uptime Assure's free plan covers all the monitor types described here: HTTP, SSL, DNS, and keyword monitors.
1. HTTP Uptime Monitor
This is the foundation. An HTTP monitor sends a GET request to your URL every 1–5 minutes from an external server and checks that it gets a 2xx response back within a timeout window.
What to configure:
- Monitor the canonical URL — if you redirect www to non-www (or vice versa), monitor the destination, not the redirect source.
- Set a reasonable timeout — 10–15 seconds is standard. Most sites should respond in under 2 seconds; 10 seconds catches infrastructure failures without too many false positives.
- Add your most critical pages individually — homepage, /login, /checkout, /api/health. A broken checkout page that still returns 200 on the homepage will go undetected otherwise.
- Consider check interval — free plans often offer 5-minute checks. For production SaaS or e-commerce, 1–2 minute checks are worth paying for.
Add a dedicated /health endpoint to your application that checks database connectivity, cache availability, and any critical third-party integrations. Monitor that URL instead of (or in addition to) the homepage — it gives you much more useful signal.
2. SSL Certificate Monitor
An SSL monitor checks your certificate expiry date daily and alerts you well before it expires. The most common trigger for SSL failures is not expiry itself but a broken auto-renewal — and you will only know the renewal broke if you are watching the expiry date.
- Set alerts at 30 days, 14 days, and 7 days before expiry.
- Monitor every domain and subdomain that serves HTTPS traffic separately.
- If you use a wildcard certificate, still add individual monitors — they all share the same expiry date, but the monitor confirms the cert is being served correctly on each subdomain.
3. DNS Monitor
DNS failures are insidious — your server can be completely healthy, but if your DNS resolution is broken, nobody can reach you. A DNS monitor checks that your domain resolves to the expected IP address.
- Add a DNS monitor for your apex domain and at least one subdomain.
- If you use Cloudflare or another CDN proxy, the resolved IP will be the CDN's IP — that is expected. What you are watching for is the domain failing to resolve at all.
- DNS alerts are especially valuable for catching: domain expiry (your registrar stops resolving the domain), accidental record deletion, and DNS propagation failures after a migration.
Domain expiry is a separate but related risk. If your domain registration lapses, your DNS stops resolving entirely — no website, no email, no API. Check your domain expiry date right now and set a calendar reminder at 60 days before.
4. Establish a Response Time Baseline
A site that is "up" but taking 8 seconds to load is functionally broken — conversion rates drop steeply after 3 seconds, and Google uses Core Web Vitals in ranking. Response time monitoring catches performance regressions before they become support tickets.
How to use this effectively:
- 1Let your monitor run for 48 hours after launch to establish a baseline. Note the average and P95 response time.
- 2Set a response time alert threshold at 2–3× your baseline. If your homepage normally loads in 400ms and it suddenly takes 1.2s, something changed.
- 3Review response time trends weekly — a gradual slowdown over weeks often signals a growing database query, a memory leak, or accumulating technical debt.
- 4Check response times from the geographic region your users are in — if your server is in US-East and most of your users are in India, add monitoring from an Asia-Pacific check location.
5. Keyword Monitor for Critical Pages
A keyword monitor checks that a specific string of text is present in the page response. This catches the category of broken deployment that is hardest to detect: your server returns 200 OK, but the content is wrong.
Common scenarios a keyword monitor catches that HTTP monitoring misses:
- A bad deployment serves a generic "App Error" page or a blank screen with status 200.
- Your CDN serves a stale cached error page.
- A database failure causes your site to show a fallback template instead of real content.
- A third-party script injection replaces your content.
- Your login page loads but the password field is missing due to a JavaScript error.
Set up keyword monitors for: your homepage (monitor for your brand name or a unique phrase), your login page (monitor for "password" or "sign in"), and your most critical conversion page (monitor for "Add to cart", "Buy now", or a form field name).
6. API Endpoint Monitors
If your product has an API — whether public or consumed by your own frontend — each critical endpoint deserves its own monitor. API failures are often completely invisible to HTTP monitors watching the frontend.
- Add an HTTP monitor for your /api/health or /api/status endpoint. Return a 200 if all dependencies are healthy, 503 if not.
- Monitor authentication endpoints (/api/login, /api/auth/token) — broken auth means your entire product is inaccessible even if the UI loads fine.
- For webhook-consuming endpoints, add a simple GET handler at /webhooks/health that returns 200 — lets you confirm the route is registered and reachable.
- Consider adding synthetic transaction monitors for critical flows (sign up → verify email → first login) if your monitoring tool supports it.
7. Public Status Page
A public status page does two things: it gives your users a place to check when they are having trouble (reducing support tickets), and it signals that you take reliability seriously.
- The status page must be hosted independently from your main application — if your site is down, the status page should still load.
- Use a subdomain like status.yoursite.com rather than yoursite.com/status. The subdomain can point to an external service; the path-based URL shares your server's fate.
- Show the current status of individual components (API, Dashboard, Email Delivery, Payments) rather than just a single overall status.
- Include historical uptime data — a 30-day or 90-day uptime history builds confidence in your reliability story.
- Link to your status page from your footer, your login page, and your error pages.
8. Configure Alert Routing
An alert that goes to the wrong place — or to nobody at 3 AM — is not an alert. Think through alert routing before launch, not during an incident.
- At minimum: email alerts to a shared team inbox, not just a personal address.
- For faster response: Slack or Discord webhook to your on-call channel.
- For critical production services: phone call or SMS escalation for incidents lasting more than 5 minutes.
- Set severity levels — a 5-minute outage warrants a Slack message; a 30-minute outage should wake someone up.
- Define on-call rotation if you have a team — a single person being always-on-call leads to burnout and slow response.
9. Write an Incident Runbook
A runbook is a short document that tells whoever receives an alert exactly what to do. Writing it before launch forces you to think through failure scenarios when you are calm — not when you are panicking at 2 AM.
Your runbook should answer:
- 1Who is the first responder? Who escalates if they are unavailable?
- 2Where are the logs? (Server logs, application logs, error tracking like Sentry)
- 3How do you restart the application server? The database? The queue worker?
- 4What is the rollback procedure for a bad deployment?
- 5Who are the contacts at your hosting provider and DNS registrar?
- 6Where is the status page admin, and how do you post an incident update?
A runbook does not need to be long. A single page in Notion or a Markdown file in your repo is enough. The discipline of writing it — not the document itself — is what matters.
10. Test Everything Before You Go Live
This step is skipped more often than any other — and it is the most important. An alert configuration that has never fired may not fire when you need it. Verify every channel before launch:
- 1Trigger a test alert from your monitoring tool and confirm you receive it at the configured email and Slack channel.
- 2Temporarily take down a monitor (by setting an intentionally incorrect URL) and confirm the down alert fires within your expected window.
- 3Restore the correct URL and confirm the recovery alert fires.
- 4If you have SMS or phone call alerts, call the number to confirm it reaches the right person.
- 5Open your status page in incognito mode and confirm it loads correctly from outside your network.
- 6Walk through your incident runbook end-to-end as a dry run — can you actually find the logs? Can you restart the service?
Schedule a 30-minute "fire drill" on your calendar one week after launch. Simulate a real incident: someone pretends the site is down, the team follows the runbook, someone posts a status update. Find the gaps before a real incident does.
Uptime Assure Team
Monitoring experts · Based in India
Written by the team behind Uptime Assure — developers and reliability engineers who build and use uptime monitoring tools every day. We write about website reliability, performance, and the practical side of keeping services online.
About us