I’m working with an Airtable base that contains around 24,000 website URLs. Many of these links have formatting issues like missing slashes or extra spaces that cause them to break. I need to identify which URLs are problematic so I can fix them manually.
My current approach
I’ve been using a fetch request to test each URL and check its status. Here’s my current code:
const config = input.config();
const websiteUrl = config.websiteUrl;
let responseStatus;
try {
const result = await fetch(websiteUrl);
responseStatus = result.status;
} catch (err) {
responseStatus = 'failed';
}
output.set('responseStatus', responseStatus);
Problems I’m facing
- My script doesn’t handle redirects properly and returns ‘failed’ even when the URL works after redirecting
- I only get either ‘200’ for success or ‘failed’ for errors, but I want to see the actual HTTP status codes like 404, 500, etc.
Can anyone help me improve this script to handle redirects and capture specific error codes? Thanks!
Been through this exact scenario when auditing URLs for a client database. Your script looks fine, but there’s a catch - fetch() doesn’t throw errors for HTTP status codes like 404 or 500. It only throws for network failures or CORS issues. Even a 500 server error returns a response object with the status code.
Your URLs probably have formatting issues causing network-level failures before any HTTP response gets generated. Try trimming whitespace and validating URL format first with new URL(websiteUrl) wrapped in its own try-catch before the fetch. This’ll catch malformed URLs separately from actual HTTP responses.
Also, some servers block requests without proper user-agent headers, so adding headers to your fetch options might reduce false negatives. For 24k URLs, definitely add some delay between requests to avoid getting rate limited.
Here’s what worked for me when I had to validate 50k+ URLs for a cleanup project:
const config = input.config();
const websiteUrl = config.websiteUrl;
let responseStatus;
try {
const result = await fetch(websiteUrl, {
method: 'HEAD', // faster than GET
timeout: 10000 // 10 second timeout
});
responseStatus = result.status;
} catch (err) {
if (err.name === 'TypeError') {
responseStatus = 'network_error';
} else if (err.name === 'AbortError') {
responseStatus = 'timeout';
} else {
responseStatus = 'failed';
}
}
output.set('responseStatus', responseStatus);
Key change: HEAD requests instead of GET. Way faster since you don’t download the entire page.
Fetch handles redirects automatically (up to 20 by default). If you’re getting ‘failed’ on redirects, you’re probably hitting the redirect limit or timing out.
Don’t forget rate limiting between requests or servers will block you thinking you’re scraping. Learned that one the hard way.
your fetch code’s mostly good, but you’re handling network errors and http errors the same way. check result.ok after the fetch - if it’s false, you can still grab result.status for 404s, 500s, etc. save the catch block for actual network failures only.
I encountered similar issues while working with a large set of URLs last year. The key problem in your current approach is that you’re catching all exceptions and returning ‘failed’, which obscures the actual HTTP status codes. To resolve this, modify your catch block to differentiate between errors. While network errors will cause exceptions, HTTP errors like 404 or 500 will still provide a response object containing the relevant status code. Check if the error object has a response property before defaulting to ‘failed’. Additionally, note that the fetch function automatically handles redirects, so if you’re still getting ‘failed’, it likely indicates that the final URL encountered an error. Implementing some logging during the fetch would help clarify what’s occurring. Given the scale of your project, consider adding a timeout mechanism to prevent hanging requests that can disrupt your entire operation.