I’m working with HtmlUnit library and trying to automate form submissions through multiple page redirects. My situation involves connecting to a reporting server that requires several redirect steps before reaching the final PDF document.
The workflow goes like this: I call a REST API endpoint which gives me a URL. When I open this URL in a regular browser, it goes through about three redirects and finally shows the PDF report. I want to replicate this same behavior programmatically.
Here’s my current code setup:
try (final WebClient client = new WebClient()) {
client.getOptions().setJavaScriptEnabled(true);
client.getOptions().setThrowExceptionOnScriptError(false);
client.getOptions().setThrowExceptionOnFailingStatusCode(false);
client.getOptions().setRedirectEnabled(true);
resultPage = client.getPage(targetUrl);
}
The problem is that my code only reaches the second redirect but doesn’t get to the actual document. What’s the best approach to handle this? Should I manually track the redirects and make additional requests, or is there a simpler way to ensure I reach the final page?
Check if those intermediate pages have forms you need to submit instead of just redirects. I hit the same issue - one of my redirect steps actually needed a form submission to continue. Use resultPage.asText() or resultPage.asXml() to inspect what’s happening at each step. Also try client.getOptions().setCssEnabled(false) to speed things up since you just want the PDF. If you need manual navigation, extract the form parameters and submit them with HtmlForm.getInputByName() and form.submit().
Try client.getOptions().setTimeout(10000) and bump up the redirect limit with client.getOptions().setMaxInMemory(0). The third redirect often takes longer than HTMLUnit expects, so it bails out early. Also check for any JavaScript delays between redirects - those can mess things up.
I’ve faced similar challenges when dealing with complex redirect chains. One approach that proved effective for me was to implement client.waitForBackgroundJavaScript(5000) after each page load, as certain servers may intentionally delay responses to deter automated access. This approach allows the necessary JavaScript to execute before proceeding further. Additionally, it can be beneficial to manually follow the Location headers to ensure that each redirect is handled as expected, particularly since HTMLUnit may struggle with meta-refresh or JavaScript-based redirects. Setting a custom User-Agent via client.addRequestHeader() can also help to avoid being flagged as a bot by some reporting servers.