I’ve been scraping a website using OkHttp3 for a while now. But they’ve upgraded some parts and now use OpenID bearer auth. My requests are failing because of this token. I can see it in Chrome dev tools for these specific parts.
When I manually add the bearer token from my browser to OkHttp3, it works fine. Without it, I get a 401 error.
I thought my browser emulation wasn’t good enough, so I tried using HtmlUnit (a headless browser setup) in Java. It let me scrape some parts, but still failed on the updated sections. I couldn’t find the bearer token in the responses, headers, or cookies.
Is there any way to make this headless browser approach work? Or are there other methods I should try? I’m stuck and could really use some advice on how to get that OpenID bearer token automatically.
Here’s a simple code example of what I’ve tried:
WebClient webClient = new WebClient();
HtmlPage page = webClient.getPage("https://example.com");
// Look for token in page content
String pageContent = page.asXml();
// Check headers and cookies
List<NameValuePair> responseHeaders = webClient.getCurrentWindow().getEnclosedPage().getWebResponse().getResponseHeaders();
Set<Cookie> cookies = webClient.getCookieManager().getCookies();
I’ve faced similar challenges with OpenID bearer tokens. One approach that worked for me was to use Puppeteer, a Node.js library that provides a high-level API to control Chrome or Chromium. It can execute JavaScript, which is crucial when the token is generated dynamically.
To explain further, start by launching a headless Chrome instance. Then navigate to the login page and perform the authentication process. Afterwards, extract the bearer token from localStorage or sessionStorage and finally use the token in your subsequent requests. Puppeteer lets you interact with the page as a real user would, which can help bypass some anti-scraping measures. While this method may be slower than direct HTTP requests, it tends to be more reliable for scenarios involving complex authentication.
Remember to respect the website’s terms of service and rate limits when scraping.
have u tried using selenium instead? it’s pretty good for this kinda stuff. might be able to grab that token easier. also, check if the site uses javascript to generate the token. if so, you’ll need something that can execute js. good luck!
I’ve had success using Playwright for similar situations. It’s a powerful tool that supports multiple browsers and languages, including Java. Unlike HtmlUnit, Playwright can fully emulate modern browsers, which is crucial when handling complex authentication like OpenID.
In my experience, the process involves setting up Playwright with your preferred browser (Chrome, Firefox, or WebKit), navigating to the login page to complete the authentication steps, and then extracting the bearer token with JavaScript evaluation from localStorage or sessionStorage. Once obtained, you can use the token with OkHttp3 for subsequent requests. Playwright handles dynamic content and JavaScript execution reliably and can often be faster than Selenium. Just ensure you implement proper error handling and respect the website’s terms of service and rate limits when scraping.