Web Scraping on Android Using a Headless Browser

CreatingStone · December 17, 2024, 10:24pm

I’m searching for a library to help with the following tasks:

Access and retrieve complete webpage content in the background without displaying it.
Handle web pages that utilize AJAX for loading additional data post-initial HTML load.
Extract specific elements using XPath or CSS selectors from the fetched HTML.
Ideally, be able to navigate through pages and interact with buttons or links programmatically in the future.

Here’s what I’ve attempted without success:

Jsoup: Efficient for fetching HTML but lacks support for JavaScript or AJAX, thus does not present the complete page content.
Android HttpEntity: Faces the same challenges as Jsoup regarding JavaScript and AJAX.
HtmlUnit: Seems like a great fit; however, I’ve struggled for hours to get it working on Android due to limitations, including failures with large JAR files and missing packages like Applets and java.awt.
Rhino: I find it quite perplexing, and I’m uncertain on implementation specifics, as well as if it meets my needs.
Selenium Driver: It appears it could be a viable solution, but it’s complex to implement it in a fully headless manner without rendering the HTML.

I really hope to make HtmlUnit work since it appears to be what I need. Is there another suitable library I might have overlooked?

I’m currently utilizing Android Studio 0.1.7 but can switch to Eclipse if necessary.

Thank you for your assistance!

Bob_Clever · December 29, 2024, 3:51am

Hey CreatingStone,

If HtmlUnit is a struggle, a good alternative for web scraping on Android is Puppeteer via Node.js and Termux. Puppeteer can run headless, supports AJAX, and handles JavaScript effectively.

Here’s a quick way to set it up:


npm install puppeteer

This can manage complete page loads and element extraction. Given your Android setup, combining it with Termux would simulate a Node.js environment, providing more flexibility over HtmlUnit.

Hope this helps!

Ethan99 · December 29, 2024, 4:46pm

In addition to Bob_Clever's suggestion of using Puppeteer with Termux, another approach you might consider is leveraging Android WebView with JavaScript enabled to perform web scraping tasks. This can be integrated directly within your Android app, providing more native control over the browsing abilities.

Here's how you can proceed with Android WebView:


WebView webView = new WebView(context);
webView.getSettings().setJavaScriptEnabled(true);
webView.loadUrl("https://your-target-website.com");
webView.setWebViewClient(new WebViewClient() {
    @Override
    public void onPageFinished(WebView view, String url) {
        // Use JavaScript and WebView interface to access the page
        webView.evaluateJavascript(
            "document.querySelectorAll('your-css-selector')", 
            new ValueCallback() {
                @Override
                public void onReceiveValue(String html) {
                    // Process your extracted elements here
                }
            }
        );
    }
});

This approach enables you to handle pages with JavaScript and AJAX by using the browser's own capabilities. You can then extract elements using a combination of JavaScript and CSS selectors, similar to DOM manipulation in a regular browser environment.

However, keep in mind that WebView is not entirely headless and still involves rendering, but you have the option to keep it invisible to the user. Additionally, this method would be limited if the entire automation is intended to be purely headless without any visual component.

For more complex navigation and interaction tasks as you mentioned, look into libraries like UiAutomator, which might complement WebView for a more comprehensive solution.