Is there a way to run my web scraping app on Heroku without headless mode?

I’ve built a web scraping app that works fine on my computer but runs into issues when deployed on Heroku. The app uses Chrome browser without headless mode. Here’s a snippet of my code:

from flask import Flask
from splinter import Browser
from selenium.webdriver.chrome.options import Options

app = Flask(__name__)

@app.route('/scrape', methods=['POST'])
def scrape():
    chrome_options = Options()
    chrome_options.add_argument("--start-maximized")
    browser = Browser('chrome', options=chrome_options)
    browser.visit(target_url)
    # ... rest of the scraping code

When I use headless mode, the app starts but can’t scrape the data properly. The HTML structure changes, and some JavaScript-heavy sites don’t load fully. I’ve tried adding delays, but it doesn’t help.

Does anyone know how to make this work on Heroku without headless mode? Or if I must use headless, how can I fix the loading issues? I’m scraping from two sites that need full browser functionality.

Should I consider scraping locally and setting up a database for the app to use instead? Any advice would be appreciated!

Running a full Chrome browser on Heroku is challenging due to their system limitations. Have you considered using a service like BrowserStack or Sauce Labs? These provide cloud-based browser instances that you can control remotely. You’d need to modify your code to connect to their API, but it could solve your non-headless browser needs without the Heroku constraints.

Alternatively, if you’re open to a different approach, you might want to look into using a headless browser like Puppeteer. It’s designed to handle JavaScript-heavy sites and can often render pages more completely than traditional headless modes. You’d need to rewrite some of your scraping logic, but it could be a more robust solution in the long run.

Lastly, if these options don’t work, your idea of scraping locally and using a database might be the best compromise. It would allow you to use your current setup while keeping your Heroku app simple and scalable.

I’ve faced similar challenges with web scraping on Heroku. Running a full browser instance on Heroku can be tricky due to resource limitations and the ephemeral nature of dynos. Here’s what worked for me:

Instead of running the scraper directly on Heroku, I set up a separate EC2 instance on AWS to handle the scraping tasks. This allowed me to run Chrome in non-headless mode without issues. I then used a message queue (RabbitMQ) to communicate between my Heroku app and the EC2 scraper.

The Heroku app would send scraping requests to the queue, and the EC2 instance would pick them up, perform the scraping, and store the results in a shared database (I used PostgreSQL on RDS). This approach gave me the full browser functionality I needed while keeping my Heroku app lightweight.

It did require more setup and infrastructure management, but it solved the headless mode problems and made my scraping more reliable. Plus, it allowed for better scaling as I could add more EC2 instances for parallel scraping when needed.

have u tried using a cloud-based browser service like browserless.io? it lets u run Chrome in the cloud without headless mode. might solve ur problem without changing much code. just need to update the connection part. could be worth a shot before going for more complex solutions