How to modify scrapy configuration through API calls

I’m working with scrapy and scrapyd to run web crawlers. I need to pass configuration parameters through API requests but I’m having trouble getting them to work properly.

I can successfully send basic parameters like start_urls through the API and they work fine. However, when I try to send settings like CONCURRENT_REQUESTS, they don’t get applied to my crawler.

Here’s my current code:

from scrapy.spiders import CrawlSpider, Rule
from scrapy.loader import ItemLoader
from myproject.extractors import CustomLinkExtractor as LinkExtractor
from myproject.items import WebPageItem

api_settings = {}

class WebCrawler(CrawlSpider):
    name = 'webcrawler'
    
    def __init__(self, *args, **kwargs):
        self.start_urls = kwargs.get('target_url')
        self.allowed_domains = [self.start_urls]
        api_settings['CONCURRENT_REQUESTS'] = int(kwargs.get('max_requests'))
        self.logger.info(f'Concurrent requests: {api_settings}')
        
        self.rules = (
            Rule(LinkExtractor(allow=(self.start_urls), deny=('\\.(jpg|png)'), unique=True),
                callback='parse_page',
                follow=True),
        )
        super(WebCrawler, self).__init__(*args, **kwargs)
    
    @classmethod
    def update_settings(cls, settings):
        cls.custom_settings.update(api_settings)
        settings.setdict(cls.custom_settings or {}, priority='spider')
    
    def parse_page(self, response):
        loader = ItemLoader(item=WebPageItem(), response=response)
        loader.add_xpath('language', "//html/@lang")
        yield loader.load_item()

The problem is that CONCURRENT_REQUESTS and other settings sent through the API are not being applied. I need to be able to modify up to 10 different settings dynamically through API calls. How can I properly implement this?

Your problem is that you’re setting up api_settings in __init__, but Scrapy needs those settings defined at the class level before the crawler starts. You’re doing it too late in the process.

I ran into the same thing. What fixed it for me was using a custom settings handler that processes API parameters before the spider gets created. You’ve got two options: override the custom_settings class attribute directly, or use the from_crawler classmethod to grab the crawler’s settings.

Here’s what I’d try: create a custom_settings property in your spider class and populate it through from_crawler. That’s where you can access crawler.settings and modify them with crawler.settings.set(). This way everything gets applied during crawler initialization instead of after.

I had the exact same issues with DOWNLOAD_DELAY and ROBOTSTXT_OBEY until I moved all the settings config to happen before the spider’s __init__ runs. Timing is everything with Scrapy settings.

This happens because Scrapy needs configuration values during spider instantiation, not after. Your update_settings method never gets called automatically by Scrapy. I ran into this exact issue when building a distributed crawling system. The fix is overriding the from_crawler classmethod in your spider. This method gets the crawler instance and lets you modify settings before the spider object is created. Replace your current approach with this in your spider class: python @classmethod def from_crawler(cls, crawler, *args, **kwargs): max_requests = kwargs.get('max_requests') if max_requests: crawler.settings.set('CONCURRENT_REQUESTS', int(max_requests)) return super().from_crawler(crawler, *args, **kwargs) This way your API parameters get applied to the actual crawler settings before the spider starts running. I’ve used this pattern to dynamically configure download delays, user agents, and concurrent request limits through API calls - works great.

yeah, your update_settings method isn’t getting called at all. scrapy doesn’t automatically run it during spider startup, so those api parameters just sit there doing nothing. I ran into the same issue - you need to move everything to from_crawler where you can actually modify the crawler settings before initialization. ditch the global api_settings dict and just use crawler.settings.set() directly in from_crawler. that’s what worked for me when I needed to pass custom headers and timeout values through the scrapyd api.

This topic was automatically closed 4 days after the last reply. New replies are no longer allowed.