I moved from the legacy mapred package to the newer mapreduce API in Hadoop. Now I’m having trouble controlling how many mappers get created for my jobs.
While I can easily control reducer count with job.setNumReduceTasks(), there doesn’t seem to be an equivalent method for mappers. I’ve also attempted setting configuration values directly:
Configuration config = new Configuration();
config.setInt("mapred.map.tasks", desiredMappers);
config.setInt("mapreduce.job.maps", desiredMappers);
But neither approach actually changes the mapper count when the job runs. What’s the correct way to specify mapper quantity in the current Hadoop API?
Those config properties you’re setting are deprecated hints, not hard limits. Hadoop picks mapper count based on input split size, which comes from your data and block size. What works is tweaking the input split parameters. Set mapreduce.input.fileinputformat.split.minsize higher for fewer mappers, or mapreduce.input.fileinputformat.split.maxsize lower for more mappers. Basic math: mappers = total input size / split size. I’ve had good luck using FileInputFormat.setMinInputSplitSize() and FileInputFormat.setMaxInputSplitSize() directly on the job config. Just remember - fighting Hadoop’s natural splitting can kill performance, so test your specific workload hard before pushing to production.
Here’s the deal - Hadoop counts mappers based on input splits, not whatever you configure. The newer API enforces this more strictly. Don’t fight it, just work with it. Take your total input size, divide by how many mappers you want, and that’s your split size. Set mapreduce.input.fileinputformat.split.maxsize to that number. I’ve had way better luck with this than messing with minimum split sizes since it stops Hadoop from making splits bigger than you want. Just remember your HDFS block size still matters - if you go too small compared to block size, you’ll get network overhead from reading partial blocks.
yeah, exactly! u can’t set mappers directly in hadoop now. just adjust the input split sizes. setting mapreduce.input.fileinputformat.split.minsize to the block size divided by the number of mappers u want should help. that did the trick for me!