I am using the python yelp/mrjob framework for my mapreduce jobs. There are only about 4G of data and I don't want to go through the trouble of setting up Hadoop or EMR. I have a 64 core machine and it takes about 2 hours to process the data with mrjob. I notice mrjob assigns 54 mappers for my job, but it seems to run only one at a time. Is there a way to make mrjob run all tasks in parallel with all my cpu cores?
I manually changed number of tasks but didn't help much.
--jobconf mapred.map.tasks=10 --jobconf mapred.reduce.tasks=10
EDIT:
I have -r local when I execute the job, however, looking at the code, it seems it defaults to run one process at a time. Please tell me I am wrong.
The local job runner for mrjob just spawns one subprocess for each MR stage, one for the mapper, one for the combiner (optionally), and one for the reducer, and passes data between them via a pipe. It is not designed to have any parallelism at all, so it will never take advantage of your 64 cores.
My suggestion would be to run hadoop on your local machine and submit the job with the -r hadoop option. A hadoop cluster running on your local machine in pseduo-distributed mode should be able to take advantage of your multiple cores.
See this question which addresses that topic: Full utilization of all cores in Hadoop pseudo-distributed mode
The runner for a job can be specified via the command line with the -r option.
When you run a mrjob script from the command line, the default run mode is inline which runs your job on your local machine in a single process. The other obvious options for running jobs are emr and hadoop.
You can make the job run in parallel on you local machine by setting the runner as local
$ python myjob.py -r local
Those --jobconf options are only recognised by Hadoop (i.e. on EMR or a Hadoop cluster).
Related
I have a program where each task is a call to a C++ external program through subprocess.Popen. The tasks are arranged in a graph and everything is executed through the dask get command.
I have a single node version of this program that works just fine with dask.threaded and I am trying to extend this version to a distributed setting. My goal is to run it on a Slurm cluster but I have trouble deploying the workers. When I run the following:
screen -d -m dask-scheduler --scheduler-file scheduler.json
screen -d -m srun dask-worker --scheduler-file scheduler.json
python3 myscript.py
only a single core gets used on every node (out of twenty cores per node).
I did suspect some issues with the GIL but the script works just fine with dask.threaded so I am not quite sure what is going on and some help would be appreciated.
I recommend looking at the dashboard to see how many tasks Dask is running at a time on each worker:
Documentation here: http://dask.pydata.org/en/latest/diagnostics-distributed.html
If you see that Dask is only running one task per worker then it's probably a problem in how you've set up your workers (you might want to look at the worker page to get a sense for what Dask thinks you've asked for)
If you see that Dask is running many tasks per worker concurrently then it's probably an issue with your function.
i have 4 spark application (to find wordcount from text file) which written on 4 different language (R,python,java,scala)
./wordcount.R
./wordcount.py
./wordcount.java
./wordcount.scala
spark works in standalone mode...
1.4worker nodes
2.1 core for each worker node
3.1gb memory for each node
4.core_max set to 1
./conf/spark-env.sh
export SPARK_MASTER_OPTS="-Dspark.deploy.defaultCores=1"
export SPARK_WORKER_OPTS="-Dspark.deploy.defaultCores=1"
export SPARK_WORKER_CORES=1
export SPARK_WORKER_MEMORY=1g
export SPARK_WORKER_INSTANCES=4
i submitted spark application using pgm.sh file on terminal
./bin/spark-submit --master spark://-Aspire-E5-001:7077 ./wordcount.R &
./bin/spark-submit --master spark://-Aspire-E5-001:7077 ./wordcount.py &
./bin/spark-submit --master spark://-Aspire-E5-001:7077 ./project_2.jar &
./bin/spark-submit --master spark://-Aspire-E5-001:7077 ./project_2.jar
when each process executing individually it takes 2sec.
when all process executed using .sh file on terminal it takes 5 sec to 6sec
how do i run different spark applications parallelly?
how to assign each spark application to individual core?
Edit: In Standalone mode, per the documentation, you simply need to set spark.cores.max to something less than the size of your standalone cluster, and it is also advisable to set spark.deploy.defaultCores for the applications that don't explicitly set this setting.
Original Answer (assuming this was running in something like local or YARN): When you submit multiple spark applications, they should run in parallel automatically, provided that the cluster or server you are running on is configured to allow for that. A YARN cluster, for example, would run the applications in parallel by default. Note that the more applications you are running in parallel, the greater the risk of resource contention.
On "how to assign each spark application to individual core": you don't, Spark handles the scheduling of workers to cores. You can configure how many resources each of your worker executors uses, but the allocation of them is up to the scheduler (be that Spark, Yarn, or some other scheduler).
I have to run a script on several machines in a compute cluster using SSH. But before I run the script I have to log in into a node in the cluster using ssh, and then use nvidia-smi to check which GPU is free (as there is no job-scheduler in place at the moment). Each node has several GPUs. So I typically access, say GPU1, by issuing ssh gpu1...followed by nvidia-smi which just outputs a list of gpus and processes and utilization of each gpu.
I need to automate all this. That is, say we have 4 GPUs : gpu1...gpu4.
I want to be able to ssh into each of these, check their utlization, then run a python script run_test.py -arg1 on the gpu that is free.
How can I write a python script that can do all these ?
I'm new to Python so need some help pls...
Even when I give a parameter to the groupByKey function, for example groupByKey(4), when I check with the top command, spark is still using one core. I run my script like that.
spark-submit --master local[4] program.py
So, why spark only uses one core when I tell it to use 4?
You're running this on Linux, if the tags to your question are to be trusted. Under linux, top does not, by default, show every thread (it shows every process). local[4] tells spark to work locally on 4 threads (not processes).
Run top -H to pick up the threads.
I am attempting to launch a couple of external applications from a Jenkins build step in Windows 7 64-bit. They are essentially programs designed to interact with each other and perform a series of regression tests on some software. Jenkins is run as Windows service as a user with admin privileges on the machine. I think that's full disclosure on any weirdness with my Jenkins installation.
I have written a Python3 script that successfully does what I want it to when run from the Windows command line. When when I run this script as a Jenkins build step, I can see that the applications have been spawned via the Task Manager, but there is no CPU activity associated with them, and no other evidence that they are actually doing anything (they produce log files, etc., but none of these appear). One of the applications typically runs at 25% CPU during the course of the regression tests.
The Python script itself runs to completion as if everything is OK. Jenkins is correctly monitoring the output of the script, which I can watch from the job's console output. I'm using os.spawnv(os.P_NOWAIT, ...) for each of the external application. The subprocess module doesn't do what I want it to, I just want these programs to run externally.
I've even run a bash script via Cygwin that functionally does the same thing as the Python script with the same results. Any idea why these applications spawn but don't execute?
Thanks!