Threading with Hadoop Streaming

Threading with Hadoop Streaming - python

I am making use of Hadoop streaming to write a python based HTML grabber. I find that running a single threaded python script is slow. I want to modify it to a multithreaded version. Does anyone know what would be a good number to set the number of threads in the mapper to. I am not sure of the specs of each node of the cluster but I assume that it would support atleast two threads.

I tried to use threading with python, there were issues with the Global Interpreter Lock. Ported code to use the multiprocessing module, internally hadoop assigns as many mappers as there are cores in the cluster, hence multiprocessing is not the way to go if you need speed up. Multithreading if performed right might give some speedup

I have not use hadoop streaming for html grabber but here is a post that talking about how urllib2 work s using multiple thread (not multipleprocessing package, just simple multi thread).
Hope can be helpful.

Related

Multithreading in python. Running the same program on multiple threads to make the program faster

So I have a quite time-extensive python program and I was wondering, if (since my CPU is multi-core) I can run the program on multiple threads at once. I always check Task Manager and python uses only one thread but pushes it to the max.
I tried searching, but I only found ways to run a function with different datasets on different threads, so I didn't try anything yet, I hope you can help!

multi-threading won't help you.
But Python's "multiprocessing" might - however, parallelization is not automatic, and you have to adapt your program, knowing what you are doing, in order to have any gains with it.
Python's multi-threading is capped to only have a single thread running actual Python code at once - you ave some gains if parts of your workload are spent with I/O, but not with a CPU intensive task.
Multiprocessing is a module on Python's standard library which provide the same interface as `"threading" and will actually run your code in parallel, in multiple processes each one with its own Python runtime. Its major drawback is that any data exchanged between processes have to be serialized and de-serialized, and that add some overhead.
In either case, you have to write your program so that certain functions (which can be entry-points for big workloads) run in new threads or sub-processes. Since you have no example code, there is no example we could create for you showing how the code could be - but look for tutorials on "python multiprocessing" - those should help you out.

Multiprocessing in python on Mac OS

Any basic information would be greatly appreciated.
I am almost completed with my project, all I have to do now is run my code to get my data. However, it takes a very long time, and it has been suggested that I make my code (python) available to multiprocess. However, I am clueless on how to do this and have had a lot of trouble running how. I use a Mac OS X 10.8.2. I know that I need a semaphore.
I have looked up the multiprocessing module and the Thread module, although I could not understand most of this. Do the Process() or Manager() functions have anything to do with this?
Lastly, I have 16 processors available for this.

You need to use the multiprocessing module.
Both modules enable concurrency, but only multiprocessing enables true parallelism. Due to Python's Global Interpreter Lock, multiple threads cannot execute simultaneously.
Keeping all 16 of your processors busy comes at the cost of a certain increased difficulty in programming since separate processes do not execute in a shared memory space, so if a spawned process needs to share data with its parent process you will need to serialize it.

Python multi threading Yay or nay?

I have been trying to write a simple python application to implement a worker queue
every webpage I found about threading has some random guy commenting on it, you shouldn't use python threading because this or that, can someone help me out? what is up with Python threading, can I use it or not? if yes which lib? the standard one is good enough?

Python's threads are perfectly viable and useful for many tasks. Since they're implemented with native OS threads, they allow executing blocking system calls and keep "running" simultaneously - by calling the blocking syscall in a separate thread. This is very useful for programs that have to do multiple things at the same time (i.e. GUIs and other event loops) and can even improve performance for IO bound tasks (such as web-scraping).
However, due to the Global Interpreter Lock, which precludes the Python interpreter of actually running more than a single thread simultaneously, if you expect to distribute CPU-intensive code over several CPU cores with threads and improve performance this way, you're out of luck. You can do it with the multiprocessing module, however, which provides an interface similar to threading and distributes work using processes rather than threads.
I should also add that C extensions are not required to be bound by the GIL and many do release it, so C extensions can employ multiple cores by using threads.
So, it all depends on what exactly you need to do.

You shouldn't need to use
threading. 95% of code does not need
threads.
Yes, Python threading is
perfectly valid, it's implemented
through the operating system's native
threads.
Use the standard library
threading module, it's excellent.

GIL should provide you some information on that topic.

Twisted Threading + MapReduce on a single node/server?

I'm confused about Twisted threading.
I've heard and read more than a few articles, books, and sat through a few presentations on the subject of threading vs processes in Python. It just seems to me that unless one is doing lots of IO or wanting to utilize shared memory across jobs, then the right choice is to use multiprocessing.
However, from what I've seen so far, it seems like Twisted uses Threads (pThreads from the python threading module). And Twisted seems to perform really really well in processing lots of data.
I've got a fairly large number of processes that I'd like to distribute processing to using the MapReduce pattern in Python on a single node/server. They don't do any IO really, they just do a lot of processing.
Is the Twisted reactor the right tool for this job?

The short answer to your question: no, twisted threading is not the right solution for heavy processing.
If you have a lot of processing to do, twisted's threading will still be subject to the GIL (Global Interpreter Lock). Without going into a long in depth explanation, the GIL is what allows only one thread at a time to execute python code. What this means in effect is you will not be able to take advantage of multiple cores with a single multi-threaded twisted process. That said, some C modules (such as bits of SciPy) can release the GIL and run multi-threaded, though the python code associated is still effectively single-threaded.
What twisted's threading is mainly useful for is using it along with blocking I/O based modules. A prime example of this is database API's, because the db-api spec doesn't account for asynchronous use cases, and most database modules adhere to the spec. Thusly, to use PostgreSQL for example from a twisted app, one has to either block or use something like twisted.enterprise.adbapi which is a wrapper that uses twisted.internet.threads.deferToThread to allow a SQL query to execute while other stuff is going on. This can allow other python code to run because the socket module (among most others involving operating system I/O) will release the GIL while in a system call.
That said, you can use twisted to write a network application talking to many twisted (or non-twisted, if you'd like) workers. Each worker could then work on little bits of work, and you would not be restricted by the GIL, because each worker would be its own completely isolated process. The master process can then make use of many of twisted's asynchronous primitives. For example you could use a DeferredList to wait on a number of results coming from any number of workers, and then run a response handler when all of the Deferred's complete. (thus allowing you to do your map call) If you want to go down this route, I recommend looking at twisted.protocols.amp, which is their Asynchronous Message Protocol, and can be used very trivially to implement a network-based RPC or map-reduce.
The downside of running many disparate processes versus something like multiprocessing is that
you lose out on simple process management, and
the subprocesses can't share memory as if they would if they were forked on a unix system.
Though for modern systems, 2) is rarely a problem unless you are running hundreds of subprocesses. And problem 1) can be solved by using a process management system like supervisord
Edit For more on python and the GIL, you should watch Dave Beazley's talks on the subject ( website , video, slides )

Python threading, queuing, asyncing... What does it all mean?

I've been experimenting with threading recently in Python and was curious when to use what.
For example, when should I use multithreading over multiprocessing? What would be a scenario when I should be using asynchronous IO rather than threading?
I mostly understand what each does (I think) but I can't see any benefits/downsides of using one over the other.
What should I use if I was creating a small HTTP server?
What should I use if I was creating a small HTTP client?
This baffles me...

What you want talk about is not specific to python only it's about multiprocessing vs threading in general i think you can find in google lot of argument from the two side the ones that prefer threading and the others that prefer multiprocessing.
But for python multi-threading is limited (if you're using CPython) by the GIL (Global Interpreter Lock), so most python programmer prefer using the multiprocessing over the threading (it's Guido recommendation)
Nevertheless, you re right the GIL is
not as bad as you would initially
think: you just have to undo the
brainwashing you got from Windows and
Java proponents who seem to consider
threads as the only way to approach
concurrent activities, Guido van Rossum.
you can find here some more info

Python multiprocessing makes sense when you have a machine with multiple cores and/or CPUs. The main difference between using threads and processes is that processes do not share an address space, and thus one process cannot easily access the data of another process. That is why the multiprocessing module provides managers and queues and stuff like that.
The issue with threading is Pythons Global Interpreter Lock, which seriously messes with multithreaded applications.
Asynchronous IO is useful when you have long running IO operations (read large file, wait for response from network) and do not want your application to block. Many operating systems offer built-in implementations of that.
So, for your server you would probably use multiprocessing or multithreading, and for your client async IO is more fitting.

We Keep Coding

Python is a programming language that lets you work quickly and integrate systems more effectively.