Is this running truly parallel? - python

I'm using requests and threading in python to do some stuff. My question is: Is this code running truly multithreaded and is it safe to use? I'm experiencing some slow down over time. Note: I'm not using this exact code but mine is doing similar things.
import time
import requests
current_threads = 0
max_threads = 32
def doStuff():
global current_threads
r = requests.get('https://google.de')
current_threads-=1
while True:
while current_threads >= max_threads:
time.sleep(0.05)
thread = threading.Thread(target = doStuff)
thread.start()
current_threads+=1

There could be a number of reasons for the issue you are facing. I'm not an expert in Python but I can see the potential for a number of causes for the slow down. Potential reasons I can think of are as follows:
Depending on the size of the data you are pulling down you could potentially be overloading your bandwidth. Hard one to prove without seeing the exact code you are are using and what your code is doing and knowing your bandwidth.
Kinda connected to the fist one but if your files are taking some time to come down per thread it maybe getting clogged up at the:
while current_threads >= max_threads:
time.sleep(0.05)
you could try reducing the max number of threads and see if that helps though it may not if it's the files that are taking time to download.
The problem may not be with your code or your bandwidth but with the server you are pulling the files from, if that server is over loaded it maybe slowing down your transfers.
Firewalls, IPS, IDS, Policys on the server maybe throttling your requests. If you make too many requests to quickly all from the same IP the server side network equipment may mistake this as some sort of DoS attack and throttle your requests in response.
Unfortunately Python, as compared to other lower level languages such as C# or C++ is not as good at multithreading. This is due to something called the GIL (Global Interpreter Lock) which comes into play when you are accessing/manipulating the same data in multiple threads. This is quite a sizeable subject in it's self but if you want to read up on it have a look at this link.
https://medium.com/practo-engineering/threading-vs-multiprocessing-in-python-7b57f224eadb
Sorry I can't be of any more assistance but this is as much as I can say on the subject given the provided information.

Sure, you're running multiple threads and provided they're not accessing/mutating the same resources you're probably "safe".
Whenever I'm accessing external resources (ie, using requests), I always recommend asyncio over vanilla threading, as it allows custom context switching (everywhere you have an "await" you switch contexts, whereas in vanilla threading switching between threads is determined by the OS and might not be optimal) and reduced overhead (you're only using ONE thread).

Related

Multithreaded crawler in Python

Is it possible to create as many threads to use 100% of CPU and is it really efficient? I'm planning to create a crawler in Python and, in order to make the program efficient, I want to create as many threads as possible, where each thread will be downloading one website. I tried looking up for some information online; unfortunately I didn't find much.
You are confusing your terminology, but that is ok. A very high level overview would help.
Concurrency can consist of IO bound (reading and writing from disk, http requests, etc) and CPU bound work (running a machine learning optimization function on a big set of data).
With IO bound work, which is what you are referring to I am assuming, in fact your CPU is not working very hard but rather waiting around for data to come back.
Contrast that with multi-processing where you can use multiple core of your machine to do more intense CPU bound work.
That said multi-threading could help you. I would advise to use the asyncio and aiohttp modules for Python. These will help you make sure whilst you are waiting for some response to be returned, the software can continue with other requests.
I use asyncio, aiohttp and bs4 when I need to do some web-scraping.

Fastest way to handle threading with callback in Python

I'm playing around with some measurement instruments through PyVisa (Python wrapper for VISA). Specifically I need to read the measurement value from four instruments, similar to this:
current1 = instrument1.ask("READ?")
current2 = instrument2.ask("READ?")
current3 = instrument3.ask("READ?")
current4 = instrument4.ask("READ?")
For my application, speed is a must. Individually, I can get between 50 and 200 measurements per second from the four instruments, but unfortunately my current code evaluates the four instruments serially.
From what I've read, there's some options with Threading and Multiprocessing in Python, but it's not obvious what the best, and fastest, option for me.
Best case scenario I will be spawning ~4x50 threads per second, so overhead is a bit of a concern.
The task is in no way CPU intensive, it is simply a matter of waiting for the readout from the instrument.
Any advice on what the proper course of action might be?
Many thanks in advance!
Try using the multiprocessing module in python. Because of the way the python interpreter is written, the threading module for python actually slows down your application when using multiple threads on a computer with multiple processors. This is due to something called the Global Interpreter Lock (GIL) which only allows one python thread per process to be run at a time. This makes things like memory management and concurrency easier for the interpreter.
I would suggest trying to create a process for each instrument (producer process) and send the results to a shared queue that gets processed by a single process (consumer process).
I would also limit the number of processes that you create to the number of processors on your computer. Creating too many processes introduces a lot of overhead.
If you really feel like you need the 200 threads you suggested in your post, I would program this in Java or C++.
I'm not sure if this fits your requirements I'm not too familiar with pyVisa, however have you looked into pypy's stackless examples the speed increase is pretty impressive, and you're able to spawn almost infinite amount of 'threadlets' without much hindrance to your system / program.
pypy is kinda limited in terms of supporting various packages but looks like pyVisa isn't one of them.
https://pypi.python.org/pypi/PyVISA

How to run a program and also execute code when data is received from the network?

I've written a little program in Python that basically does the following:
Gets a hotword as input from the user. If it matches the set keyword it continues.
After entering the correct hotword the user is asked to enter a command.
After the command is read the progam checks a command file to see if there is a command that matches that input
If a matching command is found, execute whatever that command says.
I'd like to add the ability to execute commands over a network as follows (and learn to use Twisted on the way):
Client #1 enters a command targeted at client #2.
The command gets sent to a server which routes it to client #2.
Client #2 receives the command and executes it if it's valid.
Note: Entering commands locally (using the code below) and remotely should be possible.
After some thinking I couldn't come up with any other way to implement this other than:
Have the above program run as process #1 (the program that runs locally as I've written at the beginning).
A Twisted client will be run as process #2 and receive the commands from remote clients. Whenever a command is received, the Twisted client will initialize a thread that'll parse the command, check for its validity and execute it if it's valid.
Since I don't have that much experience with threads and none with network programming, I couldn't think of any other scheme that makes sense to me.
Is the scheme stated above overly complicated?
I would appreciate some insight before trying to implement it this way.
The code for the python program (without the client) is:
The main (which is the start() method):
class Controller:
def __init__(self,listener, executor):
self.listener = listener
self.executor = executor
def start(self):
while True:
text = self.listener.listen_for_hotword()
if self.executor.is_hotword(text):
text = self.listener.listen_for_command()
if self.executor.has_matching_command(text):
self.executor.execute_command(text)
else:
tts.say("No command found. Please try again")
The Listener (gets input from the user):
class TextListener(Listener):
def listen_for_hotword(self):
text = raw_input("Hotword: ")
text =' '.join(text.split()).lower()
return text
def listen_for_command(self):
text = raw_input("What would you like me to do: ")
text = ' '.join(text.split()).lower()
return text
The executor (the class that executes the given command):
class Executor:
#TODO: Define default path
def __init__(self,parser, audio_path='../Misc/audio.wav'):
self.command_parser = parser
self.audio_path = audio_path
def is_hotword(self,hotword):
return self.command_parser.is_hotword(hotword)
def has_matching_command(self,command):
return self.command_parser.has_matching_command(command)
def execute_command(self,command):
val = self.command_parser.getCommand(command)
print val
val = os.system(val) #So we don't see the return value of the command
The command file parser:
class KeyNotFoundException(Exception):
pass
class YAMLParser:
THRESHOLD = 0.6
def __init__(self,path='Configurations/commands.yaml'):
with open(path,'r') as f:
self.parsed_yaml = yaml.load(f)
def getCommand(self,key):
try:
matching_command = self.find_matching_command(key)
return self.parsed_yaml["Commands"][matching_command]
except KeyError:
raise KeyNotFoundException("No key matching {}".format(key))
def has_matching_command(self,key):
try:
for command in self.parsed_yaml["Commands"]:
if jellyfish.jaro_distance(command,key) >=self.THRESHOLD:
return True
except KeyError:
return False
def find_matching_command(self,key):
for command in self.parsed_yaml["Commands"]:
if jellyfish.jaro_distance(command,key) >=0.5:
return command
def is_hotword(self,hotword):
return jellyfish.jaro_distance(self.parsed_yaml["Hotword"],hotword)>=self.THRESHOLD
Example configuration file:
Commands:
echo : echo hello
Hotword: start
I'm finding it extremely difficult to follow the background in your questions, but I'll take a stab at the questions themselves.
How to run a program and also execute code when data is received from the network?
As you noted in your question, the typical way to write a "walk and chew-gum" style app is to design your code in a threaded or an event-loop style.
Given you talking about threading and Twisted (which is event-loop style) I'm worried that you may be thinking about mixing the two.
I view them as fundamentally different styles of programming (each with places they excel) and that mixing them is generally a path to hell.
Let me give you some background to explain
Background: Threading vs Event programming
Threading
How to think of the concept:
I have multiple things I need to do at the same time, and I want my operating system to figure how and when to run those separate tasks.
Pluses:
'The' way to let one program use multiple processor cores at the same time
In the posix world the only way to let one process run on multiple CPU cores at the same time is via threads (with the typical ideal number of threads being no more then the cores in a given machine)
Easier to start with
The same code that you were running inline can be tossed off into a thread usually without needing a redesign (without GIL some locking would be required but more on that later)
Much easier to use with tasks that will eat all the CPU you can give at them
I.E. in most cases math is way easier to deal with using threading solutions then using event/async frameworks
Minuses:
Python has a special problem with threads
In CPython the global interpreter lock(GIL) can negate threads ability to multitask (making threads nearly useless). Avoiding the GIL is messy and can undo all the ease of use of working in threads
As you add threads (and locks) things get complicated fast, see this SO: Threading Best Practices
Rarely optimal for IO/user-interacting tasks
While threads are very good at dealing with small numbers of tasks that want to use lots of CPU (Ideally one thread per core), they are far less optimal at dealing with large counts of tasks that spend most of their time waiting.
Best use:
Computationally expensive things.
If you have big chucks of math that you want to run concurrently, its very unlikely that your going to be able to schedule the CPU utilization more intelligently then the operation system.
( Given CPythons GIL problem, threading shouldn't manually be used for math, instead a library that internally threads (like NumPy) should be used )
Event-loop/Asynchronous programming
How to think of the concept:
I have multiple things I need to do at the same time, but I (the programer) want direct control/direct-implimentation over how and when my sub-tasks are run
How you should be thinking about your code:
Think of all your sub-tasks in one big intertwined whole, your mind should always have the thought of "will this code run fast enough that it doesn't goof up the other sub-tasks I'm managing"
Pluses:
Extraordinarily efficient with network/IO/UI connections, including large counts of connections
Event-loop style programs are one of the key technologies that solved the c10k problem. Frameworks like Twisted can literally handle tens-of-thousands of connections in one python process running on a small machine.
Predictable (small) increase in complexity as other connections/tasks are added
While there is a fairly steep learning curve (particularly in twisted), once the basics are understood new connection types can be added to projects with a minimal increase in the overall complexity. Moving from a program that offers a keyboard interface to one that offers keyboard/telnet/web/ssl/ssh connections may just be a few lines of glue code per-interface (... this is highly variable by framework, but regardless of the framework event-loops complexity is more predictable then threads as connection counts increase)
Minuses:
Harder to get started.
Event/async programming requires a different style of design from the first line of code (though you'll find that style of design is somewhat portable from language to language)
One event-loop, one core
While event-loops can let you handle a spectacular number of IO connections at the same time, they in-and-of-themselves can't run on multiple cores at the same time. The conventional way to deal with this is to write programs in such a way that multiple instances of the program can be run at the same time, one for each core (this technique bypasses the GIL problem)
Multiplexing high CPU tasks can be difficult
Event programing requires cutting high CPU tasks into pieces such that each piece takes an (ideally predictably) small amount of CPU, otherwise the event system ceases to multitask whenever the high CPU task is run.
Best use:
IO based things
TL; DR - Given your question, look at twisted and -not- threads
While your application doesn't seem to be exclusively IO based, none of it seems to be CPU based (it looks like your currently playing audio via a system call, system spins off an independent process every time its called, so its work doesn't burn your processes CPU - though system blocks, so its a no-no in twisted - you have to use different calls in twisted).
Your question also doesn't suggest your concerned about maxing out multiple cores.
Therefor, given you specifically talked about Twisted, and an event-loop solution seems to be the best match for your application, I would recommend looking at Twisted and -not- threads.
Given the 'Best use' listed above you might be tempted to think that mixing twisted and threads is the way-to-got, but when doing that if you do anything even slightly wrong you will disable the advantages of both the event-loop (you'll block) and threading (GIL won't let the threads multitask) and have something super complex that provides you no advantage.
Is the scheme stated above overly complicated? I would appreciate some insight before trying to implement it this way.
The 'scheme' you gave is:
After some thinking I couldn't come up with any other way to implement this other than:
Have the above program run as process #1 (the program that runs locally as I've written at the beginning).
A Twisted client will be run as process #2 and receive the commands from remote clients. Whenever a command is received, the Twisted client will initialize a thread that'll parse the command, check for its validity and execute it if it's valid.
Since I don't have that much experience with threads and none with network programming, I couldn't think of any other scheme that makes sense to me.
In answer to "Is the scheme ... overly complicated", I would say almost certainly yes because your talking about twisted and threads. (see tr; dr above)
Given my certainly incomplete (and confused) understanding of what you want to build, I would imagine a twisted solution for you would look like:
Carefully study the krondo Twisted Introduction (it really requires tracing the example code line-for-line, but if you do the work its an AWESOME learning tool for twisted - and event programing in general)
From the ground-up you rewrite your 'hotword' thingy in twisted using what you learned in the krondo guide - starting out just providing whichever interface you currently have (keyboard?)
Add other communication interfaces to that code (telnet, web, etc) which would let you access the processing code you wrote for the keyboard(?) interface.
If, as you state in your question, you really need a server, you could write a second twisted program to provide that (you'll see examples of all that in the krondo guide). Though I'm guessing when you understand twisted's library support you'll realize you don't have to build any extra servers, that you can just include whichever protocols you need in your base code.

Python - Waiting for input from lots of sockets

I'm working on a simple experiment in Python. I have a "master" process, in charge of all the others, and every single process has a connection via unix socket to the master process. I would like to be able for the master process to be able to monitor all of the sockets for a response - but there could theoretically be almost a hundred of them. How would threads impact the memory and performance of the application? What would be the best solution? Thanks a lot!
One hundred simultaneous threads might be pushing the reasonable limits of threading. If you find this is the cleanest way to organize your code, I'd say give it a try, but threading really doesn't scale very far.
What works better is to use a technique like select to wait for one of the sockets to be readable / writable / or has an error to report. This mechanism lets you go to sleep until something interesting happens, handle as many sockets have content to handle, and then go back to sleep again, all in a single thread of execution. Removing the multi-threading can often reduce chances for errors, and this style of programming should get you into the hundreds of connections no trouble. (If you want to go beyond about 100, I'd use the poll functionality instead of select -- constantly rebuilding the list of interesting file descriptors takes time that poll does not require.)
Something to consider is the Python Twisted Framework. They've gone to some length to provide a consistent way to hook callbacks onto events for this exact sort of programming. (If you're familiar with node.js, it's a bit like that, but Python.) I must admit a slight aversion to Twisted -- I never got very far in their documentation without being utterly baffled -- but a lot of people made it further in the docs than I did. You might find it a better fit than I have.
The easiest way to conduct comparative tests of threads versus processes for socket handling is to use the SocketServer in Python's standard library. You can easily switch approaches (while keeping everything else the same) by inheriting from either ThreadingMixIn or ForkingMixIn. Here is a simple example to get you started.
Another alternative is a select/poll approach using non-blocking sockets in a single process and a single thread.
If you're interested in software that is already fully developed and highly evolved, consider these high-performance Python based server packages:
The Twisted framework uses the async single process, single thread style.
The Tornado framework is similar (less evolved, less full featured, but easier to understand)
And Gunicorn which is a high-performance forking server.

Python - question regarding the concurrent use of `multiprocess`

I want to use Python's multiprocessing to do concurrent processing without using locks (locks to me are the opposite of multiprocessing) because I want to build up multiple reports from different resources at the exact same time during a web request (normally takes about 3 seconds but with multiprocessing I can do it in .5 seconds).
My problem is that, if I expose such a feature to the web and get 10 users pulling the same report at the same time, I suddenly have 60 interpreters open at the same time (which would crash the system). Is this just the common sense result of using multiprocessing, or is there a trick to get around this potential nightmare?
Thanks
If you're really worried about having too many instances you could think about protecting the call with a Semaphore object. If I understand what you're doing then you can use the threaded semaphore object:
from threading import Semaphore
sem = Semaphore(10)
with sem:
make_multiprocessing_call()
I'm assuming that make_multiprocessing_call() will cleanup after itself.
This way only 10 "extra" instances of python will ever be opened, if another request comes along it will just have to wait until the previous have completed. Unfortunately this won't be in "Queue" order ... or any order in particular.
Hope that helps
You are barking up the wrong tree if you are trying to use multiprocess to add concurrency to a network app. You are barking up a completely wrong tree if you're creating processes for each request. multiprocess is not what you want (at least as a concurrency model).
There's a good chance you want an asynchronous networking framework like Twisted.
locks are only ever nessecary if you have multiple agents writing to a source. If they are just accessing, locks are not needed (and as you said defeat the purpose of multiprocessing).
Are you sure that would crash the system? On a web server using CGI, each request spawns a new process, so it's not unusual to see thousands of simultaneous processes (granted in python one should use wsgi and avoid this), which do not crash the system.
I suggest you test your theory -- it shouldn't be difficult to manufacture 10 simultaneous accesses -- and see if your server really does crash.

Categories