I'm serving requests from several XMLRPC clients over WAN. The thing works great for, let's say, a period of one day (sometimes two), then freezes in socket.py:
data = self._sock.recv(self._rbufsize)
_sock.timeout is -1, _sock.gettimeout is None
There is nothing special I do in the main thread (just receiving XMLRPC calls), there are another two threads talking to DB. Both these threads work fine and survive this block (did a check with WinPdb). Clients are sending requests not being longer than 1KB, and there isn't any special content: just nice and clean strings in dictionary. Between two blockings I serve tens of thousands requests without problems.
Firewall is off, no strange software on the same machine, etc...
I use Windows XP and Python 2.6.4. I've checked differences between 2.6.4. and 2.6.5, and didn't find anything important (or am I mistaking?). 2.7 version is not an option as I would miss binaries for MySqlDB.
The only thing that happens from time to time caused by the clients that have poor internet connection is that sockets break. This is happening, every 5-10 minutes (there are just five clients accessing server every 2 seconds).
I've spent great deal of time on this issue, now I'm beginning to lose any ideas what to do. Any hint or thought would be highly appreciated.
What exactly is happening in your OS's TCP/IP stack (possibly in the python layers on top, but that's less likely) to cause this is a mystery. As a practical workaround, I'd set a timeout longer than the delays you expect between requests (10 seconds should be plenty if you expect a request every 2 seconds) and if one occurs, close and reopen. (Calibrate the delay needed to work around freezes without interrupting normal traffic by trial and error). Unpleasant to hack a fix w/o understanding the problem, I know, but being pragmatical about such things is a necessary survival trait in the world of writing, deploying and operating actual server systems. Be sure to comment the workaround accurately for future maintainers!
thanks so much for the fast response. Right after I've receive it I augmented the timeout to 10 seconds. Now it is all running without problems, but of course I would need to wait another day or two to have sort of confirmation, but only after 5 days I'll be sure and will come back with the results. I see now that 140K request went well already, having so hard experience on this one I would wait at least another 200K.
What you were proposing about auto adaptation of timeouts (without putting the system down) sounds also reasonable. Would the right way to go be in creating a small class (e.g. AutoTimeoutCalibrator) and embedding it directly into serial.py?
Yes - being pragmatical is the only way without loosing another 10 days trying to figure out the real reason behind.
Thanks again, I'll be back with the results.
(sorry, but for some reason I was not able to post it as a reply to your post)
Related
I have 3 Raspberry Pi's, all on the same LAN doing stuff that is monitored by Python and I want them to talk to each other, and to my PC. Sockets seem like the way to go, but the examples are so simplistic. Here's the issue I am stuck on - the listen and receive processes are all blocking, unless you set a timeout, in which case they still block, just for less time.
So, if I set up a round-robin, then each Pi will only be listened to (or received on) for 1/3 of the time, or less if there is stuff to transmit as well.
What I'd like to understand better is what happens to the data (or connection requests) when I am not listening/receiving - are these buffered by the OS, or lost..? What happens to the socket when there is no method called, is it happy to be ignored for a while, or will the socket itself be dumped by the OS..?
I am starting to split these into separate processes now, which is getting messy and seems inefficient, but I can't think of another way except to run this as 3 (currently), maybe 6 (transmit/receive) or even 9 (listen/transmit/receive) separate processes..?
Sorry I don't have a code example, but it is already way tooo big, and it doesn't work. plus a lot of the issue seems to me to be in the murky part of the sockets - that part between the socket and the OS. I feel I need to understand this better to get to the right architecture for my bit of code before I really start debugging the various exceptions and communication failures...
You can handle multiple sockets in a single process using I/O multiplexing. This is usually done using calls such as epoll(), poll() or select(). These calls monitor multiple sockets and return when one or more sockets have data available for reading. Or are ready to write data to. In many cases this is more convenient than using multiple processes and/or threads.
These calls are pretty low level OS calls. Python seems to have higher level functionality that might be easier to use but I haven't tried this myself.
I'm using requests and threading in python to do some stuff. My question is: Is this code running truly multithreaded and is it safe to use? I'm experiencing some slow down over time. Note: I'm not using this exact code but mine is doing similar things.
import time
import requests
current_threads = 0
max_threads = 32
def doStuff():
global current_threads
r = requests.get('https://google.de')
current_threads-=1
while True:
while current_threads >= max_threads:
time.sleep(0.05)
thread = threading.Thread(target = doStuff)
thread.start()
current_threads+=1
There could be a number of reasons for the issue you are facing. I'm not an expert in Python but I can see the potential for a number of causes for the slow down. Potential reasons I can think of are as follows:
Depending on the size of the data you are pulling down you could potentially be overloading your bandwidth. Hard one to prove without seeing the exact code you are are using and what your code is doing and knowing your bandwidth.
Kinda connected to the fist one but if your files are taking some time to come down per thread it maybe getting clogged up at the:
while current_threads >= max_threads:
time.sleep(0.05)
you could try reducing the max number of threads and see if that helps though it may not if it's the files that are taking time to download.
The problem may not be with your code or your bandwidth but with the server you are pulling the files from, if that server is over loaded it maybe slowing down your transfers.
Firewalls, IPS, IDS, Policys on the server maybe throttling your requests. If you make too many requests to quickly all from the same IP the server side network equipment may mistake this as some sort of DoS attack and throttle your requests in response.
Unfortunately Python, as compared to other lower level languages such as C# or C++ is not as good at multithreading. This is due to something called the GIL (Global Interpreter Lock) which comes into play when you are accessing/manipulating the same data in multiple threads. This is quite a sizeable subject in it's self but if you want to read up on it have a look at this link.
https://medium.com/practo-engineering/threading-vs-multiprocessing-in-python-7b57f224eadb
Sorry I can't be of any more assistance but this is as much as I can say on the subject given the provided information.
Sure, you're running multiple threads and provided they're not accessing/mutating the same resources you're probably "safe".
Whenever I'm accessing external resources (ie, using requests), I always recommend asyncio over vanilla threading, as it allows custom context switching (everywhere you have an "await" you switch contexts, whereas in vanilla threading switching between threads is determined by the OS and might not be optimal) and reduced overhead (you're only using ONE thread).
I'm working on a project to learn Python, SQL, Javascript, running servers -- basically getting a grip of full-stack. Right now my basic goal is this:
I want to run a Python script infinitely, which is constantly making API calls to different services, which have different rate limits (e.g. 200/hr, 1000/hr, etc.) and storing the results (ints) in a database (PostgreSQL). I want to store these results over a period of time and then begin working with that data to display fun stuff on the front. I need this to run 24/7. I'm trying to understand the general architecture here, and searching around has proven surprisingly difficult. My basic idea in rough pseudocode is this:
database.connect()
def function1(serviceA):
while(True):
result = makeAPIcallA()
INSERT INTO tableA result;
if(hitRateLimitA):
sleep(limitTimeA)
def function2(serviceB):
//same thing, different limits, etc.
And I would ssh into my server, run python myScript.py &, shut my laptop down, and wait for the data to roll in. Here are my questions:
Does this approach make sense, or should I be doing something completely different?
Is it considered "bad" or dangerous to open a database connection indefinitely like this? If so, how else do I manage the DB?
I considered using a scheduler like cron, but the rate limits are variable. I can't run the script every hour when my limit is hit say, 5min into start time and has a wait time of 60min after that. Even running it on minute intervals seems messy: I need to sleep for persistent rate limit wait times which will keep varying. Am I correct in assuming a scheduler is not the way to go here?
How do I gracefully handle any unexpected potentially fatal errors (namely, logging and restarting)? What about manually killing the script, or editing it?
I'm interested in learning different approaches and best practices here -- any and all advice would be much appreciated!
I actually do exactly what you do for one of my personal applications and I can explain how I do it.
I use Celery instead of cron because it allows for finer adjustments in scheduling and it is Python and not bash, so it's easier to use. I have different tasks (basically a group of API calls and DB updates) to different sites running at different intervals to account for the various different rate limits.
I have the Celery app run as a service so that even if the system restarts it's trivial to restart the app.
I use the logging library in my application extensively because it is difficult to debug something when all you have is one difficult to read stack trace. I have INFO-level and DEBUG-level logs spread throughout my application, and any WARNING-level and above log gets printed to the console AND gets sent to my email.
For exception handling, the majority of what I prepare for are rate limit issues and random connectivity issues. Make sure to surround whatever HTTP request you send to your API endpoints in try-except statements and possibly just implement a retry mechanism.
As far as the DB connection, it shouldn't matter how long your connection is, but you need to make sure to surround your main application loop in a try-except statement and make sure it gracefully fails by closing the connection in the case of an exception. Otherwise you might end up with a lot of ghost connections and your application not being able to reconnect until those connections are gone.
Recently I'm working on a program which can download manga from a online manga website.It works but a bit slow.So I decide to use multithreading/processing to speed up downloading.Here are my questions:
which one is better?(this is a python3 program)
multiprocessing,I think,will definitely work.If I use multiprocessing,what is the suitable amount of processes?Does it relate to the number of cores in my CPU?
multithreading will probably work.This download work obviously needs much time to wait for pics to be downloaded,so I think when a thread starts waiting,python will make another thread work.Am I correct?
I've read 《Inside the New GIL》by David M.Beazley.What's the influence of GIL if I use multithreading?
You're probably going to be bound by either the server's upload pipe (if you have a faster connection) or your download pipe (if you have a slower connection).
There's significant startup latency associated with TCP connections. To avoid this, HTTP servers can recycle connections for requesting multiple resources. So there are two ways for your client to avoid this latency hit:
(a) Download several resources over a single TCP connection so your program only suffers the latency once, when downloading the first file
(b) Download a single resource per TCP connection, and use multiple connections so that hopefully at every point in time, at least one of them will be downloading at full speed
With option (a), you want to look into how to recycle requests with whatever HTTP library you're using. Any good one will have a way to recycle connections. http://python-requests.org/ is a good Python HTTP library.
For option (b), you probably do want a multithread/multiprocess route. I'd suggest only 2-3 simultaneous threads, since any more will likely just result in sharing bandwidth among the connections, and raise the risk of getting banned for multiple downloads.
The GIL doesn't really matter for this use case, since your code will be doing almost no processing, spending most of its time waiting bytes to arrive over the network.
The lazy way to do this is to avoid Python entirely because most UNIX-like environments have good building blocks for this. (If you're on Windows, your best choices for this approach would be msys, cygwin, or a VirtualBox running some flavor of Linux, I personally like Linux Mint.) If you have a list of URL's you want to download, one per line, in a text file, try this:
cat myfile.txt | xargs -n 1 --max-procs 3 --verbose wget
The "xargs" command with these parameters will take a whitespace-delimited URL's on stdin (in this case coming from myfile.txt) and run "wget" on each of them. It will allow up to 3 "wget" subprocesses to run at a time, when one of them completes (or errors out), it will read another line and launch another subprocess, until all the input URL's are exhausted. If you need cookies or other complicated stuff, curl might be a better choice than wget.
It doesn't really matter. It is indeed true that threads waiting on IO won't get in the way of other threads running, and since downloading over the Internet is an IO-bound task, there's no real reason to try to spread your execution threads over multiple CPUs. Given that and the fact that threads are more light-weight than processes, it might be better to use threads, but you honestly aren't going to notice the difference.
How many threads you should use depends on how hard you want to hit the website. Be courteous and take care that your scraping isn't viewed as a DOS attack.
You don't really need multithreading for this kind of tasks.. you could try single thread async programming using something like Twisted
I have a simple Python web crawler. It uses SQLite to store its output and also to keep a queue. I want to make the crawler multi-threaded so that it can crawl several pages at a time. I figured i would make a thread and just run several instances of the class at once, so they all run concurrently. But the question is, how many should i run at once? should i stick to two? can i go higher? what would be a reasonable limit for a number of threads? Keep in mind that each thread goes out to a web page, downloads the html, runs a few regex searches through it, stores the info it finds in a SQLite db, and then pops the next url off the queue.
You will probably find your application is bandwidth limited not CPU or I/O limited.
As such, add as many as you like until performance begins to degrade.
You may come up against other limits depending on your network setup. Like if you're behind an ADSL router, there will be a limit on the number of concurrent NAT sessions, which may impact making too many HTTP requests at once. Make too many and your provider may treat you as being infected by a virus or the like.
There's also the issue of how many requests the server you're crawling can handle and how much of a load you want to put on it.
I wrote a crawler once that used just one thread. It took about a day to process all the information I wanted at about one page every two seconds. I could've done it faster but I figured this was less of a burden for the server.
So really theres no hard and fast answer. Assuming a 1-5 megabit connection I'd say you could easily have up to 20-30 threads without any problems.
I would use one thread and twisted with either a deferred semaphore or a task cooperator if you already have an easy way to feed an arbitrarily long list of URLs in.
It's extremely unlikely you'll be able to make a multi-threaded crawler that's faster or smaller than a twisted-based crawler.
It's usually simpler to make multiple concurrent processes. Simply use subprocess to create as many Popens as you feel it necessary to run concurrently.
There's no "optimal" number. Generally, when you run just one crawler, your PC spends a lot of time waiting. How much? Hard to say.
When you're running some small number of concurrent crawlers, you'll see that they take about the same amount of time as one. Your CPU switches among the various processes, filling up the wait time on one with work on the others.
You you run some larger number, you see that the overall elapsed time is longer because there's now more to do than your CPU can manage. So the overall process takes longer.
You can create a graph that shows how the process scales. Based on this you can balance the number of processes and your desirable elapsed time.
Think of it this way.
1 crawler does it's job in 1 minute. 100 pages done serially could take a 100 minutes. 100 crawlers concurrently might take on hour. Let's say that 25 crawlers finishes the job in 50 minutes.
You don't know what's optimal until you run various combinations and compare the results.
cletus's answer is the one you want.
A couple of people proposed an alternate solution using asynchronous I/O, especially looking at Twisted. If you decide to go that route, a different solution is pycurl, which is a thin wrapper to libcurl, which is a widely used URL transfer library. PyCurl's home page has a 'retriever-multi.py' example of how to fetch multiple pages in parallel, in about 120 lines of code.
You can go higher that two. How much higher depends entirely on the hardware of the system you're running this on, how much processing is going on after the network operations, and what else is running on the machine at the time.
Since it's being written in Python (and being called "simple") I'm going to assume you're not exactly concerned with squeezing every ounce of performance out of the thing. In that case, I'd suggest just running some tests under common working conditions and seeing how it performs. I'd guess around 5-10 is probably reasonable, but that's a complete stab in the dark.
Since you're using a dual-core machine, I'd highly recommend checking out the Python multiprocessing module (in Python 2.6). It will let you take advantage of multiple processors on your machine, which would be a significant performance boost.
One thing you should keep in mind is that some servers may interpret too many concurrent requests from the same IP address as a DoS attack and abort connections or return error pages for requests that would otherwise succeed.
So it might be a good idea to limit the number of concurrent requests to the same server to a relatively low number (5 should be on the safe side).
Threading isn't necessary in this case. Your program is I/O bound rather than CPU bound. The networking part would probably be better done using select() on the sockets. This reduces the overhead of creating and maintaining threads. I haven't used Twisted, but I heard it has really good support for asynchronous networking. This would allow you you to specify the URLs you wish to download and register a callback for each. When each is downloaded you the callback will be called, and the page can be processed. In order to allow multiple sites to be downloaded, without waiting for each to be processed, a second "worker" thread can be created with a queue. The callback would add the site's contents to the queue. The "worker" thread would do the actual processing.
As already stated in some answers, the optimal amount of simultaneous downloads depends on your bandwidth.
I'd use one or two threads - one for the actual crawling and the other (with a queue) for processing.