BioPython submit multiple online blasts - python

Is it possible to submit multiple sequences to the Bio.Blast.NCBIWWW module at the same time? I've tried to create a function that runs my blast and have several of them run using multiprocessing, but I think the NCBI server boots me after a while and the connection stops working.

I don't know what sort of limits NCBI has on their service, but you may want to look into installing BLAST locally and running your queries that way. Biopython has support for local BLAST: http://biopython.org/DIST/docs/tutorial/Tutorial.html#sec96

Here they detail how to properly use it:
http://www.ncbi.nlm.nih.gov/BLAST/Doc/node60.html
Do not launch more than 50 threads.
Wait for the RID of a BLAST before starting the next one.
Flooding the server can lead to many problems and eventually we may be forced to block access from sites which flood the severs with no warning. We strongly suggest you limit your scripts to not send a request until you receive a RID from the server. Alternatively please introduce a "sleep" command in your script which sends request no less then one once per three seconds.
Biopython does the wait for the RID for you, but if you launch several queries you are certainly going to be banned.

Related

Running an infinite Python script while connected to database

I'm working on a project to learn Python, SQL, Javascript, running servers -- basically getting a grip of full-stack. Right now my basic goal is this:
I want to run a Python script infinitely, which is constantly making API calls to different services, which have different rate limits (e.g. 200/hr, 1000/hr, etc.) and storing the results (ints) in a database (PostgreSQL). I want to store these results over a period of time and then begin working with that data to display fun stuff on the front. I need this to run 24/7. I'm trying to understand the general architecture here, and searching around has proven surprisingly difficult. My basic idea in rough pseudocode is this:
database.connect()
def function1(serviceA):
while(True):
result = makeAPIcallA()
INSERT INTO tableA result;
if(hitRateLimitA):
sleep(limitTimeA)
def function2(serviceB):
//same thing, different limits, etc.
And I would ssh into my server, run python myScript.py &, shut my laptop down, and wait for the data to roll in. Here are my questions:
Does this approach make sense, or should I be doing something completely different?
Is it considered "bad" or dangerous to open a database connection indefinitely like this? If so, how else do I manage the DB?
I considered using a scheduler like cron, but the rate limits are variable. I can't run the script every hour when my limit is hit say, 5min into start time and has a wait time of 60min after that. Even running it on minute intervals seems messy: I need to sleep for persistent rate limit wait times which will keep varying. Am I correct in assuming a scheduler is not the way to go here?
How do I gracefully handle any unexpected potentially fatal errors (namely, logging and restarting)? What about manually killing the script, or editing it?
I'm interested in learning different approaches and best practices here -- any and all advice would be much appreciated!
I actually do exactly what you do for one of my personal applications and I can explain how I do it.
I use Celery instead of cron because it allows for finer adjustments in scheduling and it is Python and not bash, so it's easier to use. I have different tasks (basically a group of API calls and DB updates) to different sites running at different intervals to account for the various different rate limits.
I have the Celery app run as a service so that even if the system restarts it's trivial to restart the app.
I use the logging library in my application extensively because it is difficult to debug something when all you have is one difficult to read stack trace. I have INFO-level and DEBUG-level logs spread throughout my application, and any WARNING-level and above log gets printed to the console AND gets sent to my email.
For exception handling, the majority of what I prepare for are rate limit issues and random connectivity issues. Make sure to surround whatever HTTP request you send to your API endpoints in try-except statements and possibly just implement a retry mechanism.
As far as the DB connection, it shouldn't matter how long your connection is, but you need to make sure to surround your main application loop in a try-except statement and make sure it gracefully fails by closing the connection in the case of an exception. Otherwise you might end up with a lot of ghost connections and your application not being able to reconnect until those connections are gone.

Multiple rs.exe invocations

I've a python based SSRS report generation utility that I'm using to generate multiple reports (often 100+). The way it's setup is -
Multiple threads are invoked using threading.Thread and each of them is given a dictionary.
Each thread parses the dictionary and calls rs.exe passing in relevant arguments via python's subprocess.call
Reports get generated with the following caveats -
If there are around 20-30 reports everything works fine without much issues.
If the number of reports go beyond 40-50+ (for reasons unknown to me so far), some of the reports don't get rendered and come back with error
as obtained by subprocess.call non-zero status (Error message from subprocess.call does not point to any real error).
But there is no error in those rs.exe commands, as they get rendered when i run them from windows command prompt.
Additionally when i try to re-run all those failed reports they get rendered. There's no change in the commands or data while they're being re-run.
To work around this, I employed a retry logic for 2 iterations which seems to fix the issue at times. However when the reports go beyond 100/150+
even the retry doesn't work. Now i could extend the retry logic to keep retrying until all the reports are rendered and whatever failures happen
are genuine ones (like RDL not found, corrupted and so on). But before i do any such thing, want to know if there's any limitation on how many
rs.exe can be launched simultaneously or if there's any limitation on python's subproces.call when invoked in a multi-threaded context.
Can someone please share their expertise if they've faced this kind of issue and resolved it?
Thanks.
I suspect the limit you are hitting is not rs.exe itself but the target Report Server. This will use as much physical memory as is available but when that is exhausted, further requests will start to fail. This is described in the SSRS doco:
https://msdn.microsoft.com/en-us/library/ms159206.aspx
To avoid this issue and leave some server resources for other users, I would reduce your thread limit as low as you can stand - ideally to 1.

Jira python runs very slowly, any ideas on why?

I'm using jira-python to automate a bunch of tasks in Jira. One thing that I find weird is that jira-python takes a long time to run. It seems like it's loading or something before sending the requests. I'm new to python, so I'm a little confused as to what's actually going on. Before finding jira-python, I was sending requests to the Jira REST API using the requests library, and it was blazing fast (and still is, if I compare the two). Whenever I run the scripts that use jira-python, there's a good 15 second delay while 'loading' the library, and sometimes also a good 10-15 second delay sending each request.
Is there something I'm missing with python that could be causing this issue? Anyway to keep a python script running as a service so it doesn't need to 'load' the library each time it's ran?
#ThePavoIC, you seem to be correct. I notice MASSIVE changes in speed if Jira has been restarted and re-indexed recently. Scripts that would take a couple minutes to run would complete in seconds. Basically, you need to make sure Jira is tuned for performance and keep your indexes up to date.

when doing downloading with python,should I use multithreading or multiprocessing?

Recently I'm working on a program which can download manga from a online manga website.It works but a bit slow.So I decide to use multithreading/processing to speed up downloading.Here are my questions:
which one is better?(this is a python3 program)
multiprocessing,I think,will definitely work.If I use multiprocessing,what is the suitable amount of processes?Does it relate to the number of cores in my CPU?
multithreading will probably work.This download work obviously needs much time to wait for pics to be downloaded,so I think when a thread starts waiting,python will make another thread work.Am I correct?
I've read 《Inside the New GIL》by David M.Beazley.What's the influence of GIL if I use multithreading?
You're probably going to be bound by either the server's upload pipe (if you have a faster connection) or your download pipe (if you have a slower connection).
There's significant startup latency associated with TCP connections. To avoid this, HTTP servers can recycle connections for requesting multiple resources. So there are two ways for your client to avoid this latency hit:
(a) Download several resources over a single TCP connection so your program only suffers the latency once, when downloading the first file
(b) Download a single resource per TCP connection, and use multiple connections so that hopefully at every point in time, at least one of them will be downloading at full speed
With option (a), you want to look into how to recycle requests with whatever HTTP library you're using. Any good one will have a way to recycle connections. http://python-requests.org/ is a good Python HTTP library.
For option (b), you probably do want a multithread/multiprocess route. I'd suggest only 2-3 simultaneous threads, since any more will likely just result in sharing bandwidth among the connections, and raise the risk of getting banned for multiple downloads.
The GIL doesn't really matter for this use case, since your code will be doing almost no processing, spending most of its time waiting bytes to arrive over the network.
The lazy way to do this is to avoid Python entirely because most UNIX-like environments have good building blocks for this. (If you're on Windows, your best choices for this approach would be msys, cygwin, or a VirtualBox running some flavor of Linux, I personally like Linux Mint.) If you have a list of URL's you want to download, one per line, in a text file, try this:
cat myfile.txt | xargs -n 1 --max-procs 3 --verbose wget
The "xargs" command with these parameters will take a whitespace-delimited URL's on stdin (in this case coming from myfile.txt) and run "wget" on each of them. It will allow up to 3 "wget" subprocesses to run at a time, when one of them completes (or errors out), it will read another line and launch another subprocess, until all the input URL's are exhausted. If you need cookies or other complicated stuff, curl might be a better choice than wget.
It doesn't really matter. It is indeed true that threads waiting on IO won't get in the way of other threads running, and since downloading over the Internet is an IO-bound task, there's no real reason to try to spread your execution threads over multiple CPUs. Given that and the fact that threads are more light-weight than processes, it might be better to use threads, but you honestly aren't going to notice the difference.
How many threads you should use depends on how hard you want to hit the website. Be courteous and take care that your scraping isn't viewed as a DOS attack.
You don't really need multithreading for this kind of tasks.. you could try single thread async programming using something like Twisted

Run a repeating task for a web app

This seems like a simple question, but I am having trouble finding the answer.
I am making a web app which would require the constant running of a task.
I'll use sites like Pingdom or Twitterfeed as an analogy. As you may know, Pingdom checks uptime, so is constantly checking websites to see if they are up and Twitterfeed checks RSS feeds to see if they;ve changed and then tweet that. I too need to run a simple script to cycle through URLs in a database and perform an action on them.
My question is: how should I implement this? I am familiar with cron, currently using it to do my server backups. Would this be the way to go?
I know how to make a Python script which runs indefinitely, starting back at the beginning with the next URL in the database when I'm done. Should I just run that on the server? How will I know it is always running and doesn't crash or something?
I hope this question makes sense and I hope I am not repeating someone else or anything.
Thank you,
Sam
Edit: To be clear, I need the task to run constantly. As in, check URL 1 in the database, check URl 2 in the database, check URL 3 and, when it reaches the last one, go right back to the beginning. Thanks!
If you need a repeatable running of the task which can be run from command line - that's what the cron is ideal for.
I don't see any demerits of this approach.
Update:
Okay, I saw the issue somewhat different. Now I see several solutions:
run the cron task at set intervals, let it process the data once per run, next time it will process the data on another run; use PIDs/Database/semaphores to avoid parallel processes;
update the processes that insert/update data in the database; let the information be processed when it is inserted/updated; c)
write a demon process which will reside in memory and check the data in real time.
cron would definitely be a way to go with this, as well as any other task scheduler you may prefer.
The main point is found in the title to your question:
Run a repeating task for a web app
The background task and the web application should be kept separate. They can share code, they can share access to a database, but they should be separate and discrete application contexts. (Consider them as separate UIs accessing the same back-end logic.)
The main reason for this is because web applications and background processes are architecturally very different and aren't meant to be mixed. Consider the structure of a web application being held within a web server (Apache, IIS, etc.). When is the application "running"? When it is "on"? It's not really a running task. It's a service waiting for input (requests) to handle and generate output (responses) and then go back to waiting.
Web applications are for responding to requests. Scheduled tasks or daemon jobs are for running repeated processes in the background. Keeping the two separate will make your management of the two a lot easier.

Categories