Python, solr and massive amounts of queries: need some suggestions - python

i'm facing a design problem within my project.
PROBLEM
i need to query solr with all the possible combinations (more or less 20 millions) of some parameters extracted from our lists, to test wether they give at least 1 result. in the case they don't, that combination is inserted into a blacklist (used for statistical analysis and sitemap creation)
HOW I'M DOING IT NOW
nested for loops to combine parameters (extracted from python lists) and pass them to a method (the same i use in production environment to query the db within the website) that tests for 0-results. if it's 0, there's a method inserting inside the blacklist
no threading involved
HOW I'D LIKE TO TO THIS
i'd like to put all the combinations inside a queue and let a thread object pull them, query and insert, for better performances
WHAT PROBLEMS I'M EXPERIENCING
slowliness: being single threaded, it now takes a lot to complete (when and if it completes)
connection reset by peer[104] : it's an error throwed by solr after a while it's been queried (i increased the pool size, but nothing changes) this is the most recurrent (and annoying) error, at the moment.
python hanging: this i resolved with a timeout decorator (which isn't a correct solution, but at least it helps me go throu the whole processing and have a quick test output for now. i'll drop this whenever i can come to a smart solution)
queue max size: a queue object can contain up to 32k elements, so it won't fit my numbers
WHAT I'M USING
python 2.7
mysql
apache-solr
sunburnt (python interface to solr)
linux box
I don't need any code debugging, since i'd rather throw away what i did for a fresh start, instead than patching it over and over and over... "Trial by error" is not what i like.
I'd like every suggestion that can come in mind to you to design this in the correct way. Also links, websites, guides are very much welcomed, since my experience with this kind of scripts is building as i work.
Thanks all in advance for your help! If you didn't understand something, just ask, i'll answer/update the post if needed!
EDIT BASED ON SOME ANSWERS (will keep this updated)
i'll probably drop python threads for the multiprocessing lib: this could solve my performance issues
divide-and-conquer based construction method: this should add some logic in my parameters construction, without needing any bruteforce approac
what i still need to know: where can i store my combinations to feed the worker thread? maybe this is no more an issue, since the divide-and-conquer approach may let me generate runtime the combinations and split them between the working threads.
NB: i wont' accept any answer for now, since i'd like to mantain this post alive for a while, just to gather more and more ideas (not only for me, but maybe for future reference of others, since it's generic nature)
Thanks all again!

Instead of brute force, change to using a divide-and-conquer approach while keeping track of the number of hits for each search. If you subdivide into certain combinations, some of those sets will be empty so you eliminate many subtrees at once. Add missing parameters into remaining searches and repeat until you are done. It takes more bookkeeping but many fewer searches.

You can use the stdlib "multiprocessing" module in order to have several subprocesses working with your combinations - This works better than Python's threads, and allow at least each logical CPU core in your configuration to run at the same time.
Here is a minimalist example of how it works:
import random
from multiprocessing import Pool
def a(a):
if random.randint(0, 100000) == 0:
return True
return False
# the number bellow should be a equal to your number of processor cores:
p = Pool(4)
x = any(p.map(a, xrange(1000000)))
print x
So, this makes a 10 million test, divided in 4 "worker" processes, with no scaling issues.
However, given the nature of the error messages you are getting, though you don't explicitly says so, you seem to be running an application with a web interface - and you wait for all the processing to finish before rendering a result to the browser. This tipically won't work with long running calculations - you'd better perform all your calculations in a separate process than the server process serving your web interface, and update the web interface via asynchronous requests, using a little javascript. That way you will avoid any "connection reset by peer" errors.

Related

Slower execution of AWS Lambda batch-writes to DynamoDB with multiple threads

Disclaimer: I know this question will annoy some people because it's vague, theoretical, and has little code.
I have a AWS Lambda function in Python which reads a file of denormalized records off S3, formats its contents correctly, and then uploads that to DynamoDB with a batch write. It all works as advertised. I then tried to break up the uploading part of this pipeline into threads with the hope of more efficiently utilizing DynamoDBs write capacity. However, the multithread version is slower by about 50%. Since the code is very long I have included pseudocode.
NUM_THREADS = 4
for every line in the file:
Add line to list of lines
if we've read enough lines for a single thread:
Create thread that uploads list of lines
thread.start()
clear list of lines.
for every thread started:
thread.join()
Important notes and possible sources of the problem I've checked so far:
When testing this locally using DynamoDB Local, threading does make my program run faster.
If instead I use only 1 thread, or even if I use multiple threads but I join the thread right after I start it (effectively single threaded), the program completes much quicker. With 1 thread ~30s, multi thread ~45s.
I have no shared memory between threads, no locks, etc.
I have tried creating new DynamoDB connections for each thread and sharing one connection instead, with no effect.
I have confirmed that adding more threads does not overwhelm the write capacity of DynamoDB, since it makes the same number of batch write requests and I don't have more unprocessed items throughout execution than with a single thread.
Threading should improve the execution time since the program is network bound, even though Python threads do not really run on multiple cores.
I have tried reading the entire file first, and then spawning all the threads, thinking that perhaps it's better to not interrupt the disk IO, but to no effect.
I have tried both the Thread library as well as the Process library.
Again I know this question is very theoretical so it's probably hard to see the source of the issue, but is there some Lambda quirk I'm not aware of? Is there something I else I can try to help diagnose the issue? Any help is appreciated.
Nate, have you completely ruled out a problem on the Dynamodb end? The total number of write requests may be the same, but the number per second would be different with a multi-thread.
The console has some useful graphs to show if your writes (or batch writes) are being throttled at all. If you don't have the right 'back off, retry' logic in your Lambda function, Lambda will just try and try again and your problem gets worse.
One other thing, which might have been obvious to you (but not me!). I was under the impression that batch_writes saved you money on the capacity planning front. (That 200 writes in batches of 20 would only cost you 10 write units, for example. I could have sworn I heard an AWS guy mention this in a presentation, but that's beside the point.)
In fact the batch_writes save you some time, but nothing economically.
One last thought: I'd bet that Lambda processing time is cheaper than upping your Dynamodb write capacity. If you're in no particular rush for Lambda to finish, why not let it run its course on single-thread?
Good luck!
Turns out that the threading is faster, but only when the file reached a certain file size. I was originally work on a file size of about 1/2 MG. With a 10 MG file, the threaded version came out about 50% faster. Still unsure why it wouldn't work with the smaller file, maybe it just needs time to get a'cooking, you know what I mean? Computers are moody things.
As a backdrop I have good experience with python and dynamoDB along with using python's multiprocessing library. Since your file size was fairly small it may have been the setup time of the process that confused you about performance. If you haven't already, use python multiprocessing pools and use map or imap depending on your use case if you need to communicate any data back to the main thread. Using a pool is the darn simpliest way to run multiple processes in python. If you need your application to run faster as a priority you may want to look into using golang concurrency and you could always build the code into binary to use from within python. Cheers.

beginner question about python multiprocessing?

I have a number of records in the database I want to process. Basically, I want to run several regex substitution over tokens of the text string rows and at the end, and write them back to the database.
I wish to know whether does multiprocessing speeds up the time required to do such tasks.
I did a
multiprocessing.cpu_count
and it returns 8. I have tried something like
process = []
for i in range(4):
if i == 3:
limit = resultsSize - (3 * division)
else:
limit = division
#limit and offset indicates the subset of records the function would fetch in the db
p = Process(target=sub_table.processR,args=(limit,offset,i,))
p.start()
process.append(p)
offset += division + 1
for po in process:
po.join()
but apparently, the time taken is higher than the time required to run a single thread. Why is this so? Can someone please enlighten is this a suitable case or what am i doing wrong here?
Why is this so?
Can someone please enlighten in what cases does multiprocessing gives better performances?
Here's one trick.
Multiprocessing only helps when your bottleneck is a resource that's not shared.
A shared resource (like a database) will be pulled in 8 different directions, which has little real benefit.
To find a non-shared resource, you must have independent objects. Like a list that's already in memory.
If you want to work from a database, you need to get 8 things started which then do no more database work. So, a central query that distributes work to separate processors can sometimes be beneficial.
Or 8 different files. Note that the file system -- as a whole -- is a shared resource and some kinds of file access are involve sharing something like a disk drive or a directory.
Or a pipeline of 8 smaller steps. The standard unix pipeline trick query | process1 | process2 | process3 >file works better than almost anything else because each stage in the pipeline is completely independent.
Here's the other trick.
Your computer system (OS, devices, database, network, etc.) is so complex that simplistic theories won't explain performance at all. You need to (a) take several measurements and (b) try several different algorithms until you understand all the degrees of freedom.
A question like "Can someone please enlighten in what cases does multiprocessing gives better performances?" doesn't have a simple answer.
In order to have a simple answer, you'd need a much, much simpler operating system. Fewer devices. No database and no network, for example. Since your OS is complex, there's no simple answer to your question.
Here are a couple of questions:
In your processR function, does it slurp a large number of records from the database at one time, or is it fetching 1 row at a time? (Each row fetch will be very costly, performance wise.)
It may not work for your specific application, but since you are processing "everything", using database will likely be slower than a flat file. Databases are optimised for logical queries, not seqential processing. In your case, can you export the whole table column to a CSV file, process it, and then re-import the results?
Hope this helps.
In general, multicpu or multicore processing help most when your problem is CPU bound (i.e., spends most of its time with the CPU running as fast as it can).
From your description, you have an IO bound problem: It takes forever to get data from disk to the CPU (which is idle) and then the CPU operation is very fast (because it is so simple).
Thus, accelerating the CPU operation does not make a very big difference overall.

Writing a parallel programming framework, what have I missed?

Clarification: As per some of the comments, I should clarify that this is intended as a simple framework to allow execution of programs that are naturally parallel (so-called embarrassingly parallel programs). It isn't, and never will be, a solution for tasks which require communication or synchronisation between processes.
I've been looking for a simple process-based parallel programming environment in Python that can execute a function on multiple CPUs on a cluster, with the major criterion being that it needs to be able to execute unmodified Python code. The closest I found was Parallel Python, but pp does some pretty funky things, which can cause the code to not be executed in the correct context (with the appropriate modules imported etc).
I finally got tired of searching, so I decided to write my own. What I came up with is actually quite simple. The problem is, I'm not sure if what I've come up with is simple because I've failed to think of a lot of things. Here's what my program does:
I have a job server which hands out jobs to nodes in the cluster.
The jobs are handed out to servers listening on nodes by passing a dictionary that looks like this:
{
'moduleName':'some_module',
'funcName':'someFunction',
'localVars': {'someVar':someVal,...},
'globalVars':{'someOtherVar':someOtherVal,...},
'modulePath':'/a/path/to/a/directory',
'customPathHasPriority':aBoolean,
'args':(arg1,arg2,...),
'kwargs':{'kw1':val1, 'kw2':val2,...}
}
moduleName and funcName are mandatory, and the others are optional.
A node server takes this dictionary and does:
sys.path.append(modulePath)
globals()[moduleName]=__import__(moduleName, localVars, globalVars)
returnVal = globals()[moduleName].__dict__[funcName](*args, **kwargs)
On getting the return value, the server then sends it back to the job server which puts it into a thread-safe queue.
When the last job returns, the job server writes the output to a file and quits.
I'm sure there are niggles that need to be worked out, but is there anything obvious wrong with this approach? On first glance, it seems robust, requiring only that the nodes have access to the filesystem(s) containing the .py file and the dependencies. Using __import__ has the advantage that the code in the module is automatically run, and so the function should execute in the correct context.
Any suggestions or criticism would be greatly appreciated.
EDIT: I should mention that I've got the code-execution bit working, but the server and job server have yet to be written.
I have actually written something that probably satisfies your needs: jug. If it does not solve your problems, I promise you I'll fix any bugs you find.
The architecture is slightly different: workers all run the same code, but they effectively generate a similar dictionary and ask the central backend "has this been run?". If not, they run it (there is a locking mechanism too). The backend can simply be the filesystem if you are on an NFS system.
I myself have been tinkering with batch image manipulation across my computers, and my biggest problem was the fact that some things don't easily or natively pickle and transmit across the network.
for example: pygame's surfaces don't pickle. these I have to convert to strings by saving them in StringIO objects and then dumping it across the network.
If the data you are transmitting (eg your arguments) can be transmitted without fear, you should not have that many problems with network data.
Another thing comes to mind: what do you plan to do if a computer suddenly "disappears" while doing a task? while returning the data? do you have a plan for re-sending tasks?

question comparing multiprocessing vs twisted

Got a situation where I'm going to be parsing websites. each site has to have it's own "parser" and possibly it's own way of dealing with cookies/etc..
I'm trying to get in my head which would be a better choice.
Choice I:
I can create a multiprocessing function, where the (masterspawn) app gets an input url, and in turn it spans a process/function within the masterspawn app that then handles all the setup/fetching/parsing of the page/URL.
This approach would have one master app running, and it in turn creates multiple instances of the internal function.. Should be fast, yes/no?
Choice II:
I could create a "Twisted" kind of server, that would essentially do the same thing as Choice I. The difference being that using "Twisted" would also impose some overhead. I'm trying to evaluate Twisted, with regards to it being a "Server" but i don't need it to perform the fetching of the url.
Choice III:
I could use scrapy. I'm inclined not to go this route as I don't want/need to use the overhead that scrapy appears to have. As i stated, each of the targeted URLs needs its own parse function, as well as dealing with the cookies...
My goal is to basically have the "architected" solution spread across multiple boxes, where each client box interfaces with a master server that allocates the urls to be parsed.
thanks for any comments on this..
-tom
There are two dimensions to this question: concurrency and distribution.
Concurrency: either Twisted or multiprocessing will do the job of concurrently handling fetching/parsing jobs. I'm not sure though where your premise of the "Twisted overhead" comes from. On the contrary, the multiprocessing path would incur much more overhead, since a (relatively heavy-weight) OS-process would have to be spawned. Twisteds' way of handling concurrency is much more light-weight.
Distribution: multiprocessing won't distribute your fetch/parse jobs to different boxes. Twisted can do this, eg. using the AMP protocol building facilities.
I cannot comment on scrapy, never having used it.
For this particular question I'd go with multiprocessing - it's simple to use and simple to understand. You don't particularly need twisted, so why take on the extra complication.
One other option you might want to consider: use a message queue. Have the master drop URLs onto a queue (eg. beanstalkd, resque, 0mq) and have worker processes pickup the URLs and process them. You'll get both concurrency and distribution: you can run workers on as many machines as you want.

For my app, how many threads would be optimal?

I have a simple Python web crawler. It uses SQLite to store its output and also to keep a queue. I want to make the crawler multi-threaded so that it can crawl several pages at a time. I figured i would make a thread and just run several instances of the class at once, so they all run concurrently. But the question is, how many should i run at once? should i stick to two? can i go higher? what would be a reasonable limit for a number of threads? Keep in mind that each thread goes out to a web page, downloads the html, runs a few regex searches through it, stores the info it finds in a SQLite db, and then pops the next url off the queue.
You will probably find your application is bandwidth limited not CPU or I/O limited.
As such, add as many as you like until performance begins to degrade.
You may come up against other limits depending on your network setup. Like if you're behind an ADSL router, there will be a limit on the number of concurrent NAT sessions, which may impact making too many HTTP requests at once. Make too many and your provider may treat you as being infected by a virus or the like.
There's also the issue of how many requests the server you're crawling can handle and how much of a load you want to put on it.
I wrote a crawler once that used just one thread. It took about a day to process all the information I wanted at about one page every two seconds. I could've done it faster but I figured this was less of a burden for the server.
So really theres no hard and fast answer. Assuming a 1-5 megabit connection I'd say you could easily have up to 20-30 threads without any problems.
I would use one thread and twisted with either a deferred semaphore or a task cooperator if you already have an easy way to feed an arbitrarily long list of URLs in.
It's extremely unlikely you'll be able to make a multi-threaded crawler that's faster or smaller than a twisted-based crawler.
It's usually simpler to make multiple concurrent processes. Simply use subprocess to create as many Popens as you feel it necessary to run concurrently.
There's no "optimal" number. Generally, when you run just one crawler, your PC spends a lot of time waiting. How much? Hard to say.
When you're running some small number of concurrent crawlers, you'll see that they take about the same amount of time as one. Your CPU switches among the various processes, filling up the wait time on one with work on the others.
You you run some larger number, you see that the overall elapsed time is longer because there's now more to do than your CPU can manage. So the overall process takes longer.
You can create a graph that shows how the process scales. Based on this you can balance the number of processes and your desirable elapsed time.
Think of it this way.
1 crawler does it's job in 1 minute. 100 pages done serially could take a 100 minutes. 100 crawlers concurrently might take on hour. Let's say that 25 crawlers finishes the job in 50 minutes.
You don't know what's optimal until you run various combinations and compare the results.
cletus's answer is the one you want.
A couple of people proposed an alternate solution using asynchronous I/O, especially looking at Twisted. If you decide to go that route, a different solution is pycurl, which is a thin wrapper to libcurl, which is a widely used URL transfer library. PyCurl's home page has a 'retriever-multi.py' example of how to fetch multiple pages in parallel, in about 120 lines of code.
You can go higher that two. How much higher depends entirely on the hardware of the system you're running this on, how much processing is going on after the network operations, and what else is running on the machine at the time.
Since it's being written in Python (and being called "simple") I'm going to assume you're not exactly concerned with squeezing every ounce of performance out of the thing. In that case, I'd suggest just running some tests under common working conditions and seeing how it performs. I'd guess around 5-10 is probably reasonable, but that's a complete stab in the dark.
Since you're using a dual-core machine, I'd highly recommend checking out the Python multiprocessing module (in Python 2.6). It will let you take advantage of multiple processors on your machine, which would be a significant performance boost.
One thing you should keep in mind is that some servers may interpret too many concurrent requests from the same IP address as a DoS attack and abort connections or return error pages for requests that would otherwise succeed.
So it might be a good idea to limit the number of concurrent requests to the same server to a relatively low number (5 should be on the safe side).
Threading isn't necessary in this case. Your program is I/O bound rather than CPU bound. The networking part would probably be better done using select() on the sockets. This reduces the overhead of creating and maintaining threads. I haven't used Twisted, but I heard it has really good support for asynchronous networking. This would allow you you to specify the URLs you wish to download and register a callback for each. When each is downloaded you the callback will be called, and the page can be processed. In order to allow multiple sites to be downloaded, without waiting for each to be processed, a second "worker" thread can be created with a queue. The callback would add the site's contents to the queue. The "worker" thread would do the actual processing.
As already stated in some answers, the optimal amount of simultaneous downloads depends on your bandwidth.
I'd use one or two threads - one for the actual crawling and the other (with a queue) for processing.

Categories