Need help minimizing the loss of time for data collection

Need help minimizing the loss of time for data collection - python

I am trying to come up with the best way to minimize the loss of time in a data harvesting application I am building. Here are some of the restraints/factors:
I can only query for data every 12 seconds on a specific channel
I can connect to as many channels simultaneously.
I want to keep the number of channels in use to a minimum
With these factors in mind, I have thought of a solution, but would like for more input.
I have decided to in a way load balance this collection of data. My thoughts are this:
Main Program utilizes m processes (for now I am thinking 4).
Each process uses n threads, where each thread is listening on a channel.(for now I am thinking 12).
There is a variable thread_start_time_factor = 12 seconds / n threads
There is a variable process_start_time_factor = thread_start_time_factor / m processes
Each thread query's data every 12 seconds, however threads start consecutively after one another based on the thread_start_time_factor. So if I am using 12 threads, thread 1 starts, (1 second pause), thread 2 starts, ... This way data collection is now happening every 1 second.
Each process then starts one after the other based on the
process_start_time_factor
In theory, data collection SHOULD be happening every process_start_time_factor If going with the configuration above, the process_start_time_factor should be .250 seconds. (If my logic is wrong here, please let me know).
Now here is my question. Is this a good way to do this? My thoughts for using multiple processes is to essentially capture data whenever the other processes are not. The program will be written in Python (Not that it matters). Has anyone had experience with (weird) data collection restrictions like this where they have to think outside the box? Thanks to all of those who reply in advance. I am for sure open to other solutions.

given that you're using proxies, not linked to the site, and are being somewhat obscure about the question, suggests it's bordering on illegal
that said, some numbers you've not given are how long each request takes (e.g. TTFB, total duration, total data transferred) and what it takes to process the responses.
assuming you're not doing much processing on ingress, then I'd just go with an asyncio (i.e. no process/thread parallelism) approach as it's much easier to get the coordination straight. multithreading/process coordination is much awkward to reason about
you should be able to saturate a 1GB connection with HTTP requests from a single thread, maybe just using multiple processes to do post-processing so that doesn't get in the way

Related

How to submit a large set of long running parallel tasks to dask?

I have a computational workload that I originally ran with concurrent.futures.ProcessPoolExecutor which I converted to use dask so that I could make use of dask's integrations with distributed computing systems for scaling beyond one machine. The workload consists of two task types:
Task A: takes string/float inputs and produces a matrix (around 2000 x 2000). Task duration is usually 60 seconds or less.
Task B: takes the matrix from task A and uses it and some other small inputs to solve an ordinary differential equation. The solution is written to disk (so no return value). Task duration can be up to fifteen minutes.
There can multiple B tasks for each A task.
Originally, my code looked like this:
a_results = client.map(calc_a, a_inputs)
all_b_inputs = [(a_result, b_input) for b_input in b_inputs for a_result in a_results]
b_results = client.map(calc_b, all_b_inputs)
dask.distributed.wait(b_results)
because that was the clean translation from the concurrent.futures code (I actually kept the code so that it could be run either with dask or concurrent.futures so I could compare). client here is a distributed.Client instance.
I have been experiencing some stability issues with this code, especially for large numbers of tasks, and I think I might not be using dask in the best way. Recently, I changed my code to use Delayed instead like this:
a_results = [dask.delayed(calc_a)(a) for a in a_inputs]
b_results = [dask.delayed(calc_b)(a, b) for a in a_inputs for b in b_inputs]
client.compute(b_results)
I did this because I thought perhaps the scheduler could work through the tasks more efficiently if it examined the entire graph before starting anything rather than beginning to schedule the A tasks before knowing about the B tasks. This change seems to help some but I still see some stability issues.
I can create separate questions for the stability problems, but I first wanted to find out if I am using dask in the best way for this use case or if I should modify how I am submitting the tasks. Just to describe the problems briefly, the worst problem to me is that over time my workers drop to 0% CPU and tasks stop completing. Other problems include things like getting KilledWorker exceptions and seeing log messages about an unresponsive loop and time outs. Usually the scheduler runs fine for at least a few hours, completing thousands of tasks before these issues show up (which makes debugging difficult since the feedback loop is so long).
Some questions I have been wondering about:
I can have thousands of tasks to run. Can I submit these all to dask to start out or do I need to submit them in batches? My thought was that the dask scheduler would be better at scheduling tasks than my batching code.
If I do need to batch things myself, can I query the scheduler to find out the maximum number of workers so I can write something that will submit batches of the right size? Or do I need to make the batch size an input to my batching code?
In the end, my results all get written to disk and nothing gets returned. With the way I am running tasks, are resources getting held onto longer than necessary?
My B tasks are long but they could be split by scheduling tasks that solve for solutions at intermediate time steps and feeding those in as the inputs to subsequent solving tasks. I think I need to do this any way because I would like to use an HPC cluster with a timed queue and I think I need to use the lifetime parameter to retire workers to keep them from running over the time limit and that works best with short-lived tasks (to avoid losing work when shut down early). Is there an optimal way to split the B task?

There are lots of questions here, but with regards to the code snippets you provided, both look correct, but the futures version will scale better in my experience. The reason for that is that by default, whenever one of the delayed tasks fails, the computation of all delayed tasks halts, while futures can proceed as long as they are not directly affected by the failure.
Another observation is that delayed values will tend to hold on to resources after completion, while for futures you can at least .release() them once they have been completed (or use fire_and_forget).
Finally, with very large task lists, it might be worth to make them a bit more resilient to restarts. One basic option is to create simple text files after successful completion of a task, and then on restart check which tasks need to be re-computed. Fancier options include prefect and joblib.memory, but if you don't need all the bells and whistles, the text file route is often fastest.

Best way of parallelising this webcrawling loop?

I am making a webcrawler, and I have some "sleep" functions that make the crawl quite long.
For now I am doing :
for speciality in range(1,25):
for year in range(1997, 2017):
for quarter in [1,2]:
deal_with (driver, year, quarter, speciality, ok)
The deal_with function is opening several webpages, waiting a few second for complete html download before moving on. The execution time is then very long : there is 25 * 10 * 2 = 500 loops, with no less than a minute by loop.
I would like to use my 4 physical Cores (8 threads) to enjoy parallelism.
I read about tornado, multiprocessing, joblib... and can't really make my mind on an easy solution to adapt to my code.
Any insight welcome :-)

tl;dr Investing in any choice without fully understanding the bottlenecks you are facing will not help you.
At the end of the day, there are only two fundamental approaches to scaling out a task like this:
Multiprocessing
You launch a number of Python processes, and distribute tasks to each of them. This is the approach you think will help you right now.
Some sample code for how this works, though you could use any appropriate wrapper:
import multiprocessing
# general rule of thumb: launch twice as many processes as cores
process_pool = multiprocessing.Pool(8) # launches 8 processes
# generate a list of all inputs you wish to feed to this pool
inputs = []
for speciality in range(1,25):
for year in range(1997, 2017):
for quarter in [1,2]:
inputs.append((driver, year, quarter, speciality, ok))
# feed your list of inputs to your process_pool and print it when done
print(process_pool.map(deal_with, inputs))
If this is all you wanted, you can stop reading now.
Asynchronous Execution
Here, you are content with a single thread or process, but you don't want it to be sitting idle waiting for stuff like network reads or disk seeks to come back - you want it to go on and do other, more important things while it's waiting.
True native asynchronous I/O support is provided in Python 3 and does not exist in Python 2.7 outside of the Twisted networking library.
import concurrent.futures
# generate a list of all inputs you wish to feed to this pool
inputs = []
for speciality in range(1,25):
for year in range(1997, 2017):
for quarter in [1,2]:
inputs.append((driver, year, quarter, speciality, ok))
# produce a pool of processes, and make sure they don't block each other
# - get back an object representing something yet to be resolved, that will
# only be updated when data comes in.
with concurrent.futures.ProcessPoolExecutor() as executor:
outputs = [executor.submit(input_tuple) for input_tuple in inputs]
# wait for all of them to finish - not ideal, since it defeats the purpose
# in production, but sufficient for an example
for future_object in concurrent.futures.as_completed(outputs):
# do something with future_object.result()
So What's the Difference?
My main point here it to emphasise that choosing from a list of technologies isn't as hard as figuring out where the real bottleneck is.
In the examples above, there isn't any difference. Both follow a simple pattern:
Have a lot of workers
Allow these workers to pick something from a queue of tasks right away
When one is free, set them to work on the next one right away.
Thus, you gain no conceptual difference altogether if you follow these examples verbatim, even though they use entirely different technologies and claim to use entirely different techniques.
Any technology you pick will be for naught if you write it in this pattern - even though you'll get some speedup, you will be sorely disappointed if you expected a massive performance boost.
Why is this pattern bad? Because it doesn't solve your problem.
Your problem is simple: you have wait. While your process is waiting for something to come back, it can't do anything else! It can't call more pages for you. It can't process an incoming task. All it can do is wait.
Having more processes that ultimately wait is not the true solution. An army of troops that has to march to Waterloo will not be faster if you split it into regiments - each regiment eventually has to sleep, though they may sleep at different times and for different lengths, and what will happen is that all of them will arrive at almost roughly the same time.
What you need is an army that never sleeps.
So What Should You Do?
Abstract all I/O bound tasks into something non-blocking. This is your true bottleneck. If you're waiting for a network response, don't let the poor process just sit there - give it something to do.
Your task is made somewhat difficult in that by default reading from a socket is blocking. It's the way operating systems are. Thankfully, you don't need to get Python 3 to solve it (though that is always the preferred solution) - the asyncore library (though Twisted is comparably superior in every way) already exists in Python 2.7 to make network reads and writes truly in the background.
There is one and only one case where true multiprocessing needs to be used in Python, and that's if you are doing CPU-bound or CPU-intensive work. From your description, it doesn't sound like that's the case.
In short, you should edit your deal_with function to avoid the incipient wait. Make that wait in the background, if needed, using a suitable abstraction from Twisted or asyncore. But don't make it consume your process completely.

If you're using python3, I would check out the asycio module. I believe you can just decorate deal_with with #asyncio.coroutine. You will likely have to adjust what deal_with does to properly work with the event loop as well.

Python, solr and massive amounts of queries: need some suggestions

i'm facing a design problem within my project.
PROBLEM
i need to query solr with all the possible combinations (more or less 20 millions) of some parameters extracted from our lists, to test wether they give at least 1 result. in the case they don't, that combination is inserted into a blacklist (used for statistical analysis and sitemap creation)
HOW I'M DOING IT NOW
nested for loops to combine parameters (extracted from python lists) and pass them to a method (the same i use in production environment to query the db within the website) that tests for 0-results. if it's 0, there's a method inserting inside the blacklist
no threading involved
HOW I'D LIKE TO TO THIS
i'd like to put all the combinations inside a queue and let a thread object pull them, query and insert, for better performances
WHAT PROBLEMS I'M EXPERIENCING
slowliness: being single threaded, it now takes a lot to complete (when and if it completes)
connection reset by peer[104] : it's an error throwed by solr after a while it's been queried (i increased the pool size, but nothing changes) this is the most recurrent (and annoying) error, at the moment.
python hanging: this i resolved with a timeout decorator (which isn't a correct solution, but at least it helps me go throu the whole processing and have a quick test output for now. i'll drop this whenever i can come to a smart solution)
queue max size: a queue object can contain up to 32k elements, so it won't fit my numbers
WHAT I'M USING
python 2.7
mysql
apache-solr
sunburnt (python interface to solr)
linux box
I don't need any code debugging, since i'd rather throw away what i did for a fresh start, instead than patching it over and over and over... "Trial by error" is not what i like.
I'd like every suggestion that can come in mind to you to design this in the correct way. Also links, websites, guides are very much welcomed, since my experience with this kind of scripts is building as i work.
Thanks all in advance for your help! If you didn't understand something, just ask, i'll answer/update the post if needed!
EDIT BASED ON SOME ANSWERS (will keep this updated)
i'll probably drop python threads for the multiprocessing lib: this could solve my performance issues
divide-and-conquer based construction method: this should add some logic in my parameters construction, without needing any bruteforce approac
what i still need to know: where can i store my combinations to feed the worker thread? maybe this is no more an issue, since the divide-and-conquer approach may let me generate runtime the combinations and split them between the working threads.
NB: i wont' accept any answer for now, since i'd like to mantain this post alive for a while, just to gather more and more ideas (not only for me, but maybe for future reference of others, since it's generic nature)
Thanks all again!

Instead of brute force, change to using a divide-and-conquer approach while keeping track of the number of hits for each search. If you subdivide into certain combinations, some of those sets will be empty so you eliminate many subtrees at once. Add missing parameters into remaining searches and repeat until you are done. It takes more bookkeeping but many fewer searches.

You can use the stdlib "multiprocessing" module in order to have several subprocesses working with your combinations - This works better than Python's threads, and allow at least each logical CPU core in your configuration to run at the same time.
Here is a minimalist example of how it works:
import random
from multiprocessing import Pool
def a(a):
if random.randint(0, 100000) == 0:
return True
return False
# the number bellow should be a equal to your number of processor cores:
p = Pool(4)
x = any(p.map(a, xrange(1000000)))
print x
So, this makes a 10 million test, divided in 4 "worker" processes, with no scaling issues.
However, given the nature of the error messages you are getting, though you don't explicitly says so, you seem to be running an application with a web interface - and you wait for all the processing to finish before rendering a result to the browser. This tipically won't work with long running calculations - you'd better perform all your calculations in a separate process than the server process serving your web interface, and update the web interface via asynchronous requests, using a little javascript. That way you will avoid any "connection reset by peer" errors.

beginner question about python multiprocessing?

I have a number of records in the database I want to process. Basically, I want to run several regex substitution over tokens of the text string rows and at the end, and write them back to the database.
I wish to know whether does multiprocessing speeds up the time required to do such tasks.
I did a
multiprocessing.cpu_count
and it returns 8. I have tried something like
process = []
for i in range(4):
if i == 3:
limit = resultsSize - (3 * division)
else:
limit = division
#limit and offset indicates the subset of records the function would fetch in the db
p = Process(target=sub_table.processR,args=(limit,offset,i,))
p.start()
process.append(p)
offset += division + 1
for po in process:
po.join()
but apparently, the time taken is higher than the time required to run a single thread. Why is this so? Can someone please enlighten is this a suitable case or what am i doing wrong here?

Why is this so?
Can someone please enlighten in what cases does multiprocessing gives better performances?
Here's one trick.
Multiprocessing only helps when your bottleneck is a resource that's not shared.
A shared resource (like a database) will be pulled in 8 different directions, which has little real benefit.
To find a non-shared resource, you must have independent objects. Like a list that's already in memory.
If you want to work from a database, you need to get 8 things started which then do no more database work. So, a central query that distributes work to separate processors can sometimes be beneficial.
Or 8 different files. Note that the file system -- as a whole -- is a shared resource and some kinds of file access are involve sharing something like a disk drive or a directory.
Or a pipeline of 8 smaller steps. The standard unix pipeline trick query | process1 | process2 | process3 >file works better than almost anything else because each stage in the pipeline is completely independent.
Here's the other trick.
Your computer system (OS, devices, database, network, etc.) is so complex that simplistic theories won't explain performance at all. You need to (a) take several measurements and (b) try several different algorithms until you understand all the degrees of freedom.
A question like "Can someone please enlighten in what cases does multiprocessing gives better performances?" doesn't have a simple answer.
In order to have a simple answer, you'd need a much, much simpler operating system. Fewer devices. No database and no network, for example. Since your OS is complex, there's no simple answer to your question.

Here are a couple of questions:
In your processR function, does it slurp a large number of records from the database at one time, or is it fetching 1 row at a time? (Each row fetch will be very costly, performance wise.)
It may not work for your specific application, but since you are processing "everything", using database will likely be slower than a flat file. Databases are optimised for logical queries, not seqential processing. In your case, can you export the whole table column to a CSV file, process it, and then re-import the results?
Hope this helps.

In general, multicpu or multicore processing help most when your problem is CPU bound (i.e., spends most of its time with the CPU running as fast as it can).
From your description, you have an IO bound problem: It takes forever to get data from disk to the CPU (which is idle) and then the CPU operation is very fast (because it is so simple).
Thus, accelerating the CPU operation does not make a very big difference overall.

For my app, how many threads would be optimal?

I have a simple Python web crawler. It uses SQLite to store its output and also to keep a queue. I want to make the crawler multi-threaded so that it can crawl several pages at a time. I figured i would make a thread and just run several instances of the class at once, so they all run concurrently. But the question is, how many should i run at once? should i stick to two? can i go higher? what would be a reasonable limit for a number of threads? Keep in mind that each thread goes out to a web page, downloads the html, runs a few regex searches through it, stores the info it finds in a SQLite db, and then pops the next url off the queue.

You will probably find your application is bandwidth limited not CPU or I/O limited.
As such, add as many as you like until performance begins to degrade.
You may come up against other limits depending on your network setup. Like if you're behind an ADSL router, there will be a limit on the number of concurrent NAT sessions, which may impact making too many HTTP requests at once. Make too many and your provider may treat you as being infected by a virus or the like.
There's also the issue of how many requests the server you're crawling can handle and how much of a load you want to put on it.
I wrote a crawler once that used just one thread. It took about a day to process all the information I wanted at about one page every two seconds. I could've done it faster but I figured this was less of a burden for the server.
So really theres no hard and fast answer. Assuming a 1-5 megabit connection I'd say you could easily have up to 20-30 threads without any problems.

I would use one thread and twisted with either a deferred semaphore or a task cooperator if you already have an easy way to feed an arbitrarily long list of URLs in.
It's extremely unlikely you'll be able to make a multi-threaded crawler that's faster or smaller than a twisted-based crawler.

It's usually simpler to make multiple concurrent processes. Simply use subprocess to create as many Popens as you feel it necessary to run concurrently.
There's no "optimal" number. Generally, when you run just one crawler, your PC spends a lot of time waiting. How much? Hard to say.
When you're running some small number of concurrent crawlers, you'll see that they take about the same amount of time as one. Your CPU switches among the various processes, filling up the wait time on one with work on the others.
You you run some larger number, you see that the overall elapsed time is longer because there's now more to do than your CPU can manage. So the overall process takes longer.
You can create a graph that shows how the process scales. Based on this you can balance the number of processes and your desirable elapsed time.
Think of it this way.
1 crawler does it's job in 1 minute. 100 pages done serially could take a 100 minutes. 100 crawlers concurrently might take on hour. Let's say that 25 crawlers finishes the job in 50 minutes.
You don't know what's optimal until you run various combinations and compare the results.

cletus's answer is the one you want.
A couple of people proposed an alternate solution using asynchronous I/O, especially looking at Twisted. If you decide to go that route, a different solution is pycurl, which is a thin wrapper to libcurl, which is a widely used URL transfer library. PyCurl's home page has a 'retriever-multi.py' example of how to fetch multiple pages in parallel, in about 120 lines of code.

You can go higher that two. How much higher depends entirely on the hardware of the system you're running this on, how much processing is going on after the network operations, and what else is running on the machine at the time.
Since it's being written in Python (and being called "simple") I'm going to assume you're not exactly concerned with squeezing every ounce of performance out of the thing. In that case, I'd suggest just running some tests under common working conditions and seeing how it performs. I'd guess around 5-10 is probably reasonable, but that's a complete stab in the dark.
Since you're using a dual-core machine, I'd highly recommend checking out the Python multiprocessing module (in Python 2.6). It will let you take advantage of multiple processors on your machine, which would be a significant performance boost.

One thing you should keep in mind is that some servers may interpret too many concurrent requests from the same IP address as a DoS attack and abort connections or return error pages for requests that would otherwise succeed.
So it might be a good idea to limit the number of concurrent requests to the same server to a relatively low number (5 should be on the safe side).

Threading isn't necessary in this case. Your program is I/O bound rather than CPU bound. The networking part would probably be better done using select() on the sockets. This reduces the overhead of creating and maintaining threads. I haven't used Twisted, but I heard it has really good support for asynchronous networking. This would allow you you to specify the URLs you wish to download and register a callback for each. When each is downloaded you the callback will be called, and the page can be processed. In order to allow multiple sites to be downloaded, without waiting for each to be processed, a second "worker" thread can be created with a queue. The callback would add the site's contents to the queue. The "worker" thread would do the actual processing.
As already stated in some answers, the optimal amount of simultaneous downloads depends on your bandwidth.
I'd use one or two threads - one for the actual crawling and the other (with a queue) for processing.

We Keep Coding

Python is a programming language that lets you work quickly and integrate systems more effectively.