How to efficiently iterate over multiple generators?

How to efficiently iterate over multiple generators? - python

I've got three different generators, which yields data from the web. Therefore, each iteration may take a while until it's done.
I want to mix the calls to the generators, and thought about roundrobin (Found here).
The problem is that every call is blocked until it's done.
Is there a way to loop through all the generators at the same time, without blocking?

You can do this with the iter() method on my ThreadPool class.
pool.iter() yields threaded function return values until all of the decorated+called functions finish executing. Decorate all of your async functions, call them, then loop through pool.iter() to catch the values as they happen.
Example:
import time
from threadpool import ThreadPool
pool = ThreadPool(max_threads=25, catch_returns=True)
# decorate any functions you need to aggregate
# if you're pulling a function from an outside source
# you can still say 'func = pool(func)' or 'pool(func)()
#pool
def data(ID, start):
for i in xrange(start, start+4):
yield ID, i
time.sleep(1)
# each of these calls will spawn a thread and return immediately
# make sure you do either pool.finish() or pool.iter()
# otherwise your program will exit before the threads finish
data("generator 1", 5)
data("generator 2", 10)
data("generator 3", 64)
for value in pool.iter():
# this will print the generators' return values as they yield
print value

In short, no: there's no good way to do this without threads.
Sometimes ORMs are augmented with some kind of peek function or callback that will signal when data is available. Otherwise, you'll need to spawn threads in order to do this. If threads are not an option, you might try switching out your database library for an asynchronous one.

Related

Is it right to append item into the same list by multi-thread without lock?

Hrere's the detail question:
I want use multi-thread way to do a batch-http-request work, then gather all these result into a list and sort all items.
So I want to define a empty list origin_list in main process first, and start some threads to just append result into this list after pass origin_list to ervery thread.
And It seemed that I got the expected results in then end, so I think I got the right result list finally without thread lock for the list is a mutable object, am I right?
My main codes are as below:
def do_request_work(final_item_list,request_url):
request_results = request.get(request_url).text
# do request work
finnal_item_list.append(request_results )
def do_sort_work(final_item_list):
# do sort work
return final_item_list
def main():
f_item_list = []
request_list = [url1, url2, ...]
with ThreadPoolExecutor(max_workers=20) as executor:
executor.map(
partial(
do_request_work,
f_item_list
),
request_list)
sorted_list = do_sort_work(f_item_list)
Any commentary is very welcome. great thanks.

I think, that this is a quite questionable solution even without taking thread safety into account.
First of all python has GIL, which
In CPython, the global interpreter lock, or GIL, is a mutex that
protects access to Python objects, preventing multiple threads from
executing Python bytecodes at once.
Thus, I doubt about much performance benefit here, even noting that
potentially blocking or long-running operations, such as I/O, image
processing, and NumPy number crunching, happen outside the GIL.
all python work will be executed one thread in a time.
From the other perspective, the same lock may help you with the thread safety here, so only one thread will modify final_item_list in a time, but I am not sure.
Anyway, I would use multiprocessing module here with integrated parallel map:
from multiprocessing import Pool
def do_request_work(request_url):
request_results = request.get(request_url).text
# do request work
return request_results
if __name__ == '__main__':
request_list = [url1, url2, ...]
with Pool(20) as p:
f_item_list = p.map(do_request_work, request_list)
Which will guarantee you parallel lock-free execution of requests, since every process will receive only their part of work and just return the result, when ready.

Look at this thread: I'm seeking advise on multi-tasking on Python36 platform, Procedure setup.
Relevant to python3.5+
Running Tasks Concurrently¶
awaitable asyncio.gather(*aws, loop=None, return_exceptions=False)
Run awaitable objects in the aws sequence concurrently.
I use this very often, just be aware that its not thread-safe, so do not change values inside, otherwise you will have use deepcopy.
Other things to look at:
https://github.com/kennethreitz/grequests
https://github.com/jreese/aiomultiprocess
aiohttp

How do I make my ThreadPool work better with requests

I currently have this function, which does a api call, each api call is requesting different data. I can do up to 300 concurrent api calls at a time.
Doing this does not seem to go fast, since this is just waiting for the repl I was wondering how I would make this function faster?
from multiprocessing.pool import ThreadPool
import requests
pool = ThreadPool(processes=500)
variables = VariableBaseDict
for item in variables:
async_result = pool.apply_async(requests.get(url.json()))
result = async_result.get()
#do stuff with result

Your current code is not actually farming any real work off to a worker thread. You are calling requests.get(url.json()) right in the main thread, and then passing the object that returns to pool.apply_async. You should be doing pool.apply_async(requests.get, (url.json(),)) instead. That said, even if you corrected this problem, you are then immediately waiting for the reply to the call, which means you never actually run any calls concurrently. You farm one item off to a thread, wait for it to be done, then wait for the next item.
You need to:
Fix the issue where you're accidentally calling requests.get(...) in the main thread.
Either use pool.map to farm the list of work off to the worker threads concurrently, or continue using pool.apply_async, but instead of immediately calling async_result.get(), store all the async_result objects in a list, and once you've iterated over variables, iterate over the async_result list and call .get() on each item. That way you actually end up running all the calls concurrently.
So, if you used apply_async, you'd do something like this:
async_results = [pool.apply_async(requests.get, (build_url(item),)) for item in variables]
for ar in async_results:
result = ar.get()
# do stuff with result
With pool.map it would be:
results = pool.map(requests.get, [build_url(item) for item in variables])

How to execute two "aggregate" functions (like sum) concurrently, feeding them from the same iterator?

Imagine we have an iterator, say iter(range(1, 1000)). And we have two functions, each accepting an iterator as the only parameter, say sum() and max(). In SQL world we would call them aggregate functions.
Is there any way to obtain results of both without buffering the iterator output?
To do it, we would need to pause and resume aggregate function execution, in order to feed them both with the same values without storing them. Maybe is there a way to express it using async things without sleeps?

Let's consider how to apply two aggregate functions to the same iterator, which we can only exhaust once. The initial attempt (which hardcodes sum and max for brevity, but is trivially generalizable to an arbitrary number of aggregate functions) might look like this:
def max_and_sum_buffer(it):
content = list(it)
p = sum(content)
m = max(content)
return p, m
This implementation has the downside that it stores all the generated elements in memory at once, despite both functions being perfectly capable of stream processing. The question anticipates this cop-out and explicitly requests the result to be produced without buffering the iterator output. Is it possible to do this?
Serial execution: itertools.tee
It certainly seems possible. After all, Python iterators are external, so every iterator is already capable of suspending itself. How hard can it be to provide an adapter that splits an iterator into two new iterators that provide the same content? Indeed, this is exactly the description of itertools.tee, which appears perfectly suited to parallel iteration:
def max_and_sum_tee(it):
it1, it2 = itertools.tee(it)
p = sum(it1) # XXX
m = max(it2)
return p, m
The above produces the correct result, but doesn't work the way we'd like it to. The trouble is that we're not iterating in parallel. Aggregate functions like sum and max never suspend - each insists on consuming all of the iterator content before producing the result. So sum will exhaust it1 before max has had a chance to run at all. Exhausting elements of it1 while leaving it2 alone will cause those elements to be accumulated inside an internal FIFO shared between the two iterators. That's unavoidable here - since max(it2) must see the same elements, tee has no choice but to accumulate them. (For more interesting details on tee, refer to this post.)
In other words, there is no difference between this implementation and the first one, except that the first one at least makes the buffering explicit. To eliminate buffering, sum and max must run in parallel, not one after the other.
Threads: concurrent.futures
Let's see what happens if we run the aggregate functions in separate threads, still using tee to duplicate the original iterator:
def max_and_sum_threads_simple(it):
it1, it2 = itertools.tee(it)
with concurrent.futures.ThreadPoolExecutor(2) as executor:
sum_future = executor.submit(lambda: sum(it1))
max_future = executor.submit(lambda: max(it2))
return sum_future.result(), max_future.result()
Now sum and max actually run in parallel (as much as the GIL permits), threads being managed by the excellent concurrent.futures module. It has a fatal flaw, however: for tee not to buffer the data, sum and max must process their items at exactly the same rate. If one is even a little bit faster than the other, they will drift apart, and tee will buffer all intermediate elements. Since there is no way to predict how fast each will run, the amount of buffering is both unpredictable and has the nasty worst case of buffering everything.
To ensure that no buffering occurs, tee must be replaced with a custom generator that buffers nothing and blocks until all the consumers have observed the previous value before proceeding to the next one. As before, each consumer runs in its own thread, but now the calling thread is busy running a producer, a loop that actually iterates over the source iterator and signals that a new value is available. Here is an implementation:
def max_and_sum_threads(it):
STOP = object()
next_val = None
consumed = threading.Barrier(2 + 1) # 2 consumers + 1 producer
val_id = 0
got_val = threading.Condition()
def send(val):
nonlocal next_val, val_id
consumed.wait()
with got_val:
next_val = val
val_id += 1
got_val.notify_all()
def produce():
for elem in it:
send(elem)
send(STOP)
def consume():
last_val_id = -1
while True:
consumed.wait()
with got_val:
got_val.wait_for(lambda: val_id != last_val_id)
if next_val is STOP:
return
yield next_val
last_val_id = val_id
with concurrent.futures.ThreadPoolExecutor(2) as executor:
sum_future = executor.submit(lambda: sum(consume()))
max_future = executor.submit(lambda: max(consume()))
produce()
return sum_future.result(), max_future.result()
This is quite some amount of code for something so simple conceptually, but it is necessary for correct operation.
produce() loops over the outside iterator and sends the items to the consumers, one value at a time. It uses a barrier, a convenient synchronization primitive added in Python 3.2, to wait until all consumers are done with the old value before overwriting it with the new one in next_val. Once the new value is actually ready, a condition is broadcast. consume() is a generator that transmits the produced values as they arrive, until detecting STOP. The code can be generalized run any number of aggregate functions in parallel by creating consumers in a loop, and adjusting their number when creating the barrier.
The downside of this implementation is that it requires creation of threads (possibly alleviated by making the thread pool global) and a lot of very careful synchronization at each iteration pass. This synchronization destroys performance - this version is almost 2000 times slower than the single-threaded tee, and 475 times slower than the simple but non-deterministic threaded version.
Still, as long as threads are used, there is no avoiding synchronization in some form. To completely eliminate synchronization, we must abandon threads and switch to cooperative multi-tasking. The question is is it possible to suspend execution of ordinary synchronous functions like sum and max in order to switch between them?
Fibers: greenlet
It turns out that the greenlet third-party extension module enables exactly that. Greenlets are an implementation of fibers, lightweight micro-threads that switch between each other explicitly. This is sort of like Python generators, which use yield to suspend, except greenlets offer a much more flexible suspension mechanism, allowing one to choose who to suspend to.
This makes it fairly easy to port the threaded version of max_and_sum to greenlets:
def max_and_sum_greenlet(it):
STOP = object()
consumers = None
def send(val):
for g in consumers:
g.switch(val)
def produce():
for elem in it:
send(elem)
send(STOP)
def consume():
g_produce = greenlet.getcurrent().parent
while True:
val = g_produce.switch()
if val is STOP:
return
yield val
sum_result = []
max_result = []
gsum = greenlet.greenlet(lambda: sum_result.append(sum(consume())))
gsum.switch()
gmax = greenlet.greenlet(lambda: max_result.append(max(consume())))
gmax.switch()
consumers = (gsum, gmax)
produce()
return sum_result[0], max_result[0]
The logic is the same, but with less code. As before, produce produces values retrieved from the source iterator, but its send doesn't bother with synchronization, as it doesn't need to when everything is single-threaded. Instead, it explicitly switches to every consumer in turn to do its thing, with the consumer dutifully switching right back. After going through all consumers, the producer is ready for the next iteration pass.
Results are retrieved using an intermediate single-element list because greenlet doesn't provide access to the return value of the target function (and neither does threading.Thread, which is why we opted for concurrent.futures above).
There are downsides to using greenlets, though. First, they don't come with the standard library, you need to install the greenlet extension. Then, greenlet is inherently non-portable because the stack-switching code is not supported by the OS and the compiler and can be considered somewhat of a hack (although an extremely clever one). A Python targeting WebAssembly or JVM or GraalVM would be very unlikely to support greenlet. This is not a pressing issue, but it's definitely something to keep in mind for the long haul.
Coroutines: asyncio
As of Python 3.5, Python provides native coroutines. Unlike greenlets, and similar to generators, coroutines are distinct from regular functions and must be defined using async def. Coroutines can't be easily executed from synchronous code, they must instead be processed by a scheduler which drives them to completion. The scheduler is also known as an event loop because its other job is to receive IO events and pass them to appropriate callbacks and coroutines. In the standard library, this is the role of the asyncio module.
Before implementing an asyncio-based max_and_sum, we must first resolve a hurdle. Unlike greenlet, asyncio is only able to suspend execution of coroutines, not of arbitrary functions. So we need to replace sum and max with coroutines that do essentially the same thing. This is as simple as implementing them in the obvious way, only replacing for with async for, enabling the async iterator to suspend the coroutine while waiting for the next value to arrive:
async def asum(it):
s = 0
async for elem in it:
s += elem
return s
async def amax(it):
NONE_YET = object()
largest = NONE_YET
async for elem in it:
if largest is NONE_YET or elem > largest:
largest = elem
if largest is NONE_YET:
raise ValueError("amax() arg is an empty sequence")
return largest
# or, using https://github.com/vxgmichel/aiostream
#
#from aiostream.stream import accumulate
#def asum(it):
# return accumulate(it, initializer=0)
#def amax(it):
# return accumulate(it, max)
One could reasonably ask if providing a new pair of aggregate functions is cheating; after all, the previous solutions were careful to use existing sum and max built-ins. The answer will depend on the exact interpretation of the question, but I would argue that the new functions are allowed because they are in no way specific to the task at hand. They do the exact same thing the built-ins do, but consuming async iterators. I suspect that the only reason such functions don't already exist somewhere in the standard library is due to coroutines and async iterators being a relatively new feature.
With that out of the way, we can proceed to write max_and_sum as a coroutine:
async def max_and_sum_asyncio(it):
loop = asyncio.get_event_loop()
STOP = object()
next_val = loop.create_future()
consumed = loop.create_future()
used_cnt = 2 # number of consumers
async def produce():
for elem in it:
next_val.set_result(elem)
await consumed
next_val.set_result(STOP)
async def consume():
nonlocal next_val, consumed, used_cnt
while True:
val = await next_val
if val is STOP:
return
yield val
used_cnt -= 1
if not used_cnt:
consumed.set_result(None)
consumed = loop.create_future()
next_val = loop.create_future()
used_cnt = 2
else:
await consumed
s, m, _ = await asyncio.gather(asum(consume()), amax(consume()),
produce())
return s, m
Although this version is based on switching between coroutines inside a single thread, just like the one using greenlet, it looks different. asyncio doesn't provide explicit switching of coroutines, it bases task switching on the await suspension/resumption primitive. The target of await can be another coroutine, but also an abstract "future", a value placeholder which will be filled in later by some other coroutine. Once the awaited value becomes available, the event loop automatically resumes execution of the coroutine, with the await expression evaluating to the provided value. So instead of produce switching to consumers, it suspends itself by awaiting a future that will arrive once all the consumers have observed the produced value.
consume() is an asynchronous generator, which is like an ordinary generator, except it creates an async iterator, which our aggregate coroutines are already prepared to accept by using async for. An async iterator's equivalent of __next__ is called __anext__ and is a coroutine, allowing the coroutine that exhausts the async iterator to suspend while waiting for the new value to arrive. When a running async generator suspends on an await, that is observed by async for as a suspension of the implicit __anext__ invocation. consume() does exactly that when it waits for the values provided by produce and, as they become available, transmits them to aggregate coroutines like asum and amax. Waiting is realized using the next_val future, which carries the next element from it. Awaiting that future inside consume() suspends the async generator, and with it the aggregate coroutine.
The advantage of this approach compared to greenlets' explicit switching is that it makes it much easier to combine coroutines that don't know of each other into the same event loop. For example, one could have two instances of max_and_sum running in parallel (in the same thread), or run a more complex aggregate function that invoked further async code to do calculations.
The following convenience function shows how to run the above from non-asyncio code:
def max_and_sum_asyncio_sync(it):
# trivially instantiate the coroutine and execute it in the
# default event loop
coro = max_and_sum_asyncio(it)
return asyncio.get_event_loop().run_until_complete(coro)
Performance
Measuring and comparing performance of these approaches to parallel execution can be misleading because sum and max do almost no processing, which over-stresses the overhead of parallelization. Treat these as you would treat any microbenchmarks, with a large grain of salt. Having said that, let's look at the numbers anyway!
Measurements were produced using Python 3.6 The functions were run only once and given range(10000), their time measured by subtracting time.time() before and after the execution. Here are the results:
max_and_sum_buffer and max_and_sum_tee: 0.66 ms - almost exact same time for both, with the tee version being a bit faster.
max_and_sum_threads_simple: 2.7 ms. This timing means very little because of non-deterministic buffering, so this might be measuring the time to start two threads and the synchronization internally performed by Python.
max_and_sum_threads: 1.29 seconds, by far the slowest option, ~2000 times slower than the fastest one. This horrible result is likely caused by a combination of the multiple synchronizations performed at each step of the iteration and their interaction with the GIL.
max_and_sum_greenlet: 25.5 ms, slow compared to the initial version, but much faster than the threaded version. With a sufficiently complex aggregate function, one can imagine using this version in production.
max_and_sum_asyncio: 351 ms, almost 14 times slower than the greenlet version. This is a disappointing result because asyncio coroutines are more lightweight than greenlets, and switching between them should be much faster than switching between fibers. It is likely that the overhead of running the coroutine scheduler and the event loop (which in this case is overkill given that the code does no IO) is destroying the performance on this micro-benchmark.
max_and_sum_asyncio using uvloop: 125 ms. This is more than twice the speed of regular asyncio, but still almost 5x slower than greenlet.
Running the examples under PyPy doesn't bring significant speedup, in fact most of the examples run slightly slower, even after running them several times to ensure JIT warmup. The asyncio function requires a rewrite not to use async generators (since PyPy as of this writing implements Python 3.5), and executes in somewhat under 100ms. This is comparable to CPython+uvloop performance, i.e. better, but not dramatic compared to greenlet.

If it holds for your aggregate functions that f(a,b,c,...) == f(a, f(b, f(c, ...))),then you could just cycle through your functions and feed them one element at a time, each time combining them with the result of the previous application, like reduce would do, e.g. like this:
def aggregate(iterator, *functions):
first = next(iterator)
result = [first] * len(functions)
for item in iterator:
for i, f in enumerate(functions):
result[i] = f((result[i], item))
return result
This is considerably slower (about 10-20 times) than just materializing the iterator in a list and applying the aggregate function on the list as a whole, or using itertools.tee (which basically does the same thing, internally), but it has the benefit of using no additional memory.
Note, however, that while this works well for functions like sum, min or max, it does not work for other aggregating functions, e.g. finding the mean or median element of an iterator, as mean(a, b, c) != mean(a, mean(b, c)). (For mean, you could of course just get the sum and divide it by the number of elements, but computing e.g. the median taking just one element at a time will be more challenging.)

multithreading check membership in Queue and stop the threads

I want to iterate over a list using 2 thread. One from leading and other from trailing, and put the elements in a Queue on each iteration. But before putting the value in Queue I need to check for existence of the value within Queue (its when that one of the threads has putted that value in Queue), So when this happens I need to stop the thread and return list of traversed values for each thread.
This is what I have tried so far :
from Queue import Queue
from threading import Thread, Event
class ThreadWithReturnValue(Thread):
def __init__(self, group=None, target=None, name=None,
args=(), kwargs={}, Verbose=None):
Thread.__init__(self, group, target, name, args, kwargs, Verbose)
self._return = None
def run(self):
if self._Thread__target is not None:
self._return = self._Thread__target(*self._Thread__args,
**self._Thread__kwargs)
def join(self):
Thread.join(self)
return self._return
main_path = Queue()
def is_in_queue(x, q):
with q.mutex:
return x in q.queue
def a(main_path,g,l=[]):
for i in g:
l.append(i)
print 'a'
if is_in_queue(i,main_path):
return l
main_path.put(i)
def b(main_path,g,l=[]):
for i in g:
l.append(i)
print 'b'
if is_in_queue(i,main_path):
return l
main_path.put(i)
g=['a','b','c','d','e','f','g','h','i','j','k','l']
t1 = ThreadWithReturnValue(target=a, args=(main_path,g))
t2 = ThreadWithReturnValue(target=b, args=(main_path,g[::-1]))
t2.start()
t1.start()
# Wait for all produced items to be consumed
print main_path.join()
I used ThreadWithReturnValue that will create a custom thread that returns the value.
And for membership checking I used the following function :
def is_in_queue(x, q):
with q.mutex:
return x in q.queue
Now if I first start the t1 and then the t2 I will get 12 a then one b then it doesn't do any thing and I need to terminate the python manually!
But if I first run the t2 then t1 I will get the following result:
b
b
b
b
ab
ab
b
b
b
b
a
a
So my questions is that why python treads different in this cases? and how can I terminate the threads and make them communicate with each other?

Before we get into bigger problems, you're not using Queue.join right.
The whole point of this function is that a producer who adds a bunch of items to a queue can wait until the consumer or consumers have finished working on all of those items. This works by having the consumer call task_done after they finish working on each item that they pulled off with get. Once there have been as many task_done calls as put calls, the queue is done. You're not doing a get anywhere, much less a task_done, so there's no way the queue can ever be finished. So, that's why you block forever after the two threads finish.
The first problem here is that your threads are doing almost no work outside of the actual synchronization. If the only thing they do is fight over a queue, only one of them is going to be able to run at a time.
Of course that's common in toy problems, but you have to think through your real problem:
If you're doing a lot of I/O work (listening on sockets, waiting for user input, etc.), threads work great.
If you're doing a lot of CPU work (calculating primes), threads don't work in Python because of the GIL, but processes do.
If you're actually primarily dealing with synchronizing separate tasks, neither one is going to work well (and processes will be worse). It may still be simpler to think in terms of threads, but it'll be the slowest way to do things. You may want to look into coroutines; Greg Ewing has a great demonstration of how to use yield from to use coroutines to build things like schedulers or many-actor simulations.
Next, as I alluded to in your previous question, making threads (or processes) work efficiently with shared state requires holding locks for as short a time as possible.
So, if you have to search a whole queue under a lock, that had better be a constant-time search, not a linear-time search. That's why I suggested using something like an OrderedSet recipe rather than a list, like the one inside the stdlib's Queue.Queue. Then this function:
def is_in_queue(x, q):
with q.mutex:
return x in q.queue
… is only blocking the queue for a tiny fraction of a second—just long enough to look up a hash value in a table, instead of long enough to compare every element in the queue against x.
Finally, I tried to explain about race conditions on your other question, but let me try again.
You need a lock around every complete "transaction" in your code, not just around the individual operations.
For example, if you do this:
with queue locked:
see if x is in the queue
if x was not in the queue:
with queue locked:
add x to the queue
… then it's always possible that x was not in the queue when you checked, but in the time between when you unlocked it and relocked it, someone added it. This is exactly why it's possible for both threads to stop early.
To fix this, you need to put a lock around the whole thing:
with queue locked:
if x is not in the queue:
add x to the queue
Of course this goes directly against what I said before about locking the queue for as short a time as possible. Really, that's what makes multithreading hard in a nutshell. It's easy to write safe code that just locks everything for as long as might conceivably be necessary, but then your code ends up only using a single core, while all the other threads are blocked waiting for the lock. And it's easy to write fast code that just locks everything as briefly as possible, but then it's unsafe and you get garbage values or even crashes all over the place. Figuring out what needs to be a transaction, and how to minimize the work inside those transactions, and how to deal with the multiple locks you'll probably need to make that work without deadlocking them… that's not so easy.

A couple of things that I think can be improved:
Due to the GIL, you might want to use the multiprocessing (rather than threading) module. In general, CPython threading will not cause CPU intensive work to speed up. (Depending on what exactly is the context of your question, it's also possible that multiprocessing won't, but threading almost certainly won't.)
A function like your is_inqueue would likely lead to high contention.
The locked time seems linear in the number of items that need to be traversed:
def is_in_queue(x, q):
with q.mutex:
return x in q.queue
So, instead, you could possibly do the following.
Use multiprocessing with a shared dict:
from multiprocessing import Process, Manager
manager = Manager()
d = manager.dict()
# Fn definitions and such
p1 = Process(target=p1, args=(d,))
p2 = Process(target=p2, args=(d,))
within each function, check for the item like this:
def p1(d):
# Stuff
if 'foo' in d:
return

How to manage python threads results?

I am using this code:
def startThreads(arrayofkeywords):
global i
i = 0
while len(arrayofkeywords):
try:
if i<maxThreads:
keyword = arrayofkeywords.pop(0)
i = i+1
thread = doStuffWith(keyword)
thread.start()
except KeyboardInterrupt:
sys.exit()
thread.join()
for threading in python, I have almost everything done, but I dont know how to manage the results of each thread, on each thread I have an array of strings as result, how can I join all those arrays into one safely? Because, I if I try writing into a global array, two threads could be writing at the same time.

First, you actually need to save all those thread objects to call join() on them. As written, you're saving only the last one of them, and then only if there isn't an exception.
An easy way to do multithreaded programming is to give each thread all the data it needs to run, and then have it not write to anything outside that working set. If all threads follow that guideline, their writes will not interfere with each other. Then, once a thread has finished, have the main thread only aggregate the results into a global array. This is know as "fork/join parallelism."
If you subclass the Thread object, you can give it space to store that return value without interfering with other threads. Then you can do something like this:
class MyThread(threading.Thread):
def __init__(self, ...):
self.result = []
...
def main():
# doStuffWith() returns a MyThread instance
threads = [ doStuffWith(k).start() for k in arrayofkeywords[:maxThreads] ]
for t in threads:
t.join()
ret = t.result
# process return value here
Edit:
After looking around a bit, it seems like the above method isn't the preferred way to do threads in Python. The above is more of a Java-esque pattern for threads. Instead you could do something like:
def handler(outList)
...
# Modify existing object (important!)
outList.append(1)
...
def doStuffWith(keyword):
...
result = []
thread = Thread(target=handler, args=(result,))
return (thread, result)
def main():
threads = [ doStuffWith(k) for k in arrayofkeywords[:maxThreads] ]
for t in threads:
t[0].start()
for t in threads:
t[0].join()
ret = t[1]
# process return value here

Use a Queue.Queue instance, which is intrinsically thread-safe. Each thread can .put its results to that global instance when it's done, and the main thread (when it knows all working threads are done, by .joining them for example as in #unholysampler's answer) can loop .getting each result from it, and use each result to .extend the "overall result" list, until the queue is emptied.
Edit: there are other big problems with your code -- if the maximum number of threads is less than the number of keywords, it will never terminate (you're trying to start a thread per keyword -- never less -- but if you've already started the max numbers you loop forever to no further purpose).
Consider instead using a threading pool, kind of like the one in this recipe, except that in lieu of queueing callables you'll queue the keywords -- since the callable you want to run in the thread is the same in each thread, just varying the argument. Of course that callable will be changed to peel something from the incoming-tasks queue (with .get) and .put the list of results to the outgoing-results queue when done.
To terminate the N threads you could, after all keywords, .put N "sentinels" (e.g. None, assuming no keyword can be None): a thread's callable will exit if the "keyword" it just pulled is None.
More often than not, Queue.Queue offers the best way to organize threading (and multiprocessing!) architectures in Python, be they generic like in the recipe I pointed you to, or more specialized like I'm suggesting for your use case in the last two paragraphs.

You need to keep pointers to each thread you make. As is, your code only ensures the last created thread finishes. This does not imply that all the ones you started before it have also finished.
def startThreads(arrayofkeywords):
global i
i = 0
threads = []
while len(arrayofkeywords):
try:
if i<maxThreads:
keyword = arrayofkeywords.pop(0)
i = i+1
thread = doStuffWith(keyword)
thread.start()
threads.append(thread)
except KeyboardInterrupt:
sys.exit()
for t in threads:
t.join()
//process results stored in each thread
This also solves the problem of write access because each thread will store it's data locally. Then after all of them are done, you can do the work to combine each threads local data.

I know that this question is a little bit old, but the best way to do this is not to harm yourself too much in the way proposed by other colleagues :)
Please read the reference on Pool. This way you will fork-join your work:
def doStuffWith(keyword):
return keyword + ' processed in thread'
def startThreads(arrayofkeywords):
pool = Pool(processes=maxThreads)
result = pool.map(doStuffWith, arrayofkeywords)
print result

Writing into a global array is fine if you use a semaphore to protect the critical section. You 'acquire' the lock when you want to append to the global array, then 'release' when you are done. This way, only one thread is every appending to the array.
Check out http://docs.python.org/library/threading.html and search for semaphore for more info.
sem = threading.Semaphore()
...
sem.acquire()
# do dangerous stuff
sem.release()

try some semaphore's methods, like acquire and release..
http://docs.python.org/library/threading.html

We Keep Coding

Python is a programming language that lets you work quickly and integrate systems more effectively.