import sqlite3
conn = sqlite3.connect('output.db')
count = 0
items = []
for item in InfStream: # assume I have an infinite stream
items.append((item,))
count += 1
if count == 10000:
conn.executemany("INSERT INTO table VALUES (?)", items)
conn.commit()
items = []
In this Python code, I have a stream of unknown length called InfStream from an API and I would like to insert the item in the stream to a table in a sqlite database. In this case, I firstly create a list of 10,000 items and then insert into the db using executemany. This will take around 1 hour. However, the code has a problem, when executemany is running, I have to wait around 15 seconds to finish. This is not acceptable in my case because, I need to keep getting the item from the stream, or otherwise it will be disconnected if I delay too long.
I would like the loop continues while executemany is running at the same time. Is it possible to do so?
nb. Input is far slower than the write. 10,000 items from input will take around 1 hour and output is only 15 seconds.
This is a classic Producer–consumer problem that can best be handled using Queue.
The Producer in this case is your InfStream, and the consumer is everything within your for block.
It would be straight forward to convert your sequential code to a multi-threaded Producer-Consumer Model and using Queue for dispatching data between the threads
Consider your Code
import sqlite3
conn = sqlite3.connect('output.db')
count = 0
items = []
for item in InfStream: # assume I have an infinite stream
items.append((item,))
count += 1
if count == 10000:
conn.executemany("INSERT INTO table VALUES (?)", items)
conn.commit()
items = []
Create a Consumer function, to consume the data
def consumer(q):
def helper():
while True:
items = [(q.get(),) for _ in range(10000)]
conn.executemany("INSERT INTO table VALUES (?)", items)
conn.commit()
return helper
And a Producer Function to produce it until infinitum
def producer():
q = Queue()
t = Thread(target=consumer(q))
t.daemon = True
t.start()
for item in InfStream:
q.put(item)
q.task_done()
Additional Notes in response to the comments
Theoretically, the queue can scale to infinite size, limited by system resource.
If the consumer cannot keep pace with producer
Span Multiple Consumer
ache the Data in a faster IO device and flush it later to the database.
Make the Count configurable and dynamic.
It sounds like executemany is blocked on IO, so threading might actually help here, so I would try that first. In particular, create a separate thread which will simply call executemany on data that the first threads throws onto a shared queue. Then, the first read can keep reading, while the second thread does the executemany. As the other answer pointed out, this is a Producer-Consumer problem.
If that does not solve the problem, switch to multiprocessing.
Note that if your input is flowing in more quickly than you can write in the second thread or process, then neither solution will work, because you will fill up memory faster than you can empty it. In that case, you will have to throttle the input reading rate, regardless.
Related
I have this task which is sort of I/O bound and CPU bound at the same time.
Basically I am getting a list of queries from a user, google search them (via custom-search-api), store each query results in a .txt file, and storing all results in a results.txt file.
I was thinking that maybe parallelism might be an advantage here.
My whole task is wrapped with an Object which has 2 member fields which I am supposed to use across all threads/processes (a list and a dictionary).
Therefore, when I use multiprocessing I get weird results (I assume that it is because of my shared resources).
i.e:
class MyObject(object):
_my_list = []
_my_dict = {}
_my_dict contains key:value pairs of "query_name":list().
_my_list is a list of queries to search in google. It is safe to assume that it is not written into.
For each query : I search it on google, grab the top results and store it in _my_dict
I want to do this in parallel. I thought that threading may be good but it seems that they slow the work..
how I attempted to do it (this is the method which is doing the entire job per query):
def _do_job(self, query):
""" search the query on google (via http)
save results on a .txt file locally. """
this is the method which is supposed to execute all jobs for all queries in parallel:
def find_articles(self):
p = Pool(processes=len(self._my_list))
p.map_async(self._do_job, self._my_list)
p.close()
p.join()
self._create_final_log()
The above execution does not work, I get corrupted results...
When I use multithreading however, the results are fine, but very slow:
def find_articles(self):
thread_pool = []
for vendor in self._vendors_list:
self._search_validate_cache(vendor)
thread = threading.Thread(target=self._search_validate_cache, args=. (vendor,))
thread_pool.append(thread)
thread.start()
for thread in thread_pool:
thread.join()
self._create_final_log()
Any help would be appreciated, thanks!
I have encountered this while doing similar projects in the past (multiprocessing doesn't work efficiently, single-threaded is too slow, starting a thread per query is too fast and is bottlenecked). I found an efficient way to complete a task like this is to create a thread pool with a limited amount of threads. Logically, the fastest way to complete this task is to use as many network resources as possible without a bottleneck, which is why the threads active at one time actively making requests are capped.
In your case, cycling a list of queries with a thread pool with a callback function would be a quick and easy way to go through all the data. Obviously, there is a lot of factors that affect that such as network speed and finding the correct size threadpool to avoid a bottlneck, but overall I've found this to work well.
import threading
class MultiThread:
def __init__(self, func, list_data, thread_cap=10):
"""
Parameters
----------
func : function
Callback function to multi-thread
threads : int
Amount of threads available in the pool
list_data : list
List of data to multi-thread index
"""
self.func = func
self.thread_cap = thread_cap
self.thread_pool = []
self.current_index = -1
self.total_index = len(list_data) - 1
self.complete = False
self.list_data = list_data
def start(self):
for _ in range(self.thread_cap):
thread = threading.Thread(target=self._wrapper)
self.thread_pool += [thread]
thread.start()
def _wrapper(self):
while not self.complete:
if self.current_index < self.total_index:
self.current_index += 1
self.func(self.list_data[self.current_index])
else:
self.complete = True
def wait_on_completion(self):
for thread in self.thread_pool:
thread.join()
import requests #, time
_my_dict = {}
base_url = "https://www.google.com/search?q="
s = requests.sessions.session()
def example_callback_func(query):
global _my_dict
# code to grab data here
r = s.get(base_url+query)
_my_dict[query] = r.text # whatever parsed results
print(r, query)
#start_time = time.time()
_my_list = ["examplequery"+str(n) for n in range(100)]
mt = MultiThread(example_callback_func, _my_list, thread_cap=30)
mt.start()
mt.wait_on_completion()
# output queries to file
#print("Time:{:2f}".format(time.time()-start_time))
You could also open the file and output whatever you need to as you go, or output data at the end. Obviously, my replica here isn't exactly what you need, but it's a solid boilerplate with a lightweight function I made that will greatly reduce the time it takes. It uses a thread pool to call a callback to a default function that takes a single parameter (the query).
In my test here, it completed cycling 100 queries in ~2 seconds. I could definitely play with the thread cap and get the timings lower before I find the bottleneck.
I have three base stations, they have to work in parallel, and they will receive a list every 10 seconds which contain information about their cluster, and I want to run this code for about 10 minutes. So, every 10 seconds my three threads have to call the target method with new arguments, and this process should last long for 10 minutes. I don't know how to do this, but I came up with the below idea which seems to be not quite a good one! Thus I appreciate any help.
I have a list named base_centroid_assign that I want to pass each item of it to a distinct thread. The list content will be updated frequently (supposed for instance 10 seconds), I so wish to recall my previous threads and give the update items to them.
In the below code, the list contains three items which have multiple items in them (it's nested). I want to have three threads stop after executing the quite simple target function, and then recall the threads with update item; however, when I run the below code, I ended up with 30 threads! (the run_time variable is 10 and list's length is 3).
How can I implement idea as mentioned above?
run_time = 10
def cluster_status_broadcasting(info_base_cent_avr):
print(threading.current_thread().name)
info_base_cent_avr.sort(key=lambda item: item[2], reverse=True)
start = time.time()
while(run_time > 0):
for item in base_centroid_assign:
t = threading.Thread(target=cluster_status_broadcasting, args=(item,))
t.daemon = True
t.start()
print('Entire job took:', time.time() - start)
run_time -= 1
Welcome to Stackoverflow.
Problems with thread synchronisation can be so tricky to handle that Python already has some very useful libraries specifically to handle such tasks. The primary such library is queue.Queue in Python 3. The idea is to have a queue for each "worker" thread. The main thread collect and put new data onto a queue, and have the subsidiary threads get the data from that queue.
When you call a Queue's get method its normal action is to block the thread until something is available, but presumably you want the threads to continue working on the current inputs until new ones are available, in which case it would make more sense to poll the queue and continue with the current data if there is nothing from the main thread.
I outline such an approach in my answer to this question, though in that case the worker threads are actually sending return values back on another queue.
The structure of your worker threads' run method would then need to be something like the following pseudo-code:
def run(self):
request_data = self.inq.get() # Wait for first item
while True:
process_with(request_data)
try:
request_data = self.inq.get(block=False)
except queue.Empty:
continue
You might like to add logic to terminate the thread cleanly when a sentinel value such as None is received.
I have a dataframe, several thousand rows in length, that contains two pairs of GPS coordinates in one of the columns, with which I am trying to calculate the drive time between those coordinates. I have a function that takes in those coordinates and returns the drive time and it takes maybe 3-8 seconds to calculate each entry. So, the total process can take quite a while. What I'd like to be able to do is: using maybe 3-5 threads, iterate through the list and calculate the drive time and move on to the next entry while the other threads are completing and not creating more than 5 threads in the process. Independently, I have everything working - I can run multiple threads, I can track the thread count and wait until the max number of allowed threads drops below limit until the next starts and can iterate the dataframe and calculate the drive time. However, I'm having trouble piecing it all together. Here's an edited, slimmed down version of what I have.
import pandas
import threading
import arcgis
class MassFunction:
#This is intended to keep track of the active threads
MassFunction.threadCount = 0
def startThread(functionName,params=None):
#This kicks off a new thread and should count up to keep track of the threads
MassFunction.threadCount +=1
if params is None:
t = threading.Thread(target=functionName)
else:
t = threading.Thread(target=functionName,args=[params])
t.daemon = True
t.start()
class GeoAnalysis:
#This class handles the connection to the ArcGIS services
def __init__(self):
super(GeoAnalysis, self).__init__()
self.my_gis = arcgis.gis.GIS("https://www.arcgis.com", username, pw)
def drivetimeCalc(self, coordsString):
#The coords come in as a string, formatted as 'lat_1,long_1,lat_2,long_2'
#This is the bottleneck of the process, as this calculation/response
#below takes a few seconds to get a response
points = coordsString.split(", ")
route_service_url = self.my_gis.properties.helperServices.route.url
self.route_layer = arcgis.network.RouteLayer(route_service_url, gis=self.my_gis)
point_a_to_point_b = "{0}, {1}; {2}, {3}".format(points[1], points[0], points[3], points[2])
result = self.route_layer.solve(stops=point_a_to_point_b,return_directions=False, return_routes=True,output_lines='esriNAOutputLineNone',return_barriers=False, return_polygon_barriers=False,return_polyline_barriers=False)
travel_time = result['routes']['features'][0]['attributes']['Total_TravelTime']
#This is intended to 'remove' one of the active threads
MassFunction.threadCount -=1
return travel_time
class MainFunction:
#This is to give access to the GeoAnalysis class from this class
GA = GeoAnalysis()
def closureDriveTimeCalc(self,coordsList):
#This is intended to loop in the event that a fifth loop gets started and will prevent additional threads from starting
while MassFunction.threadCount > 4:
pass
MassFunction.startThread(MainFunction.GA.drivetimeCalc,coordsList)
def driveTimeAnalysis(self,location):
#This reads a csv file containing a few thousand entries.
#Each entry/row contains gps coordinates, which need to be
#iterated over to calculate the drivetimes
locationMemberFile = pandas.read_csv(someFileName)
#The built-in apply() method in pandas seems to be the
#fastest way to iterate through the rows
locationMemberFile['DRIVETIME'] = locationMemberFile['COORDS_COL'].apply(self.closureDriveTimeCalc)
When I run this right now, using VS Code, I can see the thread counts go up into the thousands in the call stack, so I feel like it is not waiting for the thread to finish and adding/subtracting from the threadCount value. Any ideas/suggestions/tips would be much appreciated.
EDIT: Essentially my problem is how do I get the travel_time value back so that it can be placed into the dataframe. I currently have no return statement for closureDriveTimeCalc function, so while the function runs correctly, it doesn't send any information back into the apply() method.
Rather than do this in an apply, I'd use multiprocessing Pool.map:
from multiprocessing import Pool
with Pool(processes=4) as pool:
locationMemberFile['DRIVETIME'] = pool.map(self.closureDriveTimeCalc, locationMemberFile['COORDS_COL']))
I am a beginner in python and cant figure out how to do this:
I am running a python script that puts a new value every 5-10 seconds into a list. I want to choose these elements from the list in another multithreaded python script however per thread one value, so one value shouldnt be reused, if theres no next value, then wait until next value is present. I have some code where I tried to do it but with no success:
Script that creates values:
values = ['a','b','c','d','e','f']
cap = []
while True:
cap.append(random.choice(values))
print cap
time.sleep(5)
Script that needs these values:
def adding(self):
p = cap.pop()
print (p)
However in a multithreaded environment, each thread gives me the same value, even thought I want value for each thread to be different (e.g remove value already used by thread) What are my options here?
If I understood correctly, you want to use one thread (a producer) to fill a list with values, and then a few different threads (consumers) to remove from that same list. Thus resulting with a series of consumers which have mutually exclusive subsets of the values added by the producer.
A possible outcome might be:
Producer
cap.append('a')
cap.append('c')
cap.append('b')
cap.append('f')
Consumer 1
cap.pop() # a
cap.pop() # f
Consumer 2
cap.pop() # c
cap.pop() # b
If this is the behavior you want I recommend using a thread-safe object like a Queue (python 2.*) or queue (python 3.*)
Here is one possible implementation
Producer
import Queue
values = ['a','b','c','d','e','f']
q = Queue.Queue()
while True:
q.put(random.choice(values))
print q
time.sleep(5)
Consumer
val = q.get() # this call will block (aka wait) for something to be available
print(val)
Its also very important that both the producer and the consumer have access to the same instance of the of q.
I have a small function (see below) that returns a list of names that are mapped from a list of integers (eg [1,2,3,4]) which can be of length up to a thousand.
This function can potentially get called tens of thousands of times at a time and I want to know if I can do anything to make it run faster.
The graph_hash is a large hash that maps keys to sets of length 1000 or less. I am iterating over a set and mapping the values to names and returning a list. The u.get_name_from_id() queries an sqlite database.
Any thoughts to optimize any part of this function?
def get_neighbors(pid):
names = []
for p in graph_hash[pid]:
names.append(u.get_name_from_id(p))
return names
Caching and multithreading are only going to get you so far, you should create a new method that uses executemany under the hood to retrieve multiple names from the database in bulk.
Something like names = u.get_names_from_ids(graph_hash[pid]).
You're hitting the database sequentially here:
for p in graph_hash[pid]:
names.append(u.get_name_from_id(p))
I would recommend doing it concurrently using threads. Something like this should get you started:
def load_stuff(queue, p):
q.put(u.get_name_from_id(p))
def get_neighbors(pid):
names = Queue.Queue()
# we'll keep track of the threads with this list
threads = []
for p in graph_hash[pid]:
thread = threading.Thread(target=load_stuff, args=(names,p))
threads.append(thread)
# start the thread
thread.start()
# wait for them to finish before you return your Queue
for thread in threads:
thread.join()
return names
You can turn the Queue back into a list with [item for item in names.queue] if needed.
The idea is that the database calls are blocking until they're done, but you can make multiple SELECT statements on a database without locking. So, you should use threads or some other concurrency method to avoid waiting unnecessarily.
I would recommend to use deque instead of list if you doing thousands of appends. So, names should be names = deque().
A list comprehension is a start (similar to #cricket_007's generator suggestion), but you are limited by function calls:
def get_neighbors(pid):
return [u.get_name_from_id(p) for p in graph_hash[pid]]
As #salparadise suggested, consider memoization to speed up get_name_from_id().