After some lookup into google and posts of stackoverflow and other sites, i'm still confused on how i can apply a queue and threading on my code:
import psycopg2
import sys
import re
# for threading and queue
import multiprocessing
from multiprocessing import Queue
# for threading and queue
import time
from datetime import datetime
class Database_connection():
def db_call(self,query,dbHost,dbName,dbUser,dbPass):
try:
con = None
con = psycopg2.connect(host=dbHost,database=dbName,
user=dbUser,password=dbPass)
cur = con.cursor()
cur.execute(query)
data = cur.fetchall()
resultList = []
for data_out in data:
resultList.append(data_out)
return resultList
except psycopg2.DatabaseError, e:
print 'Error %s' % e
sys.exit(1)
finally:
if con:
con.close()
w = Database_connection()
sql = "select stars from galaxy"
startTime = datetime.now()
for result in w.db_call(sql, "x", "x", "x", "x"):
print result[0]
print "Runtime: " + str(datetime.now()-startTime)
lets supose the result will be 100+ values. How can i, put those 100+ results on queue and execute ( print, for example ) then 5 at time using queue and multiprocessing module?
What do you want this code to do?
You get no output from this code because get() returns the next item from the queue (doc). You are putting the letters from the sql response into the queue one letter at a time. the i in for i... is looping over the list returned by w.db_call. Those items are (I assume) strings, which you are then iterating over and adding one at a time to the queue. The next thing you do is to remove the element you just added to the queue from the queue, which leaves the queue unchanged over each pass through the loop. If you put a print statement in the loop it prints out the letter it just got from the queue.
Queues are used to pass information between processes. I think you are trying to set-up a producer/consumer pattern where you have one process add things to the queue and multiple other processes which consume things from the queue. See working example of multiprocessing.Queue and links contained there in (example, main documentation).
Probably the simplest way to get this working, as long as you don't need it to run in an interactive shell, is to use Pool (lifted almost verbatim from the documentation of multiprocess)
from multiprocessing import Pool
p = Pool(5) # sets the number of worker threads you want
def f(res):
# put what ever you want to do with each of the query results in here
return res
result_lst = w.db_call(sql, "x", "x", "x", "x")
proced_results = p.map(f, result_lst)
which apply what ever you want to do to each result (written into the function f) and returns the results of that manipulation as a list. The number of sub-processes to use is set by the argument to Pool.
This is my suggestion...
import Queue
from threading import Thread
class Database_connection:
def db_call(self,query,dbHost,dbName,dbUser,dbPass):
# your code here
return
# in this example each thread will execute this function
def processFtpAddrMt(queue):
# loop will continue until queue containing FTP addresses is empty
while True:
# get an ftp address, a exception will be called when the
# queue is empty and the loop will break
try: ftp_addr = queue.get()
except: break
# put code to process the ftp address here
# let queue know this task is done
queue.task_done()
w = Database_connection()
sql = "select stars from galaxy"
ftp_addresses = w.db_call(sql, "x", "x", "x", "x")
# put each result of the SQL call in a Queue class
ftp_addr_queue = Queue.Queue()
for addr in ftp_addresses:
ftp_addr_queue.put(addr)
# create five threads where each one will run analyzeFtpResult
# pass the queue to the analyzeFtpResult function
for x in range(0,5):
t = Thread(target=processFtpAddrMt,args=(ftp_addr_queue,))
t.setDaemon(True)
t.start()
# blocks further execution of the script until all queue items have been processed
ftp_addr_queue.join()
It uses the Queue class to store your SQL results and then the Thread class to process the queue. Five thread classes are created and each one uses a processFtpAddrMt function which take ftp addresses from the queue until the queue is empty. All you have to do is add the code for processing the ftp address. Hope this helps.
I was able to solve the problem with the following:
def worker():
w = Database_connection()
sql = "select stars from galaxy"
for result in w.db_call(sql, "x", "x", "x", "x"):
if result:
jobs = []
startTime = datetime.now()
for i in range(1):
p = multiprocessing.Process(target=worker)
jobs.append(p)
p.start()
print "Runtime: " + str(datetime.now()-startTime)
I belive it is not the best way to do it, but for now solved my problem :)
Related
Note: this question is different from that question, notably in when the jobs are dispatched to the workers and when the results are gathered.
So I have this code:
mp_jobqueue = MP.Queue()
mp_mgr = MP.Manager()
mp_state = mp_mgr.dict()
mp_faileds = mp_mgr.list()
# the processing in process_data_worker is very CPU-intensive,
# thus totally not suitable for async.
workers: List[MP.Process] = []
for ident in range(0, WORKER_COUNT):
print(ident, end=" ", flush=True)
mp_state[ident] = None
w = MP.Process(
target=process_data_worker,
args=(mp_jobqueue, mp_state, mp_faileds),
)
w.start()
workers.append(w)
# fetch_data asynchronously fetches chunks of data,
# each chunk will be directly fed into the job queue to be processed
# by the workers
asyncio.run(fetch_data(mp_jobqueue))
# when we reach here, all data-fetching should have been finished
# and submitted to the workers' job queue
# wait until mp_jobqueue is empty AND all workers are IDLE
safed_workers = 0
while not mp_jobqueue.is_empty() or safed_workers < WORKER_COUNT:
time.sleep(1.0)
safed_workers = sum(1 for state in mp_state.values() if state == "IDLE")
# gather failed results
faileds = list(mp_faileds)
# close manager first to prevent GetOverlappedResult error
mp_mgr.shutdown()
mp_mgr.join()
# disband the workers
[mp_jobqueue.put("DIE") for _ in workers]
time.sleep(1.0)
mp_jobqueue.close()
[w.join() for w in workers]
So as you can see, I cannot use pool.map() to gather the "faileds".
This got me thinking, though:
Will it be better (performance-wise) to use another Queue for mp_faileds instead of a list like it is now? Because I only need an object that can handle "add into bag" and "take out from bag until bag is empty".
Edit: Just found out about multiprocessing.queues.SimpleQueue. The answers to this question, notably this particular answer, seems to hint that SimpleQueue might be even faster. Can someone confirm?
Goal:
Accelerate the random walk generation by using multiple processes.
Get the list of vertices ids from which I want random walks to be generated in an input queue
Start as much processes as possible with the correct parameters
Make them put the random walks into an output queue
Wait for completion
Read the output queue
What I am doing:
# Libraries imports
from multiprocessing import cpu_count, Process, Queue
import queue
import configparser
from gremlin_python.driver.driver_remote_connection import DriverRemoteConnection
from gremlin_python.process.anonymous_traversal import AnonymousTraversalSource, traversal
from gremlin_python.process.graph_traversal import __
# Function the processes are supposed to execute
def job(proc_id:int, siq:Queue, rwq:Queue, g:AnonymousTraversalSource, length:int):
while True:
try:
# Get next element in ids queue
start_id = siq.get_nowait()
except queue.Empty:
# If the ids queue is empty, then terminate
break
else:
# Do a random walk of length <length> from the vertex with id <start_id>
random_walk = g.V(start_id).repeat(
__.local(__.both().sample(1))
).times(length).path().next()
print(f"{proc_id}: rw obtained")
# Transform the list of vertices into a comma-separated string of ids
rwq.put(",".join(
[str(v.id) for v in random_walk]
))
print(f"{proc_id}: rw handled")
if __name__ == "__main__":
# Get the parameters from the <config.ini> configuration file
config = configparser.RawConfigParser()
config.read("config.ini")
jg_uri = config["JANUSGRAPH"]["URI"]
file_random_walks = config["FILES"]["RANDOM_WALKS"]
walks_nb_per_node = int(config["WALKS"]["NB_PER_NODE"])
walks_length = int(config["WALKS"]["LENGTH"])
# Connect to Janus Graph
connection = DriverRemoteConnection(jg_uri, "g")
g_main = traversal().withRemote(connection)
# Instantiate the queues and populate the ids one
start_ids_queue = Queue()
random_walks_queue = Queue()
for vertex in g_main.V().has("vertex_label", "<label>").fold().next():
start_ids_queue.put(vertex.id)
# Create and start the processes
nb_processes = cpu_count()
processes = []
for i in range(nb_processes):
p = Process(target=job, args=(
i,
start_ids_queue,
random_walks_queue,
g_main,
walks_length
))
processes.append(p)
p.start()
for p in processes:
p.join()
# Once the processes are terminated, read the random walks queue
random_walks = []
while not random_walks_queue.empty():
random_walks.append(random_walks_queue.get())
# Do something with the random walks
...
Issue:
Once the processes are started, nothing seems to happen. I never get the X: rw obtained/X: rw handled messages. With a bit more logging, I can see that the queries have been sent yet isn't finishing.
In the logs, when performing the first g_main.V().has("vertex_label", "<label>").fold().next() in the main process (when I populate the ids queue), I have the following message:
DEBUG:gremlinpython:submit with bytecode '[['V'], ['has', 'vertex_label', 'movie'], ['fold']]'
DEBUG:gremlinpython:message '[['V'], ['has', 'vertex_label', '<label>'], ['fold']]'
DEBUG:gremlinpython:processor='traversal', op='bytecode', args='{'gremlin': [['V'], ['has', 'vertex_label', '<label>'], ['fold']], 'aliases': {'g': 'g'}}'
DEBUG:asyncio:Using selector: EpollSelector
When the other processes send their queries, I have similar logs:
DEBUG:gremlinpython:submit with bytecode '[['V', 16456], ['repeat', [['local', [['both'], ['sample', 1]]]]], ['times', 10], ['path']]'
DEBUG:gremlinpython:message '[['V', 16456], ['repeat', [['local', [['both'], ['sample', 1]]]]], ['times', 10], ['path']]'
DEBUG:gremlinpython:processor='traversal', op='bytecode', args='{'gremlin': [['V', 16456], ['repeat', [['local', [['both'], ['sample', 1]]]]], ['times', 10], ['path']], 'aliases': {'g': 'g'}}'
DEBUG:asyncio:Using selector: EpollSelector
The issue seems not to reside in the query sent, but instead in the indefinite wait that ensues.
If you know of an issue with gremlinpython and multiprocessing, if there is a problem in my multi-processing code, or if you have any explanation that I may have overlooked, please explain to me! Thanks a lot to everyone reading this!
Solutions:
The first partial solution that I found is to use multi-threading instead of multiprocessing:
import configparser
from gremlin_python.driver.driver_remote_connection import DriverRemoteConnection
from gremlin_python.process.anonymous_traversal import AnonymousTraversalSource, traversal
from gremlin_python.process.graph_traversal import __
import threading
class myThread(threading.Thread):
def __init__(self, thread_id, g, length):
threading.Thread.__init__(self)
self.thread_id = thread_id
self.thread_count = 0
self.gtraversal = g
self.walk_length = length
self.nb_walks = nb_walks
def run(self):
while True:
start_ids_list_lock.acquire()
try:
start_id = start_ids_list.pop(0)
start_ids_list_lock.release()
except IndexError:
start_ids_list_lock.release()
break
else:
self.thread_count += 1
random_walk = job(
vertex_id=start_id,
g=self.gtraversal,
length=self.walk_length,
nb_walks=self.nb_walks
)
random_walks_list_lock.acquire()
random_walks_list.append(random_walk)
random_walks_list_lock.release()
logging.info(f"Thread {self.thread_id}: {self.thread_count} done")
def job(vertex_id:int, g:AnonymousTraversalSource, length:int) -> str:
random_walk = g.V(vertex_id).repeat(
__.local(__.both().sample(1))
).times(length).path().next()
return ",".join(random_walk)
config = configparser.RawConfigParser()
config.read("config.ini")
jg_uri = config["JANUSGRAPH"]["URI"]
file_random_walks = config["FILES"]["RANDOM_WALKS"]
walks_length = int(config["WALKS"]["LENGTH"])
connection = DriverRemoteConnection(jg_uri, "g")
g_main = traversal().withRemote(connection)
threads = []
start_ids_list = []
random_walks_list = []
random_walks_list_lock = threading.Lock()
start_ids_list_lock = threading.Lock()
start_ids_list = [vertex.id for vertex in g_main.V().has("vertex_label", "<label>").fold().next()]
nb_vertices = len(start_ids_list)
nb_threads = 6
for i in range(nb_threads):
thread = myThread(
thread_id=i,
g=g_main,
length=walks_length
)
thread.start()
threads.append(thread)
for t in threads:
t.join()
# Do something with the random walks
...
This solution is effectively working and improves the execution time of the program. This isn't a full answer though, as it doesn't explain why the multiprocessing is not performing as I expected.
I see a lot of tutorials on how to use queues, but they always show them implemented in the same file. I'm trying to organize my code files well from the beginning because I anticipate the project to become very large. How do I get the queue that I initialize in my main file to import into the other function files?
Here is my main file:
import multiprocessing
import queue
from data_handler import data_handler
from get_info import get_memory_info
from get_info import get_cpu_info
if __name__ == '__main__':
q = queue.Queue()
getDataHandlerProcess = multiprocessing.Process(target=data_handler(q))
getMemoryInfoProcess = multiprocessing.Process(target=get_memory_info(q))
getCPUInfoProcess = multiprocessing.Process(target=get_cpu_info(q))
getDataHandlerProcess.start()
getMemoryInfoProcess.start()
getCPUInfoProcess.start()
print("DEBUG: All tasks successfully started.")
Here is my producer:
import psutil
import struct
import time
from data_frame import build_frame
def get_cpu_info(q):
while True:
cpu_string_data = bytes('', 'utf-8')
cpu_times = psutil.cpu_percent(interval=0.0, percpu=True)
for item in cpu_times:
cpu_string_data = cpu_string_data + struct.pack('<d',item)
cpu_frame = build_frame(cpu_string_data, 0, 0, -1, -1)
q.put(cpu_frame)
print(cpu_frame)
time.sleep(1.000)
def get_memory_info(q):
while True:
memory_string_data = bytes('', 'utf-8')
virtual_memory = psutil.virtual_memory()
swap_memory = psutil.swap_memory()
memory_info = list(virtual_memory+swap_memory)
for item in memory_info:
memory_string_data = memory_string_data + struct.pack('<d',item)
memory_frame = build_frame(memory_string_data, 0, 1, -1, -1)
q.put(memory_frame)
print(memory_frame)
time.sleep(1.000)
def get_disk_info(q):
while True:
disk_usage = psutil.disk_usage("/")
disk_io_counters = psutil.disk_io_counters()
time.sleep(1.000)
print(disk_usage)
print(disk_io_counters)
def get_network_info(q):
while True:
net_io_counters = psutil.net_io_counters()
time.sleep(1.000)
print(net_io_counters)
And here is my consumer:
def data_handler(q):
while True:
next_element = q.get()
print(next_element)
print('Item received at data handler queue.')
It is not entirely clear to me what do you mean by " How do I get the queue that I initialize in my main file to import into the other function files?".
Normally you pass a queue as and argument to a function and use it within a function scope regardless of the file structure. Or perform any other variable sharing techniques used for any other data type.
Your code seems to have a few errors however. Firstly, you shouldn't be using queue.Queue with multiprocessing. It has it's own version of that class.
q = multiprocessing.Queue()
It is slower than the queue.Queue, but it works for sharing the data across processes.
Secondly, the proper way to create process objects is:
getDataHandlerProcess = multiprocessing.Process(target=data_handler, args = (q,))
Otherwise you are actually calling data_handler(q) the main thread and trying to assign its return value to the target argument of multiprocessing.Process. Your data_handler function never returns, so the program probably gets into an infinite a deadlock at this point before multiprocessing even begins. Edit: actually it probably goes into infinite wait trying to get an element from an empty queue which will never be filled.
For all the Active campaigns, I have to query TSDB API for a date period to fetch data for each Campaign ID.so I get all the Campaign ids from the Db and put it to queue. In Db, I have 430 active campaign ids.
But python code is terminating after some 100 entries, don't know the reason, can somebody guide me here, but if I removed the API query fetching code and just prints the queue value get(q.get()), the Id value to fetch API is coming.
below is the code
import mysql.connector
from datetime import datetime,timedelta
from datetime import date
import requests
import json
from collections import OrderedDict
from multiprocessing import Pool, Queue
from os import getpid
from time import sleep
from random import random
db = mysql.connector.connect(
host='HOSTNAME',
database='DB',
user='ROOT',
password='PASSWORD',
port='PORT'
)
print("Connection ID:", db.connection_id)
MAX_WORKERS=10
class Testing_mp(object):
def __init__(self):
"""
Initiates a queue, a pool and a temporary buffer, used only
when the queue is full.
"""
self.q = Queue()
self.pool = Pool(processes=MAX_WORKERS, initializer=self.worker_main,)
self.temp_buffer = []
def add_to_queue(self, msg):
"""
If queue is full, put the message in a temporary buffer.
If the queue is not full, adding the message to the queue.
If the buffer is not empty and that the message queue is not full,
putting back messages from the buffer to the queue.
"""
if self.q.full():
print("QISFULL",msg)
self.temp_buffer.append(msg)
else:
self.q.put(msg)
if len(self.temp_buffer) > 0:
add_to_queue(self.temp_buffer.pop())
def write_to_queue(self):
"""
This function writes some messages to the queue.
"""
mycursor = db.cursor()
mycursor.execute("select Id from Campaign where Status='ACTIVE' order by Id desc")
myresult = mycursor.fetchall()
for x in myresult:
self.add_to_queue(x[3])
sleep(random()*2)
db.close() # close the connection
def worker_main(self):
"""
Waits indefinitely for an item to be written in the queue.
Finishes when the parent process terminates.
"""
print "Process {0} started".format(getpid())
while True:
# If queue is not empty, pop the next element and do the work.
# If queue is empty, wait indefinitly until an element get in the queue.
item = self.q.get(block=True,timeout=None)
start_date=datetime.today()
start_date=start_date.date()
end_date = start_date - timedelta(days=8)
start_date = start_date - timedelta(days=1)
print "{0} retrieved: {1}".format(getpid(), item)
#print("STARTDATE",type(start_date))
start_date_ft=start_date.strftime('%Y/%m/%d')
##print("ENDDATE",end_date)
end_date_ft=end_date.strftime('%Y/%m/%d')
url = "http://tsdb.metrics.com:4343/api/query"
if item is not None:
querystring = {"start":end_date_ft,"end":start_date_ft,"m":"avg:1d-avg:percentization{campaign="+str(item)+",type=seen}"}
print(querystring)
response = requests.request("GET", url,params=querystring)
print(response.text)
if response and response.text is not None:
loaded_json = json.loads(response.text,object_pairs_hook=OrderedDict)
for x in loaded_json:
for attribute, value in x.items():
if attribute is not None and attribute=="dps":
dps_data=loaded_json[0][attribute]
perValue=[]
if len(dps_data)>0:
for key,val in dps_data.items():
perValue.append(str(val))
print(str(item)+"==ITEM=="+key+"="+str(val))
print(perValue)
# simulate some random length operations
sleep(random()*1)
# Warning from Python documentation:
# Functionality within this package requires that the __main__ module be
# importable by the children. This means that some examples, such as the
# multiprocessing.Pool examples will not work in the interactive interpreter.
if __name__ == '__main__':
mp_class = Testing_mp()
mp_class.write_to_queue()
# Waits a bit for the child processes to do some work
# because when the parent exits, childs are terminated.
sleep(5)
So currently I am these two method where one reads the RF Data from another device constantly and another method sends that data every so often.
How could I do this? I need the RF Data incoming to be constantly updated and received while the sendData() method just grabs the data from the global variable whenever it can.
Heres the code below so far but it's not working...
import httplib, urllib
import time, sys
import serial
from multiprocessing import Process
key = 'MY API KEY'
rfWaterLevelVal = 0
ser = serial.Serial('/dev/ttyUSB0',9600)
def rfWaterLevel():
global rfWaterLevelVal
rfDataArray = ser.readline().strip().split()
print 'incoming: %s' %rfDataArray
if len(rfDataArray) == 5:
rfWaterLevelVal = float(rfDataArray[4])
print 'RFWater Level1: %.3f cm' % (rfWaterLevelVal)
#rfWaterLevel = 0
def sendData():
global rfWaterLevelVal
params = urllib.urlencode({'field1':rfWaterLevelVal, 'key':key})
headers = {"Content-type" : "application/x-www-form-urlencoded","Accept": "text/plain"}
conn = httplib.HTTPConnection("api.thingspeak.com:80", timeout = 5)
conn.request("POST", "/update", params, headers)
#print 'RFWater Level2: %.3f cm' % (rfWaterLevelVal)
response = conn.getresponse()
print response.status, response.reason
data = response.read()
conn.close()
while True:
try:
rfWaterLevel()
p = Process(target=sendData(), args())
p.start()
p.join()
#Also tried threading...did not work..
#t1 = threading.Thread(target=rfWaterLevel())
#t2 = threading.Thread(target=sendData())
#t1.start()
#t1.join()
#t2.join()
except KeyboardInterrupt:
print "caught keyboard interrupt"
sys.exit()
Please help!
Just to clarify, I need rfWaterLevel() method to run constantly as the rf data is incoming constantly, and I need sendData() to just be called as soon as it's ready to send again (roughly every 5 seconds or so). But it seems as if, if there is any sort of delay to the incoming rf data then rf data stops updating itself (the received end) and thus the data being sent is not accurate to what is being sent from the rf transmitter.
Thanks in advance!
I can't give you a full solution but I can guide you into the right direction.
Your code has three problems.
Process starts (as the name suggests) a new process and not a new thread.
A new process cannot share data with the old process.
You should use mutlithreading instead.
Have a look at threading as explained here
You are calling rfWaterLevel() inside the main thread.
You need to start the second thread before entering the while Loop.
Your are creating the second thread again and again inside the while Loop.
Create it only once and put the while Loop inside the function
Your basic program structure should be like this:
import time
def thread_function_1():
while True:
rfWaterLevel()
def thread_function_2():
while True:
sendData()
time.sleep(5)
# start thread 1
thread1 = Thread(target = thread_function_1)
thread1.start()
# start thread 2
thread2 = Thread(target = thread_function_2)
thread2.start()
# wait for both threads to finish
thread1.join()
thread2.join()