Multithread or multiprocess - python

So, currently, I am using multiprocessing to run these 3 functions together.
As only tokens changes, is it recommended to switch to multi-threading? (if yes, will it really help in a performance like speed-up and I think memory will be for sure used less)
This is my code:
from database_function import *
from kiteconnect import KiteTicker
import pandas as pd
from datetime import datetime, timedelta
import schedule
import time
from multiprocessing import Process
def tick_A():
#credentials code here
tokens = [x[0] for x in db_fetchquery("SELECT zerodha FROM script ORDER BY id ASC LIMIT 50")] #FETCHING FIRST 50 SCRIPTS TOKEN
#print(tokens)
##### TO MAKE SURE THE TASK STARTS AFTER 8:59 ONLY ###########
t = datetime.today()
future = datetime(t.year,t.month,t.day,8,59)
if ((future-t).total_seconds()) < 0:
future = datetime(t.year,t.month,t.day,t.hour,t.minute,(t.second+2))
time.sleep((future-t).total_seconds())
##### TO MAKE SURE THE TASK STARTS AFTER 8:59 ONLY ###########
def on_ticks(ws, ticks):
global ltp
ltp = ticks[0]["last_price"]
for tick in ticks:
print(f"{tick['instrument_token']}A")
db_runquery(f'UPDATE SCRIPT SET ltp = {tick["last_price"]} WHERE zerodha = {tick["instrument_token"]}') #UPDATING LTP IN DATABASE
#print(f"{tick['last_price']}")
def on_connect(ws, response):
#print(f"response from connect :: {response}")
# Subscribe to a list of instrument_tokens (TOKENS FETCHED ABOVE WILL BE SUBSCRIBED HERE).
# logging.debug("on connect: {}".format(response))
ws.subscribe(tokens)
ws.set_mode(ws.MODE_LTP,tokens) # SETTING TOKEN TO TICK MODE (LTP / FULL / QUOTE)
kws.on_ticks = on_ticks
kws.on_connect = on_connect
kws.connect(threaded=True)
#####TO STOP THE TASK AFTER 15:32 #######
end_time = datetime(t.year,t.month,t.day,15,32)
while True:
schedule.run_pending()
#time.sleep(1)
if datetime.now() > end_time:
break
#####TO STOP THE TASK AFTER 15:32 #######
def tick_B():
everything remains the same only tokens value changes
tokens = [x[0] for x in db_fetchquery("SELECT zerodha FROM script ORDER BY id ASC OFFSET (50) ROWS FETCH NEXT (50) ROWS ONLY")]
def tick_C():
everything remains the same only tokens value changes
tokens = [x[0] for x in db_fetchquery("SELECT zerodha FROM script ORDER BY id ASC OFFSET (100) ROWS FETCH NEXT (50) ROWS ONLY")]
if __name__ == '__main__':
def runInParallel(*fns):
proc = []
for fn in fns:
p = Process(target=fn)
p.start()
proc.append(p)
for p in proc:
p.join()
runInParallel(tick_A , tick_B , tick_C)
So, currently, I am using multiprocessing to run these 3 functions together.
As only tokens changes, is it recommended to switch to multi-threading? (if yes, will it really help in a performance like speed-up and I think memory will be for sure used less)

most Python implementations do not have true multi-threading, because they use global lock (GIL). So only one thread runs at a time.
For I/O heavy applications it should not make difference. But if you need CPU heavy operations done in parallel (and I see that you use Panda - so the answer must be yes) - you will be better off staying with multi-process app.

Related

Fetching the order depth with python-binance, but the code does not complete

I am running a python script to fetch all the current order books for all symbols that ends with USDT.
Whenever I try to run it, it fetches the orderbook for the first three symbols (in this case BTCUSDT, ETHUSDT and BNBUSDT). Any takers on what I am messing up here?
I am using this logic to get a list of the symbols and the order book;
import asyncio
import config as c #from config.py
import infinity as inf #userdefined function for infinity (probably not needed)
from binance import AsyncClient, DepthCacheManager, Client
client = Client(c.API_KEY, c.API_SECRET, tld = 'com')
info = client.get_exchange_info()
symbols = info['symbols']
ls = []
for s in symbols:
if 'USDT' in s['symbol']:
#if 'BUSD' not in s['symbol']:
ls.append(s['symbol'])
async def main():
# initialise the client
client = await AsyncClient.create()
for i in ls:
async with DepthCacheManager(client, symbol= i, limit = 10000) as dcm_socket:
depth_cache = await dcm_socket.recv()
symbol = i
asks = depth_cache.get_asks()[:5]
bids = depth_cache.get_bids()[:5]
full = [symbol, asks, bids]
print(full)
if __name__ == "__main__":
loop = asyncio.get_event_loop()
loop.run_until_complete(main())
It wouldn't complete, because it's not supposed to.
DepthCacheManager is designed to establish a connection (WebSockets), get a snapshot of the order information, and then subscribes to a stream of updates to the current outstanding orders that it applies locally in it's "DepthCache". Each time that gets updated, it deliver the updated set of current asks/bids as you can see.
The trading and orders never stop, so why would it stop?
Maybe you wana try: https://github.com/LUCIT-Systems-and-Development/unicorn-binance-local-depth-cache
import unicorn_binance_local_depth_cache
ubldc = unicorn_binance_local_depth_cache.BinanceLocalDepthCacheManager(exchange="binance.com")
ubldc.create_depth_cache("LUNABTC")
asks = ubldc.get_asks("LUNABTC")
bids = ubldc.get_bids("LUNABTC")
Thats it :)

Python multiprocessing finish the work correctly, but the processes still alive (Linux)

I use python multiprocessing to compute some sort of scores on DNA sequences from a large file.
For that I write and use the script below.
I use a Linux machine with 48 cpu in python 3.8 environment.
Th code work fine, and terminate the work correctly and print the processing time at the end.
Problem: when I use the htop command, I find that all 48 processes are still alive.
I don't know why, and I don't know what to add to my script to avoid this.
import csv
import sys
import concurrent.futures
from itertools import combinations
import psutil
import time
nb_cpu = psutil.cpu_count(logical=False)
def fun_job(seq_1, seq_2): # seq_i : (id, string)
start = time.time()
score_dist = compute_score_dist(seq_1[1], seq_2[1])
end = time.time()
return seq_1[0], seq_2[0], score_dist, end - start # id seq1, id seq2, score, time
def help_fun_job(nested_pair):
return fun_job(nested_pair[0], nested_pair[1])
def compute_using_multi_processing(list_comb_ids, dict_ids_seqs):
start = time.perf_counter()
with concurrent.futures.ProcessPoolExecutor(max_workers=nb_cpu) as executor:
results = executor.map(help_fun_job,
[((pair_ids[0], dict_ids_seqs[pair_ids[0]]), (pair_ids[1], dict_ids_seqs[pair_ids[1]]))
for pair_ids in list_comb_ids])
save_results_to_csv(results)
finish = time.perf_counter()
proccessing_time = str(datetime.timedelta(seconds=round(finish - start, 2)))
print(f' Processing time Finished in {proccessing_time} hh:mm:ss')
def main():
print("nb_cpu in this machine : ", nb_cpu)
file_path = sys.argv[1]
dict_ids_seqs = get_dict_ids_seqs(file_path)
list_ids = list(dict_ids_seqs) # This will convert the dict_keys to a list
list_combined_ids = list(combinations(list_ids, 2))
compute_using_multi_processing(list_combined_ids, dict_ids_seqs)
if __name__ == '__main__':
main()
Thank you for your help.
Edit : add the complete code for fun_job (after #Booboo answer)
from Bio import Align
def fun_job(seq_1, seq_2): # seq_i : (id, string)
start = time.time()
aligner = Align.PairwiseAligner()
aligner.mode = 'global'
score_dist = aligner.score(seq_1[1],seq_2[1])
end = time.time()
return seq_1[0], seq_2[0], score_dist, end - start # id seq1, id seq2, score, time
When the with ... as executor: block exits, there is an implicit call to executor.shutdown(wait=True). This will wait for all pending futures to to be done executing "and the resources associated with the executor have been freed", which presumably includes terminating the processes in the pool (if possible?). Why your program terminates (or does it?) or at least you say all the futures have completed executing, while the processes have not terminated is a bit of a mystery. But you haven't provided the code for fun_job, so who can say why this is so?
One thing you might try is to switch to using the multiprocessing.pool.Pool class from the multiprocessing module. It supports a terminate method, which is implicitly called when its context manager with block exits, that explicitly attempts to terminate all processes in the pool:
#import concurrent.futures
import multiprocessing
... # etc.
def compute_using_multi_processing(list_comb_ids, dict_ids_seqs):
start = time.perf_counter()
with multiprocessing.Pool(processes=nb_cpu) as executor:
results = executor.map(help_fun_job,
[((pair_ids[0], dict_ids_seqs[pair_ids[0]]), (pair_ids[1], dict_ids_seqs[pair_ids[1]]))
for pair_ids in list_comb_ids])
save_results_to_csv(results)
finish = time.perf_counter()
proccessing_time = str(datetime.timedelta(seconds=round(finish - start, 2)))
print(f' Processing time Finished in {proccessing_time} hh:mm:ss')

Stop multiprocess pool when a condition is met and continue with program

I've been trying to wrap my head around multiprocessing using an old python bitcoin mining program. Although relatively useless for mining, I figured this would be a great way to explore multiprocessing. However, I've hit a wall when it comes to stopping the processes when one of them achieves the goal they are all working towards.
I want to kill all multiprocessing pools when one of them finds the solution. Then allow the program to continue. I have tried terminate() and join(). I've attempted to include an Event(). I've tried using Process instead of Pool with the direction of a similar issue here: Killing a multiprocessing process when condition is met. However, same problem. How can I stop all processes after a condition is met without exiting the program with something like sys.exit() that would kill the entire program?
I tried also apply_sync with the direction from this post: Python Multiprocess Pool. How to exit the script when one of the worker process determines no more work needs to be done? However, it did not solve the problem of needing to continue executing the final functions of the program. In fact, it actually slowed the program significantly.
For clarity, I've included the code I tried based on the above mentioned link here:
from multiprocessing import Pool
from hashlib import sha256
import time
def SHA256(text):
return sha256(text.encode("ascii")).hexdigest()
def solution_helper(args):
solution, nonce = do_job(args)
if solution:
print(f"\nNonce Found: {nonce}\n")
return True
else:
return False
class Mining():
def __init__(self, workers, initargs):
self.pool = Pool(processes=workers, initargs=initargs)
def callback(self, result):
if result:
print('Solution Found...Terminating Processes...')
self.pool.terminate()
def do_job(self):
for args in values:
start_nonce = args[0]
end_nonce = args[1]
prefix_str = '0'*difficulty
self.pool.apply_async(solution_helper, args=args, callback=self.callback)
start = time.time()
for nonce in range(start_nonce, end_nonce):
text = str(block_number) + transactions + previous_hash + str(nonce)
new_hash = SHA256(text)
if new_hash.startswith(prefix_str):
print(f"Hashing: {text}")
print(f"\nSuccessfully mined bitcoin with nonce value: {nonce}\n")
print(f"New hash: {new_hash}")
total_time = str((time.time()-start))
print(f"\nEnd mning... Mining took {total_time} seconds\n")
return new_hash, nonce
self.pool.close()
self.pool.join()
print('.Goodbye.')
block_number = 5
transactions = """
bill->steve->20,
jan->phillis->45
"""
previous_hash = '0000000b7c7723e4d3a8654c975fe4dd23d4d37f22d0ea7e5abde2225d1567dc6'
values = [(20000, 100000), (100000, 1000000), (1000000, 10000000), (10000000, 100000000)]
difficulty = 4
m = Mining(5, values)
m.do_job()
Here's the basic concept. It works great to start the processes, but I cannot figure out how to stop them:
from multiprocessing import Pool
from hashlib import sha256
import functools
MAX_NONCE = 1000000000
def SHA256(text):
return sha256(text.encode("ascii")).hexdigest()
def nonce(block_number, transactions, previous_hash, prefix_str):
import time
start = time.time()
for nonce in range(MAX_NONCE):
text = str(block_number) + transactions + previous_hash + str(nonce)
new_hash = SHA256(text)
if new_hash.startswith(prefix_str):
print(f"\nYay! Successfully mined bitcoins with nonce value:{nonce}")
total_time = str((time.time()-start))
print(f"\nend mining. Mining took: {total_time} seconds\n")
print(new_hash + "\n")
def mine(block_number, transactions, previous_hash, prefix_zeros):
from multiprocessing import Pool
with Pool(4) as p:
prefix_str = '0'*prefix_zeros
p.map(nonce(block_number, transactions, previous_hash, prefix_str), [20000, 40000, 60000, 80000, 100000])
if __name__=='__main__':
transactions="""
bill->steve->20,
jan->phillis->45
"""
difficulty=7
print("\nstart mining\n")
new_hash = mine(5, transactions, '0000000b7c7723e4d3a8654c975fe4dd23d4d37f22d0ea7e5abde2225d1567dc6', difficulty)
# Do some other things... Here is where I'd like to get to after the multiproccesses are killed
print(f"\nMission Complete...{new_hash}\n") <---This never gets a chance to happen

Multiprocessing function not writing to file or printing

I'm working on a Raspberry Pi (3 B+) making a data collection device and I'm
trying to spawn a process to record the data coming in and write it to a file. I have a function for the writing that works fine when I call it directly.
When I call it using the multiprocess approach however, nothing seems to happen. I can see in task monitors in Linux that the process does in fact get spawned but no file gets written, and when I try to pass a flag to it to shut down it doesn't work, meaning I end up terminating the process and nothing seems to have happened.
I've been over this every which way and can't see what I'm doing wrong; does anyone else? In case it's relevant, these are functions inside a parent class, and one of the functions is meant to spawn another as a thread.
Code I'm using:
from datetime import datetime, timedelta
import csv
from drivers.IMU_SEN0 import IMU_SEN0
import multiprocessing, os
class IMU_data_logger:
_output_filename = ''
_csv_headers = []
_accelerometer_headers = ['Accelerometer X','Accelerometer Y','Accelerometer Z']
_gyroscope_headers = ['Gyroscope X','Gyroscope Y','Gyroscope Z']
_magnetometer_headers = ['Bearing']
_log_accelerometer = False
_log_gyroscope= False
_log_magnetometer = False
IMU = None
_writer=[]
_run_underway = False
_process=[]
_stop_value = 0
def __init__(self,output_filename='/home/pi/blah.csv',log_accelerometer = True,log_gyroscope= True,log_magnetometer = True):
"""data logging device
NOTE! Multiple instances of this class should not use the same IMU devices simultaneously!"""
self._output_filename = output_filename
self._log_accelerometer = log_accelerometer
self._log_gyroscope = log_gyroscope
self._log_magnetometer = log_magnetometer
def __del__(self):
# TODO Update this
if self._run_underway: # If there's still a run underway, end it first
self.end_recording()
def _set_up(self):
self.IMU = IMU_SEN0(self._log_accelerometer,self._log_gyroscope,self._log_magnetometer)
self._set_up_headers()
def _set_up_headers(self):
"""Set up the headers of the CSV file based on the header substrings at top and the input flags on what will be measured"""
self._csv_headers = []
if self._log_accelerometer is not None:
self._csv_headers+= self._accelerometer_headers
if self._log_gyroscope is not None:
self._csv_headers+= self._gyroscope_headers
if self._log_magnetometer is not None:
self._csv_headers+= self._magnetometer_headers
def _record_data(self,frequency,stop_value):
self._set_up() #Run setup in thread
"""Record data function, which takes a recording frequency, in herz, as an input"""
previous_read_time=datetime.now()-timedelta(1,0,0)
self._run_underway = True # Note that a run is now going
Period = 1/frequency # Period, in seconds, of a recording based on the input frequency
print("Writing output data to",self._output_filename)
with open(self._output_filename,'w',newline='') as outcsv:
self._writer = csv.writer(outcsv)
self._writer.writerow(self._csv_headers) # Write headers to file
while stop_value.value==0: # While a run continues
if datetime.now()-previous_read_time>=timedelta(0,1,0): # If we've waited a period, collect the data; otherwise keep looping
print("run underway value",self._run_underway)
if datetime.now()-previous_read_time>=timedelta(0,Period,0): # If we've waited a period, collect the data; otherwise keep looping
previous_read_time = datetime.now() # Update previous readtime
next_row = []
if self._log_accelerometer:
# Get values in m/s^2
axes = self.IMU.read_accelerometer_values()
next_row += [axes['x'],axes['y'],axes['z']]
if self._log_gyroscope:
# Read gyro values
gyro = self.IMU.read_gyroscope_values()
next_row += [gyro['x'],gyro['y'],gyro['z']]
if self._log_magnetometer:
# Read magnetometer value
b= self.IMU.read_magnetometer_bearing()
next_row += b
self._writer.writerow(next_row)
# Close the csv when done
outcsv.close()
def start_recording(self,frequency_in_hz):
# Create recording process
self._stop_value = multiprocessing.Value('i',0)
self._process = multiprocessing.Process(target=self._record_data,args=(frequency_in_hz,self._stop_value))
# Start recording process
self._process.start()
print(datetime.now().strftime("%H:%M:%S.%f"),"Data logging process spawned")
print("Logging Accelerometer:",self._log_accelerometer)
print("Logging Gyroscope:",self._log_gyroscope)
print("Logging Magnetometer:",self._log_magnetometer)
print("ID of data logging process: {}".format(self._process.pid))
def end_recording(self,terminate_wait = 2):
"""Function to end the recording multithread that's been spawned.
Args: terminate_wait: This is the time, in seconds, to wait after attempting to shut down the process before terminating it."""
# Get process id
id = self._process.pid
# Set stop event for process
self._stop_value.value = 1
self._process.join(terminate_wait) # Wait two seconds for the process to terminate
if self._process.is_alive(): # If it's still alive after waiting
self._process.terminate()
print(datetime.now().strftime("%H:%M:%S.%f"),"Process",id,"needed to be terminated.")
else:
print(datetime.now().strftime("%H:%M:%S.%f"),"Process",id,"successfully ended itself.")
====================================================================
ANSWER: For anyone following up here, it turns out the problem was my use of the VS Code debugger which apparently doesn't work with multiprocessing and was somehow preventing the success of the spawned process. Many thanks to Tomasz Swider below for helping me work through issues and, eventually, find my idiocy. The help was very deeply appreciated!!
I can see few thing wrong in your code:
First thing
stop_value == 0 will not work as the multiprocess.Value('i', 0) != 0, change that line to
while stop_value.value == 0
Second, you never update previous_read_time so it will write the readings as fast as it can, you will run out of disk quick
Third, try use time.sleep() the thing you are doing is called busy looping and it is bad, it is wasting CPU cycles needlessly.
Four, terminating with self._stop_value = 1 probably will not work there must be other way to set that value maybe self._stop_value.value = 1.
Well here is a pice of example code based on the code that you have provided that is working just fine:
import csv
import multiprocessing
import time
from datetime import datetime, timedelta
from random import randint
class IMU(object):
#staticmethod
def read_accelerometer_values():
return dict(x=randint(0, 100), y=randint(0, 100), z=randint(0, 10))
class Foo(object):
def __init__(self, output_filename):
self._output_filename = output_filename
self._csv_headers = ['xxxx','y','z']
self._log_accelerometer = True
self.IMU = IMU()
def _record_data(self, frequency, stop_value):
#self._set_up() # Run setup functions for the data collection device and store it in the self.IMU variable
"""Record data function, which takes a recording frequency, in herz, as an input"""
previous_read_time = datetime.now() - timedelta(1, 0, 0)
self._run_underway = True # Note that a run is now going
Period = 1 / frequency # Period, in seconds, of a recording based on the input frequency
print("Writing output data to", self._output_filename)
with open(self._output_filename, 'w', newline='') as outcsv:
self._writer = csv.writer(outcsv)
self._writer.writerow(self._csv_headers) # Write headers to file
while stop_value.value == 0: # While a run continues
if datetime.now() - previous_read_time >= timedelta(0, 1,
0): # If we've waited a period, collect the data; otherwise keep looping
print("run underway value", self._run_underway)
if datetime.now() - previous_read_time >= timedelta(0, Period,
0): # If we've waited a period, collect the data; otherwise keep looping
next_row = []
if self._log_accelerometer:
# Get values in m/s^2
axes = self.IMU.read_accelerometer_values()
next_row += [axes['x'], axes['y'], axes['z']]
previous_read_time = datetime.now()
self._writer.writerow(next_row)
# Close the csv when done
outcsv.close()
def start_recording(self, frequency_in_hz):
# Create recording process
self._stop_value = multiprocessing.Value('i', 0)
self._process = multiprocessing.Process(target=self._record_data, args=(frequency_in_hz, self._stop_value))
# Start recording process
self._process.start()
print(datetime.now().strftime("%H:%M:%S.%f"), "Data logging process spawned")
print("ID of data logging process: {}".format(self._process.pid))
def end_recording(self, terminate_wait=2):
"""Function to end the recording multithread that's been spawned.
Args: terminate_wait: This is the time, in seconds, to wait after attempting to shut down the process before terminating it."""
# Get process id
id = self._process.pid
# Set stop event for process
self._stop_value.value = 1
self._process.join(terminate_wait) # Wait two seconds for the process to terminate
if self._process.is_alive(): # If it's still alive after waiting
self._process.terminate()
print(datetime.now().strftime("%H:%M:%S.%f"), "Process", id, "needed to be terminated.")
else:
print(datetime.now().strftime("%H:%M:%S.%f"), "Process", id, "successfully ended itself.")
if __name__ == '__main__':
foo = Foo('/tmp/foometer.csv')
foo.start_recording(20)
time.sleep(5)
print('Ending recording')
foo.end_recording()

Python Multiple requests

I have a situation to call multiple requests in a scheduler job to check live user status for 1000 users at a time. But server limits maximum up to 50 users in each hit of an API request. So using following approach with for loop its taking around 66 seconds for 1000 users (i.e for 20 API calls).
from apscheduler.schedulers.blocking import BlockingScheduler
sched = BlockingScheduler()
def shcdulerjob():
"""
"""
uidlist = todays_userslist() #Get around 1000 users from table
#-- DIVIDE LIST BY GIVEN SIZE (here 50)
split_list = lambda lst, sz: [lst[i:i+sz] for i in range(0, len(lst), sz)]
idlists = split_list(uidlist, 50) # SERVER MAX LIMIT - 50 ids/request
for idlist in idlists:
apiurl = some_server_url + "&ids="+str(idlist)
resp = requests.get(apiurl)
save_status(resp.json()) #-- Save status to db
if __name__ == "__main__":
sched.add_job(shcdulerjob, 'interval', minutes=10)
sched.start()
So,
Is there any workaround so that it should optimize the time required to fetch API?
Does Python- APScheduler provide any multiprocessing option to process such api requests in a single job?
You could try to apply python's Thread pool from the concurrent.futures module, if the server allows concurrent requests. That way you would parallelise the processing, instead of the scheduling itself
There are some good examples provided in the documentation here (If you're using python 2, there is a sort of an equivalent module
e.g.
import concurrent.futures
import multiprocessing
import requests
import time
import json
cpu_start_time = time.process_time()
clock_start_time = time.time()
queue = multiprocessing.Queue()
uri = "http://localhost:5000/data.json"
users = [str(user) for user in range(1, 50)]
with concurrent.futures.ThreadPoolExecutor(multiprocessing.cpu_count()) as executor:
for user_id, result in zip(
[str(user) for user in range(1, 50)]
, executor.map(lambda x: requests.get(uri, params={id: x}).content, users)
):
queue.put((user_id, result))
while not queue.empty():
user_id, rs = queue.get()
print("User ", user_id, json.loads(rs.decode()))
cpu_end_time = time.process_time()
clock_end_time = time.time()
print("Took {0:.03}s [{1:.03}s]".format(cpu_end_time-cpu_start_time, clock_end_time-clock_start_time))
If you want to use a Process pool, just make sure you don't use shared resources, e.g. queue, and write your data our independently

Categories