Python memory consumption causes to crash web-socket connection

Python memory consumption causes to crash web-socket connection - python

Me and my friend have been trying to solve this code for about a week now with no success. We would appreciate some feedback from experienced programmers.
We have developed below code to connect to a web-socket. Our python script runs smoothly for 7 hours, but after 7 hours it crashes. We got "Error to many open files" couple of times. I have searched stackoverflow for a while to find smilar mistake in coding but we couldn't relate to our actual problem.
We also watch closely proc/"pid of our python script"/fd for open pipes. Whenever it reached to 1024 the websocket connection dies. We edited ulimit -n to increased the limit also but still script dies.
I am sharing the below code, I would really appreciate if you guys can give us some feedback in order for us to solve our longlasting headache.
import time
import datetime
import os
import sys
sys.path.insert(0, os.path.dirname(os.path.dirname(__file__)))
from bt import Bt
from app import db
from app.models import LOG_HISTORY, BOOKING_ORDERS, BOOKING_CANCELLING,
add_row, delete_rows
import config
logger = config.Logger('bt log websocket.log')
def get_authenticateWss():
authenticated_wss = Bt(key=config.socket_api_key,
secret=config.socket_api_secret)
authenticated_wss.start()
while not authenticated_wss.conn.connected.is_set():
time.sleep(1)
authenticated_wss.authenticate()
time.sleep(5)
return authenticated_wss
def main(authenticated_wss):
while authenticated_wss.conn.connected.is_set():
booking_orders = BOOKING_ORDERS.query.all()
for booking_order in booking_orders:
payload = {
'cid': booking_order.cid,
'symbol': 't%s' % booking_order.symbol.split(":") .
[-1].strip(),
'type': "EXCHANGE LIMIT",
'amount': str(booking_order.amount),
'price': str(booking_order.price),
'hidden': 1
}
authenticated_wss.new_order(**payload)
logger.info("Creating the Order: %s" % str(payload))
db.session.delete(booking_order)
if float(booking_order.amount) >= 0:
add_row(LOG_HISTORY, [datetime.datetime.now(),
booking_order.symbol, "Buy Order", str(payload)])
else:
add_row(LOG_HISTORY, [datetime.datetime.now(),
booking_order.symbol, "Selling Order", str(payload)])
time.sleep(5)
booking_cancels = BOOKING_CANCELLING.query.all()
for booking_cancel in booking_cancels:
payload = {
'id': booking_cancel.order_id,
'cid': booking_cancel.order_cid,
'cid_date': booking_cancel.create_mts
}
authenticated_wss.cancel_order(**payload)
logger.info("Cancelling the Order: %s" % str(payload))
db.session.delete(booking_cancel)
add_row(LOG_HISTORY, [datetime.datetime.now(),
booking_cancel.symbol, "Cancelling Order", str(payload)])
time.sleep(5)
# time.sleep(10)
if __name__ == "__main__":
delete_rows(BOOKING_ORDERS)
delete_rows(BOOKING_CANCELLING)
while True:
logger.info("-------------- START ------------------")
authenticated_wss = get_authenticateWss()
try:
main(authenticated_wss)
except Exception as e:
logger.error(e)
finally:
logger.info("---------- STOP -----------------")
authenticated_wss.stop()

We have solved the issue, it was totally web-socket compatibility issue. We have updated the version of the module.

Related

Ram usage is constantly increasing on my azure server running python scripts

Issue in Brief
I have recently started using an azure server running ubuntu 20.04. My workflow includes running around 50 python scripts 24/7 and they are operationally very important to my team. But the issue when I just start those python scripts my ram usage is nominal like 12/16 GB remains free in my system after running all my scripts.
But slowly RAM usage by those scripts starts increasing to the point where the system starts to kill them to free up some main memory.
I have no idea what the issue is over here. My scripts are pretty simple and I really don't know where and how do I resolve this issue. Can anyone please show/tell me some guidelines on how do I approach solving this issue?
Comments
I am using python 3.10. Script function is to download the data from some server and upload it to my MySQL database. I can provide the code if anyone asks for it.
Let me know if I can provide anything else to make this easier for you.
Code files
I am uploading the code which is taking up the maximum memory according to htop.
dcx_trades.py
import json
import time
import datetime
from mysql_connector import SQLConnector
import pandas as pd
import sys
import os
import signal
from contextlib import contextmanager
def raise_timeout(signum, frame):
print("timeout")
raise Exception("timouttt")
#contextmanager
def timeout(time):
# Register a function to raise a TimeoutError on the signal.
signal.signal(signal.SIGALRM, raise_timeout)
# Schedule the signal to be sent after ``time``.
signal.alarm(time)
try:
yield
except TimeoutError:
# exit()
pass
finally:
# Unregister the signal so it won't be triggered
# if the timeout is not reached.
signal.signal(signal.SIGALRM, signal.SIG_IGN)
from_db = {'user': 'db_user_name', 'password': 'password', 'host': 'host_url', 'database': 'crypto'}
s = SQLConnector('crypto', from_db)
dict_ = {'timestamp': '', "exchange": "coindcx", "symbol":"", 'error_msg':''}
df = pd.DataFrame(columns = ["exchange_id","timestamp","symbol","price","quantity","exchange","turnover"])
df.set_index('symbol')
while True:
try:
data = pd.read_csv('dcx_trades.csv')
trades = data.to_dict(orient='records')
data = data.iloc[0:0]
if len(trades):
for trade in trades:
utc_time = datetime.datetime.fromtimestamp(trade['T']/1000, datetime.timezone.utc)
local_time = utc_time.astimezone()
datetime_formatted = local_time.strftime("%Y-%m-%d %H:%M:%S")
dict_['timestamp'] = datetime_formatted
dict_["exchange_id"] = 12345
dict_["symbol"] = trade['s']
dict_['price'] = trade['p']
dict_['quantity'] = trade['q']
dict_['turnover'] = float(trade['p'])*float(trade['q'])
dict_['error'] = '0'
df = df.append(dict_, ignore_index=True)
print(df)
df_new = df
df_new= df_new.to_dict(orient='records')
df = df.iloc[0:0]
data.to_csv('dcx_trades.csv', mode='w', index=False)
if len(df_new):
with timeout(60):
try:
print(datetime.datetime.now())
s.add_multipletrades(df_new)
print(datetime.datetime.now())
except Exception as e:
print(e)
os.execv(sys.executable, ['python'] + sys.argv)
print("error_time:", datetime.datetime.now())
except Exception as e:
data = pd.read_csv('dcx_trades.csv')
data = data.loc[1:]
data.to_csv('dcx_trades.csv',index=False)
pass
Objective of the file:
Firstly s = SQLConnector('crypto', from_db) this lines makes the connection with the DB. All the database-related functions have been defined in another file named mysql_connector.py which I have imported in the beginning.
Then the code reads from the CSV file named dcx_trades.csv and preprocesses the data as per the database table. Before uploading the data into DB it clears the CSV file so as to remove duplicates. timeout(60) function is used because sometimes the file get stuck while writing into the DB and thus it needs to get restarted which is what timeout() function does.

All of those transforms can easily be done in SQL --
LOAD DATA into a temp table with whatever columns match the columns and datatypes in the file
Run a single INSERT .. SELECT .. to copy the values over, doing whatever expressions are needed (such as p * q).

Problem using a proxy while running a Python script

I'm having a problem related to proxies in my Python script. Running the script below, to access NCBI Blast through biopython, the company's network where I work blocks the access because of security reasons. While talking with the IT guys they gave me a proxy for this kind of situations that has to be incorporated in my script. I've tried a lot of potential solutions but nothing seems to be working. Am I missing something here?
def main(seq):
import os
from Bio.Blast import NCBIWWW
import time
start_time = time.time()
try:
print('Connecting to NCBI...')
blast_handle = NCBIWWW.qblast('blastn','nt',sequence = seq, format_type = 'Text', megablast=True)
text = blast_handle.read()
print(text)
print("--- %s seconds ---" % (time.time() - start_time))
except Exception as e:
print(e)
if __name__ == '__main__':
import os
os.environ['http_proxy'] = 'http://123.456.78.90:80' # The proxy IT guys gave me
seq = 'CAACTTTTTTTTTTATTACAGACAATCAAGAAATTTTCTATTGAAATAAAATATTTTAAA\
ACTTTCAACAACGGATCTCTTGGTTCTCGCATCGATGAAGAACGTAGCGAATTGCGATAA\
GTAATGTGAATTGCAGATTCTCGTGAATCATTGAATTTTTGAACGCACATTGCGCCCTCT\
GGTATTCCAGAGGGCATGCCTGTTTGAGCGTCATTTCCTTCTCAAAAACCCAGTTTTTGG\
TTGTGAGTGATACTCTGCTTCAGGGTTAACTTGAAAATGCTATGCCCCTTTGGCTGCCCT\
TCTTTGAGGGGACTGCGCGTCTGTGCAGGATGTAACCAATGTATTTAGGTATTCATACCA\
ACTTTCATTGTGCGCGTCTTATGCAGTTGTAGTCCACCCAACCTCAGACACACAGGCTGG\
CTGGGCCAACAGTATTCATAAAGTTTGACCTCA'
main(seq)
Thank you very much.

It seems that NCBIWWW.qblast doesn't have support for proxies, so you will need to adapt the code yourself.
In your local BioPython installation, go find the biopython/Bio/Blast/NCBIWWW.py file and add your proxy settings at line 203:
request = Request(url_base, message, {"User-Agent": "BiopythonClient"})
request.set_proxy('http://123.456.78.90:80', 'http') # <-- Add this line
handle = urlopen(request)

Restart a Python script if killed

I write this post because I have not found solutions for my specific case. I refer to this article, which, however, did not work for me on Windows 10 version 1909.
I programmed a "python_code_a.py" script that has the task of uploading, one at a time, all the images contained in a local folder on a converter server and to download them, always one at a time, from the server to my PC in another folder. How the script works depends on the server, which is public and not owned by me, so it is possible, approximately every two and a half hours, that the script crashes due to an unexpected connection error. Obviously, it is not possible to consider the fact that he stays all day observing the Python shell and acting in case the script stops.
As reported in the article above, I compiled a second file with the name "python_code_b.py", which had the task of acting in case "python_code_a.py" had stopped by restarting the latter. When I try to get it to run from the "python.exe" CMD, however, the latter responds to the input with "...", nothing else.
I attach a general example of "python_code_a.py":
processnumber= 0
photosindex= 100000
photo = 0
path = 0
while photosindex<"number of photos in folder":
photo = str('your_path'+str(photoindex)+'.png')
path = str('your_path'+str(photoindex)+'.jpg')
print ('It\'s converting: '+ photo)
import requests
r = requests.post(
"converter_site",
files={
'image': open(photo , 'rb'),
},
headers={'api-key': 'your_api_key'}
)
file= r.json()
json_output = file['output_url']
import urllib.request
while photosindex<'number of photos in folder':
urllib.request.urlretrieve( json_output , path )
print('Finished process number: '+str(processnumber))
break
photosindex= photosindex +1
processnumber= processnumber +1
print(
)
print('---------------------------------------------------')
print('Every pending job has been completed.')
print(
)
How can I solve it?

you can use error capturing:
while photosindex<"number of photos in folder":
try:
#Your code
except:
print("Something else went wrong")
https://www.w3schools.com/python/python_try_except.asp

Improve Speed of Python Script: Multithreading or Multiple Instances?

I have a Python script that I'd like to run everyday and I'd prefer that it only takes 1-2 hours to run. It's currently setup to hit 4 different APIs for a given URL, capture the results, and then save the data into a PostgreSQL database. The problem is I have over 160,000 URLs to go through and the script ends up taking a really long time -- I ran some preliminary tests and it would take over 36 hours to go through each URL in its current format. So, my question boils down to: should I optimize my script to run multiple threads at the same time? Or should I scale out the number of servers I'm using? Obviously the second approach will be more costly so I'd prefer to have multiple threads running on the same instance.
I'm using a library I created (SocialAnalytics) which provides methods to hit the different API endpoints and parse the results. Here's how I have my script configured:
import psycopg2
from socialanalytics import pinterest
from socialanalytics import facebook
from socialanalytics import twitter
from socialanalytics import google_plus
from time import strftime, sleep
conn = psycopg2.connect("dbname='***' user='***' host='***' password='***'")
cur = conn.cursor()
# Select all URLs
cur.execute("SELECT * FROM urls;")
urls = cur.fetchall()
for url in urls:
# Pinterest
try:
p = pinterest.getPins(url[2])
except:
p = { 'pin_count': 0 }
# Facebook
try:
f = facebook.getObject(url[2])
except:
f = { 'comment_count': 0, 'like_count': 0, 'share_count': 0 }
# Twitter
try:
t = twitter.getShares(url[2])
except:
t = { 'share_count': 0 }
# Google
try:
g = google_plus.getPlusOnes(url[2])
except:
g = { 'plus_count': 0 }
# Save results
try:
now = strftime("%Y-%m-%d %H:%M:%S")
cur.execute("INSERT INTO social_stats (fetched_at, pinterest_pins, facebook_likes, facebook_shares, facebook_comments, twitter_shares, google_plus_ones) VALUES(%s, %s, %s, %s, %s, %s, %s, %s);", (now, p['pin_count'], f['like_count'], f['share_count'], f['comment_count'], t['share_count'], g['plus_count']))
conn.commit()
except:
conn.rollback()
You can see that each call to the API is using the Requests library, which is a synchronous, blocking affair. After some preliminary research I discovered Treq, which is an API on top of Twisted. The asynchronous, non-blocking nature of Twisted seems like a good candidate for improving my approach, but I've never worked with it and I'm not sure how exactly (and if) it'll help me achieve my goal.
Any guidance is much appreciated!

At first you should measure time that your script spends on every step. May be you discover something interesting :)
Second, you can split your urls on chunks:
chunk_size = len(urls)/cpu_core_count; // don't forget about remainder of division
After these steps you can use multiprocessing for processing every chunk in parallel. Here is example for you:
import multiprocessing as mp
p = mp.Pool(5)
# first solution
for urls_chunk in urls: # urls = [(url1...url6),(url7...url12)...]
res = p.map(get_social_stat, urls_chunk)
for record in res:
save_to_db(record)
# or, simple
res = p.map(get_social_stat, urls)
for record in res:
save_to_db(record)
Also, gevent can help you. Because it can optimize time spending on processing sequence of synchronous blocking requests.

How to shutdown an httplib2 request when it is too long

I have a pretty annoying issue at the moment. When I process to a httplib2.request with a way too large page, I would like to be able to stop it cleanly.
For example :
from httplib2 import Http
url = 'http://media.blubrry.com/podacademy/p/content.blubrry.com/podacademy/Neuroscience_and_Society_1.mp3'
h = Http(timeout=5)
h.request(url, 'GET')
In this example, the url is a podcast and it will keep being downloaded forever. My main process will hang indefinitely in this situation.
I have tried to set it in a separate thread using this code and to delete straight my object.
def http_worker(url, q):
h = Http()
print 'Http worker getting %s' % url
q.put(h.request(url, 'GET'))
def process(url):
q = Queue.Queue()
t = Thread(target=http_worker, args=(url, q))
t.start()
tid = t.ident
t.join(3)
if t.isAlive():
try:
del t
print 'deleting t'
except: print 'error deleting t'
else: print q.get()
check_thread(tid)
process(url)
Unfortunately, the thread is still active and will continue to consume cpu / memory.
def check_thread(tid):
import sys
print 'Thread id %s is still active ? %s' % (tid, tid in sys._current_frames().keys() )
Thank you.

Ok I found an hack to be able to deal with this issue.
The best solution so far is to set a maximum of data read and to stop reading from the socket. The data is read from the method _safe_read of httplib module. In order to overwrite this method, I used this lib : http://blog.rabidgeek.com/?tag=wraptools
And voila :
from httplib import HTTPResponse, IncompleteRead, MAXAMOUNT
from wraptools import wraps
#wraps(httplib.HTTPResponse._safe_read)
def _safe_read(original_method, self, amt):
"""Read the number of bytes requested, compensating for partial reads.
Normally, we have a blocking socket, but a read() can be interrupted
by a signal (resulting in a partial read).
Note that we cannot distinguish between EOF and an interrupt when zero
bytes have been read. IncompleteRead() will be raised in this
situation.
This function should be used when <amt> bytes "should" be present for
reading. If the bytes are truly not available (due to EOF), then the
IncompleteRead exception can be used to detect the problem.
"""
# NOTE(gps): As of svn r74426 socket._fileobject.read(x) will never
# return less than x bytes unless EOF is encountered. It now handles
# signal interruptions (socket.error EINTR) internally. This code
# never caught that exception anyways. It seems largely pointless.
# self.fp.read(amt) will work fine.
s = []
total = 0
MAX_FILE_SIZE = 3*10**6
while amt > 0 and total < MAX_FILE_SIZE:
chunk = self.fp.read(min(amt, httplib.MAXAMOUNT))
if not chunk:
raise IncompleteRead(''.join(s), amt)
total = total + len(chunk)
s.append(chunk)
amt -= len(chunk)
return ''.join(s)
In this case, MAX_FILE_SIZE is set to 3Mb.
Hopefully, this will help others.

We Keep Coding

Python is a programming language that lets you work quickly and integrate systems more effectively.

Python memory consumption causes to crash web-socket connection - python

We have solved the issue, it was totally web-socket compatibility issue. We have updated the version of the module.

Related

Ram usage is constantly increasing on my azure server running python scripts

Problem using a proxy while running a Python script

Restart a Python script if killed

Improve Speed of Python Script: Multithreading or Multiple Instances?

How to shutdown an httplib2 request when it is too long

Categories

Resources