Issue in Brief
I have recently started using an azure server running ubuntu 20.04. My workflow includes running around 50 python scripts 24/7 and they are operationally very important to my team. But the issue when I just start those python scripts my ram usage is nominal like 12/16 GB remains free in my system after running all my scripts.
But slowly RAM usage by those scripts starts increasing to the point where the system starts to kill them to free up some main memory.
I have no idea what the issue is over here. My scripts are pretty simple and I really don't know where and how do I resolve this issue. Can anyone please show/tell me some guidelines on how do I approach solving this issue?
Comments
I am using python 3.10. Script function is to download the data from some server and upload it to my MySQL database. I can provide the code if anyone asks for it.
Let me know if I can provide anything else to make this easier for you.
Code files
I am uploading the code which is taking up the maximum memory according to htop.
dcx_trades.py
import json
import time
import datetime
from mysql_connector import SQLConnector
import pandas as pd
import sys
import os
import signal
from contextlib import contextmanager
def raise_timeout(signum, frame):
print("timeout")
raise Exception("timouttt")
#contextmanager
def timeout(time):
# Register a function to raise a TimeoutError on the signal.
signal.signal(signal.SIGALRM, raise_timeout)
# Schedule the signal to be sent after ``time``.
signal.alarm(time)
try:
yield
except TimeoutError:
# exit()
pass
finally:
# Unregister the signal so it won't be triggered
# if the timeout is not reached.
signal.signal(signal.SIGALRM, signal.SIG_IGN)
from_db = {'user': 'db_user_name', 'password': 'password', 'host': 'host_url', 'database': 'crypto'}
s = SQLConnector('crypto', from_db)
dict_ = {'timestamp': '', "exchange": "coindcx", "symbol":"", 'error_msg':''}
df = pd.DataFrame(columns = ["exchange_id","timestamp","symbol","price","quantity","exchange","turnover"])
df.set_index('symbol')
while True:
try:
data = pd.read_csv('dcx_trades.csv')
trades = data.to_dict(orient='records')
data = data.iloc[0:0]
if len(trades):
for trade in trades:
utc_time = datetime.datetime.fromtimestamp(trade['T']/1000, datetime.timezone.utc)
local_time = utc_time.astimezone()
datetime_formatted = local_time.strftime("%Y-%m-%d %H:%M:%S")
dict_['timestamp'] = datetime_formatted
dict_["exchange_id"] = 12345
dict_["symbol"] = trade['s']
dict_['price'] = trade['p']
dict_['quantity'] = trade['q']
dict_['turnover'] = float(trade['p'])*float(trade['q'])
dict_['error'] = '0'
df = df.append(dict_, ignore_index=True)
print(df)
df_new = df
df_new= df_new.to_dict(orient='records')
df = df.iloc[0:0]
data.to_csv('dcx_trades.csv', mode='w', index=False)
if len(df_new):
with timeout(60):
try:
print(datetime.datetime.now())
s.add_multipletrades(df_new)
print(datetime.datetime.now())
except Exception as e:
print(e)
os.execv(sys.executable, ['python'] + sys.argv)
print("error_time:", datetime.datetime.now())
except Exception as e:
data = pd.read_csv('dcx_trades.csv')
data = data.loc[1:]
data.to_csv('dcx_trades.csv',index=False)
pass
Objective of the file:
Firstly s = SQLConnector('crypto', from_db) this lines makes the connection with the DB. All the database-related functions have been defined in another file named mysql_connector.py which I have imported in the beginning.
Then the code reads from the CSV file named dcx_trades.csv and preprocesses the data as per the database table. Before uploading the data into DB it clears the CSV file so as to remove duplicates. timeout(60) function is used because sometimes the file get stuck while writing into the DB and thus it needs to get restarted which is what timeout() function does.
All of those transforms can easily be done in SQL --
LOAD DATA into a temp table with whatever columns match the columns and datatypes in the file
Run a single INSERT .. SELECT .. to copy the values over, doing whatever expressions are needed (such as p * q).
Related
I have a python application that writes some data to AWS elastic cache cluster at a regular interval.
Here's an example script that simulates the functionality of my application. It also replicates the error that I have been facing.
from datetime import datetime
import redis
import time
import sys
import random
class Config:
default_host = "localhost"
master_host = "xxx.xx.0001.xxxx.cache.amazonaws.com"
replica_host = "xxx.xx.0001.xxxx.cache.amazonaws.com"
redis_db = 8
socket_conn_timeout = 10
request_delay_sec = 0.1
def get_redis_client():
return redis.Redis(
host=Config.master_host,
db=Config.redis_db,
socket_connect_timeout=Config.socket_conn_timeout,
)
def get_random_key_value():
val = time.time()
key = "test_key_" + str(random.randint(0, 100))
return key, val
r = get_redis_client()
r.flushdb()
flag = False
while True:
try:
if flag:
print("beat:", time.time())
r.set(*get_random_key_value())
time.sleep(Config.request_delay_sec)
except redis.RedisError as re:
print(datetime.now(), "Error:", type(re), re)
flag = True
# sys.exit()
except KeyboardInterrupt:
print("Stopping loop execution")
sys.exit()
Here are the environment details of my application
Python(v 3.7.0)
Redis-py(v 3.5.3)
AWS elastic cache(cluster mode disabled, 1 master node, 1 read replica)
When I scale my AWS elastic cache cluster vertically and the above script is running, I get the following error for few seconds while the cluster scaling-up is in process and then it goes away.
<class 'redis.exceptions.ReadOnlyError'> You can't write against a read only replica.
AWS docs also states that during vertical scaling process some inconsistencies may occur because of data syncing(https://docs.aws.amazon.com/AmazonElastiCache/latest/red-ug/redis-cluster-vertical-scaling.html)
Has anyone faced any similar issue or can explain why this error occurs during the scale up process? how can it be fixed?
EDIT:
I tried the same thing with a golang script and it works perfectly fine.
I am using the sample program from the Snowflake document on using Python to ingest the data to the destination table.
So basically, I have to execute put command to load data to the internal stage and then run the Python program to notify the snowpipe to ingest the data to the table.
This is how I create the internal stage and pipe:
create or replace stage exampledb.dbschema.example_stage;
create or replace pipe exampledb.dbschema.example_pipe
as copy into exampledb.dbschema.example_table
from
(
select
t.*
from
#exampledb.dbschema.example_stage t
)
file_format = (TYPE = CSV) ON_ERROR = SKIP_FILE;
put command:
put file://E:\\example\\data\\a.csv #exampledb.dbschema.example_stage OVERWRITE = TRUE;
This is the sample program I use:
from logging import getLogger
from snowflake.ingest import SimpleIngestManager
from snowflake.ingest import StagedFile
from snowflake.ingest.utils.uris import DEFAULT_SCHEME
from datetime import timedelta
from requests import HTTPError
from cryptography.hazmat.primitives import serialization
from cryptography.hazmat.primitives.serialization import load_pem_private_key
from cryptography.hazmat.backends import default_backend
from cryptography.hazmat.primitives.serialization import Encoding
from cryptography.hazmat.primitives.serialization import PrivateFormat
from cryptography.hazmat.primitives.serialization import NoEncryption
import time
import datetime
import os
import logging
logging.basicConfig(
filename='/tmp/ingest.log',
level=logging.DEBUG)
logger = getLogger(__name__)
# If you generated an encrypted private key, implement this method to return
# the passphrase for decrypting your private key.
def get_private_key_passphrase():
return '<private_key_passphrase>'
with open("E:\\ssh\\rsa_key.p8", 'rb') as pem_in:
pemlines = pem_in.read()
private_key_obj = load_pem_private_key(pemlines,
get_private_key_passphrase().encode(),
default_backend())
private_key_text = private_key_obj.private_bytes(
Encoding.PEM, PrivateFormat.PKCS8, NoEncryption()).decode('utf-8')
# Assume the public key has been registered in Snowflake:
# private key in PEM format
# List of files in the stage specified in the pipe definition
file_list=['a.csv.gz']
ingest_manager = SimpleIngestManager(account='<account_identifier>',
host='<account_identifier>.snowflakecomputing.com',
user='<user_login_name>',
pipe='exampledb.dbschema.example_pipe',
private_key=private_key_text)
# List of files, but wrapped into a class
staged_file_list = []
for file_name in file_list:
staged_file_list.append(StagedFile(file_name, None))
try:
resp = ingest_manager.ingest_files(staged_file_list)
except HTTPError as e:
# HTTP error, may need to retry
logger.error(e)
exit(1)
# This means Snowflake has received file and will start loading
assert(resp['responseCode'] == 'SUCCESS')
# Needs to wait for a while to get result in history
while True:
history_resp = ingest_manager.get_history()
if len(history_resp['files']) > 0:
print('Ingest Report:\n')
print(history_resp)
break
else:
# wait for 20 seconds
time.sleep(20)
hour = timedelta(hours=1)
date = datetime.datetime.utcnow() - hour
history_range_resp = ingest_manager.get_history_range(date.isoformat() + 'Z')
print('\nHistory scan report: \n')
print(history_range_resp)
After running the program, I just need to remove the file in the internal stage:
REMOVE #exampledb.dbschema.example_stage;
The code works as expected for the first time but when I truncate the data on that table and run the code again, the table on snowflake doesn't have any data in it.
Do I miss something here? How can I make this code can run multiple times?
Update:
I found that if I use a file with a different name each time I run, the data can load to the snowflake table.
So how can I run this code without changing the data filename?
Snowflake uses file loading metadata to prevent reloading the same files (and duplicating data) in a table. Snowpipe prevents loading files with the same name even if they were later modified (i.e. have a different eTag).
The file loading metadata is associated with the pipe object rather than the table. As a result:
Staged files with the same name as files that were already loaded are ignored, even if they have been modified, e.g. if new rows were added or errors in the file were corrected.
Truncating the table using the TRUNCATE TABLE command does not delete the Snowpipe file loading metadata.
However, note that pipes only maintain the load history metadata for 14 days. Therefore:
Files modified and staged again within 14 days:
Snowpipe ignores modified files that are staged again. To reload modified data files, it is currently necessary to recreate the pipe object using the CREATE OR REPLACE PIPE syntax.
Files modified and staged again after 14 days:
Snowpipe loads the data again, potentially resulting in duplicate records in the target table.
For more information have a look here
I'm trying to create a CAN logs converter from .asc files to .csv files (in human readable form). I'm somewhat successful. My code works fine with almost any database but j1939.dbc.
The thing is, that if I print out the messages read from the dbc file, I can see that the messages from j1939.dbc are read into the database. But it fails to find any of those messages in the processed log file. At the same time I can read the same file using Vector CANalyzer with no issues.
I wonder why this may happed and why it only affects the j1939.dbc and not the others.
I suspect that maybe the way I convert those messages is wrong because it never goes by the if msg_id in database: line (and as mentioned above, those messages are certainly there because Vector CANalyzer works fine with them).
EDIT: I realized that maybe the problem is not cantools but python-can package, maybe the can.ASCReader() doeasn't do well with j1939 frames and omits them? I'm gonna investigate myself but I hope someone better at coding will help.
import pandas as pd
import can
import cantools
import time as t
from tqdm import tqdm
import re
import os
from binascii import unhexlify
dbcs = [filename.split('.')[0] for filename in os.listdir('./dbc/') if filename.endswith('.dbc')]
files = [filename.split('.')[0] for filename in os.listdir('./asc/') if filename.endswith('.asc')]
start = t.time()
db = cantools.database.Database()
for dbc in dbcs:
with open(f'./dbc/{dbc}.dbc', 'r') as f:
db.add_dbc(f)
f_num = 1
for fname in files:
print(f'[{f_num}/{len(files)}] Parsing data from file: {fname}')
log=can.ASCReader(f'./asc/{fname}.asc')
entries = []
all_msgs =[]
message = {'Time [s]': ''}
database = list(db._frame_id_to_message.keys())
print(database)
lines = sum(1 for line in open(f'./asc/{fname}.asc'))
msgs = iter(log)
try:
for msg, i in zip(msgs, tqdm(range(lines))):
msg = re.split("\\s+", str(msg))
timestamp = round(float(msg[1]), 0)
msg_id = int(msg[3], 16)
try:
data = unhexlify(''.join(msg[7:15]))
except:
continue
if msg_id in database:
if timestamp != message['Time [s]']:
entries.append(message.copy())
message.update({'Time [s]': timestamp})
message.update(db.decode_message(msg_id, data))
except ValueError:
print('ValueError')
df = pd.DataFrame(entries[1:])
duration = t.time() - start
df.to_csv(f'./csv/{fname}.csv', index=False)
print(f'DONE IN {int(round(duration, 2)//60)}min{round(duration % 60, 2)}s!\n{len(df.columns)} signals extracted!')
f_num += 1
class can.ASCReader(file, base=’hex’)
Bases: can.io.generic.BaseIOHandler
Iterator of CAN messages from a ASC logging file. Meta data (comments, bus statistics, J1939 Transport
Protocol messages) is ignored.
Might answer your question...
I have one function view that creates a report using xlsxwriter, it is created on the fly using a StringIO as buffer and finally sending through HttpResponse. It works well using Local Server.
The problem is that on Heroku, after some seconds (documentation mention 30 seconds timeout and not modifiable) the server hangs out and reboot the web process, giving error as a response.
What is the best way to...?:
create an xmlx file on the fly (dynamically) in memory
serve the entire file to the client.
prevent server to hang out because of the long process running
This is a piece of the code I am using:
def reporte_usuarios(request):
from xlsxwriter.workbook import Workbook
try:
import cStringIO as StringIO
except ImportError:
import StringIO
# create a workbook in memory
output = StringIO.StringIO()
workbook = Workbook(output)
bold = workbook.add_format({'bold': True})
# get the data
from django.db.models import Count
usuarios = User.objects.filter(....... # all filter stuff
for usr in usuarios:
if usr.activos > 0:
# create a workbook sheet every User registered
ws = workbook.add_worksheet(u'%s' % usr.username)
# some relevant user data
ws.write(1, 1, u'USUARIO: %s' % usr.username)
...
# get rows for user
log = LogActivos.objects.filter(usuario=usr).select_related('activo__unidad__id', 'activo__unidad__nombre', 'activo__nombre')
# write headers
ws.write(3, 0, u'FECHA', bold)
...
sig_fila = 4 #starting row for data (after headers)
for l in log:
# write all data
ws.write(sig_fila, 0, u'%s' % l.fecha)
...
sig_fila += 1
# close the workbook
workbook.close()
# go to the beginning of the buffer
output.seek(0)
# response using the buffer
response = HttpResponse(output.read(), content_type='application/vnd.openxmlformats-officedocument.spreadsheetml.sheet')
response['Content-Disposition'] = 'attachment; filename="ACTIVOS_USUARIOS__%s.xlsx"' % datetime.now().strftime("%Y%m%d_%H%M")
return response
Notes: I am using Gunicorn on Heroku, django 1.9.13 and python 2.7.11
IMHO you should follow a totally different approach in this case.
As you are generating a file rather big in size, it's normal for the system to hang out because of a timeout error.
What you could do instead is to deploy a background task queue, like Celery or DjangoRQ. With that, you will get a background task to create this file using your user's data, and then you can let your user know that it's ready by any mean, like a notification or an email.
If you need more details regarding how you can do something like this, let me know and I can help :)
I have a function that runs over a number of tables in a sqlite database. It reads the data, does some stuff and then saves the result in a csv-file.
from __future__ import division
import sqlalchemy as sql
import pandas as pd
import os
import multiprocessing as mp
dst = r'H:\Results'
eng = sql.create_engine('sqlite:///Y:/Database/some.db') # database on external drive
con = eng.connect()
def get_res(tab_name,lock):
query_tr = """SELECT t, p, size, event, direction \
FROM {tb} WHERE event IN (4, 5)""".format(tb=tab_name)
df_tr = pd.read_sql_query(query_tr,con)
# do some stuff with df_tr ...
with lock:
df_tr.to_csv(os.path.join(dst,'my_res.csv'), mode='a')
return 1
I do this in parallel like so
if __name__=='__main__':
workers = mp.cpu_count()
tables = sql.inspect(eng).get_table_names()
man = mp.Manager()
pool = mp.Pool(workers)
lock = man.Lock()
res = {tab_name: pool.apply_async(get_res,(tab_name,lock)) for tab_name in tables}
pool.close()
pool.join()
man.shutdown()
The strange thing is that the call man.shutdown() returns a Windows Error 5: Access Denied when the function reads the data from a database that is on an external hard disk drive, but works absolutely fine when the database is on the computer's hard drive. The function get_res goes through correctly without any error and does what it should do.
I know that this is not much to go on, but are there any suggestions why that could be the case?