Django - Heroku - Serving dynamic large file timing out

Django - Heroku - Serving dynamic large file timing out - python

I have one function view that creates a report using xlsxwriter, it is created on the fly using a StringIO as buffer and finally sending through HttpResponse. It works well using Local Server.
The problem is that on Heroku, after some seconds (documentation mention 30 seconds timeout and not modifiable) the server hangs out and reboot the web process, giving error as a response.
What is the best way to...?:
create an xmlx file on the fly (dynamically) in memory
serve the entire file to the client.
prevent server to hang out because of the long process running
This is a piece of the code I am using:
def reporte_usuarios(request):
from xlsxwriter.workbook import Workbook
try:
import cStringIO as StringIO
except ImportError:
import StringIO
# create a workbook in memory
output = StringIO.StringIO()
workbook = Workbook(output)
bold = workbook.add_format({'bold': True})
# get the data
from django.db.models import Count
usuarios = User.objects.filter(....... # all filter stuff
for usr in usuarios:
if usr.activos > 0:
# create a workbook sheet every User registered
ws = workbook.add_worksheet(u'%s' % usr.username)
# some relevant user data
ws.write(1, 1, u'USUARIO: %s' % usr.username)
...
# get rows for user
log = LogActivos.objects.filter(usuario=usr).select_related('activo__unidad__id', 'activo__unidad__nombre', 'activo__nombre')
# write headers
ws.write(3, 0, u'FECHA', bold)
...
sig_fila = 4 #starting row for data (after headers)
for l in log:
# write all data
ws.write(sig_fila, 0, u'%s' % l.fecha)
...
sig_fila += 1
# close the workbook
workbook.close()
# go to the beginning of the buffer
output.seek(0)
# response using the buffer
response = HttpResponse(output.read(), content_type='application/vnd.openxmlformats-officedocument.spreadsheetml.sheet')
response['Content-Disposition'] = 'attachment; filename="ACTIVOS_USUARIOS__%s.xlsx"' % datetime.now().strftime("%Y%m%d_%H%M")
return response
Notes: I am using Gunicorn on Heroku, django 1.9.13 and python 2.7.11

IMHO you should follow a totally different approach in this case.
As you are generating a file rather big in size, it's normal for the system to hang out because of a timeout error.
What you could do instead is to deploy a background task queue, like Celery or DjangoRQ. With that, you will get a background task to create this file using your user's data, and then you can let your user know that it's ready by any mean, like a notification or an email.
If you need more details regarding how you can do something like this, let me know and I can help :)

Related

Ram usage is constantly increasing on my azure server running python scripts

Issue in Brief
I have recently started using an azure server running ubuntu 20.04. My workflow includes running around 50 python scripts 24/7 and they are operationally very important to my team. But the issue when I just start those python scripts my ram usage is nominal like 12/16 GB remains free in my system after running all my scripts.
But slowly RAM usage by those scripts starts increasing to the point where the system starts to kill them to free up some main memory.
I have no idea what the issue is over here. My scripts are pretty simple and I really don't know where and how do I resolve this issue. Can anyone please show/tell me some guidelines on how do I approach solving this issue?
Comments
I am using python 3.10. Script function is to download the data from some server and upload it to my MySQL database. I can provide the code if anyone asks for it.
Let me know if I can provide anything else to make this easier for you.
Code files
I am uploading the code which is taking up the maximum memory according to htop.
dcx_trades.py
import json
import time
import datetime
from mysql_connector import SQLConnector
import pandas as pd
import sys
import os
import signal
from contextlib import contextmanager
def raise_timeout(signum, frame):
print("timeout")
raise Exception("timouttt")
#contextmanager
def timeout(time):
# Register a function to raise a TimeoutError on the signal.
signal.signal(signal.SIGALRM, raise_timeout)
# Schedule the signal to be sent after ``time``.
signal.alarm(time)
try:
yield
except TimeoutError:
# exit()
pass
finally:
# Unregister the signal so it won't be triggered
# if the timeout is not reached.
signal.signal(signal.SIGALRM, signal.SIG_IGN)
from_db = {'user': 'db_user_name', 'password': 'password', 'host': 'host_url', 'database': 'crypto'}
s = SQLConnector('crypto', from_db)
dict_ = {'timestamp': '', "exchange": "coindcx", "symbol":"", 'error_msg':''}
df = pd.DataFrame(columns = ["exchange_id","timestamp","symbol","price","quantity","exchange","turnover"])
df.set_index('symbol')
while True:
try:
data = pd.read_csv('dcx_trades.csv')
trades = data.to_dict(orient='records')
data = data.iloc[0:0]
if len(trades):
for trade in trades:
utc_time = datetime.datetime.fromtimestamp(trade['T']/1000, datetime.timezone.utc)
local_time = utc_time.astimezone()
datetime_formatted = local_time.strftime("%Y-%m-%d %H:%M:%S")
dict_['timestamp'] = datetime_formatted
dict_["exchange_id"] = 12345
dict_["symbol"] = trade['s']
dict_['price'] = trade['p']
dict_['quantity'] = trade['q']
dict_['turnover'] = float(trade['p'])*float(trade['q'])
dict_['error'] = '0'
df = df.append(dict_, ignore_index=True)
print(df)
df_new = df
df_new= df_new.to_dict(orient='records')
df = df.iloc[0:0]
data.to_csv('dcx_trades.csv', mode='w', index=False)
if len(df_new):
with timeout(60):
try:
print(datetime.datetime.now())
s.add_multipletrades(df_new)
print(datetime.datetime.now())
except Exception as e:
print(e)
os.execv(sys.executable, ['python'] + sys.argv)
print("error_time:", datetime.datetime.now())
except Exception as e:
data = pd.read_csv('dcx_trades.csv')
data = data.loc[1:]
data.to_csv('dcx_trades.csv',index=False)
pass
Objective of the file:
Firstly s = SQLConnector('crypto', from_db) this lines makes the connection with the DB. All the database-related functions have been defined in another file named mysql_connector.py which I have imported in the beginning.
Then the code reads from the CSV file named dcx_trades.csv and preprocesses the data as per the database table. Before uploading the data into DB it clears the CSV file so as to remove duplicates. timeout(60) function is used because sometimes the file get stuck while writing into the DB and thus it needs to get restarted which is what timeout() function does.

All of those transforms can easily be done in SQL --
LOAD DATA into a temp table with whatever columns match the columns and datatypes in the file
Run a single INSERT .. SELECT .. to copy the values over, doing whatever expressions are needed (such as p * q).

Cannot ingest data using snowpipe more than once

I am using the sample program from the Snowflake document on using Python to ingest the data to the destination table.
So basically, I have to execute put command to load data to the internal stage and then run the Python program to notify the snowpipe to ingest the data to the table.
This is how I create the internal stage and pipe:
create or replace stage exampledb.dbschema.example_stage;
create or replace pipe exampledb.dbschema.example_pipe
as copy into exampledb.dbschema.example_table
from
(
select
t.*
from
#exampledb.dbschema.example_stage t
)
file_format = (TYPE = CSV) ON_ERROR = SKIP_FILE;
put command:
put file://E:\\example\\data\\a.csv #exampledb.dbschema.example_stage OVERWRITE = TRUE;
This is the sample program I use:
from logging import getLogger
from snowflake.ingest import SimpleIngestManager
from snowflake.ingest import StagedFile
from snowflake.ingest.utils.uris import DEFAULT_SCHEME
from datetime import timedelta
from requests import HTTPError
from cryptography.hazmat.primitives import serialization
from cryptography.hazmat.primitives.serialization import load_pem_private_key
from cryptography.hazmat.backends import default_backend
from cryptography.hazmat.primitives.serialization import Encoding
from cryptography.hazmat.primitives.serialization import PrivateFormat
from cryptography.hazmat.primitives.serialization import NoEncryption
import time
import datetime
import os
import logging
logging.basicConfig(
filename='/tmp/ingest.log',
level=logging.DEBUG)
logger = getLogger(__name__)
# If you generated an encrypted private key, implement this method to return
# the passphrase for decrypting your private key.
def get_private_key_passphrase():
return '<private_key_passphrase>'
with open("E:\\ssh\\rsa_key.p8", 'rb') as pem_in:
pemlines = pem_in.read()
private_key_obj = load_pem_private_key(pemlines,
get_private_key_passphrase().encode(),
default_backend())
private_key_text = private_key_obj.private_bytes(
Encoding.PEM, PrivateFormat.PKCS8, NoEncryption()).decode('utf-8')
# Assume the public key has been registered in Snowflake:
# private key in PEM format
# List of files in the stage specified in the pipe definition
file_list=['a.csv.gz']
ingest_manager = SimpleIngestManager(account='<account_identifier>',
host='<account_identifier>.snowflakecomputing.com',
user='<user_login_name>',
pipe='exampledb.dbschema.example_pipe',
private_key=private_key_text)
# List of files, but wrapped into a class
staged_file_list = []
for file_name in file_list:
staged_file_list.append(StagedFile(file_name, None))
try:
resp = ingest_manager.ingest_files(staged_file_list)
except HTTPError as e:
# HTTP error, may need to retry
logger.error(e)
exit(1)
# This means Snowflake has received file and will start loading
assert(resp['responseCode'] == 'SUCCESS')
# Needs to wait for a while to get result in history
while True:
history_resp = ingest_manager.get_history()
if len(history_resp['files']) > 0:
print('Ingest Report:\n')
print(history_resp)
break
else:
# wait for 20 seconds
time.sleep(20)
hour = timedelta(hours=1)
date = datetime.datetime.utcnow() - hour
history_range_resp = ingest_manager.get_history_range(date.isoformat() + 'Z')
print('\nHistory scan report: \n')
print(history_range_resp)
After running the program, I just need to remove the file in the internal stage:
REMOVE #exampledb.dbschema.example_stage;
The code works as expected for the first time but when I truncate the data on that table and run the code again, the table on snowflake doesn't have any data in it.
Do I miss something here? How can I make this code can run multiple times?
Update:
I found that if I use a file with a different name each time I run, the data can load to the snowflake table.
So how can I run this code without changing the data filename?

Snowflake uses file loading metadata to prevent reloading the same files (and duplicating data) in a table. Snowpipe prevents loading files with the same name even if they were later modified (i.e. have a different eTag).
The file loading metadata is associated with the pipe object rather than the table. As a result:
Staged files with the same name as files that were already loaded are ignored, even if they have been modified, e.g. if new rows were added or errors in the file were corrected.
Truncating the table using the TRUNCATE TABLE command does not delete the Snowpipe file loading metadata.
However, note that pipes only maintain the load history metadata for 14 days. Therefore:
Files modified and staged again within 14 days:
Snowpipe ignores modified files that are staged again. To reload modified data files, it is currently necessary to recreate the pipe object using the CREATE OR REPLACE PIPE syntax.
Files modified and staged again after 14 days:
Snowpipe loads the data again, potentially resulting in duplicate records in the target table.
For more information have a look here

How do I use python requests to download a processed files?

I'm using Django 1.8.1 with Python 3.4 and i'm trying to use requests to download a processed file. The following code works perfect for a normal request.get command to download the exact file at the server location, or unprocessed file.
The file needs to get processed based on the passed data (shown below as "data"). This data will need to get passed into the Django backend, and based off the text pass variables to run an internal program from the server and output .gcode instead .stl filetype.
python file.
import requests, os, json
SERVER='http://localhost:8000'
authuser = 'admin#google.com'
authpass = 'passwords'
#data not implimented
##############################################
data = {FirstName:Steve,Lastname:Escovar}
############################################
category = requests.get(SERVER + '/media/uploads/9128342/141303729.stl', auth=(authuser, authpass))
#download to path file
path = "/home/bradman/Downloads/requestdata/newfile.stl"
if category.status_code == 200:
with open(path, 'wb') as f:
for chunk in category:
f.write(chunk)
I'm very confused about this, but I think the best course of action is to pass the data along with request.get, and somehow make some function to grab them inside my views.py for Django. Anyone have any ideas?

To use data in request you can do
get( ... , params=data)
(and you get data as parameters in url)
or
post( ... , data=data).
(and you send data in body - like HTML form)
BTW. some APIs need params= and data= in one request of GET or POST to send all needed information.
Read requests documentation

Celery-Redis worker showing weird results on Heroku

I have built a Flask application which I have hosted on Heroku, Celery as the worker with Redis as the Broker and for saving the backend on Redis itself, has the following code:
def create_csv_group(orgi,mx):
# Write a csv file with filename 'group'
cols = []
maxx = int(mx)+1
cols.append(['SID','First','Last','Email'])
for i in range(0,int(mx)):
cols.append(['SID'+str(i),'First'+str(i),'Last'+str(i),'Email'+ str(i)])
with open(os.path.join('uploads/','groupfile_t.csv'), 'wb') as f:
writer = csv.writer(f)
for i in range(len(max(cols, key=len))):
writer.writerow([(c[i] if i<len(c) else '') for c in cols])
#app.route('/mark',methods=['POST'])
def mark():
task = create_csv_group.apply_async(args=[orig,mx])
tsk_id = task.id
If I try to access the variable tsk_id, sometimes it gives the error:
variable used before being initialized.
I thought the reason it was not sending the task to the queue before I was accessing the tsk_id. So I moved the function after two form filling pages.
But now, it is not updating/saving the file correctly, it shows weird output in the file(Seems to be the old file data, which should get updated on filling the new form). When I run the same code locally, it runs perfectly fine. I logged the worker, it goes in the task function, runs properly too.
Why is this weird output is being displayed? How can I fix both of the issues, so that it writes properly to the file and check on the task id?

Python 3 urllib: 530 too many connections, in loop

I am retrieving data files from a FTP server in a loop with the following code:
response = urllib.request.urlopen(url)
data = response.read()
response.close()
compressed_file = io.BytesIO(data)
gin = gzip.GzipFile(fileobj=compressed_file)
Retrieving and processing the first few works fine, but after a few request I am getting the following error:
530 Maximum number of connections exceeded.
I tried closing the connection (see code above) and using a sleep() timer, but this both did not work. What is it I am doing wrong here?

Trying to make urllib do FTP properly makes my brain hurt. By default, it creates a new connection for each file, apparently without really properly ensuring the connections close.
ftplib is more appropriate I think.
Since I happen to be working on the same data you are(were)... Here is a very specific answer decompressing the .gz files and passing them into ish_parser (https://github.com/haydenth/ish_parser).
I think it is also clear enough to serve as a general answer.
import ftplib
import io
import gzip
import ish_parser # from: https://github.com/haydenth/ish_parser
ftp_host = "ftp.ncdc.noaa.gov"
parser = ish_parser.ish_parser()
# identifies what data to get
USAF_ID = '722950'
WBAN_ID = '23174'
YEARS = range(1975, 1980)
with ftplib.FTP(host=ftp_host) as ftpconn:
ftpconn.login()
for year in YEARS:
ftp_file = "pub/data/noaa/{YEAR}/{USAF}-{WBAN}-{YEAR}.gz".format(USAF=USAF_ID, WBAN=WBAN_ID, YEAR=year)
print(ftp_file)
# read the whole file and save it to a BytesIO (stream)
response = io.BytesIO()
try:
ftpconn.retrbinary('RETR '+ftp_file, response.write)
except ftplib.error_perm as err:
if str(err).startswith('550 '):
print('ERROR:', err)
else:
raise
# decompress and parse each line
response.seek(0) # jump back to the beginning of the stream
with gzip.open(response, mode='rb') as gzstream:
for line in gzstream:
parser.loads(line.decode('latin-1'))
This does read the whole file into memory, which could probably be avoided using some clever wrappers and/or yield or something... but works fine for a year's worth of hourly weather observations.

Probably a pretty nasty workaround, but this worked for me. I made a script (here called test.py) which does the request (see code above). The code below is used in the loop I mentioned and calls test.py
from subprocess import call
with open('log.txt', 'a') as f:
call(['python', 'test.py', args[0], args[1]], stdout=f)

We Keep Coding

Python is a programming language that lets you work quickly and integrate systems more effectively.

Django - Heroku - Serving dynamic large file timing out - python

Related

Ram usage is constantly increasing on my azure server running python scripts

Cannot ingest data using snowpipe more than once

How do I use python requests to download a processed files?

Celery-Redis worker showing weird results on Heroku

Python 3 urllib: 530 too many connections, in loop

Categories

Resources