Celery-Redis worker showing weird results on Heroku

Celery-Redis worker showing weird results on Heroku - python

I have built a Flask application which I have hosted on Heroku, Celery as the worker with Redis as the Broker and for saving the backend on Redis itself, has the following code:
def create_csv_group(orgi,mx):
# Write a csv file with filename 'group'
cols = []
maxx = int(mx)+1
cols.append(['SID','First','Last','Email'])
for i in range(0,int(mx)):
cols.append(['SID'+str(i),'First'+str(i),'Last'+str(i),'Email'+ str(i)])
with open(os.path.join('uploads/','groupfile_t.csv'), 'wb') as f:
writer = csv.writer(f)
for i in range(len(max(cols, key=len))):
writer.writerow([(c[i] if i<len(c) else '') for c in cols])
#app.route('/mark',methods=['POST'])
def mark():
task = create_csv_group.apply_async(args=[orig,mx])
tsk_id = task.id
If I try to access the variable tsk_id, sometimes it gives the error:
variable used before being initialized.
I thought the reason it was not sending the task to the queue before I was accessing the tsk_id. So I moved the function after two form filling pages.
But now, it is not updating/saving the file correctly, it shows weird output in the file(Seems to be the old file data, which should get updated on filling the new form). When I run the same code locally, it runs perfectly fine. I logged the worker, it goes in the task function, runs properly too.
Why is this weird output is being displayed? How can I fix both of the issues, so that it writes properly to the file and check on the task id?

Related

How does OpenWPMs run_custom_function affect the Scope?

I'm trying to scrape some sites with OpenWPM and wrote a custom function get_pp_links(), that adds something to a global list. But when I use it with OpenWPMs
run_custom_function(), the things I append to the list disappear.
My script looks like this:
from automation import CommandSequence, TaskManager
NUM_BROWSERS = 2
alist = [1]
def get_pp_links(**kwargs):
alist.append(1)
def main():
# The list of sites that we wish to crawl
with open("top-100.csv", "r") as file:
csv_reader = reader(file)
sites = list(csv_reader)
# Loads the default manager preferences
manager_params, browser_params = TaskManager.load_default_params(NUM_BROWSERS)
# Update browser configuration
for i in range(NUM_BROWSERS):
browser_params[i]['headless'] = True
# Instantiates the measurement platform
manager = TaskManager.TaskManager(manager_params, browser_params)
for site in sites:
command_sequence = CommandSequence.CommandSequence("https://"+site[1], reset=True)
command_sequence.get(sleep=0, timeout=60)
command_sequence.run_custom_function(get_pp_links, ())
manager.execute_command_sequence(command_sequence)
# Shuts down the browsers and waits for the data to finish logging
manager.close()
print(alist)
if __name__ == "__main__":
main()
And top-100.cvs contains a domain list:
92,twitch.tv
93,forbes.com
94,bbc.com
I'm expecting the list to grow with every scanned site, so the result would look like [1,1,1,1], but instead, the printed list is only [1]
I think this is somehow connected to the run_custom_function(), because when I call get_pp_links() directly, the problem does not appear.

This is because OpenWPM creates a new process for each browser spawned.
Since each process is isolated against the parent process and each other the following happens.
alist gets created in the main process.
alist gets copied into the browser process.
alist gets changed by get_pp_links in the browser process
The changes stay in the browser process and you can't observe them in the parent process.
You might be able to get around this by using a multiprocessing.SyncManager and syncing the list between the processes.

Function in django views run 2 times without reason

I have problem because I can not find the reason why my function in Django views.py sometimes runs two times. When I go to url, which call function create_db in Django view, function read json files from directory, parse it and write the data in the database. Most of the time it works perfectly, but sometimes for no reason it runs two times and write the same data in the data base. Does anyone know what can be the reason why code is sometimes done twice and how can I solve the problem?
Here is my create_db function:
def create_db(request):
response_data = {}
try:
start = time.time()
files = os.listdir()
print(files)
for filename in files:
if filename.endswith('.json'):
print(filename)
with open(f'{filename.strip()}', encoding='utf-8') as f:
data = json.load(f)
for item in data["CVE_Items"]:
import_item(item)
response_data['result'] = 'Success'
response_data['message'] = 'Baza podatkov je ustvarjena.'
except KeyError:
response_data['result'] = 'Error'
response_data['message'] = 'Prislo je do napake! Podatki niso bili uvozeni!'
return HttpResponse(json.dumps(response_data), content_type='application/json')
The console output that I expect:
['nvdcve-1.0-2002.json', 'nvdcve-1.0-2003.json', 'nvdcve-1.0-2004.json', 'nvdcve-1.0-2005.json', 'nvdcve-1.0-2006.json', 'nvdcve-1.0-2007.json', 'nvdcve-1.0-2008.json', 'nvdcve-1.0-2009.json', 'nvdcve-1.0-2010.json', 'nvdcve-1.0-2011.json', 'nvdcve-1.0-2012.json', 'nvdcve-1.0-2013.json', 'nvdcve-1.0-2014.json', 'nvdcve-1.0-2015.json', 'nvdcve-1.0-2016.json', 'nvdcve-1.0-2017.json']
nvdcve-1.0-2002.json
nvdcve-1.0-2003.json
nvdcve-1.0-2004.json
nvdcve-1.0-2005.json
nvdcve-1.0-2006.json
nvdcve-1.0-2007.json
nvdcve-1.0-2008.json
nvdcve-1.0-2009.json
nvdcve-1.0-2010.json
nvdcve-1.0-2011.json
nvdcve-1.0-2012.json
nvdcve-1.0-2013.json
nvdcve-1.0-2014.json
nvdcve-1.0-2015.json
nvdcve-1.0-2016.json
nvdcve-1.0-2017.json
Console output when error happened:
['nvdcve-1.0-2002.json', 'nvdcve-1.0-2003.json', 'nvdcve-1.0-2004.json', 'nvdcve-1.0-2005.json', 'nvdcve-1.0-2006.json', 'nvdcve-1.0-2007.json', 'nvdcve-1.0-2008.json', 'nvdcve-1.0-2009.json', 'nvdcve-1.0-2010.json', 'nvdcve-1.0-2011.json', 'nvdcve-1.0-2012.json', 'nvdcve-1.0-2013.json', 'nvdcve-1.0-2014.json', 'nvdcve-1.0-2015.json', 'nvdcve-1.0-2016.json', 'nvdcve-1.0-2017.json']
nvdcve-1.0-2002.json
['nvdcve-1.0-2002.json', 'nvdcve-1.0-2003.json', 'nvdcve-1.0-2004.json', 'nvdcve-1.0-2005.json', 'nvdcve-1.0-2006.json', 'nvdcve-1.0-2007.json', 'nvdcve-1.0-2008.json', 'nvdcve-1.0-2009.json', 'nvdcve-1.0-2010.json', 'nvdcve-1.0-2011.json', 'nvdcve-1.0-2012.json', 'nvdcve-1.0-2013.json', 'nvdcve-1.0-2014.json', 'nvdcve-1.0-2015.json', 'nvdcve-1.0-2016.json', 'nvdcve-1.0-2017.json']
nvdcve-1.0-2002.json
nvdcve-1.0-2003.json
nvdcve-1.0-2003.json
nvdcve-1.0-2004.json
nvdcve-1.0-2004.json
nvdcve-1.0-2005.json
nvdcve-1.0-2005.json
nvdcve-1.0-2006.json
nvdcve-1.0-2006.json
nvdcve-1.0-2007.json
nvdcve-1.0-2007.json
nvdcve-1.0-2008.json
nvdcve-1.0-2008.json
nvdcve-1.0-2009.json
nvdcve-1.0-2009.json
nvdcve-1.0-2010.json
nvdcve-1.0-2010.json
nvdcve-1.0-2011.json
nvdcve-1.0-2011.json
nvdcve-1.0-2012.json
nvdcve-1.0-2012.json
nvdcve-1.0-2013.json
nvdcve-1.0-2013.json
nvdcve-1.0-2014.json
nvdcve-1.0-2014.json
nvdcve-1.0-2015.json
nvdcve-1.0-2015.json
nvdcve-1.0-2016.json
nvdcve-1.0-2016.json
nvdcve-1.0-2017.json
nvdcve-1.0-2017.json

The problem is not in the code which you show us. Enable logging for the HTTP requests which your application receives to make sure the browser sends you just a single request. If you see two requests, make sure they use the same session (maybe another user is clicking at the same time).
If it's from the same user, maybe you're clicking the button twice. Could be a hardware problem with the mouse. To prevent this, use JavaScript to disable the button after the first click.

Flask, uWSGI and threads produce unreproducible result

I have an API based on flask + uwsgi (running in a docker container).
Its input is a text, and its output is a simple json with data about this text (something like {"score": 1}).
I request this API in 5 threads. In my uwsgi.ini I have processes = 8 and threads = 2.
Recently I noticed that some results are not reproducible, though the source code of API didn't change and there is no code provoking random answers inside it.
So I took the same set of queries and fed it to my API, first in straight order, then in the reversed one. About 1% of responses differed!
When I did the same locally (without threads and docker& in just one thread), the results became just the same though. So my hypothesis is that flask somewhat confuses responses to different threads from time to time.
Has anyone dealt with it? I found only https://github.com/getsentry/raven-python/issues/923, but if it's the case then the problem still remains unsolved as fas as I understand...
So here are relevant code pieces:
uwsgi.ini
[uwsgi]
socket = :8000
processes = 8
threads = 2
master = true
module = web:app
requests
import json
from multiprocessing import Pool
import requests
def fetch_score(query):
r = requests.post("http://url/api/score", data = query)
score = json.loads(r.text)
query["score"] = score["score"]
return query
def calculateParallel(arr_of_data):
pool = Pool(processes=5)
results = pool.map(fetch_score, arr_of_data)
pool.close()
pool.join()
return results
results = calculateParallel(final_data)

Django - Heroku - Serving dynamic large file timing out

I have one function view that creates a report using xlsxwriter, it is created on the fly using a StringIO as buffer and finally sending through HttpResponse. It works well using Local Server.
The problem is that on Heroku, after some seconds (documentation mention 30 seconds timeout and not modifiable) the server hangs out and reboot the web process, giving error as a response.
What is the best way to...?:
create an xmlx file on the fly (dynamically) in memory
serve the entire file to the client.
prevent server to hang out because of the long process running
This is a piece of the code I am using:
def reporte_usuarios(request):
from xlsxwriter.workbook import Workbook
try:
import cStringIO as StringIO
except ImportError:
import StringIO
# create a workbook in memory
output = StringIO.StringIO()
workbook = Workbook(output)
bold = workbook.add_format({'bold': True})
# get the data
from django.db.models import Count
usuarios = User.objects.filter(....... # all filter stuff
for usr in usuarios:
if usr.activos > 0:
# create a workbook sheet every User registered
ws = workbook.add_worksheet(u'%s' % usr.username)
# some relevant user data
ws.write(1, 1, u'USUARIO: %s' % usr.username)
...
# get rows for user
log = LogActivos.objects.filter(usuario=usr).select_related('activo__unidad__id', 'activo__unidad__nombre', 'activo__nombre')
# write headers
ws.write(3, 0, u'FECHA', bold)
...
sig_fila = 4 #starting row for data (after headers)
for l in log:
# write all data
ws.write(sig_fila, 0, u'%s' % l.fecha)
...
sig_fila += 1
# close the workbook
workbook.close()
# go to the beginning of the buffer
output.seek(0)
# response using the buffer
response = HttpResponse(output.read(), content_type='application/vnd.openxmlformats-officedocument.spreadsheetml.sheet')
response['Content-Disposition'] = 'attachment; filename="ACTIVOS_USUARIOS__%s.xlsx"' % datetime.now().strftime("%Y%m%d_%H%M")
return response
Notes: I am using Gunicorn on Heroku, django 1.9.13 and python 2.7.11

IMHO you should follow a totally different approach in this case.
As you are generating a file rather big in size, it's normal for the system to hang out because of a timeout error.
What you could do instead is to deploy a background task queue, like Celery or DjangoRQ. With that, you will get a background task to create this file using your user's data, and then you can let your user know that it's ready by any mean, like a notification or an email.
If you need more details regarding how you can do something like this, let me know and I can help :)

Rename log file in Python while file keeps writing any other logs

I am using the Python logger mechanism for keeping a record of my logs. I have two types of logs,
one is the Rotating log (log1, log2, log3...) and a non-rotating log called json.log (which has json logs in it as the name suggests).
The log files are created when the server is started and close when the app is closed.
What I am trying to do in general is: When I press the import button on my page, to have all json logs saved on the sqlite db.
The problem I am facing is:
When I try to rename the json.log file like this:
source_file = "./logs/json.log"
snapshot_file = "./logs/json.snapshot.log"
try:
os.rename(source_file, snapshot_file)
I get the windowsError: [Error 32] The process cannot access the file because it is being used by another process
and this is because the file is being used by the logger continuously. Therefore, I need to "close" the file somehow so I can do my I/O operation successfully.
The thing is that this is not desirable because logs might be lost until the file is closed, then renamed and then "re-created".
I was wondering if anyone came across such scenario again and if any practical solution was found.
I have tried something which works but does not seem convenient and not sure if it is safe so that any logs are not lost.
My code is this:
source_file = "./logs/json.log"
snapshot_file = "./logs/json.snapshot.log"
try:
logger = get_logger()
# some hackish way to remove the handler for json.log
if len(logger.handlers) > 2:
logger.removeHandler(logger.handlers[2])
if not os.path.exists(snapshot_file):
os.rename(source_file, snapshot_file)
try:
if type(logger.handlers[2]) == RequestLoggerHandler:
del logger.handlers[2]
except IndexError:
pass
# re-adding the logs file handler so it continues writing the logs
json_file_name = configuration["brew.log_file_dir"] + os.sep + "json.log"
json_log_level = logging.DEBUG
json_file_handler = logging.FileHandler(json_file_name)
json_file_handler.setLevel(json_log_level)
json_file_handler.addFilter(JSONLoggerFiltering())
json_file_handler.setFormatter(JSONFormatter())
logger.addHandler(json_file_handler)
... code continues to write the logs to the db and then delete the json.snapshot.file
until the next time the import button is pressed; then the snapshot is created again
only for writing the logs to the db.
Also for reference my log file has this format:
{'status': 200, 'actual_user': 1, 'resource_name': '/core/logs/process', 'log_level': 'INFO', 'request_body': None, ... }
Thanks in advance :)

We Keep Coding

Python is a programming language that lets you work quickly and integrate systems more effectively.

Celery-Redis worker showing weird results on Heroku - python

Related

How does OpenWPMs run_custom_function affect the Scope?

Function in django views run 2 times without reason

Flask, uWSGI and threads produce unreproducible result

Django - Heroku - Serving dynamic large file timing out

Rename log file in Python while file keeps writing any other logs

Categories

Resources