I built a flask web app that manipulates big dataframes.
Each time I make a call to the url, the used RAM of the server increase of 2Gb.
But if I kill the browser session, the server used RAM does not decrease (no clear of the used memory) --> it ends with a memory error.
The structure of the code is the following :
import pandas as pd
from flask import Flask,render_template,request
app = Flask(__name__)
#app.route("/index", methods = ['GET','POST'])
def index():
global df
df2 = df.copy()
#...data treatment on df2 depending on POST parameters...#
return render_template('template.html', df2 = df2.to_html())
df = pd.read_csv("path_to_huge_dataframe")
app.run(debug=True,threaded=True)
I canot load the dataframe in the index() function because it takes several minutes because of the size of the file. So I use a 'global' variable but I think it is the reason of my problem.
Related
I'm making a multi-page dash application that I plan to host on a server using Gunicorn and Nginx. It will access a PostgreSQL database on an external server over the network.
The data on one of the pages is obtained by a query from the database and should be updated every 30 seconds. I use to update the #callback through the dcc.Interval.
My code (simplified version):
from dash import Dash, html, dash_table, dcc, Input, Output, callback
import dash_bootstrap_components as dbc
from flask import Flask
import pandas as pd
from random import random
server = Flask(__name__)
app = Dash(__name__, server=server, suppress_callback_exceptions=True, external_stylesheets=[dbc.themes.BOOTSTRAP])
app.layout = html.Div([
dcc.Interval(
id='interval-component-time',
interval=1000,
n_intervals=0
),
html.Br(),
html.H6(id='time_update'),
dcc.Interval(
id='interval-component-table',
interval=1/2*60000,
n_intervals=0
),
html.Br(),
html.H6(id='table_update')
])
#callback(
Output('time_update', 'children'),
Input('interval-component-time', 'n_intervals')
)
def time_update(n_intervals):
time_show = 30
text = "Next update in {} sec".format(time_show - (n_intervals % 30))
return text
#callback(
Output('table_update', 'children'),
Input('interval-component-table', 'n_intervals')
)
def data_update(n_intervals):
# here in a separate file a query is made to the database and a dataframe is returned
# now here is a simplified receipt df
col = ["Col1", "Col2", "Col3"]
data = [[random(), random(), random()]]
df = pd.DataFrame(data, columns=col)
return dash_table.DataTable(df.to_dict('records'),
style_cell={'text-align': 'center', 'margin-bottom': '0'},
style_table={'width':'500px'})
if __name__ == '__main__':
server.run(port=5000, debug=True)
Locally, everything works fine for me, the load on the database is small, one such request loads 1 out of 8 processors by 30% for 3 seconds.
But, if you open my application in several browser windows, then the same data is displayed on two pages by two queries to the database at different times, that is, the load doubles. I am worried that when connecting more than 10 people, my server with the database will not withstand / will freeze heavily, and the database on it should work without delay and not fall.
Question:
Is it possible to make page refresh the same for different connections? That is, so that the data is updated at the same time for different users and only with the help of one query to the database.
I studied everything about the callback in the documentation and did not find an answer.
Solution
Thanks for the advice, #Epsi95! I studied page Dash Performance and added this to my code:
cache = Cache(app.server, config={
'CACHE_TYPE': 'filesystem',
'CACHE_DIR': 'cache-directory',
'CACHE_THRESHOLD': 50
})
#cache.memoize(timeout=30)
def query_data():
# here I make a query to the database and save the result in a dataframe
return df
def dataframe():
df = query_data()
return df
And in the #callback function I make a call to the dataframe() function.
Everything works the way I needed it. Thank you!
I stuck on the problem of how to organize code / proper way to get xlsx file as output in flask app.
I have a function.py file where for now the xlsx file generates.
The sense is that flask app gets some settings in json format, these settings are processed by function that must return xlsx(?) to app.
The function do some calculations depending on the settings.
The file has the next structure:
def function (settings):
settings=settings
df = pd.read_csv(settings['df'])
from pandas import ExcelWriter
writer = ExcelWriter('file.xlsx')
if settings[somefeature1]==1:
f1=dosmth.to_excel(writer, "feature 1")
if settings[somefeature2]==1:
f2=dosmth.to_excel(writer, "feature 2")
...
writer.save()
But if the file is already generated in function, what should I pass to flask? How the flask app function must look like then (especially in case if I want to return xlsx as json)?
#app.route('/Function', methods = ['POST'])
def Function():
settings = request.get_json(force = True)
return(function(settings)) #???
You should never forget that flask is a framework for creating web applications, a web application is a piece of software that receives a web request and generates a web response.
To make it super simple: your flask function should return something that a common web browser will be able to handle.
In this case your response should return the file but also some metadata to tell to the potential receiver what is inside the response and how to handle it.
I think that something like this could work:
return send_file(filename, mimetype='application/vnd.ms-excel')
I am trying to create a machine Learning application with Flask. I have created a POST API route that will take the data and transform it into a pandas dataframe. Here is the Flask code that calls the python function to transform the data.
from flask import Flask, abort, request
import json
import mlFlask as ml
import pandas as pd
app = Flask(__name__)
#app.route('/test', methods=['POST'])
def test():
if not request.json:
abort(400)
print type(request.json)
result = ml.classification(request.json)
return json.dumps(result)
This is the file that contains the helper function.
def jsonToDataFrame(data):
print type(data)
df = pd.DataFrame.from_dict(data,orient='columns')
But I am getting an import error. Also, when I print the type of data is dict so I don't know why it would cause an issue. It works when I orient the dataframe based on index but it doesn't work based on column.
ValueError: If using all scalar values, you must pass an index
Here is the body of the request in JSON format.
{
"updatedDate":"2012-09-30T23:51:45.778Z",
"createdDate":"2012-09-30T23:51:45.778Z",
"date":"2012-06-30T00:00:00.000Z",
"name":"Mad Max",
"Type":"SBC",
"Org":"Private",
"month":"Feb"
}
What am I doing wrong here ?
I have an API based on flask + uwsgi (running in a docker container).
Its input is a text, and its output is a simple json with data about this text (something like {"score": 1}).
I request this API in 5 threads. In my uwsgi.ini I have processes = 8 and threads = 2.
Recently I noticed that some results are not reproducible, though the source code of API didn't change and there is no code provoking random answers inside it.
So I took the same set of queries and fed it to my API, first in straight order, then in the reversed one. About 1% of responses differed!
When I did the same locally (without threads and docker& in just one thread), the results became just the same though. So my hypothesis is that flask somewhat confuses responses to different threads from time to time.
Has anyone dealt with it? I found only https://github.com/getsentry/raven-python/issues/923, but if it's the case then the problem still remains unsolved as fas as I understand...
So here are relevant code pieces:
uwsgi.ini
[uwsgi]
socket = :8000
processes = 8
threads = 2
master = true
module = web:app
requests
import json
from multiprocessing import Pool
import requests
def fetch_score(query):
r = requests.post("http://url/api/score", data = query)
score = json.loads(r.text)
query["score"] = score["score"]
return query
def calculateParallel(arr_of_data):
pool = Pool(processes=5)
results = pool.map(fetch_score, arr_of_data)
pool.close()
pool.join()
return results
results = calculateParallel(final_data)
first off, a disclaimer: I'm not well versed in python or flask, so bear with me.
I'm trying to put together a minimal API using flask, i was planning to dynamically generate routes and their associated procs from the contents of a subdirectory.
The code looks something like this:
from flask import Flask
import os
app = Flask(__name__)
configs = os.getcwd() + "/configs"
for i in os.listdir(configs):
if i.endswith(".json"):
call = "/" + os.path.splitext(i)[0]
#app.route(call, methods=['POST'])
def call():
return jsonify({"status": call + "Success"}), 200
The plan being to iterate over a bunch of config files and use their naes to define the routes. Now, this works for a single config file, but wont work for multiple files as I end up trying to overwrite the function call that is used by each route.
I can factor out most of the code to a separate function as long as i can pass in the call name. However it seems that however i go about this i need to dynamically name the function generated and mapped to the route.
So, my question is: how can use the contents of a variable, such as 'call' to be the function name?
i.e. something like
call = "getinfo"
def call(): # Effectively being evaled as def getinfo():
Everything i've tried hasn't worked, and i'm not confident enough in my python syntax to know if it's because i'm just doing something silly.
Alternatively is there another way to do what i'm trying to achieve?
Thanks for all and any feedback!
Thanks for the help. I've moved to one route and one handler and building up the file list, and handling of the request paths, etc separately.
This is a sanitized version of the model i now have:
from flask import Flask
import os
calls = []
cfgs = {}
app = Flask(__name__)
configs = os.getcwd() + "/configs"
for i in os.listdir(configs):
if i.endswith(".json"):
cfgs[call] = os.path.splitext(i)[0]
calls.extend([call])
#app.route('/<call>', methods=['POST'])
def do(call):
if call not in calls:
abort(400, "invalid call")
# Do stuff
return jsonify({"status": call + "Success"}), 200
if __name__ == '__main__':
app.run(debug=True)
So, thanks to the above comments this is doing what i'm after. Still curious to know if there is any way to use variables in function names?