pandas csv error 'TextFileReader' object has no attribute 'to_html' - python

I am reading a large csv file using Pandas and then serving it using Flask. I am getting the error 'TextFileReader' object has no attribute 'to_html'. I think chunk size is causing the issue, but I can't open a file above 4GB without it.
from flask import Flask, session, request, json,Response,stream_with_context,send_from_directory,render_template
import pandas as pd
app = Flask(__name__)
#app.route('/readcsv')
def host_data():
csvname=request.args.get('csvname')
df=pd.read_csv(csvname,chunksize=5000)
return df.to_html(header="true")

When using chunksize you will get a generator of chunks. You should concatenate them for example using the following:
df = pd.concat((chunk for chunk in pd.read_csv(csvname,chunksize=5000)))
Serving a big file like this without implementing some sort of pagination, will create a total blocking response from your server, that will lead the user to wait till the file is opened and properly rendered as html.

Related

How to properly return dataframe as JSON using FastAPI?

I created an API using FastAPI that returned a JSON. First, I used to turn the Dataframe to JSON using the Pandas .to_json() method, which allowed me to choose the correct "orient" parameter. This saved a .json file and then opened it to make fastAPI return it as it follows:
DATA2.to_json("json_records.json",orient="records")
with open('json_records.json', 'r') as f:
data = json.load(f)
return(data)
This worked perfectly, but i was told that my script shouldn't save any files since this script would be running on my company's server, so I had to directly turn the dataframe into JSON and return it. I tried doing this:
data = DATA2.to_json(orient="records")
return(data)
But now the API's output is a JSON full of "\". I guess there is a problem with the parsing but i can't really find a way to do it properly.
The output now looks like this:
"[{\"ExtraccionHora\":\"12:53:00\",\"MiembroCompensadorCodigo\":117,\"MiembroCompensadorDescripcion\":\"OMEGA CAPITAL S.A.\",\"CuentaCompensacionCodigo\":\"1143517\",\"CuentaNeteoCodigo\":\"160234117\",\"CuentaNeteoDescripcion\":\"UNION FERRO SRA A\",\"ActivoDescripcion\":\"X17F3\",\"ActivoID\":8,\"FinalidadID\":2,\"FinalidadDescripcion\":\"Margenes\",\"Cantidad\":11441952,\"Monto\":-16924935.3999999985,\"Saldo\":-11379200.0,\"IngresosVerificados\":11538288.0,\"IngresosNoVerificado\":0.0,\"MargenDelDia\":0.0,\"SaldoConsolidadoFinal\":-16765847.3999999985,\"CuentaCompensacionCodigoPropia\":\"80500\",\"SaldoCuentaPropia\":-7411284.3200000003,\"Resultado\":\"0\",\"MiembroCompensadorID\":859,\"CuentaCompensacionID\":15161,\"CuentaNeteoID\":7315285}.....
What would be a proper way of turning my dataframe into a JSON using the "records" orient, and then returning it as the FastAPI output?
Thanks!
update: i changed the to_json() method to to_dict() using the same parameters and seems to work... don't know if its correct.
data = DATA2.to_dict(orient="records")
return(data)

How to stream DataFrame using FastAPI without saving the data to csv file?

I would like to know how to stream a DataFrame using FastAPI without having to save the DataFrame to a csv file on disk. Currently, what I managed to do is to stream data from the csv file, but the speed was not very fast compared to returning a FileResponse. The /option7 below is what I'm trying to do.
My goal is to stream data from FastAPI backend without saving the DataFrame to a csv file.
Thank you.
from fastapi import FastAPI, Response,Query
from fastapi.responses import FileResponse,HTMLResponse,StreamingResponse
app = FastAPI()
df = pd.read_csv("data.csv")
#app.get("/option4")
def load_questions():
return FileResponse(path="C:Downloads/data.csv", filename="data.csv")
#app.get("/option5")
def load_questions():
def iterfile(): #
with open('data.csv', mode="rb") as file_like: #
yield from file_like #
return StreamingResponse(iterfile(), media_type="text/csv")
#app.get("/option7")
def load_questions():
def iterfile(): #
#with open(df, mode="rb") as file_like: #
yield from df #
return StreamingResponse(iterfile(), media_type="application/json")
Approach 1 (recommended)
As mentioned in this answer, as well as here and here, when the entire data (a DataFrame in your case) is already loaded into memory, there is no need to use StreamingResponse. StreamingResponse makes sense when you want to transfer real-time data and when you don't know the size of your output ahead of time, and you don't want to wait to collect it all to find out before you start sending it to the client, as well as when a file that you would like to return is too large to fit into memory—for instance, if you have 8GB of RAM, you can't load a 50GB file—and hence, you would rather load the file into memory in chunks.
In your case, as the DataFrame is already loaded into memory, you should instead return a custom Response directly, after using .to_json() method to convert the DataFrame into a JSON string, as described in this answer (see related posts here and here as well). Example:
from fastapi import Response
#app.get("/")
def main():
return Response(df.to_json(orient="records"), media_type="application/json")
If you find the browser taking a while to display the data, you may want to have the data downloaded as a .json file to the user's device (which would be completed much faster), rather than waiting for the browser to display a large amount of data. You can do that by setting the Content-Disposition header in the Response using the attachment parameter (see this answer for more details):
#app.get("/")
def main():
headers = {'Content-Disposition': 'attachment; filename="data.json"'}
return Response(df.to_json(orient="records"), headers=headers, media_type='application/json')
You could also return the data as a .csv file, using the .to_csv() method without specifying the path parameter. Since using return df.to_csv() would result in displaying the data in the browser with \r\n characters included, you might find it better to put the csv data in a Response instead, and specify the Content-Disposition header, so that the data will be downloaded as a .csv file. Example:
#app.get("/")
def main():
headers = {'Content-Disposition': 'attachment; filename="data.csv"'}
return Response(df.to_csv(), headers=headers, media_type="text/csv")
Approach 2
To use a StreamingResponse, you would need to iterate over the rows in a DataFrame, convert each row into a dictionary and subsequently into a JSON string, using either the standard json library, or other faster JSON encoders, as described in this answer (the JSON string will be later encoded into byte format internally by FastAPI/Starlette, as shown in the source code here). Example:
#app.get("/")
def main():
def iter_df():
for _, row in df.iterrows():
yield json.dumps(row.to_dict()) + '\n'
return StreamingResponse(iter_df(), media_type="application/json")
Iterating through Pandas objects is generally slow and not recommended. As described in this answer:
Iteration in Pandas is an anti-pattern and is something you should
only do when you have exhausted every other option. You should
not use any function with "iter" in its name for more than a few
thousand rows or you will have to get used to a lot of waiting.
Update
As #Panagiotis Kanavos noted in the comments section below, using either .to_json() or .to_csv() on the DataFrame that is already loaded into memory, would result in allocating the entire output string in memory, thus doubling the RAM usage or even worse. Hence, in the case of having such a huge amount of data that may cause your system to slow down or crash (because of running out of memory) if used either method above, you should rather use StreamingResponse, as described earlier. You may find faster alernative methods to iterrows() in this post, as well as faster JSON encoders, such as orjson and ujson, as described in this answer and this answer.
Alternatively, you could save the data to disk, then delete the DataFrame to release the memory—you could even manually trigger the garbage collection using gc.collect(), as shown in this answer; however, frequent calls to garbage collection is discouraged, as it is a costly operation and may affect performance—and return a FileResponse (assuming the data can fit into RAM; otherwise, you should use StreamingResponse, see this answer, as well as this answer), and finally, have a BackgroundTask to delete the file from disk after returning the response. Example is given below.
Regardless, the solution you may choose should be based on your application's requirements, e.g., the number of users you expect to serve simultaneously, the size of data, the response time, etc.), as well as your system's specifications (e.g., avaialable memory for allocation). Additionally, since all calls to DataFrame's methods are synchronous, you should remember to define your endpoint with a normal def, so that it is run in an external threadpool; otherwise, it would block the server. Alternatively, you could use Starlette's run_in_threadpool() from the concurrency module, which will run the to_csv() or to_json() function in a separate thread to ensure that the main thread (where coroutines are run) does not get blocked. Please have a look at this answer for more details on def vs async def.
from fastapi import BackgroundTasks
from fastapi.responses import FileResponse
import uuid
import os
#app.get("/")
def main(background_tasks: BackgroundTasks):
filename = str(uuid.uuid4()) + ".csv"
df.to_csv(filename)
del df # release the memory
background_tasks.add_task(os.remove, filename)
return FileResponse(filename, filename="data.csv", media_type="text/csv")
# or return StreamingResponse, if the file can't fit into RAM; see linked answers above

Deploying dash plotly app on heroku gives r15 error

I built a dash plotly web app with Python and am trying to deploy it on Heroku, but I get R15 errors (memory vastly exceeded). I have tried unsuccessfully to identify and fix the problem. There are a few similar questions on SO, all seemingly unresolved.
My app works as follows.
Using requests I download the data, which is in csv format.
I read it in chunks using pandas and then concat it together.
I do some other necessary data transformations, like adding columns.
I then write it to a feather file.
All of steps 1-4 I relegate to a background process that runs only once a day.
Then, I import the feather file in my dash app (saved in my project), do a few more necessary operations that are needed to be able to work with the data, and generate the layout and callbacks.
Everything works well locally, although the local server takes around 6 mins to run. So it seems to point to either a memory leak or inefficient code. I used memory_profiler to try to identify what is taking up lots of memory. This allowed me to identify a few problematic areas, which I fixed, but I'm still getting R15 errors.
I realise that the way I'm approaching this whole thing might be wrong, particularly saving the data as a feather file and then uploading that. I guess it would be better to store the data in a database and then make queries on that? I don't have much experience with this, so was hoping I could get by without it. But if there are no other solutions, then I'll give that a go. But some people seem to suggest that Dash and Heroku don't work well together, so don't want to go through all the trouble if it won't work anyway.
I am using plotly==5.8.0, dash==2.4.1, Django==3.2.3, django-plotly-dash==1.6.5, and Python==3.9.5.
Procfile:
web: gunicorn datasite.wsgi --max-requests 1200 --timeout 120 --preload
data_update: python fetchData/updater.py
updater.py:
from datetime import datetime
from apscheduler.schedulers.background import BackgroundScheduler
from fetchData import fetch
def start():
scheduler = BackgroundScheduler(timezone="Pacific/Auckland")
scheduler.add_job(fetch.get_df_lab, 'cron', day_of_week='mon-fri', hour=10, minute=47)
print("Labour data updated")
scheduler.start()
fetch.py:
import pandas as pd
import requests, zipfile, io
import gc
### LABOUR MARKET ###
col_list = ["Series_reference", "Period", "Data_value"]
def get_df_lab():
url_lab = "https://www.stats.govt.nz/assets/Uploads/Labour-market-statistics/Labour-market-statistics-December-2021-quarter/Download-data/labour-market-statistics-december-2021-quarter-csv.zip"
file_lab = "labour-market-statistics-december-2021-quarter-csv/hlfs-dec21qtr-csv.csv"
r = requests.get(url_lab, stream=True)
r.raise_for_status()
z = zipfile.ZipFile(io.BytesIO(r.content))
if file_lab in z.namelist():
temp = pd.read_csv(z.open(file_lab), dtype={'a': str, 'b': str, 'c': float}, usecols=col_list, parse_dates=['Period'], \
encoding = "ISO-8859-1", iterator=True, chunksize=100000, infer_datetime_format = True)
df_lab = pd.concat(temp, ignore_index=True)
df_lab['qpc'] = df_lab.groupby('Series_reference').Data_value.pct_change()
data_value = df_lab.groupby('Series_reference')['Data_value']
df_lab['apc'] = data_value.transform(lambda x: (x/x.shift(4))-1)
df_lab['aapc'] = data_value.transform(lambda x: ((x.rolling(window=4).mean()/x.rolling(window=4).mean().shift(4))-1))
df_lab.to_feather('df_lab.feather')
del df_lab
gc.collect()
else:
df_lab = "Something went wrong"
return 'df_lab.feather'
dash_lab.py:
df_lab = pd.read_feather('df_lab.feather')
df_lab.set_index('Period', inplace=True, drop=True)
###other operations, then layout and callbacks###
Any suggestions of what I could do?

How to use numpy-stl with file uploaded with Flask request

I am writing an app, which takes a STL file as input. I want to get volume of the stl object without saving the stl file and use the volume to calculate a quote and post it back to browser. Right now I am using numpy-stl package, but I am stuck on how to create a mesh object for numpy-stl from the file I get with request.files['file'].read(). Any help is appreciated.
mycode:
what I get for filedata
what I get for error
You can try the following code:
import io
filedata = request.files['file'].read()
data = io.BytesIO(filedata)
tmp_mesh = mesh.Mesh.from_file("tmp.stl", fh=data)
You can use tmp_mesh object to do you interesting operation
suggestion to add error handle on something not expected
if request.files not contain 'file' keys

use csv file and plot data django2

I have a simple app that import csv file and make plot but without having any error message it doesn't show plot
It's part of my module:
...
def plot_data(self):
df = pd.read_csv("file.csv")
return plotly.express.line(df)
...
and it's part of my app file:
import panel
def app(doc):
gspec = pn.GridSpec()
gspec[0, 1] = pn.Pane(instance_class.plot_data())
gspec.server_doc(doc)
Update:
After searching more I could find HttpResponse and with that I could write things to csv through using csv module but still I have no idea how to read from csv
Also I saw HttpRequest and thought maybe I can use it for reading csv but I couldn't find any sample code and couldn't understand documentation about using it as a reader

Categories