Flask FileSystem Cache for Large Datasets

Flask FileSystem Cache for Large Datasets - python

I'm having a Dash Application which takes in multiple CSV files, and creates a combined Dataframe for analysis and visualization. Usually, this computation takes around 30-35 seconds for datasets of size 600-650 MB. I'm using Flask Filesystem cache to store this dataframe once, and every next time I request the data, it comes from Cache.
I used the code from Dash's example here
I'm having two problems here :
It seems since the cache is in Filesystem, it takes twice the amount of time (nearly 70 seconds) to get the Dataframe, the first try, then it comes quickly from the subsequent requests. Can I use any other Cache type to avoid this overhead?
I tried automatically clearing my cache by setting CACHE_THRESHOLD (for example, I had set it to 1), but it's not working and I see files getting on added in the directory.
Sample Code :
app = dash.Dash(__name__)
cache = Cache(app.server, config={
'CACHE_TYPE' : 'filesystem',
'CACHE_DIR' : 'my-cache-directory',
'CACHE_THRESHOLD': 1
})
app.layout = app_layout
#cache.memoize()
def getDataFrame():
df = createLargeDataFrame()
return df
#app.callback(...) # Callback that uses DataFrame
def useDataFrame():
df = getDataFrame()
# Using Dataframe here
return value
Can someone help me with this? Thanks.

Related

How to stream DataFrame using FastAPI without saving the data to csv file?

I would like to know how to stream a DataFrame using FastAPI without having to save the DataFrame to a csv file on disk. Currently, what I managed to do is to stream data from the csv file, but the speed was not very fast compared to returning a FileResponse. The /option7 below is what I'm trying to do.
My goal is to stream data from FastAPI backend without saving the DataFrame to a csv file.
Thank you.
from fastapi import FastAPI, Response,Query
from fastapi.responses import FileResponse,HTMLResponse,StreamingResponse
app = FastAPI()
df = pd.read_csv("data.csv")
#app.get("/option4")
def load_questions():
return FileResponse(path="C:Downloads/data.csv", filename="data.csv")
#app.get("/option5")
def load_questions():
def iterfile(): #
with open('data.csv', mode="rb") as file_like: #
yield from file_like #
return StreamingResponse(iterfile(), media_type="text/csv")
#app.get("/option7")
def load_questions():
def iterfile(): #
#with open(df, mode="rb") as file_like: #
yield from df #
return StreamingResponse(iterfile(), media_type="application/json")

Approach 1 (recommended)
As mentioned in this answer, as well as here and here, when the entire data (a DataFrame in your case) is already loaded into memory, there is no need to use StreamingResponse. StreamingResponse makes sense when you want to transfer real-time data and when you don't know the size of your output ahead of time, and you don't want to wait to collect it all to find out before you start sending it to the client, as well as when a file that you would like to return is too large to fit into memory—for instance, if you have 8GB of RAM, you can't load a 50GB file—and hence, you would rather load the file into memory in chunks.
In your case, as the DataFrame is already loaded into memory, you should instead return a custom Response directly, after using .to_json() method to convert the DataFrame into a JSON string, as described in this answer (see related posts here and here as well). Example:
from fastapi import Response
#app.get("/")
def main():
return Response(df.to_json(orient="records"), media_type="application/json")
If you find the browser taking a while to display the data, you may want to have the data downloaded as a .json file to the user's device (which would be completed much faster), rather than waiting for the browser to display a large amount of data. You can do that by setting the Content-Disposition header in the Response using the attachment parameter (see this answer for more details):
#app.get("/")
def main():
headers = {'Content-Disposition': 'attachment; filename="data.json"'}
return Response(df.to_json(orient="records"), headers=headers, media_type='application/json')
You could also return the data as a .csv file, using the .to_csv() method without specifying the path parameter. Since using return df.to_csv() would result in displaying the data in the browser with \r\n characters included, you might find it better to put the csv data in a Response instead, and specify the Content-Disposition header, so that the data will be downloaded as a .csv file. Example:
#app.get("/")
def main():
headers = {'Content-Disposition': 'attachment; filename="data.csv"'}
return Response(df.to_csv(), headers=headers, media_type="text/csv")
Approach 2
To use a StreamingResponse, you would need to iterate over the rows in a DataFrame, convert each row into a dictionary and subsequently into a JSON string, using either the standard json library, or other faster JSON encoders, as described in this answer (the JSON string will be later encoded into byte format internally by FastAPI/Starlette, as shown in the source code here). Example:
#app.get("/")
def main():
def iter_df():
for _, row in df.iterrows():
yield json.dumps(row.to_dict()) + '\n'
return StreamingResponse(iter_df(), media_type="application/json")
Iterating through Pandas objects is generally slow and not recommended. As described in this answer:
Iteration in Pandas is an anti-pattern and is something you should
only do when you have exhausted every other option. You should
not use any function with "iter" in its name for more than a few
thousand rows or you will have to get used to a lot of waiting.
Update
As #Panagiotis Kanavos noted in the comments section below, using either .to_json() or .to_csv() on the DataFrame that is already loaded into memory, would result in allocating the entire output string in memory, thus doubling the RAM usage or even worse. Hence, in the case of having such a huge amount of data that may cause your system to slow down or crash (because of running out of memory) if used either method above, you should rather use StreamingResponse, as described earlier. You may find faster alernative methods to iterrows() in this post, as well as faster JSON encoders, such as orjson and ujson, as described in this answer and this answer.
Alternatively, you could save the data to disk, then delete the DataFrame to release the memory—you could even manually trigger the garbage collection using gc.collect(), as shown in this answer; however, frequent calls to garbage collection is discouraged, as it is a costly operation and may affect performance—and return a FileResponse (assuming the data can fit into RAM; otherwise, you should use StreamingResponse, see this answer, as well as this answer), and finally, have a BackgroundTask to delete the file from disk after returning the response. Example is given below.
Regardless, the solution you may choose should be based on your application's requirements, e.g., the number of users you expect to serve simultaneously, the size of data, the response time, etc.), as well as your system's specifications (e.g., avaialable memory for allocation). Additionally, since all calls to DataFrame's methods are synchronous, you should remember to define your endpoint with a normal def, so that it is run in an external threadpool; otherwise, it would block the server. Alternatively, you could use Starlette's run_in_threadpool() from the concurrency module, which will run the to_csv() or to_json() function in a separate thread to ensure that the main thread (where coroutines are run) does not get blocked. Please have a look at this answer for more details on def vs async def.
from fastapi import BackgroundTasks
from fastapi.responses import FileResponse
import uuid
import os
#app.get("/")
def main(background_tasks: BackgroundTasks):
filename = str(uuid.uuid4()) + ".csv"
df.to_csv(filename)
del df # release the memory
background_tasks.add_task(os.remove, filename)
return FileResponse(filename, filename="data.csv", media_type="text/csv")
# or return StreamingResponse, if the file can't fit into RAM; see linked answers above

Deploying dash plotly app on heroku gives r15 error

I built a dash plotly web app with Python and am trying to deploy it on Heroku, but I get R15 errors (memory vastly exceeded). I have tried unsuccessfully to identify and fix the problem. There are a few similar questions on SO, all seemingly unresolved.
My app works as follows.
Using requests I download the data, which is in csv format.
I read it in chunks using pandas and then concat it together.
I do some other necessary data transformations, like adding columns.
I then write it to a feather file.
All of steps 1-4 I relegate to a background process that runs only once a day.
Then, I import the feather file in my dash app (saved in my project), do a few more necessary operations that are needed to be able to work with the data, and generate the layout and callbacks.
Everything works well locally, although the local server takes around 6 mins to run. So it seems to point to either a memory leak or inefficient code. I used memory_profiler to try to identify what is taking up lots of memory. This allowed me to identify a few problematic areas, which I fixed, but I'm still getting R15 errors.
I realise that the way I'm approaching this whole thing might be wrong, particularly saving the data as a feather file and then uploading that. I guess it would be better to store the data in a database and then make queries on that? I don't have much experience with this, so was hoping I could get by without it. But if there are no other solutions, then I'll give that a go. But some people seem to suggest that Dash and Heroku don't work well together, so don't want to go through all the trouble if it won't work anyway.
I am using plotly==5.8.0, dash==2.4.1, Django==3.2.3, django-plotly-dash==1.6.5, and Python==3.9.5.
Procfile:
web: gunicorn datasite.wsgi --max-requests 1200 --timeout 120 --preload
data_update: python fetchData/updater.py
updater.py:
from datetime import datetime
from apscheduler.schedulers.background import BackgroundScheduler
from fetchData import fetch
def start():
scheduler = BackgroundScheduler(timezone="Pacific/Auckland")
scheduler.add_job(fetch.get_df_lab, 'cron', day_of_week='mon-fri', hour=10, minute=47)
print("Labour data updated")
scheduler.start()
fetch.py:
import pandas as pd
import requests, zipfile, io
import gc
### LABOUR MARKET ###
col_list = ["Series_reference", "Period", "Data_value"]
def get_df_lab():
url_lab = "https://www.stats.govt.nz/assets/Uploads/Labour-market-statistics/Labour-market-statistics-December-2021-quarter/Download-data/labour-market-statistics-december-2021-quarter-csv.zip"
file_lab = "labour-market-statistics-december-2021-quarter-csv/hlfs-dec21qtr-csv.csv"
r = requests.get(url_lab, stream=True)
r.raise_for_status()
z = zipfile.ZipFile(io.BytesIO(r.content))
if file_lab in z.namelist():
temp = pd.read_csv(z.open(file_lab), dtype={'a': str, 'b': str, 'c': float}, usecols=col_list, parse_dates=['Period'], \
encoding = "ISO-8859-1", iterator=True, chunksize=100000, infer_datetime_format = True)
df_lab = pd.concat(temp, ignore_index=True)
df_lab['qpc'] = df_lab.groupby('Series_reference').Data_value.pct_change()
data_value = df_lab.groupby('Series_reference')['Data_value']
df_lab['apc'] = data_value.transform(lambda x: (x/x.shift(4))-1)
df_lab['aapc'] = data_value.transform(lambda x: ((x.rolling(window=4).mean()/x.rolling(window=4).mean().shift(4))-1))
df_lab.to_feather('df_lab.feather')
del df_lab
gc.collect()
else:
df_lab = "Something went wrong"
return 'df_lab.feather'
dash_lab.py:
df_lab = pd.read_feather('df_lab.feather')
df_lab.set_index('Period', inplace=True, drop=True)
###other operations, then layout and callbacks###
Any suggestions of what I could do?

Saving to the same parquet file in parallel using dask leading to ArrowInvalid

I am doing some simulation where I compute some stuff for several time step. For each I want to save a parquet file where each line correspond to a simulation this looks like so :
def simulation():
nsim = 3
timesteps = [1,2]
data = {} #initialization not shown here
for i in nsim:
compute_stuff()
for j in timesteps:
data[str(j)]= compute_some_other_stuff()
return data
Once I have my dict data containing the result of my simulation (under numpy arrays), I transform it into dask.DataFrame objects to then save them to file using the .to_parquet() method as follows:
def save(data):
for i in data.keys():
data[i] = pd.DataFrame(data[i], bins=...)
df = from_pandas(data[i], npartitions=2)
f.to_parquet(datafolder + i + "/", engine="pyarrow", append=True, ignore_divisions = True)
When use this code only once it works perfectly and the struggle arrises when I try to implement it in parallel. Using dask I do:
client = Client(n_workers=10, processes=True)
def f():
data = simulation()
save(data)
to_compute = [delayed(f)(n) for n in range(20)]
compute(to_compute)
The behaviour of this last portion of code is quite random. At some point this happens:
distributed.worker - WARNING - Compute Failed
Function: f
args: (4)
kwargs: {}
Exception: "ArrowInvalid('Parquet file size is 0 bytes')"
....
distributed.worker - WARNING - Compute Failed
Function: f
args: (12)
kwargs: {}
Exception: "ArrowInvalid('Parquet magic bytes not found in footer. Either the file is corrupted or this is not a parquet file.')"
I think these errors are due to the fact that 2 processes try to write at the same time to the same parquet file, and it not well handled (as it can be on txt file). I already tried to switch to pySpark / Koalas without much success. I there a better way to save the result along a simulation (in case of a crash / wall time on a cluster)?

You are making a classic dask mistake of invoking the dask API from within functions that are themselves delayed. The error indicates that things are happening in parallel (which is what dask does!) which are not expected to change during processing. Specifically, a file is clearly being edited by one task while another one is reading it (not sure which).
What you probably want to do, is use concat on the dataframe pieces and then a single call to to_parquet.
Note that it seems all of your data is actually held in the client, and you are using from_parquet. This seems like a bad idea, since you are missing out on one of dask's biggest features, to only load data when needed. You should, instead, load your data inside delayed functions or dask dataframe API calls.

Repeated insertions into sqlite database via sqlalchemy causing memory leak?

When inserting a huge pandas dataframe into sqlite via sqlalchemy and pandas to_sql and a specified chucksize, I would get memory errors.
At first I thought it was an issue with to_sql but I tried a workaround where instead of using chunksize I used for i in range(100): df.iloc[i * 100000:(i+1):100000].to_sql(...) and that still resulted in an error.
It seems under certain conditions, that there is a memory leak with repeated insertions to sqlite via sqlalchemy.
I had a hard time trying to replicate the memory leak that occured when converting my data, through a minimal example. But this gets pretty close.
import string
import numpy as np
import pandas as pd
from random import randint
import random
def make_random_str_array(size=10, num_rows=100, chars=string.ascii_uppercase + string.digits):
return (np.random.choice(list(chars), num_rows*size)
.view('|U{}'.format(size)))
def alt(size, num_rows):
data = make_random_str_array(size, num_rows=2*num_rows).reshape(-1, 2)
dfAll = pd.DataFrame(data)
return dfAll
dfAll = alt(randint(1000, 2000), 10000)
for i in range(330):
print('step ', i)
data = alt(randint(1000, 2000), 10000)
df = pd.DataFrame(data)
dfAll = pd.concat([ df, dfAll ])
import sqlalchemy
from sqlalchemy import create_engine
engine = sqlalchemy.create_engine('sqlite:///testtt.db')
for i in range(500):
print('step', i)
dfAll.iloc[(i%330)*10000:((i%330)+1)*10000].to_sql('test_table22', engine, index = False, if_exists= 'append')
This was run on Google Colab CPU enviroment.
The database itself isn't causing the memory leak, because I can restart my enviroment, and the previously inserted data is still there, and connecting to that database doesn't cause an increase in memory. The issue seems to be under certain conditions repeated insertions via looping to_sql or one to_sql with chucksize specified.
Is there a way that this code could be run without causing an eventual increase in memory usage?
Edit:
To fully reproduce the error, run this notebook
https://drive.google.com/open?id=1ZijvI1jU66xOHkcmERO4wMwe-9HpT5OS
The notebook requires you to import this folder into the main directory of your Google Drive
https://drive.google.com/open?id=1m6JfoIEIcX74CFSIQArZmSd0A8d0IRG8
The notebook will also mount your Google drive, you need to give it authorization to access your Google drive. Since the data is hosted on my Google drive, importing the data should not take up any of your allocated data.

The Google Colab instance starts with about 12.72GB of RAM available.
After creating the DataFrame, theBigList, about 9.99GB of RAM have been used.
Already this is a rather uncomfortable situation to be in, since it is not unusual for
Pandas operations to require as much additional space as the DataFrame it is operating on.
So we should strive to avoid using even this much RAM if possible, and fortunately there is an easy way to do this: simply load each .npy file and store its data in the sqlite database one at a time without ever creating theBigList (see below).
However, if we use the code you posted, we can see that the RAM usage slowly increases
as chunks of theBigList is stored in the database iteratively.
theBigList DataFrame stores the strings in a NumPy array. But in the process
of transferring the strings to the sqlite database, the NumPy strings are
converted into Python strings. This takes additional memory.
Per this Theano tutoral which discusses Python internal memory management,
To speed-up memory allocation (and reuse) Python uses a number of lists for
small objects. Each list will contain objects of similar size: there will be a
list for objects 1 to 8 bytes in size, one for 9 to 16, etc. When a small object
needs to be created, either we reuse a free block in the list, or we allocate a
new one.
... The important point is that those lists never shrink.
Indeed: if an item (of size x) is deallocated (freed by lack of reference) its
location is not returned to Python’s global memory pool (and even less to the
system), but merely marked as free and added to the free list of items of size
x. The dead object’s location will be reused if another object of compatible
size is needed. If there are no dead objects available, new ones are created.
If small objects memory is never freed, then the inescapable conclusion is that,
like goldfishes, these small object lists only keep growing, never shrinking,
and that the memory footprint of your application is dominated by the largest
number of small objects allocated at any given point.
I believe this accurately describes the behavior you are seeing as this loop executes:
for i in range(0, 588):
theBigList.iloc[i*10000:(i+1)*10000].to_sql(
'CS_table', engine, index=False, if_exists='append')
Even though many dead objects' locations are being reused for new strings, it is
not implausible with essentially random strings such as those in theBigList that extra space will occasionally be
needed and so the memory footprint keeps growing.
The process eventually hits Google Colab's 12.72GB RAM limit and the kernel is killed with a memory error.
In this case, the easiest way to avoid large memory usage is to never instantiate the entire DataFrame -- instead, just load and process small chunks of the DataFrame one at a time:
import numpy as np
import pandas as pd
import matplotlib.cbook as mc
import sqlalchemy as SA
def load_and_store(dbpath):
engine = SA.create_engine("sqlite:///{}".format(dbpath))
for i in range(0, 47):
print('step {}: {}'.format(i, mc.report_memory()))
for letter in list('ABCDEF'):
path = '/content/gdrive/My Drive/SummarizationTempData/CS2Part{}{:02}.npy'.format(letter, i)
comb = np.load(path, allow_pickle=True)
toPD = pd.DataFrame(comb).drop([0, 2, 3], 1).astype(str)
toPD.columns = ['title', 'abstract']
toPD = toPD.loc[toPD['abstract'] != '']
toPD.to_sql('CS_table', engine, index=False, if_exists='append')
dbpath = '/content/gdrive/My Drive/dbfile/CSSummaries.db'
load_and_store(dbpath)
which prints
step 0: 132545
step 1: 176983
step 2: 178967
step 3: 181527
...
step 43: 190551
step 44: 190423
step 45: 190103
step 46: 190551
The last number on each line is the amount of memory consumed by the process as reported by
matplotlib.cbook.report_memory. There are a number of different measures of memory usage. On Linux, mc.report_memory() is reporting
the size of the physical pages of the core image of the process (including text, data, and stack space).
By the way, another basic trick you can use manage memory is to use functions.
Local variables inside the function are deallocated when the function terminates.
This relieves you of the burden of manually calling del and gc.collect().

Spark coalesce vs collect, which one is faster?

I am using pyspark to process 50Gb data using AWS EMR with ~15 m4.large cores.
Each row of the data contains some information at a specific time on a day. I am using the following for loop to extract and aggregate information for every hour. Finally I union the data, as I want my result to save in one csv file.
# daily_df is a empty pyspark DataFrame
for hour in range(24):
hourly_df = df.filter(hourFilter("Time")).groupby("Animal").agg(mean("weights"), sum("is_male"))
daily_df = daily_df.union(hourly_df)
As of my knowledge, I have to perform the following to force the pyspark.sql.Dataframe object to save to 1 csv files (approx 1Mb) instead of 100+ files:
daily_df.coalesce(1).write.csv("some_local.csv")
It seems it took about 70min to finish this progress, and I am wondering if I can make it faster by using collect() method like?
daily_df_pandas = daily_df.collect()
daily_df_pandas.to_csv("some_local.csv")

Both coalesce(1) and collect are pretty bad in general but with expected output size around 1MB it doesn't really matter. It simply shouldn't be a bottleneck here.
One simple improvement is to drop loop -> filter -> union and perform a single aggregation:
df.groupby(hour("Time"), col("Animal")).agg(mean("weights"), sum("is_male"))
If that's not enough then most likely the issue here is configuration (the good place to start could be adjusting spark.sql.shuffle.partitions if you don't do that already).

To save as single file these are options
Option 1 :
coalesce(1) (minimum shuffle data over network) or repartition(1) or collect may work for small data-sets, but large data-sets it may not perform, as expected.since all data will be moved to one partition on one node
option 1 would be fine if a single executor has more RAM for use than the driver.
Option 2 :
Other option would be FileUtil.copyMerge() - to merge the outputs into a single file like below code snippet.
import org.apache.hadoop.conf.Configuration
import org.apache.hadoop.fs._
def merge(srcPath: String, dstPath: String): Unit = {
val hadoopConfig = new Configuration()
val hdfs = FileSystem.get(hadoopConfig)
FileUtil.copyMerge(hdfs, new Path(srcPath), hdfs, new Path(dstPath), false, hadoopConfig, null)
}
Option 3 :
after getting part files you can use hdfs getMerge command like this...
hadoop fs -getmerge /tmp/demo.csv /localmachine/tmp/demo.csv
Now you have to decide based on your requirements... which one is safer/faster
also, can have look at Dataframe save after join is creating numerous part files

We Keep Coding

Python is a programming language that lets you work quickly and integrate systems more effectively.

Flask FileSystem Cache for Large Datasets - python

Related

How to stream DataFrame using FastAPI without saving the data to csv file?

Deploying dash plotly app on heroku gives r15 error

Saving to the same parquet file in parallel using dask leading to ArrowInvalid

Repeated insertions into sqlite database via sqlalchemy causing memory leak?

Spark coalesce vs collect, which one is faster?

Categories

Resources