I built a dash plotly web app with Python and am trying to deploy it on Heroku, but I get R15 errors (memory vastly exceeded). I have tried unsuccessfully to identify and fix the problem. There are a few similar questions on SO, all seemingly unresolved.
My app works as follows.
Using requests I download the data, which is in csv format.
I read it in chunks using pandas and then concat it together.
I do some other necessary data transformations, like adding columns.
I then write it to a feather file.
All of steps 1-4 I relegate to a background process that runs only once a day.
Then, I import the feather file in my dash app (saved in my project), do a few more necessary operations that are needed to be able to work with the data, and generate the layout and callbacks.
Everything works well locally, although the local server takes around 6 mins to run. So it seems to point to either a memory leak or inefficient code. I used memory_profiler to try to identify what is taking up lots of memory. This allowed me to identify a few problematic areas, which I fixed, but I'm still getting R15 errors.
I realise that the way I'm approaching this whole thing might be wrong, particularly saving the data as a feather file and then uploading that. I guess it would be better to store the data in a database and then make queries on that? I don't have much experience with this, so was hoping I could get by without it. But if there are no other solutions, then I'll give that a go. But some people seem to suggest that Dash and Heroku don't work well together, so don't want to go through all the trouble if it won't work anyway.
I am using plotly==5.8.0, dash==2.4.1, Django==3.2.3, django-plotly-dash==1.6.5, and Python==3.9.5.
Procfile:
web: gunicorn datasite.wsgi --max-requests 1200 --timeout 120 --preload
data_update: python fetchData/updater.py
updater.py:
from datetime import datetime
from apscheduler.schedulers.background import BackgroundScheduler
from fetchData import fetch
def start():
scheduler = BackgroundScheduler(timezone="Pacific/Auckland")
scheduler.add_job(fetch.get_df_lab, 'cron', day_of_week='mon-fri', hour=10, minute=47)
print("Labour data updated")
scheduler.start()
fetch.py:
import pandas as pd
import requests, zipfile, io
import gc
### LABOUR MARKET ###
col_list = ["Series_reference", "Period", "Data_value"]
def get_df_lab():
url_lab = "https://www.stats.govt.nz/assets/Uploads/Labour-market-statistics/Labour-market-statistics-December-2021-quarter/Download-data/labour-market-statistics-december-2021-quarter-csv.zip"
file_lab = "labour-market-statistics-december-2021-quarter-csv/hlfs-dec21qtr-csv.csv"
r = requests.get(url_lab, stream=True)
r.raise_for_status()
z = zipfile.ZipFile(io.BytesIO(r.content))
if file_lab in z.namelist():
temp = pd.read_csv(z.open(file_lab), dtype={'a': str, 'b': str, 'c': float}, usecols=col_list, parse_dates=['Period'], \
encoding = "ISO-8859-1", iterator=True, chunksize=100000, infer_datetime_format = True)
df_lab = pd.concat(temp, ignore_index=True)
df_lab['qpc'] = df_lab.groupby('Series_reference').Data_value.pct_change()
data_value = df_lab.groupby('Series_reference')['Data_value']
df_lab['apc'] = data_value.transform(lambda x: (x/x.shift(4))-1)
df_lab['aapc'] = data_value.transform(lambda x: ((x.rolling(window=4).mean()/x.rolling(window=4).mean().shift(4))-1))
df_lab.to_feather('df_lab.feather')
del df_lab
gc.collect()
else:
df_lab = "Something went wrong"
return 'df_lab.feather'
dash_lab.py:
df_lab = pd.read_feather('df_lab.feather')
df_lab.set_index('Period', inplace=True, drop=True)
###other operations, then layout and callbacks###
Any suggestions of what I could do?
Related
I have a local linux server which contains 4 cores. I am running a pyspark job on it locally which basically reads two tables from database and saves the data into 2 dataframes. Now i am using these 2 dataframes to do some processing and then i am using the resultant processed df to save it into elasticsearch. Below is the code
def save_to_es(df):
df.write.format('es').option('es.nodes', 'es_node').option('es.port', some_port_no.).option('es.resource', index_name).option('es.mapping', es_mappings).save()
def coreFun():
spark = SparkSession.builder.master("local[1]").appName('test').getOrCreate()
spark.catalog.clearCache()
spark.sparkContext.setLogLevel("ERROR")
sc = spark.sparkContext
sqlContext = SQLContext(sc)
select_sql = """(select * from db."master_table")"""
df_master = spark.read.format("jdbc").option("url", "jdbcurl").option("dbtable", select_sql).option("user", "username").option("password", "password").option("driver", "database_driver").load()
select_sql_child = """(select * from db."child_table")"""
df_child = spark.read.format("jdbc").option("url", "jdbcurl").option("dbtable", select_sql_cost).option("user", "username").option("password", "password").option("driver", "database_driver").load()
merged_df = merged_python_file.merged_function(df_master,df_child,sqlContext)
logic1_df = logic1_python_file.logic1_function(df_master,sqlContext)
logic2_df = logic2_python_file.logic2_function(df_master,sqlContext)
logic3_df = logic3_python_file.logic3_function(df_master,sqlContext)
logic4_df = logic4_python_file.logic4_function(df_master,sqlContext)
logic5_df = logic5_python_file.logic5_function(df_master,sqlContext)
save_to_es(merged_df)
save_to_es(logic1_df)
save_to_es(logic2_df)
save_to_es(logic3_df)
save_to_es(logic4_df)
save_to_es(logic5_df)
end_time = int(time.time())
print(end_time-start_time)
sc.stop()
if __name__ == "__main__":
coreFun()
There are different logic for processing written in separate python files e.g logic1 in logic1_python_file etc. I send my df_master to separate functions and they return resultant processed df back to driver. Now i use this resultant processed df to save into elasticsearch.
It works fine but problem is here everything is happening sequentially first merged_df gets processed and while it is getting processed others simply wait even though they are not really dependent on the o/p of merged_df function and then logic_1 gets processed while others wait and it goes on. This is not an ideal system design considering the o/p of one logic is not dependent on other.
I am sure asynchronous processing can help me here but i am not sure how to implement it here in my usecase. I know i may have to use some kind of queue(jms,kafka etc) to accomplish this but i dont have a complete picture.
Please let me know how can i utilize asynchronous processing here. Any other inputs which can help in improving the performance of job is welcome.
If during the processing of one single step like (merged_python_file.merged_function), only one core of the CPU is getting heavily utilized and others are nearly idle, multiprocessing can speed up. It can be achieved by using multiprocessing module of python. For more details can check answer on How to do parallel programming in Python?
I'm having a Dash Application which takes in multiple CSV files, and creates a combined Dataframe for analysis and visualization. Usually, this computation takes around 30-35 seconds for datasets of size 600-650 MB. I'm using Flask Filesystem cache to store this dataframe once, and every next time I request the data, it comes from Cache.
I used the code from Dash's example here
I'm having two problems here :
It seems since the cache is in Filesystem, it takes twice the amount of time (nearly 70 seconds) to get the Dataframe, the first try, then it comes quickly from the subsequent requests. Can I use any other Cache type to avoid this overhead?
I tried automatically clearing my cache by setting CACHE_THRESHOLD (for example, I had set it to 1), but it's not working and I see files getting on added in the directory.
Sample Code :
app = dash.Dash(__name__)
cache = Cache(app.server, config={
'CACHE_TYPE' : 'filesystem',
'CACHE_DIR' : 'my-cache-directory',
'CACHE_THRESHOLD': 1
})
app.layout = app_layout
#cache.memoize()
def getDataFrame():
df = createLargeDataFrame()
return df
#app.callback(...) # Callback that uses DataFrame
def useDataFrame():
df = getDataFrame()
# Using Dataframe here
return value
Can someone help me with this? Thanks.
References:
https://examples.dask.org/applications/forecasting-with-prophet.html?highlight=prophet
https://facebook.github.io/prophet/
A few things to note:
I've got a total of 48gb of ram
Here are my versions of the libraries im using
Python 3.7.7
dask==2.18.0
fbprophet==0.6
pandas==1.0.3
The reason im import pandas is for this line only pd.options.mode.chained_assignment = None
This helps with dask erroring when im using dask.distributed
So, I have a 21gb csv file that I am reading using dask and jupyter notebook...
I've tried to read it from my mysql database table, however, the kernel eventually crashes
I've tried multiple combinations of using my local network of workers, threads, and available memory, available storage_memory, and even tried not using distributed at all. I have also tried chunking with pandas (not with the line mentioned above related to pandas), however, even with chunking, the kernel still crashes...
I can now load the csv with dask, and apply a few transformations, such as setting the index, adding the column (names) that fbprophet requires... but I am still not able to compute the dataframe with df.compute(), as this is why I think I am receiving the error I am with fbprophet. After I have added the columns y, and ds, with the appropriate dtypes, I receive the error Truth of Delayed objects is not supported, and I think this is because fbprophet expects the dataframe to not be lazy, which is why im trying to run compute beforehand. I have also bumped up the ram on the client to allow it to use the full 48gb, as I suspected that it may be trying to load the data twice, however, this still failed, so most likely this wasn't the case / isn't causing the problem.
Alongside this, fbpropphet is also mentioned in the documentation of dask for applying machine learning to dataframes, however, I really don't understand why this isn't working... I've also tried modin with ray, and with dask, with basically the same result.
Another question... regarding memory usage
distributed.worker - WARNING - Memory use is high but worker has no data to store to disk. Perhaps some other process is leaking memory? Process memory: 32.35 GB -- Worker memory limit: 25.00 GB
I am getting this error when assigning the client, reading the csv file, and applying operations/transformations to the dataframe, however the allotted size is larger than the csv file itself, so this confuses me...
What I have done to try and solve this myself:
- Googling of course, did not find anything :-/
- Asking a discord help channel, on multiple occasions
- Asking IIRC help channel, on multiple occasions
Anyways, would really appreciate any help on this problem!!!
Thank you in advance :)
MCVE
from dask.distributed import Client
import dask.dataframe as dd
import pandas as pd
from fbprophet import Prophet
pd.options.mode.chained_assignment = None
client = Client(n_workers=2, threads_per_worker=4, processes=False, memory_limit='4GB')
csv_file = 'provide_your_own_csv_file_here.csv'
df = dd.read_csv(csv_file, parse_dates=['Time (UTC)'])
df = df.set_index('Time (UTC)')
df['y'] = df[['a','b']].mean(axis=1)
m = Prophet(daily_seasonality=True)
m.fit(df)
# ERROR: Truth of Delayed objects is not supported
Unfortunately Prophet doesn't support Dask dataframes today.
The example that you refer to shows using Dask to accelerate Prophet's fitting on Pandas dataframes. Dask Dataframe is only one way that people use Dask.
As already suggested, one approach is to use dask.delayed with a pandas DataFrame, and skip dask.dataframe.
You could use a simplified version of the load-clean-analyze pipeline shown for custom computations using Dask.
Here is one possible approach based on this type of custom pipeline, using a small dataset (to create a MCVE) - every step in the pipeline will be delayed
Imports
import numpy as np
import pandas as pd
from dask import delayed
from dask.distributed import Client
from fbprophet import Prophet
Generate some data in a .csv, with column names Time (UTC), a and b
def generate_csv(nrows, fname):
df = pd.DataFrame(np.random.rand(nrows, 2), columns=["a", "b"])
df["Time (UTC)"] = pd.date_range(start="1850-01-01", periods=nrows)
df.to_csv(fname, index=False)
First write the load function from the pipeline, to load the .csv with Pandas, and delay its execution using the dask.delayed decorator
might be good to use read_csv with nrows to see how the pipeline performs on a subset of the data, rather than loading it all
this will return a dask.delayed object and not a pandas.DataFrame
#delayed
def load_data(fname, nrows=None):
return pd.read_csv(fname, nrows=nrows)
Now create the process function, to process data using pandas, again delayed since its input is a dask.delayed object and not a pandas.DataFrame
#delayed
def process_data(df):
df = df.rename(columns={"Time (UTC)": "ds"})
df["y"] = df[["a", "b"]].mean(axis=1)
return df
Last function - this one will train fbprophet on the data (loaded from the .csv and processed, but delayed) to make a forecast. This analyze function is also delayed, since one of its inputs is a dask.delayed object
#delayed
def analyze(df, horizon):
m = Prophet(daily_seasonality=True)
m.fit(df)
future = m.make_future_dataframe(periods=horizon)
forecast = m.predict(future)
return forecast
Run the pipeline (if running from a Python script, requires __name__ == "__main__")
the output of the pipeline (a forecast by fbprophet) is stored in a variable result, which is delayed
when this output is computed, this will generate a pandas.DataFrame (corresponding to the output of a forecast by fbprophet), so it can be evaluated using result.compute()
if __name__ == "__main__":
horizon = 8
num_rows_data = 40
num_rows_to_load = 35
csv_fname = "my_file.csv"
generate_csv(num_rows_data, csv_fname)
client = Client() # modify this as required
df = load_data(csv_fname, nrows=num_rows_to_load)
df = process_data(df)
result = analyze(df, horizon)
forecast = result.compute()
client.close()
assert len(forecast) == num_rows_to_load + horizon
print(forecast[["ds", "yhat", "yhat_lower", "yhat_upper"]].head())
Output
ds yhat yhat_lower yhat_upper
0 1850-01-01 0.330649 0.095788 0.573378
1 1850-01-02 0.493025 0.266692 0.724632
2 1850-01-03 0.573344 0.348953 0.822692
3 1850-01-04 0.491388 0.246458 0.712400
4 1850-01-05 0.307939 0.066030 0.548981
I'm trying to read a bunch of large csv files (multiple files) from google storage. I use the Dask distribution library for parallel computation, but the problem I'm facing here is, though I mention the blocksize (100mb), I'm not sure how to read partition by partition and save it to my postgres database so that I don't want overload my memory.
from dask.distributed import Client
from dask.diagnostics import ProgressBar
client = Client(processes=False)
import dask.dataframe as dd
def read_csv_gcs():
with ProgressBar():
df = dd.read_csv('gs://mybucket/renish/*.csv', blocksize=100e6)
pd = df.compute(scheduler='threads')
return pd
def write_df_to_db(df):
try:
from sqlalchemy import create_engine
engine = create_engine('postgresql://usr:pass#localhost:5432/sampledb')
df.to_sql('sampletable', engine, if_exists='replace',index=False)
except Exception as e:
print(e)
pass
pd = read_csv_gcs()
write_df_to_db(pd)
The above code is my basic implementation, but as said I would like to read it in chunk and update the db. Something like
df = dd.read_csv('gs://mybucket/renish/*.csv', blocksize=100e6)
for chunk in df:
write_it_to_db(chunk)
Is it possible to do it in Dask? or should I go for pandas's chunksize and iterate, then save it to DB (But I miss parallel computation here)?
Can someone shed some light?
This line
df.compute(scheduler='threads')
says: load the data in chunks in worker threads, and concatenate them all into a single in-memory dataframe, df. This is not what you wanted. You wanted to insert the chunks as they come and then drop them from memory.
You probably wanted to use map_partitions
df = dd.read_csv('gs://mybucket/renish/*.csv', blocksize=100e6)
df.map_partitions(write_it_to_db).compute()
or use df.to_delayed().
Note that, depending on your SQL driver, you might not be able to get parallelism this way, and if not, the pandas iter-chunk method would have worked just as well.
Today I began using the Dask and Paramiko packages, partly as a learning exercise, and partly because I'm beginning a project that will require dealing with large datasets (10s of GB) that must be accessed from a remote VM only (i.e. cannot store locally).
The following piece of code belongs to a short, helper program that will make a dask dataframe of a large csv file hosted on the VM. I want to later pass its output (reference to the dask dataframe) to a second function that will perform some overview analysis on it.
import dask.dataframe as dd
import paramiko as pm
import pandas as pd
import sys
def remote_file_to_dask_dataframe(remote_path):
if isinstance(remote_path, (str)):
try:
client = pm.SSHClient()
client.load_system_host_keys()
client.connect('#myserver', username='my_username', password='my_password')
sftp_client = client.open_sftp()
remote_file = sftp_client.open(remote_path)
df = dd.read_csv(remote_file)
remote_file.close()
sftp_client.close()
return df
except:
print("An error occurred.")
sftp_client.close()
remote_file.close()
else:
raise ValueError("Path to remote file as string required")
The code is neither nice nor complete, and I will replace username and password with ssh keys in time, but this is not the issue. In a jupyter notebook, I've previously opened the sftp connection with a path to a file on the server, and read it into a dataframe with a regular Pandas read_csv call. However, here the equivalent line, using Dask, is the source of the problem:df = dd.read_csv(remote_file).
I've looked at the documentation online (here), but I can't tell whether what I'm trying above is possible. It seems that for networked options, Dask wants a url. The parameter passing options for, e.g. S3, appear to depend on that infrastructure's backend. I unfortunately cannot make any sense of the dash-ssh documentation (here).
I've poked around with print statements and the only line that fails to execute is the one stated. The error risen is: raise TypeError('url type not understood: %s' % urlpath)
TypeError: url type not understood:
Can anybody point me in the right direction for achieving what I'm trying to do? I'd expected Dask's read_csv to function as Pandas' had, as it's based on the same.
I'd appreciate any help, thanks.
p.s. I'm aware of Pandas' read_csv chunksize option, but I would like to achieve this through Dask, if possible.
In the master version of Dask, file-system operations are now using fsspec which, along with the previous implementations (s3, gcs, hdfs) now supports some additional file-systems, see the mapping to protocol identifiers fsspec.registry.known_implementations.
In short, using a url like "sftp://user:pw#host:port/path" should now work for you, if you install fsspec and Dask from master.
It seems that you would have to implement their "file system" interface.
I'm not sure what is minimal set of methods that you need to implement to allow read_csv. But you definitely have to implement the open.
class SftpFileSystem(object):
def open(self, path, mode='rb', **kwargs):
return sftp_client.open(path, mode)
dask.bytes.core._filesystems['sftp'] = SftpFileSystem
df = dd.read_csv('sftp://remote/path/file.csv')