matplotlib loads memory and does not show plot - python

I want to plot a large (>100k rows) file in matplotlib. When I do it for the first time, I get the result I need. However, if I restart and rerun kernel, plt.show() infinitely loads memory and does not show the graph.
Tried restarting Jupyter Notebook and Anaconda, the problem remains.
import pandas as pd
import matplotlib.pyplot as plt
dataset = f'data/data_name.csv'
df = pd.read_csv(dataset)
pd.options.display.float_format = '{:.2f}'.format
df.set_index('time', inplace=True)
plt.figure(figure=18,6))
plt.plot(df['some_column']
plt.show()
From this moment, an instance of Python appears in processes, and it starts to consume memory with no end.
Thank you in advance.

It appears the memory on your machine is being overwhelmed by the size of the plot and is crashing your kernel. I'd suggest plotting fewer datapoints using df.sample(n=10**4, random_state=1). If your data is massive and nicely distributed, taking a sample should reduce the memory and allow for more rapid plotting.

Related

clear memory used by mplfinance

I'm using mplfinance module to plot candlesticks. The problem is mplfinance uses too much memory when it generates plots. I have tried the instructions mentioned in free up the memory used by matplotlib but nothing changed and my code is still fulling up my computer memory.Here is my code:
fig, axlist = mpf.plot(hloc,hlines=hlines,
ylabel='Price(USDT)',type='candle',
style='binance',title=my_title,closefig=True,returnfig=True)
any suggestion is highly appreciated.
It would be helpful to see the rest of your code, to see how you are displaying plots and how many. That said, given the above code, when you are done with each plot you might try:
for ax in axlist:
del ax
del fig
This will save memory, but at the expense of some time (which will anyway not be noticeable unless your are making thousands of plots).
If you are saving your plots to image files (instead of displaying to the screen) then matplotlib.use("Agg") may help as well.

How can I avoid memory leaks with real-time plotting (matplotlib) in Jupyter Notebook?

I'm training a large DQN in Jupyter notebook. I'm having some trouble finding a way to update this plot in real-time without causing a memory leak. I currently have a dirty implementation that uses ~ 1GB of RAM per episode (14,000 steps). By the time I've gotten through 7 episodes like the screenshot below, I'm about halfway out of memory on my system.
From what I've read in other posts, attempting to plot in the same thread will cause a memory leak regardless of gc.collect() or del fig, fig.clear(), etc. How can I update this plot within a loop without causing a memory leak?
I found a similar question here, but couldn't quite figure out how to apply it in my case with multiple figures and data that is updated dynamically.
clear_output(wait=True)
plt.close()
plt.ion()
fig, axs = plt.subplots(2, figsize=(10,7))
fig.tight_layout()
color = [int((item + 1) * 255 / 2) for item in p_reward_history]
axs[0].scatter(tindex, p_reward_history[-plot_len:], c=color[-plot_len:], cmap='RdYlGn', linewidth=3)
axs[0].set_title('P&L Individual Transactions')
axs[0].plot(zero_line, color="black", linewidth=3)
axs[0].set_facecolor('#2c303c')
axs[1].set_title('P&L Running Total')
axs[1].set_facecolor('#2c303c')
axs[1].plot(running_rewards_history, color="#94c273", linewidth=3)
The variables that are dynamic are running_reward_history and p_reward_history. These are both lists that get new values appended to each loop.
Current implementation looks like this:
I prefer to work in Jupyter notebook, but if I need to train in a regular shell in order to update asynchronously, that is okay with me.

Bokeh: Repeated plotting in jupyter lab increases (browser) memory usage

I am using Bokeh to plot many time-series (>100) with many points (~20,000) within a Jupyter Lab Notebook.
When executing the cell multiple times in Jupyter the memory consumption of Chrome increases per run by over 400mb. After several cell executions Chrome tends to crash, usually when several GB of RAM usage are accumulated. Further, the plotting tends to get slower after each execution.
A "Clear [All] Outputs" or "Restart Kernel and Clear All Outputs..." in Jupyter also does not free any memory. In a classic Jupyter Notebook as well as with Firefox or Edge the issue also occurs.
Minimal version of my .ipynp:
import numpy as np
from bokeh.io import show, output_notebook
from bokeh.plotting import figure
import bokeh
output_notebook() # See e.g.: https://github.com/bokeh/bokeh-notebooks/blob/master/tutorial/01%20-%20Basic%20Plotting.ipynb
# Just create a list of numpy arrays with random-walks as dataset
ts_length = 20000
n_lines = 100
np.random.seed(0)
dataset = [np.cumsum(np.random.randn(ts_length)) + i*100 for i in range(n_lines)]
# Plot exactly the same linechart every time
plot = figure(x_axis_type="linear")
for data in dataset:
plot.line(x=range(ts_length), y=data)
show(plot)
This 'memory-leak' behavior continues, even if I execute the following cell every time before re-executing the (plot) cell above:
bokeh.io.curdoc().clear()
bokeh.io.state.State().reset()
bokeh.io.reset_output()
output_notebook() # has to be done again because output was reset
Are there any additional mechanism in Bokeh that I might have overlooked, which would allow me to clean up the plot and to free the memory (in browser/js/client)?
Do I have to plot (or show the plot) somehow else within a Jupyter Notebook to avoid this issue? Or is this simply a bug of Bokeh/Jupyter?
Installed Versions on my System (Windows 10):
Python 3.6.6 : Anaconda custom (64-bit)
bokeh: 1.4.0
Chrome: 78.0.3904.108
jupyter:
core: 4.6.1
lab: 1.1.4
ipywidgets: 7.5.1
labextensions:
#bokeh/jupyter_bokeh: v1.1.1
#jupyter-widgets/jupyterlab-manager: v1.0.*
TLDR; This is probably worth making an issue(s) for.
Memory usage
Just some notes about different aspects:
Clear/Reset functions
First to note, these:
bokeh.io.curdoc().clear()
bokeh.io.state.State().reset()
bokeh.io.reset_output()
Only affect data structures in the Python process (e.g. the Jupyter Kernel). They will never have any effect on the browser memory usage or footprint.
One-time Memory footprint
Based on just the data I'd expect some where in the neighborhood of ~64MB:
20000 * 100 * 2 * 2 * 8 = 64MB
That's: 100 lines with 20k (x,y) points, which will also be converted to (sx,sy) screen coordinates, all in float64 (8byte) types arrays. However, Bokeh also constructs a spatial index for all data to support things like hover tools. I expect you are blowing up this index with this data. it is probably worth making this feature configurable so that folks who do not need hit testing do not have to pay for it. A feature-request issue to discuss this would be appropriate.
Repeated Execution
There are supposed to be DOM event triggers that will clean up when a notebook cell is re-executed. Perhaps these have become broken? Maintaining integrations between three large hybrid Python/JS tools (including classic Notebook) with a tiny team is unfortunately an ongoing challenge. A bug-report issue would be appropriate so that this can be tracked and investigated.
Other options
What can you do, right now?
More optimal usage
At least for the specific case you have here with timeseries all of the same length, that above code is structured in a very suboptimal way. You should try putting everything in a single ColumnDataSource instead:
ts_length = 20000
n_lines = 100
np.random.seed(0)
source = ColumnDataSource(data=dict(x=np.arange(ts_length)))
for i in range(n_lines):
source.data[f"y{i}"] = np.cumsum(np.random.randn(ts_length)) + i*100
plot = figure()
for i in range(n_lines):
plot.line(x='x', y=f"y{i}", source=source)
show(plot)
By passing sequence literals to line, your code results in the creation 99 unnecessary CDS objects (one per line call). Also does not re-used the x data, resulting in sending 99*20k extra points to BokehJS unnecessarily. And by sending a plain list instead of a numpy array, these also all get encoded using the less efficient (in time and space) default JSON encoding, instead of the efficient binary encoding that is available for numpy arrays.
That said, this is not causing all the issues here, and is probably not a solution on its own. But I wanted to make sure to point it out.
Datashader
For this many points, you might consider using DataShader in conjunction with Bokeh. The Holoviews library also integrates Bokeh and Datashader automatically at a high level. By pre-rendering images on the Python side, Datashader is effectively a bandwidth compression tool (among other things).
PNG export
Bokeh tilts trade-off towards affording various kinds of interactivity. But if you don't actually need that interactivity, then you are paying some extra costs. If that's your situation, you could consider generating static PNGs instead:
from bokeh.io.export import get_screenshot_as_png
p = get_screenshot_as_png(plot)
You'll need to install the additional optional dependencies listed in Exporting Plots and if you are doing many plots you might want to consider saving and reusing a webdriver explicitly for each call.

Seaborn Plot doesn't show up

I am creating a bar chart with seaborn, and it's not generating any sort of error, but nothing happens either.
This is the code I have:
import pandas
import numpy
import matplotlib.pyplot as plt
import seaborn
data = pandas.read_csv('fy15crime.csv', low_memory = False)
seaborn.countplot(x="primary_type", data=data)
plt.xlabel('crime')
plt.ylabel('amount')
seaborn.plt.show()
I added "seaborn.plt.show() in an effort to have it show up, but it isn't working still.
You should place this line somewhere in the top cell in Jupyter to enable inline plotting:
%matplotlib inline
It's simply plt.show() you were close. No need for seaborn
I was using PyCharm using a standard Python file and I had the best luck with the following:
Move code to a Jupyter notebook (which can you do inside of PyCharm by right clicking on the project and choosing new - Jupyter Notebook)
If running a chart that takes a lot of processing time it might not have been obvious before, but in Jupyter mode you can easily see when the cell has finished processing.

Memory overflow when saving Matplotlib plots in a loop

I am using an iterative loop to plot soame data using Matplotlib. When the code has saved around 768 plots, it throws the following exception.
RuntimeError: Could not allocate memory for image
My computer has around 3.5 GB RAM.
Is there any method to free the memory in parallel so that the memory does not get exhausted?
Are you remembering to close your figures when you are done with them? e.g.:
import matplotlib.pyplot as plt
#generate figure here
#...
plt.close(fig) #release resources associated with fig
As a slightly different answer, remember that you can re-use figures. Something like:
fig = plt.figure()
ax = plt.gca()
im = ax.imshow(data_list[0],...)
for new_data in data_list:
im.set_cdata(new_data)
fig.savefig(..)
Which will make your code run much faster as it will not need to set up and tear down the figure 700+ times.

Categories