Plotting in loop hangs matplotlib: strange memory leak? - python

I have a bunch of timestamp-value like pandas.DataFrame -s in a dict, like this:
dfS[k1] = df1
dfS[k2] = df2
...
While plotting to the same axis like this:
dfS[k1].plot(ax=ax1)
dfS[k2].plot(ax=ax1)
...
works, the same in a loop:
for k in dfS.keys():
dfS[k].plot(ax=ax1)
crashes matplotlib after about 20 secs with the message:
Traceback (most recent call last):
File "testDataDisplay.py", line 66, in <module>
dfS[k].plot(ax=ax)
File "/usr/lib/python3/dist-packages/pandas/plotting/_core.py", line 847, in __call__
return plot_backend.plot(data, kind=kind, **kwargs)
File "/usr/lib/python3/dist-packages/pandas/plotting/_matplotlib/__init__.py", line 61, in plot
plot_obj.generate()
File "/usr/lib/python3/dist-packages/pandas/plotting/_matplotlib/core.py", line 269, in generate
self._post_plot_logic_common(ax, self.data)
File "/usr/lib/python3/dist-packages/pandas/plotting/_matplotlib/core.py", line 437, in _post_plot_logic_common
self._apply_axis_properties(ax.xaxis, rot=self.rot, fontsize=self.fontsize)
File "/usr/lib/python3/dist-packages/pandas/plotting/_matplotlib/core.py", line 520, in _apply_axis_properties
labels = axis.get_majorticklabels() + axis.get_minorticklabels()
File "/usr/lib/python3/dist-packages/matplotlib/axis.py", line 1207, in get_majorticklabels
ticks = self.get_major_ticks()
File "/usr/lib/python3/dist-packages/matplotlib/axis.py", line 1378, in get_major_ticks
numticks = len(self.get_majorticklocs())
File "/usr/lib/python3/dist-packages/matplotlib/axis.py", line 1283, in get_majorticklocs
return self.major.locator()
File "/usr/lib/python3/dist-packages/pandas/plotting/_matplotlib/converter.py", line 988, in __call__
locs = self._get_default_locs(vmin, vmax)
File "/usr/lib/python3/dist-packages/pandas/plotting/_matplotlib/converter.py", line 968, in _get_default_locs
self.plot_obj.date_axis_info = self.finder(vmin, vmax, self.freq)
File "/usr/lib/python3/dist-packages/pandas/plotting/_matplotlib/converter.py", line 588, in _daily_finder
info = np.zeros(
MemoryError: Unable to allocate 45.2 GiB for an array with shape (1617786887,) and data type [('val', '<i8'), ('maj', '?'), ('min', '?'), ('fmt', 'S20')]
Looks like matplotlib interprets a timestamp as a shape, since the total number of datapoints is just 608. Here is some of it for reference:
dfS['pRp:1:avg5min'].head(4)
pRp:1:avg5min
2021-04-07 14:14:30 64.6226
2021-04-07 14:14:35 64.1258
2021-04-07 14:14:40 64.5340
2021-04-07 14:14:45 66.2782
for key in dfS.keys():
print(key, end=' ')
print(dfS[key].shape)
pRp:0:avg5min (5, 1)
pRp:0:raw (299, 1)
pRp:1:avg5min (5, 1)
pRp:1:raw (299, 1)
matplotlib.__version__
'3.3.0'
python3 --version
Python 3.8.6
pd.__version__
'1.0.5'
Any suggestion?

This should rather be a comment then an answer - but because stackoverflow decides that I can only comment with 50 reputation points, I will formulate this as an answer:
It seems that you are loading multiple data frames, each with its own time series and multiple data columns, and thus hitting the memory limit.
When you perform
for k in dfS.keys():
dfS[k].plot(ax=ax1)
you plot the whole dataframe.
Maybe something like this will help you:
for k in dfS.keys():
dfS[k].plot(x="columnName", ax=ax1)
Where columnName represents the name of a specific column of the dataframe.
It is really hard to guess the exact problem here, because we don't know how the input data looks like - maybe you post a minimal or header version of the data content here.

Related

ValueError: x and y must have same first dimension, but have shapes (13500,) and (13476,) Despite working with other Dataset?

I was trying to plot an excel dataset, however, one dataset is able to plot and with the second one I get an error message.
My code:
from matplotlib import pyplot as plt
from xlrd import open_workbook
x_time = list()
y_absorbance = list()
book = open_workbook("SP1.xls")
sheet = book.sheet_by_index(0) #data is on sheet 1
column1 = sheet.col_values(0)
column2 = sheet.col_values(1)
for x in column1:
try:
num = float(x)
time = num/1000000
x_time.append(time)
except:
continue
for y in column2:
try:
num = float(y)
absorbance = num/1000000
y_absorbance.append(absorbance)
except:
continue
plt.plot(x_time, y_absorbance)
plt.title("Final Analysis native R5 Main Pool", fontname="Times New Roman",fontweight="bold", size=20)
plt.xlabel("run time [min]", fontname="Times New Roman")
plt.ylabel("Absorbance [mAU]", fontname="Times New Roman")
plt.legend(("UV-VIS 214 nm",), loc="upper right")
plt.show()
Error message:
Traceback (most recent call last):
File "/Users/nico/PycharmProjects/Exercise/HPLC_plot.py", line 29, in <module>
plt.plot(x_time, y_absorbance)
File "/Users/nico/PycharmProjects/venv/lib/python3.9/site-packages/matplotlib/pyplot.py", line 2840, in plot
return gca().plot(
File "/Users/nico/PycharmProjects/venv/lib/python3.9/site-packages/matplotlib/axes/_axes.py", line 1743, in plot
lines = [*self._get_lines(*args, data=data, **kwargs)]
File "/Users/nico/PycharmProjects/venv/lib/python3.9/site-packages/matplotlib/axes/_base.py", line 273, in __call__
yield from self._plot_args(this, kwargs)
File "/Users/nico/PycharmProjects/venv/lib/python3.9/site-packages/matplotlib/axes/_base.py", line 399, in _plot_args
raise ValueError(f"x and y must have same first dimension, but "
ValueError: x and y must have same first dimension, but have shapes (13500,) and (13476,)
The data is an excel sheet and looks like this:
https://i.stack.imgur.com/npJDy.png
The other plot looks like this
so this is how the plot should look like.
I do not know what I am doing wrong since with the first dataset it works and the second data set is from the same HPLC-software just from another run.
Any suggestions are appreciated in advance! :)

Memory/xarray/dask bug when trying to load xarray dataset comprised of dask arrays chunks onto memory?

first time posting a question so don't hesitate to point out additions/corrections if needed. Also, I am not sure if this is really a bug or not so if you think it is, I will redirect my question on the xarray/dask GitHub. I am also quite a newbie with xarray, I come from matlab but trying to transition to python. So here it is...I am opening two years of hourly data with xarray and subset it as per the code below:
ds=xr.open_mfdataset('F:/supdude/datatest/*', combine='by_coords')
ds=ds.sel(longitude=slice(-145+360,-52+360), latitude=slice(70,41))
Once done, the dataset looks like this:
<xarray.Dataset>
Dimensions: (latitude: 117, longitude: 373, time: 17544)
Coordinates:
* longitude (longitude) float32 215.0 215.25 215.5 ... 307.5 307.75 308.0
* latitude (latitude) float32 70.0 69.75 69.5 69.25 ... 41.5 41.25 41.0
* time (time) datetime64[ns] 1979-01-01 ... 1980-12-31T23:00:00
Data variables:
d2m (time, latitude, longitude) float32 dask.array<chunksize=(8760, 117, 373), meta=np.ndarray>
I then proceed with simple resampling methods to bring it down to a daily basis and it comes down to this (I've got two xarray dataset that are like this):
<xarray.Dataset>
Dimensions: (latitude: 117, longitude: 373, time: 731)
Coordinates:
* time (time) datetime64[ns] 1979-01-01 1979-01-02 ... 1980-12-31
* longitude (longitude) float32 215.0 215.25 215.5 ... 307.5 307.75 308.0
* latitude (latitude) float32 70.0 69.75 69.5 69.25 ... 41.5 41.25 41.0
Data variables:
d2m (time, latitude, longitude) float32 dask.array<chunksize=(1, 117, 373), meta=np.ndarray>
The problem comes heres. I changed my resampling workflow to make it faster and I am trying to compare both results from the new resampling method and the previous one so I am sure my new method yields the right results, but in order to do that I need to load my xarray datasets onto my memory so I can access the data in the variable "d2m". To do so I use:
ds.load() / ds.compute()
When running that, I get the following memory error (full traceback at the end):
MemoryError: Unable to allocate 17.0 GiB for an array with shape (8784, 721, 1440) and data type >i2
This seems quite off since that is the size of a yearly file for the whole world. Technically, my datasets after resampling are supposed to be about 500 MB per year (1Gb total) and as you can see above, the full shape after resampling is (117, 373, 731) so not sure why using ".load()" yields an error with the shape (8784, 721, 1440). My computer has only 16Gb of RAM so I tried to do it on a computer that has 64Gb (so I technically should be able to open both year (17Gb per year) directly on the memory and not use dask), but when I tried to use ".load()", it filled up the whole 64Gb or RAM and crashed the computer. For now this is only a test, in the near future I'll have to deal with MUCH larger dataset so I can't really just open these up without dask on the computer that has 64Gb of RAM. Further tests showed that this problem occurs only when I already used ".load()" on an array (i.e. I want to load both my arrays to compare them, if it is in a new kernel, the first one will load up and add ~1Gb on my RAM and when I try to load up the second one, I get the memory error). I don't know a lot about dask and none of the issues I found online were directly related to what I am getting here....maybe something about the scheduler but not sure. Any ideas??
Full traceback:
Traceback (most recent call last):
File "<ipython-input-42-88724a7ac66e>", line 1, in <module>
new.compute()
File "C:\Users\Psybot\Anaconda3\lib\site-packages\xarray\core\dataset.py", line 807, in compute
return new.load(**kwargs)
File "C:\Users\Psybot\Anaconda3\lib\site-packages\xarray\core\dataset.py", line 651, in load
evaluated_data = da.compute(*lazy_data.values(), **kwargs)
File "C:\Users\Psybot\Anaconda3\lib\site-packages\dask\base.py", line 437, in compute
results = schedule(dsk, keys, **kwargs)
File "C:\Users\Psybot\Anaconda3\lib\site-packages\dask\threaded.py", line 84, in get
**kwargs
File "C:\Users\Psybot\Anaconda3\lib\site-packages\dask\local.py", line 486, in get_async
raise_exception(exc, tb)
File "C:\Users\Psybot\Anaconda3\lib\site-packages\dask\local.py", line 316, in reraise
raise exc
File "C:\Users\Psybot\Anaconda3\lib\site-packages\dask\local.py", line 222, in execute_task
result = _execute_task(task, data)
File "C:\Users\Psybot\Anaconda3\lib\site-packages\dask\core.py", line 118, in _execute_task
args2 = [_execute_task(a, cache) for a in args]
File "C:\Users\Psybot\Anaconda3\lib\site-packages\dask\core.py", line 118, in <listcomp>
args2 = [_execute_task(a, cache) for a in args]
File "C:\Users\Psybot\Anaconda3\lib\site-packages\dask\core.py", line 118, in _execute_task
args2 = [_execute_task(a, cache) for a in args]
File "C:\Users\Psybot\Anaconda3\lib\site-packages\dask\core.py", line 118, in <listcomp>
args2 = [_execute_task(a, cache) for a in args]
File "C:\Users\Psybot\Anaconda3\lib\site-packages\dask\core.py", line 119, in _execute_task
return func(*args2)
File "C:\Users\Psybot\Anaconda3\lib\site-packages\dask\array\core.py", line 106, in getter
c = np.asarray(c)
File "C:\Users\Psybot\Anaconda3\lib\site-packages\numpy\core\_asarray.py", line 85, in asarray
return array(a, dtype, copy=False, order=order)
File "C:\Users\Psybot\Anaconda3\lib\site-packages\xarray\core\indexing.py", line 491, in __array__
return np.asarray(self.array, dtype=dtype)
File "C:\Users\Psybot\Anaconda3\lib\site-packages\numpy\core\_asarray.py", line 85, in asarray
return array(a, dtype, copy=False, order=order)
File "C:\Users\Psybot\Anaconda3\lib\site-packages\xarray\core\indexing.py", line 653, in __array__
return np.asarray(self.array, dtype=dtype)
File "C:\Users\Psybot\Anaconda3\lib\site-packages\numpy\core\_asarray.py", line 85, in asarray
return array(a, dtype, copy=False, order=order)
File "C:\Users\Psybot\Anaconda3\lib\site-packages\xarray\core\indexing.py", line 557, in __array__
return np.asarray(array[self.key], dtype=None)
File "C:\Users\Psybot\Anaconda3\lib\site-packages\numpy\core\_asarray.py", line 85, in asarray
return array(a, dtype, copy=False, order=order)
File "C:\Users\Psybot\Anaconda3\lib\site-packages\xarray\coding\variables.py", line 72, in __array__
return self.func(self.array)
File "C:\Users\Psybot\Anaconda3\lib\site-packages\xarray\coding\variables.py", line 218, in _scale_offset_decoding
data = np.array(data, dtype=dtype, copy=True)
File "C:\Users\Psybot\Anaconda3\lib\site-packages\xarray\coding\variables.py", line 72, in __array__
return self.func(self.array)
File "C:\Users\Psybot\Anaconda3\lib\site-packages\xarray\coding\variables.py", line 138, in _apply_mask
data = np.asarray(data, dtype=dtype)
File "C:\Users\Psybot\Anaconda3\lib\site-packages\numpy\core\_asarray.py", line 85, in asarray
return array(a, dtype, copy=False, order=order)
File "C:\Users\Psybot\Anaconda3\lib\site-packages\xarray\core\indexing.py", line 557, in __array__
return np.asarray(array[self.key], dtype=None)
File "C:\Users\Psybot\Anaconda3\lib\site-packages\xarray\backends\netCDF4_.py", line 73, in __getitem__
key, self.shape, indexing.IndexingSupport.OUTER, self._getitem
File "C:\Users\Psybot\Anaconda3\lib\site-packages\xarray\core\indexing.py", line 837, in explicit_indexing_adapter
result = raw_indexing_method(raw_key.tuple)
File "C:\Users\Psybot\Anaconda3\lib\site-packages\xarray\backends\netCDF4_.py", line 85, in _getitem
array = getitem(original_array, key)
File "netCDF4\_netCDF4.pyx", line 4408, in netCDF4._netCDF4.Variable.__getitem__
File "netCDF4\_netCDF4.pyx", line 5335, in netCDF4._netCDF4.Variable._get
MemoryError: Unable to allocate 17.0 GiB for an array with shape (8784, 721, 1440) and data type >i2

How to use `seaborn` to distplot an array of double value in Python3.6?

I tried to use distplot to plot an array of double value but failed. Below is my source code:
>>> import seaborn as sns, numpy as np
>>> sns.set(); np.random.seed(0)
>>> x = np.random.randn(100)
>>> ax = sns.distplot(x)
Below is the error I got. I don't know what wrong with my code. Does anyone know the issue?
>>> ax = sns.distplot(x)
Traceback (most recent call last):
File "<stdin>", line 1, in <module>
File "/anaconda/lib/python3.6/site-packages/seaborn/distributions.py", line 221, in distplot
kdeplot(a, vertical=vertical, ax=ax, color=kde_color, **kde_kws)
File "/anaconda/lib/python3.6/site-packages/seaborn/distributions.py", line 604, in kdeplot
cumulative=cumulative, **kwargs)
File "/anaconda/lib/python3.6/site-packages/seaborn/distributions.py", line 270, in _univariate_
kdeplot
cumulative=cumulative)
File "/anaconda/lib/python3.6/site-packages/seaborn/distributions.py", line 328, in _statsmodels
_univariate_kde
kde.fit(kernel, bw, fft, gridsize=gridsize, cut=cut, clip=clip)
File "/anaconda/lib/python3.6/site-packages/statsmodels/nonparametric/kde.py", line 146, in fit
clip=clip, cut=cut)
File "/anaconda/lib/python3.6/site-packages/statsmodels/nonparametric/kde.py", line 506, in kden
sityfft
f = revrt(zstar)
File "/anaconda/lib/python3.6/site-packages/statsmodels/nonparametric/kdetools.py", line 20, in
revrt
y = X[:m/2+1] + np.r_[0,X[m/2+1:],0]*1j
TypeError: slice indices must be integers or None or have an __index__ method
BTW, I am using python3.6.
This is caused by an old version of statsmodels and the problem is fixed in version 0.8.0. Upgrade it as described in https://github.com/mwaskom/seaborn/issues/1092
conda update statsmodels

Pandas/Sklearn gives incorrect Memoryerror

I'm working in Python2.7 (Anaconda 4.0) on a Jupyter notebook on a EC2 instance with plenty of memory (60GB, 48GB free according to free). I've loaded a Pandas (v0.18) dataframe that is large (150K rows, ~30KB per row), but is nowhere near the memory capacity of the instance, even if many many copies are made. Certain Pandas and Scikit-learn (v0.17) calls will trigger a MemoryError instantly, e.g.:
#X is a subset of the original df with 60 columns instead of the 3000
#Y is a float column
X.add(Y)
#And for sklearn...
pca = decomposition.KernelPCA(n_components=5)
pca.fit(X,Y)
Meanwhile, these work fine:
Z = X.copy(deep=True)
pca = decomposition.PCA(n_components=5)
Most perplexingly, I can do this and it finishes in a few seconds:
huge = range(1000000000)
I've rebooted the notebook, the kernel, and the instance, but the same calls keep giving the MemoryError. I've also verified that I'm using 64-bit Python. Any suggestions?
Update: adding the traceback errors:
Traceback (most recent call last):
File "<ipython-input-9-ae71777140e2>", line 2, in <module>
Z = X.add(Y)
File "/home/ec2-user/anaconda2/lib/python2.7/site-packages/pandas/core/ops.py", line 1057, in f
return self._combine_series(other, na_op, fill_value, axis, level)
File "/home/ec2-user/anaconda2/lib/python2.7/site-packages/pandas/core/frame.py", line 3500, in _combine_series
fill_value=fill_value)
File "/home/ec2-user/anaconda2/lib/python2.7/site-packages/pandas/core/frame.py", line 3528, in _combine_match_columns
copy=False)
File "/home/ec2-user/anaconda2/lib/python2.7/site-packages/pandas/core/frame.py", line 2730, in align
broadcast_axis=broadcast_axis)
File "/home/ec2-user/anaconda2/lib/python2.7/site-packages/pandas/core/generic.py", line 4152, in align
fill_axis=fill_axis)
File "/home/ec2-user/anaconda2/lib/python2.7/site-packages/pandas/core/generic.py", line 4234, in _align_series
fdata = fdata.reindex_indexer(join_index, lidx, axis=0)
File "/home/ec2-user/anaconda2/lib/python2.7/site-packages/pandas/core/internals.py", line 3528, in reindex_indexer
fill_tuple=(fill_value,))
File "/home/ec2-user/anaconda2/lib/python2.7/site-packages/pandas/core/internals.py", line 3591, in _slice_take_blocks_ax0
fill_value=fill_value))
File "/home/ec2-user/anaconda2/lib/python2.7/site-packages/pandas/core/internals.py", line 3621, in _make_na_block
block_values = np.empty(block_shape, dtype=dtype)
MemoryError
and
Traceback (most recent call last):
File "<ipython-input-13-d510bc16443e>", line 3, in <module>
pca.fit(X,Y)
File "/home/ec2-user/anaconda2/lib/python2.7/site-packages/sklearn/decomposition/kernel_pca.py", line 202, in fit
K = self._get_kernel(X)
File "/home/ec2-user/anaconda2/lib/python2.7/site-packages/sklearn/decomposition/kernel_pca.py", line 135, in _get_kernel
filter_params=True, **params)
File "/home/ec2-user/anaconda2/lib/python2.7/site-packages/sklearn/metrics/pairwise.py", line 1347, in pairwise_kernels
return _parallel_pairwise(X, Y, func, n_jobs, **kwds)
File "/home/ec2-user/anaconda2/lib/python2.7/site-packages/sklearn/metrics/pairwise.py", line 1054, in _parallel_pairwise
return func(X, Y, **kwds)
File "/home/ec2-user/anaconda2/lib/python2.7/site-packages/sklearn/metrics/pairwise.py", line 716, in linear_kernel
return safe_sparse_dot(X, Y.T, dense_output=True)
File "/home/ec2-user/anaconda2/lib/python2.7/site-packages/sklearn/utils/extmath.py", line 184, in safe_sparse_dot
return fast_dot(a, b)
MemoryError
Figured out the Pandas side of the issue. I had a DF and a Series with matching indexes, X and Y. I figured I could add Y as another column like this:
X.add(Y)
But doing this tries to match Y on the columns, not on the index, thus creating a 150Kx150K array. I needed to supply the axis:
X.add(Y, axis='index')

How to make grouper and axis the same length?

For my assignment I'm supposed to plot the tracks of 20 hurricanes on a map using matplotlib. However when I run my code I get the error: AssertionError:Grouper and axis must be the same length
Here's the code I have:
import numpy as np
from mpl_toolkits.basemap import Basemap
import matplotlib.pyplot as plt
from PIL import *
fig = plt.figure(figsize=(12,12))
ax = fig.add_axes([0.1,0.1,0.8,0.8])
m = Basemap(llcrnrlon=-100.,llcrnrlat=0.,urcrnrlon=-20.,urcrnrlat=57.,
projection='lcc',lat_1=20.,lat_2=40.,lon_0=-60.,
resolution ='l',area_thresh=1000.)
m.bluemarble()
m.drawcoastlines(linewidth=0.5)
m.drawcountries(linewidth=0.5)
m.drawstates(linewidth=0.5)
# Creates parallels and meridians
m.drawparallels(np.arange(10.,35.,5.),labels=[1,0,0,1])
m.drawmeridians(np.arange(-120.,-80.,5.),labels=[1,0,0,1])
m.drawmapboundary(fill_color='aqua')
# Opens data file
import pandas as pd
name = [ ]
df = pd.read_csv('louisianastormb.csv')
for name, group in df.groupby([name]):
latitude = group.lat.values
longitude = group.lon.values
x,y = m(longitude, latitude)
plt.plot(x,y,'y-',linewidth=2 )
plt.xlabel('Longitude')
plt.ylabel('Latitude')
plt.title('20 Hurricanes with Landfall in Louisiana')
plt.savefig('20hurpaths.jpg', dpi=100)
Here's the full error output:
Traceback (most recent call last):
File "/home/darealmzd/lstorms.py", line 31, in <module>
for name, group in df.groupby([name]):
File "/usr/local/lib/python2.7/dist-packages/pandas/core/generic.py", line 186, in groupby
squeeze=squeeze)
File "/usr/local/lib/python2.7/dist-packages/pandas/core/groupby.py", line 533, in groupby
return klass(obj, by, **kwds)
File "/usr/local/lib/python2.7/dist-packages/pandas/core/groupby.py", line 197, in __init__
level=level, sort=sort)
File "/usr/local/lib/python2.7/dist-packages/pandas/core/groupby.py", line 1325, in _get_grouper
ping = Grouping(group_axis, gpr, name=name, level=level, sort=sort)
File "/usr/local/lib/python2.7/dist-packages/pandas/core/groupby.py", line 1129, in __init__
self.grouper = _convert_grouper(index, grouper)
File "/usr/local/lib/python2.7/dist-packages/pandas/core/groupby.py", line 1350, in _convert_grouper
raise Assertionerror('Grouper and axis must be same length')
Assertionerror: Grouper and axis must be same length
ValueError: Grouper and axis must be same length
This can occur if you are using double brackets in the groupby argument.
(I posted this since it is the top result on Google).
The problem is that you're grouping by (effectively) a list of empty list ([[]]). Because you have name = [] earlier and then you wrap that in a list as well.
If you want to group on a single column (called 'HurricaneName'), you should do something like:
for name, group in df.groupby('HurricaneName'):
However, if you want to group on multiple columns, then you need to pass a list:
for name, group in df.groupby(['HurricaneName', 'Year'])
If you want to put it in a variable like you have, you can do it like this:
col_name = 'State'
for name, group in df.groupby([col_name]):
Try iloc to make grouper equal to axis.
example:
sns.boxplot(x=df['pH-binned'].iloc[0:3], y=v_count, data=df)
In case axis=3.

Categories