Pandas/Sklearn gives incorrect Memoryerror - python

I'm working in Python2.7 (Anaconda 4.0) on a Jupyter notebook on a EC2 instance with plenty of memory (60GB, 48GB free according to free). I've loaded a Pandas (v0.18) dataframe that is large (150K rows, ~30KB per row), but is nowhere near the memory capacity of the instance, even if many many copies are made. Certain Pandas and Scikit-learn (v0.17) calls will trigger a MemoryError instantly, e.g.:
#X is a subset of the original df with 60 columns instead of the 3000
#Y is a float column
X.add(Y)
#And for sklearn...
pca = decomposition.KernelPCA(n_components=5)
pca.fit(X,Y)
Meanwhile, these work fine:
Z = X.copy(deep=True)
pca = decomposition.PCA(n_components=5)
Most perplexingly, I can do this and it finishes in a few seconds:
huge = range(1000000000)
I've rebooted the notebook, the kernel, and the instance, but the same calls keep giving the MemoryError. I've also verified that I'm using 64-bit Python. Any suggestions?
Update: adding the traceback errors:
Traceback (most recent call last):
File "<ipython-input-9-ae71777140e2>", line 2, in <module>
Z = X.add(Y)
File "/home/ec2-user/anaconda2/lib/python2.7/site-packages/pandas/core/ops.py", line 1057, in f
return self._combine_series(other, na_op, fill_value, axis, level)
File "/home/ec2-user/anaconda2/lib/python2.7/site-packages/pandas/core/frame.py", line 3500, in _combine_series
fill_value=fill_value)
File "/home/ec2-user/anaconda2/lib/python2.7/site-packages/pandas/core/frame.py", line 3528, in _combine_match_columns
copy=False)
File "/home/ec2-user/anaconda2/lib/python2.7/site-packages/pandas/core/frame.py", line 2730, in align
broadcast_axis=broadcast_axis)
File "/home/ec2-user/anaconda2/lib/python2.7/site-packages/pandas/core/generic.py", line 4152, in align
fill_axis=fill_axis)
File "/home/ec2-user/anaconda2/lib/python2.7/site-packages/pandas/core/generic.py", line 4234, in _align_series
fdata = fdata.reindex_indexer(join_index, lidx, axis=0)
File "/home/ec2-user/anaconda2/lib/python2.7/site-packages/pandas/core/internals.py", line 3528, in reindex_indexer
fill_tuple=(fill_value,))
File "/home/ec2-user/anaconda2/lib/python2.7/site-packages/pandas/core/internals.py", line 3591, in _slice_take_blocks_ax0
fill_value=fill_value))
File "/home/ec2-user/anaconda2/lib/python2.7/site-packages/pandas/core/internals.py", line 3621, in _make_na_block
block_values = np.empty(block_shape, dtype=dtype)
MemoryError
and
Traceback (most recent call last):
File "<ipython-input-13-d510bc16443e>", line 3, in <module>
pca.fit(X,Y)
File "/home/ec2-user/anaconda2/lib/python2.7/site-packages/sklearn/decomposition/kernel_pca.py", line 202, in fit
K = self._get_kernel(X)
File "/home/ec2-user/anaconda2/lib/python2.7/site-packages/sklearn/decomposition/kernel_pca.py", line 135, in _get_kernel
filter_params=True, **params)
File "/home/ec2-user/anaconda2/lib/python2.7/site-packages/sklearn/metrics/pairwise.py", line 1347, in pairwise_kernels
return _parallel_pairwise(X, Y, func, n_jobs, **kwds)
File "/home/ec2-user/anaconda2/lib/python2.7/site-packages/sklearn/metrics/pairwise.py", line 1054, in _parallel_pairwise
return func(X, Y, **kwds)
File "/home/ec2-user/anaconda2/lib/python2.7/site-packages/sklearn/metrics/pairwise.py", line 716, in linear_kernel
return safe_sparse_dot(X, Y.T, dense_output=True)
File "/home/ec2-user/anaconda2/lib/python2.7/site-packages/sklearn/utils/extmath.py", line 184, in safe_sparse_dot
return fast_dot(a, b)
MemoryError

Figured out the Pandas side of the issue. I had a DF and a Series with matching indexes, X and Y. I figured I could add Y as another column like this:
X.add(Y)
But doing this tries to match Y on the columns, not on the index, thus creating a 150Kx150K array. I needed to supply the axis:
X.add(Y, axis='index')

Related

Plotting in loop hangs matplotlib: strange memory leak?

I have a bunch of timestamp-value like pandas.DataFrame -s in a dict, like this:
dfS[k1] = df1
dfS[k2] = df2
...
While plotting to the same axis like this:
dfS[k1].plot(ax=ax1)
dfS[k2].plot(ax=ax1)
...
works, the same in a loop:
for k in dfS.keys():
dfS[k].plot(ax=ax1)
crashes matplotlib after about 20 secs with the message:
Traceback (most recent call last):
File "testDataDisplay.py", line 66, in <module>
dfS[k].plot(ax=ax)
File "/usr/lib/python3/dist-packages/pandas/plotting/_core.py", line 847, in __call__
return plot_backend.plot(data, kind=kind, **kwargs)
File "/usr/lib/python3/dist-packages/pandas/plotting/_matplotlib/__init__.py", line 61, in plot
plot_obj.generate()
File "/usr/lib/python3/dist-packages/pandas/plotting/_matplotlib/core.py", line 269, in generate
self._post_plot_logic_common(ax, self.data)
File "/usr/lib/python3/dist-packages/pandas/plotting/_matplotlib/core.py", line 437, in _post_plot_logic_common
self._apply_axis_properties(ax.xaxis, rot=self.rot, fontsize=self.fontsize)
File "/usr/lib/python3/dist-packages/pandas/plotting/_matplotlib/core.py", line 520, in _apply_axis_properties
labels = axis.get_majorticklabels() + axis.get_minorticklabels()
File "/usr/lib/python3/dist-packages/matplotlib/axis.py", line 1207, in get_majorticklabels
ticks = self.get_major_ticks()
File "/usr/lib/python3/dist-packages/matplotlib/axis.py", line 1378, in get_major_ticks
numticks = len(self.get_majorticklocs())
File "/usr/lib/python3/dist-packages/matplotlib/axis.py", line 1283, in get_majorticklocs
return self.major.locator()
File "/usr/lib/python3/dist-packages/pandas/plotting/_matplotlib/converter.py", line 988, in __call__
locs = self._get_default_locs(vmin, vmax)
File "/usr/lib/python3/dist-packages/pandas/plotting/_matplotlib/converter.py", line 968, in _get_default_locs
self.plot_obj.date_axis_info = self.finder(vmin, vmax, self.freq)
File "/usr/lib/python3/dist-packages/pandas/plotting/_matplotlib/converter.py", line 588, in _daily_finder
info = np.zeros(
MemoryError: Unable to allocate 45.2 GiB for an array with shape (1617786887,) and data type [('val', '<i8'), ('maj', '?'), ('min', '?'), ('fmt', 'S20')]
Looks like matplotlib interprets a timestamp as a shape, since the total number of datapoints is just 608. Here is some of it for reference:
dfS['pRp:1:avg5min'].head(4)
pRp:1:avg5min
2021-04-07 14:14:30 64.6226
2021-04-07 14:14:35 64.1258
2021-04-07 14:14:40 64.5340
2021-04-07 14:14:45 66.2782
for key in dfS.keys():
print(key, end=' ')
print(dfS[key].shape)
pRp:0:avg5min (5, 1)
pRp:0:raw (299, 1)
pRp:1:avg5min (5, 1)
pRp:1:raw (299, 1)
matplotlib.__version__
'3.3.0'
python3 --version
Python 3.8.6
pd.__version__
'1.0.5'
Any suggestion?
This should rather be a comment then an answer - but because stackoverflow decides that I can only comment with 50 reputation points, I will formulate this as an answer:
It seems that you are loading multiple data frames, each with its own time series and multiple data columns, and thus hitting the memory limit.
When you perform
for k in dfS.keys():
dfS[k].plot(ax=ax1)
you plot the whole dataframe.
Maybe something like this will help you:
for k in dfS.keys():
dfS[k].plot(x="columnName", ax=ax1)
Where columnName represents the name of a specific column of the dataframe.
It is really hard to guess the exact problem here, because we don't know how the input data looks like - maybe you post a minimal or header version of the data content here.

ValueError: There aren't any elements to reflect in axis 0 of `array` when calling librosa.feature.melspectrogram

I'm trying to extract the mel spectrograms for different audio files and for some of them I get the following error:
Traceback (most recent call last):
File "", line 25, in script_process_file
bounds, events, features, RMS = process_file(_file,version=None,output=True,reading=True,cython=False,corr_FIR=None,features_list = features_list_all, tree_function =Tree_4_0_0,data=data, amplify=None)
File "pcm_algorithm/process_file.py", line 111, in process_file
mel_spec = librosa.feature.melspectrogram(sound,n_fft=256,hop_length=128,n_mels=n_mels).T
File "/usr/local/lib/python2.7/dist-packages/librosa/feature/spectral.py", line 1388, in melspectrogram
power=power)
File "/usr/local/lib/python2.7/dist-packages/librosa/core/spectrum.py", line 1179, in _spectrogram
S = np.abs(stft(y, n_fft=n_fft, hop_length=hop_length))**power
File "/usr/local/lib/python2.7/dist-packages/librosa/core/spectrum.py", line 160, in stft
y = np.pad(y, int(n_fft // 2), mode=pad_mode)
File "/usr/local/lib/python2.7/dist-packages/numpy/lib/arraypad.py", line 1420, in pad
" in axis {} of array".format(axis))
ValueError: There aren't any elements to reflect in axis 0 of array
I use 256 fft points with 128 frames overlap and 40 mel_bands. Any advice would be really helpful.
Here is the exact line of code that gives me an error:
n_mels = **40**
mel_spec = librosa.feature.melspectrogram(sound,n_fft=**256**,hop_length=**128**,n_mels=n_mels).T

Fipy Grid3D 'an index can only have a single ellipsis' error

I am interesting in solving differential equation using fipy.
The following code is working correctly when I am using Grid2D.
from fipy import *
mesh = Grid2D(nx=3, ny=3)
#mesh = Grid3D(nx=3, ny=3, nz=3)
phi = CellVariable(name='solution variable', mesh=mesh, value=0.)
phi.constrain(0, mesh.facesLeft)
phi.constrain(10, mesh.facesRight)
coeff = CellVariable(mesh=mesh, value=1.)
eq = DiffusionTerm(coeff) == 0
eq.solve(var=phi)
When I am using Grid3D instead of Grid2D (commented line), I get following error:
Traceback (most recent call last):
File "/home/user/Programming/python/fdms/forSo.py", line 11, in <module>
eq.solve(var=phi)
File "/home/user/Programs/miniconda2/envs/FipyEnv2/lib/python3.6/site-packages/fipy/terms/term.py", line 211, in solve
solver = self._prepareLinearSystem(var, solver, boundaryConditions, dt)
File "/home/user/Programs/miniconda2/envs/FipyEnv2/lib/python3.6/site-packages/fipy/terms/term.py", line 169, in _prepareLinearSystem
diffusionGeomCoeff=self._getDiffusionGeomCoeff(var),
File "/home/user/Programs/miniconda2/envs/FipyEnv2/lib/python3.6/site-packages/fipy/terms/abstractDiffusionTerm.py", line 458, in _getDiffusionGeomCoeff
return self._getGeomCoeff(var)
File "/home/user/Programs/miniconda2/envs/FipyEnv2/lib/python3.6/site-packages/fipy/terms/term.py", line 465, in _getGeomCoeff
self.geomCoeff = self._calcGeomCoeff(var)
File "/home/user/Programs/miniconda2/envs/FipyEnv2/lib/python3.6/site-packages/fipy/terms/abstractDiffusionTerm.py", line 177, in _calcGeomCoeff
tmpBop = (coeff * FaceVariable(mesh=mesh, value=mesh._faceAreas) / mesh._cellDistances)[numerix.newaxis, :]
File "/home/user/Programs/miniconda2/envs/FipyEnv2/lib/python3.6/site-packages/fipy/variables/variable.py", line 1151, in __mul__
return self._BinaryOperatorVariable(lambda a,b: a*b, other)
File "/home/user/Programs/miniconda2/envs/FipyEnv2/lib/python3.6/site-packages/fipy/variables/variable.py", line 1116, in _BinaryOperatorVariable
if not v.unit.isDimensionless() or len(v.shape) > 3:
File "/home/user/Programs/miniconda2/envs/FipyEnv2/lib/python3.6/site-packages/fipy/variables/variable.py", line 255, in _getUnit
return self._extractUnit(self.value)
File "/home/user/Programs/miniconda2/envs/FipyEnv2/lib/python3.6/site-packages/fipy/variables/variable.py", line 538, in _getValue
value = self._calcValue()
File "/home/user/Programs/miniconda2/envs/FipyEnv2/lib/python3.6/site-packages/fipy/variables/cellToFaceVariable.py", line 48, in _calcValue
alpha = self.mesh._faceToCellDistanceRatio
File "/home/user/Programs/miniconda2/envs/FipyEnv2/lib/python3.6/site-packages/fipy/meshes/uniformGrid3D.py", line 269, in _faceToCellDistanceRatio
XZdis[..., 0,...] = 1
IndexError: an index can only have a single ellipsis ('...')
I installed fipy using «Recomended method» from https://www.ctcms.nist.gov/fipy/INSTALLATION.html. I tried to install using Miniconda for both Pthon 3.6 and Python 2.7 and got same errors.
How to solve equations using Grid3D?
This is because newer versions of numpy are less tolerant of our sloppy syntax. You can either checkout our develop source branch or make this change to your code.

How to use `seaborn` to distplot an array of double value in Python3.6?

I tried to use distplot to plot an array of double value but failed. Below is my source code:
>>> import seaborn as sns, numpy as np
>>> sns.set(); np.random.seed(0)
>>> x = np.random.randn(100)
>>> ax = sns.distplot(x)
Below is the error I got. I don't know what wrong with my code. Does anyone know the issue?
>>> ax = sns.distplot(x)
Traceback (most recent call last):
File "<stdin>", line 1, in <module>
File "/anaconda/lib/python3.6/site-packages/seaborn/distributions.py", line 221, in distplot
kdeplot(a, vertical=vertical, ax=ax, color=kde_color, **kde_kws)
File "/anaconda/lib/python3.6/site-packages/seaborn/distributions.py", line 604, in kdeplot
cumulative=cumulative, **kwargs)
File "/anaconda/lib/python3.6/site-packages/seaborn/distributions.py", line 270, in _univariate_
kdeplot
cumulative=cumulative)
File "/anaconda/lib/python3.6/site-packages/seaborn/distributions.py", line 328, in _statsmodels
_univariate_kde
kde.fit(kernel, bw, fft, gridsize=gridsize, cut=cut, clip=clip)
File "/anaconda/lib/python3.6/site-packages/statsmodels/nonparametric/kde.py", line 146, in fit
clip=clip, cut=cut)
File "/anaconda/lib/python3.6/site-packages/statsmodels/nonparametric/kde.py", line 506, in kden
sityfft
f = revrt(zstar)
File "/anaconda/lib/python3.6/site-packages/statsmodels/nonparametric/kdetools.py", line 20, in
revrt
y = X[:m/2+1] + np.r_[0,X[m/2+1:],0]*1j
TypeError: slice indices must be integers or None or have an __index__ method
BTW, I am using python3.6.
This is caused by an old version of statsmodels and the problem is fixed in version 0.8.0. Upgrade it as described in https://github.com/mwaskom/seaborn/issues/1092
conda update statsmodels

matplotlib.scatter color argument not accepting numpy array

I have written the following python plotting script using matplotlib:
import pynbody as pyn
import numpy as np
import matplotlib.pyplot as plt
import glob
s = pyn.load('./ballsV2.00001')
sl = s.g[np.where((s.g['z'] < 0.005) & (s.g['z']>-0.005))]
sx = s.s['x'][0]
sy = s.s['y'][0]
sz = s.s['z'][0]
r2 = ((s.g['x']-sx)**2+(s.g['y']-sy)**2+(s.g['z']-sz)**2)
Flux = np.array(1./(4*np.pi*r2)*np.exp(-1*7.00114988051*np.sqrt(r2)))
print(type(np.log10(sl['radFlux'])))
print(type(np.log10(Flux)))
plt.figure(figsize = (15,12))
#plt.scatter(sl['x'],sl['y'],c=np.log10(sl['radFlux']),s=75,edgecolors='none', marker = '.',vmin=-6,vmax=1)
plt.scatter(sl['x'],sl['y'],c=np.log10(Flux),s=75,edgecolors='none', marker = '.',vmin=-8,vmax=4)
plt.xlim([-0.5,0.5])
plt.ylim([-0.5,0.5])
plt.xlabel("x")
plt.ylabel("y")
plt.colorbar(label="log(Code Flux)")
plt.savefig('./ballsV2_0.1.pdf')
plt.savefig('./ballsV2_0.1.png')
plt.show()
plt.close()
When I run the script I get the following error:
foo#bar ~/Data/RadTransfer/Scaling_Tests/ballsV2 $ py
balls.py
balls.py:15: RuntimeWarning: divide by zero encountered in log10
print(type(np.log10(sl['radFlux'])))
<class 'numpy.ndarray'>
<class 'numpy.ndarray'>
Traceback (most recent call last):
File "/home/grondjj/anaconda3/lib/python3.6/site-packages/matplotlib/colors.py", line 141, in to_rgba
rgba = _colors_full_map.cache[c, alpha]
KeyError: (-4.1574455411341349, None)
During handling of the above exception, another exception occurred:
Traceback (most recent call last):
File "/home/grondjj/anaconda3/lib/python3.6/site-packages/matplotlib/colors.py", line 192, in _to_rgba_no_colorcycle
c = tuple(map(float, c))
TypeError: 'numpy.float64' object is not iterable
During handling of the above exception, another exception occurred:
Traceback (most recent call last):
File "balls.py", line 17, in <module>
plt.scatter(sl['x'],sl['y'],c=np.log10(Flux),s=75,edgecolors='none', marker = '.',vmin=-8,vmax=4)
File "/home/grondjj/anaconda3/lib/python3.6/site-packages/matplotlib/pyplot.py", line 3435, in scatter
edgecolors=edgecolors, data=data, **kwargs)
File "/home/grondjj/anaconda3/lib/python3.6/site-packages/matplotlib/__init__.py", line 1892, in inner
return func(ax, *args, **kwargs)
File "/home/grondjj/anaconda3/lib/python3.6/site-packages/matplotlib/axes/_axes.py", line 4028, in scatter
alpha=alpha
File "/home/grondjj/anaconda3/lib/python3.6/site-packages/matplotlib/collections.py", line 890, in __init__
Collection.__init__(self, **kwargs)
File "/home/grondjj/anaconda3/lib/python3.6/site-packages/matplotlib/collections.py", line 139, in __init__
self.set_facecolor(facecolors)
File "/home/grondjj/anaconda3/lib/python3.6/site-packages/matplotlib/collections.py", line 674, in set_facecolor
self._set_facecolor(c)
File "/home/grondjj/anaconda3/lib/python3.6/site-packages/matplotlib/collections.py", line 659, in _set_facecolor
self._facecolors = mcolors.to_rgba_array(c, self._alpha)
File "/home/grondjj/anaconda3/lib/python3.6/site-packages/matplotlib/colors.py", line 237, in to_rgba_array
result[i] = to_rgba(cc, alpha)
File "/home/grondjj/anaconda3/lib/python3.6/site-packages/matplotlib/colors.py", line 143, in to_rgba
rgba = _to_rgba_no_colorcycle(c, alpha)
File "/home/grondjj/anaconda3/lib/python3.6/site-packages/matplotlib/colors.py", line 194, in _to_rgba_no_colorcycle
raise ValueError("Invalid RGBA argument: {!r}".format(orig_c))
ValueError: Invalid RGBA argument: -4.1574455411341349
Ignore the divide by zero stuff,the issue is the scatter plot function isn't taking my array of values to map colour to. What is strange is that the commented out scatter plot command above it runs fine. The only difference is the array of values I am passing it. I made sure to cast them to the same type (they are both <class 'numpy.ndarray'>). Also, the values themselves are more sane ranging between ~4000 and 1E-7 in the Flux array, it is only the np.log10(sl['radFlux'] that has the divide by zero errors and that one works. Any suggestions?
Flux and np.log10(sl['radFlux']) ended up being different lengths. sl (a slice of s) was not used to compute r2, so Flux ended up being to big. It would be nice if matplotlib checked that the color array was the same length as the scatter x and y arrays and had an error message like it does when the x and y arrays are different lengths.

Categories