Cubic spline memory error

Cubic spline memory error - python

On a computer with 4GB of memory this simple interpolation leads to a memory error:
(based on: http://docs.scipy.org/doc/scipy/reference/tutorial/interpolate.html)
import numpy as np
from scipy.interpolate import interp1d
x = np.linspace(0, 10, 80000)
y = np.cos(-x**2/8.0)
f2 = interp1d(x, y, kind='cubic')
I thought about cutting the data into chunks, but is there a way I can perform this cubic spline interpolation without requiring so much memory?
Why does it even get in trouble?

If you look at the traceback when the error occurs, you'll see something like:
---------------------------------------------------------------------------
MemoryError Traceback (most recent call last)
<ipython-input-4-1e538e8d766e> in <module>()
----> 1 f2 = interp1d(x, y, kind='cubic')
/home/warren/local_scipy/lib/python2.7/site-packages/scipy/interpolate/interpolate.py in __init__(self, x, y, kind, axis, copy, bounds_error, fill_value)
390 else:
391 minval = order + 1
--> 392 self._spline = splmake(x, y, order=order)
393 self._call = self.__class__._call_spline
394
/home/warren/local_scipy/lib/python2.7/site-packages/scipy/interpolate/interpolate.py in splmake(xk, yk, order, kind, conds)
1754
1755 # the constraint matrix
-> 1756 B = _fitpack._bsplmat(order, xk)
1757 coefs = func(xk, yk, order, conds, B)
1758 return xk, coefs, order
MemoryError:
The function that is failing is scipy.interpolate._fitpack._bsplmat(order, xk). This function creates a 2-d array of 64-bit floats with shape (len(xk), len(xk) + order - 1). In your case, this is over 51GB.
Instead of interp1d, see if InterpolatedUnivariateSpline works for you. For example,
import numpy as np
from scipy.interpolate import InterpolatedUnivariateSpline
x = np.linspace(0, 10, 80000)
y = np.cos(-x**2/8.0)
f2 = InterpolatedUnivariateSpline(x, y, k=3)
I don't get a memory error with this.

Related

try to interpolate data from a csv file, getting error 'x and y arrays must be equal in length along interpolation axis'. found solution doesn't work

I'm completely new to python and data science with it, so please bear with me here. I appreciate every help and try to understand it as much as possible. Following code is what I got so far.
import pandas as pd
import numpy as np
from pandas import read_csv
from matplotlib import pyplot
import matplotlib.pyplot as plt
from scipy import interpolate
from scipy.interpolate import interp1d
from scipy.interpolate import interp2d
dataset = pd.read_csv(r"C:\Users\...\csv\Test1K.csv", sep=';', skiprows=0)
x = dataset.iloc[0:1420, 0:1].values
y = dataset.iloc[0:1420, 3:4].values
f = interpolate.interp1d(x, y, kind = 'cubic')
xn =270
t_on = f(xn)
print(t_on)
first rows of output from the csv file looks like this:
0 [s] [Celsius] [Celsius] [Celsius] [Celsius] [Celsius]
1 0 22.747 22.893 0.334 22.898 22.413
2 60 22.769 22.902 22.957 22.907 -0.187
3 120 22.78 22.895 25.519 22.911 -2.739
4 180 22.794 22.956 33.62 22.918 -10.827
short thing about what I try to do and where the problem is. I have this csv file, where there is a alot of data in it, with temperature readings every 60 seconds, for like 1400 readings. Now I want to interpolate that, so I can get a specific data point between each 60 seconds and possible even further than the 1400 iterations. (maybe up to 1600)
The first dataset I want, is the third celsius one. The code above is how far I got so far. Now I get the error code
---------------------------------------------------------------------------
ValueError Traceback (most recent call last)
~\AppData\Local\Temp/ipykernel_8008/3128405253.py in <module>
7
8
----> 9 f = interpolate.interp1d(x, y, kind = 'cubic')
10 yn =270 # Wert auf den Interpoliert werden soll
11 t_on = f(yn)
~\AppData\Roaming\Python\Python38\site-packages\scipy\interpolate\interpolate.py in __init__(self, x, y, kind, axis, copy, bounds_error, fill_value, assume_sorted)
434 assume_sorted=False):
435 """ Initialize a 1-D linear interpolation class."""
--> 436 _Interpolator1D.__init__(self, x, y, axis=axis)
437
438 self.bounds_error = bounds_error # used by fill_value setter
~\AppData\Roaming\Python\Python38\site-packages\scipy\interpolate\polyint.py in __init__(self, xi, yi, axis)
52 self.dtype = None
53 if yi is not None:
---> 54 self._set_yi(yi, xi=xi, axis=axis)
55
56 def __call__(self, x):
~\AppData\Roaming\Python\Python38\site-packages\scipy\interpolate\polyint.py in _set_yi(self, yi, xi, axis)
122 shape = (1,)
123 if xi is not None and shape[axis] != len(xi):
--> 124 raise ValueError("x and y arrays must be equal in length along "
125 "interpolation axis.")
126
ValueError: x and y arrays must be equal in length along interpolation axis.
I searched for solutions and got this for example:
x = np.linspace(0, 4, 13)
y = np.linspace(0, 4, 13)
X, Y = np.meshgrid(x, y)
z = np.arccos(-np.cos(2*X) * np.cos(2*Y))
f = interpolate.interp2d(x, y, z, kind = 'cubic')
I read at other problems, that the 2d solution should help, but when I put it like this I get:
---------------------------------------------------------------------------
TypeError Traceback (most recent call last)
~\AppData\Local\Temp/ipykernel_8008/1890924245.py in <module>
14 f = interpolate.interp2d(x, y, z, kind = 'cubic')
15 yn =270 # Wert auf den Interpoliert werden soll
---> 16 t_on = f(yn)
17 print(t_on)
18
TypeError: __call__() missing 1 required positional argument: 'y'
Now I get it, I need something for y, but wouldnt that ruin the whole thing I try to do? My result should be my y and not one of my inputs.
If anyone can help me with my code or even has completely different other solution, I appreciate everthing. Thank you all for the help
Edit: the whole error message

I think your issue is here:
x = dataset.iloc[0:1420, 0:1].values
y = dataset.iloc[0:1420, 3:4].values
The result of a 0:1 slice will be a DataFrame, and you will end up with a two-dimensional array with the shape (1420, 1).
You need a 1d array for interp1d, so you should just do
x = dataset.iloc[0:1420, 0].values
y = dataset.iloc[0:1420, 3].values
The shape of x and y will be (1420,) (1d array).

Interpolate_to_grid returns all nans

Practicing with MetPy Monday interpolate_to_grid for metar data and I successfully got the mslp grid to work.
Moving on to Potential temperature and the result has been all nan. When it "works". When it doesnt work, I get a set of errors that dont appear to help...
import numpy as np
import cartopy.crs as ccrs
import cartopy.feature as cfeature
import matplotlib.pyplot as plt
from siphon.catalog import TDSCatalog
from metpy.io import parse_metar_file
from metpy.interpolate import interpolate_to_grid, remove_nan_observations
from metpy.plots import add_metpy_logo, current_weather, sky_cover, StationPlot
from metpy.calc import wind_components, wet_bulb_temperature, altimeter_to_station_pressure,potential_temperature,gradient
from metpy.units import units
from datetime import datetime,timedelta
import pandas as pd
mapcrs = ccrs.LambertConformal(central_longitude=-100.,central_latitude=35.,standard_parallels=(30.,60.))
datacrs = ccrs.PlateCarree()
cat = TDSCatalog('https://thredds-test.unidata.ucar.edu/thredds/catalog/noaaport/text/metar/catalog.xml')
ds = cat.datasets[-4]
dattim = ds.name[6:14]+' '+ds.name[15:19]
ds.download()
df = parse_metar_file(ds.name)
#pandas dataframe
#df.head()
df.columns.values
extent = [-120,-72,24,50]
df = df.dropna(subset=['latitude','longitude','elevation','altimeter','air_temperature','eastward_wind','northward_wind','air_pressure_at_sea_level','dew_point_temperature'])
lon = df['longitude'].values
lat = df['latitude'].values
stn_ids = df['station_id'].values
elev = df['elevation'].values
altimeter = df['altimeter'].values
t2 = df['air_temperature'].values
mslp = df['air_pressure_at_sea_level'].values
#projected coords
xp, yp, _ = mapcrs.transform_points(datacrs,lon,lat).T # x,y returned
#mslp WORKS
x_masked, y_masked, mslp = remove_nan_observations(xp,yp,mslp)
#altgridx,altgridy,alt = interpolate_to_grid(x_masked,y_masked,alt, interp_type='cressman')
altgridx,altgridy,mslp = interpolate_to_grid(x_masked,y_masked,mslp, interp_type='barnes',gamma=.5,kappa_star=10, hres=25000)
#Potential Temperature doesnt work
pres = altimeter_to_station_pressure(altimeter * units('mbar'), elev * units('m'))*33.8639
print(pres)
# theta
x_masked, y_masked, temp = remove_nan_observations(xp,yp,t2*units('degC'))
x_masked, y_masked, pres = remove_nan_observations(xp,yp,pres)
print(np.size(temp))
potemp = potential_temperature(pres, temp)
print(np.size(potemp))
print(np.unique(np.array(potemp)))
grdx = 75000.
thgridx,thgridy,theta = interpolate_to_grid(x_masked,y_masked, potemp, interp_type='barnes',kappa_star=6, gamma=0.5,hres=grdx)
print(np.shape(thgridx))
print(np.unique(theta))
Here is what is returned from the last section:
[949.361081708803 993.4468013877739 987.2845093729651 ... 1029.0930108008558 1016.002484792407 930.3708063382303] millibar
5837
5837
[236.32885315 237.21299941 239.04372591 ... 368.37047837 369.20079652
370.76269267]
---------------------------------------------------------------------------
DimensionalityError Traceback (most recent call last)
~/miniconda3/lib/python3.7/site-packages/pint/quantity.py in __float__(self)
896 return float(self._convert_magnitude_not_inplace(UnitsContainer()))
--> 897 raise DimensionalityError(self._units, "dimensionless")
898
DimensionalityError: Cannot convert from 'kelvin' to 'dimensionless'
The above exception was the direct cause of the following exception:
ValueError Traceback (most recent call last)
/var/folders/5n/sg5k98bx6gg4flb4fskykh4m0000gn/T/ipykernel_41626/379842406.py in <module>
11
12 grdx = 75000.
---> 13 thgridx,thgridy,theta = interpolate_to_grid(x_masked,y_masked, potemp, interp_type='barnes',kappa_star=6, gamma=0.5,hres=grdx)
14 print(np.shape(thgridx))
15 print(np.unique(theta))
~/miniconda3/lib/python3.7/site-packages/metpy/pandas.py in wrapper(*args, **kwargs)
19 kwargs = {name: (v.values if isinstance(v, pd.Series) else v)
20 for name, v in kwargs.items()}
---> 21 return func(*args, **kwargs)
22 return wrapper
~/miniconda3/lib/python3.7/site-packages/metpy/interpolate/grid.py in interpolate_to_grid(x, y, z, interp_type, hres, minimum_neighbors, gamma, kappa_star, search_radius, rbf_func, rbf_smooth, boundary_coords)
301 minimum_neighbors=minimum_neighbors, gamma=gamma,
302 kappa_star=kappa_star, search_radius=search_radius,
--> 303 rbf_func=rbf_func, rbf_smooth=rbf_smooth)
304
305 return grid_x, grid_y, img.reshape(grid_x.shape)
~/miniconda3/lib/python3.7/site-packages/metpy/interpolate/points.py in interpolate_to_points(points, values, xi, interp_type, minimum_neighbors, gamma, kappa_star, search_radius, rbf_func, rbf_smooth)
365 return inverse_distance_to_points(points, values, xi, search_radius, gamma, kappa,
366 min_neighbors=minimum_neighbors,
--> 367 kind=interp_type)
368
369 # If this is radial basis function, make the interpolator and apply it
~/miniconda3/lib/python3.7/site-packages/metpy/interpolate/points.py in inverse_distance_to_points(points, values, xi, r, gamma, kappa, min_neighbors, kind)
268 img[idx] = cressman_point(dists, values_subset, r)
269 elif kind == 'barnes':
--> 270 img[idx] = barnes_point(dists, values_subset, kappa, gamma)
271
272 else:
ValueError: setting an array element with a sequence.
I struggled with Units, but I think the units are correct now. What could be causing this?
I tried cressman, I tried a larger Barnes grid, and I tried making sure search_radius was large. Still nan, when it worked.

The problem is caused by interpolate_to_grid choking on units when using Cressman or Barnes--which we definitely need to fix. For now the solution is to either use a different interpolation method (like interp_type='linear', the default), or to strip units before calling:
thgridx, thgridy, theta = interpolate_to_grid(x_masked, y_masked, potemp.magnitude,
interp_type='barnes', kappa_star=6, gamma=0.5, hres=grdx)
theta = units.Quantity(theta, 'K')
As far as your problems with NaNs is concerned, you may want to look at the search_radius parameter, which controls the maximum distance from a target point observations are considered. In some data-sparse areas, this could cause you to have some drop-outs. By default, it uses a guess of 5 times the average distance from one ob point to its nearest neighbor.

Python: I keep getting a broadcasting error, but I'm not sure how to fix it

I am attempting to adapt the accepted answer code from this link for my purpose:
Plot gradient arrows over heatmap with plt
I am working on a project that requires me to take a thermal image in the form of a .csv file and then take the data from the .csv file to make arrows (via quiverplot streamplot etc.) that show the direction of the heat flow from the hottest point (highest pixel value) on the image. I think that this could be achieved using the gradient of the image but I am unsure how to implement that.
Here is my code:
import matplotlib.pyplot as plt
import numpy as np
import math
directory = os.chdir(r'user_directory') #Set folder to look in
file = 'data.csv'
data = np.genfromtxt(file, delimiter = ',')
horizontal_min, horizontal_max, horizontal_stepsize = 0, 100, 0.3
vertical_min, vertical_max, vertical_stepsize = 0, 100, 0.5
horizontal_dist = horizontal_max-horizontal_min
vertical_dist = vertical_max-vertical_min
horizontal_stepsize = horizontal_dist / float(math.ceil(horizontal_dist/float(horizontal_stepsize)))
vertical_stepsize = vertical_dist / float(math.ceil(vertical_dist/float(vertical_stepsize)))
xv, yv = np.meshgrid(np.arange(horizontal_min, horizontal_max, horizontal_stepsize),
np.arange(vertical_min, vertical_max, vertical_stepsize))
xv+=horizontal_stepsize/2.0
yv+=vertical_stepsize/2.0
result_matrix = np.asmatrix(data)
yd, xd = np.gradient(result_matrix)
def func_to_vectorize(x, y, dx, dy, scaling=0.01):
plt.arrow(x, y, dx*scaling, dy*scaling), fc="k", ec="k", head_width=0.06,
head_length=0.1)
vectorized_arrow_drawing = np.vectorize(func_to_vectorize)
plt.imshow(np.flip(result_matrix,0), extent=[horizontal_min, horizontal_max, vertical_min,
vertical_max])
vectorized_arrow_drawing(xv, yv, xd, yd, 0.1)
plt.colorbar()
plt.show()
This is the error I'm getting:
ValueError: operands could not be broadcast together with shapes (200,335) (200,335) (100,100) (100,100) ()
EDIT: Traceback Error
ValueError Traceback (most recent call last)
<ipython-input-95-25a8b7e2dff8> in <module>
46
47 plt.imshow(np.flip(result_matrix,0), extent=[horizontal_min,
horizontal_max, vertical_min, vertical_max])
---> 48 vectorized_arrow_drawing(xv, yv, xd, yd, 0.1)
49 plt.colorbar()
50 plt.show()
~\Anaconda3\lib\site-packages\numpy\lib\function_base.py in __call__(self,
*args, **kwargs)
1970 vargs.extend([kwargs[_n] for _n in names])
1971
-> 1972 return self._vectorize_call(func=func, args=vargs)
1973
1974 def _get_ufunc_and_otypes(self, func, args):
~\Anaconda3\lib\site-packages\numpy\lib\function_base.py in
_vectorize_call(self, func, args)
2046 for a in args]
2047
-> 2048 outputs = ufunc(*inputs)
2049
2050 if ufunc.nout == 1:
ValueError: operands could not be broadcast together with shapes (100,168)
(100,168) (100,100) (100,100) ()

You really need xd, yd, xv, yv all to have the same shape (or all broadcastable to the same shape, but functionally that's the same thing here) for vectorize to work. Easiest way to do this is:
xv, yv = np.meshgrid(np.linspace(horizontal_min, horizontal_max, data.shape[0]),
np.linspace(vertical_min, vertical_max, data.shape[1]))
The alternative if you really want more resolution in one direction than another is to use scipy.interpolate.interp2d to interpolate the xd and yd to the dimensions of xv and yv. But that's a lot more complicated.

I'm guessing the error occurs due to this pair of statements:
vectorized_arrow_drawing = np.vectorize(func_to_vectorize)
...
vectorized_arrow_drawing(xv, yv, xd, yd, 0.1)
Even my guess is right, you should post more of the error message.
np.vectorize uses broadcasting to combine values from the inputs, and sends a set of scalar values to func_to_vectorize for each combination.
According to the error, the 5 arguments have shapes:
(200,335) (200,335) (100,100) (100,100) ()
The () array is the scalar value 0.1. That should be ok. But it can't use the (200,335) arrays along with the (100,100) ones. The xv and yv arrays are not compatible with the xd and yd ones.

Result from function not a proper array of floats using optimize.curve_fit

I have recorded experimental temperatures at five locations from the surface of a solid. At every time step, I want to fit these readings to a theoretical curve defined by my function: Temp_Function_JLT(X,h).
X is a multi-dimensional array that includes the x_coordinates as well as time, initial temperature and material properties (all independent variables). "h" is the heat transfer coefficient, which for the purpose of this exercise I'm trying to optimize (leaving the physics aside for a moment.)
This is the definition of my temperature function:
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import pickle
import scipy.optimize as opt
from scipy.special import erfc
def Temp_Function_JLT(X,ht):
# Work around the fact that only one independent variable can be passed to optimize.curve_fit
x,t,T0,q,alpha,rho,c,k = X
term_a = q/ht
term_b = erfc(x/np.sqrt(4*alpha*t))
term_c = np.exp(((ht*x)/(np.sqrt(alpha)*np.sqrt(k*rho*c)))+((ht**2)/(k*rho*c)))
term_d = erfc((ht*np.sqrt(t))/(np.sqrt(k*rho*c)) + (x/np.sqrt(4*alpha*t)))
Temperature = (term_a * (term_b - term_c * term_d)) + T0 - 273
return Temperature
The function works. I can run it with some initial parameters and obtain sensible values. More importantly for this question, if I call it with the following data:
t = 1
x_test = np.linspace(0.004,0.02,5) # TC locations
time_test = range(1,180,30)
T0_test = 25 + 273
q_test = 20000
h_test = 10
I will obtain a numpy array as a solution of shape (1,) which gives an answer to np.ndim of 1 (This has been mentioned in the following previous questions:
Least Linear Squares: scipy.optimize.curve_fit() throws "Result from function call is not a proper array of floats."
Fitting a vector function with curve_fit in Scipy
Fitting a 2D Gaussian function using scipy.optimize.curve_fit - ValueError and minpack.error
The problem arises when I call opt.curve_fit(). indepth_temperatures is a list that contains each test as an array. I iterate over it (to iterate over each test) and then I perform the fit on each row (each time step), according to the following code:
for i,test in enumerate(indepth_temperatures):
# Iterate over every row
for j,row in enumerate(test):
# Define tuple that contains all independent variables
X = (TC_depth,
times[i][j],
T0_temperatures[i] + 273,
20000,
pmma_alpha,
pmma_rho,
pmma_c,
pmma_k)
print(Temp_Function_JLT(X,h0))
print(row)
print('---')
# Call function to optimize curve fit on h
popt, pcov = opt.curve_fit(Temp_Function_JLT,X,row,h0)
print(popt)
For the first iteration, I obtain the following result:
[23.2034 23.2034 23.2034 23.2034 23.2034] # comes from print(Temp_Function_JLT(X,h0))
[23.937 22.619 22.59 24.884 21.987000000000002] # comes from print(row)
Followed by this error:
---------------------------------------------------------------------------
TypeError Traceback (most recent call last)
TypeError: Cannot cast array data from dtype('O') to dtype('float64') according to the rule 'safe'
---------------------------------------------------------------------------
error Traceback (most recent call last)
<ipython-input-67-9c4545fd257b> in <module>()
22 print('---')
23 # Call function to optimize curve fit on h
---> 24 popt, pcov = opt.curve_fit(Temp_Function_JLT,X,row,h0)
25 print(popt)
~\AppData\Local\Continuum\anaconda2\envs\py36\lib\site-packages\scipy\optimize\minpack.py in curve_fit(f, xdata, ydata, p0, sigma, absolute_sigma, check_finite, bounds, method, jac, **kwargs)
749 # Remove full_output from kwargs, otherwise we're passing it in twice.
750 return_full = kwargs.pop('full_output', False)
--> 751 res = leastsq(func, p0, Dfun=jac, full_output=1, **kwargs)
752 popt, pcov, infodict, errmsg, ier = res
753 cost = np.sum(infodict['fvec'] ** 2)
~\AppData\Local\Continuum\anaconda2\envs\py36\lib\site-packages\scipy\optimize\minpack.py in leastsq(func, x0, args, Dfun, full_output, col_deriv, ftol, xtol, gtol, maxfev, epsfcn, factor, diag)
392 with _MINPACK_LOCK:
393 retval = _minpack._lmdif(func, x0, args, full_output, ftol, xtol,
--> 394 gtol, maxfev, epsfcn, factor, diag)
395 else:
396 if col_deriv:
error: Result from function call is not a proper array of floats.
I have tried returning from my function np.ravel(Temperature) or Temperature.flatten() with no luck. The error remains, and I can't figure out why it's there. As I mentioned, I have checked the dimensions of the return of my function and it is a 1D array.
Any help will be greatly appreciated!
UPDATE: I realized it was hard to replicate this code, so this is a simplified version:
Temp_Function_JLT(X,h0): stays the same.
pmma_rho = 1200 # kg/m3
pmma_c = 1500 # J/kgK
pmma_k = 0.16 # W/mK
pmma_alpha = pmma_k/(pmma_rho*pmma_c)
x_test = np.linspace(0.004,0.02,5) # TC locations
t = 1
T0_test = 25 + 273
q_test = 20000
h_test = 10
X = (x_test,t,T0_test,q_test,pmma_alpha,pmma_rho,pmma_c,pmma_k)
y_data = [23.937 22.619 22.59 24.884 21.987000000000002]
opt.curve_fit(Temp_Function_JLT, X, y_data, h_test)

I realized what was wrong with my code. Even though my y_data (row) was defined as a 1-D numpy array, its data type was object. I don't yet understand why this was the cause, but by forcing the data type with np.astype(np.float), opt.curve_fit worked.

ValueError: negative dimensions are not allowed in scikit linear regression CV model with sparse matrices

I recently competed in a kaggle competition and ran into problems trying to run linear CV models from scikit learn. I am aware of a similar question on stack overflow but I can't see how the accepted reply relates to my issue. Any assistance would be greatly appreciated. My code is given below:
train=pd.read_csv(".../train.csv")
test=pd.read_csv(".../test.csv")
data=pd.read_csv(".../sampleSubmission.csv")
from sklearn.feature_extraction.text import TfidfVectorizer
transformer = TfidfVectorizer(max_features=None)
Y=transformer.fit_transform(train.tweet)
Z=transformer.transform(test.tweet)
from sklearn import linear_model
clf = linear_model.RidgeCV()
a=4
b=1
while (a<28):
clf.fit(Y, train.ix[:,a])
pred=clf.predict(Z)
linpred=pd.DataFrame(pred)
data[data.columns[b]]=linpred
b=b+1
a=a+1
print b
The error that I receive is pasted in total below:
---------------------------------------------------------------------------
ValueError Traceback (most recent call last)
<ipython-input-17-41c31233c15c> in <module>()
1 blah=train.ix[:,a]
----> 2 clf.fit(Y, blah)
D:\Users\soates\AppData\Local\Enthought\Canopy\User\lib\site- packages\sklearn\linear_model\ridge.pyc in fit(self, X, y, sample_weight)
815 gcv_mode=self.gcv_mode,
816 store_cv_values=self.store_cv_values)
--> 817 estimator.fit(X, y, sample_weight=sample_weight)
818 self.alpha_ = estimator.alpha_
819 if self.store_cv_values:
D:\Users\soates\AppData\Local\Enthought\Canopy\User\lib\site- packages\sklearn\linear_model\ridge.pyc in fit(self, X, y, sample_weight)
722 raise ValueError('bad gcv_mode "%s"' % gcv_mode)
723
--> 724 v, Q, QT_y = _pre_compute(X, y)
725 n_y = 1 if len(y.shape) == 1 else y.shape[1]
726 cv_values = np.zeros((n_samples * n_y, len(self.alphas)))
D:\Users\soates\AppData\Local\Enthought\Canopy\User\lib\site- packages\sklearn\linear_model\ridge.pyc in _pre_compute(self, X, y)
607 def _pre_compute(self, X, y):
608 # even if X is very sparse, K is usually very dense
--> 609 K = safe_sparse_dot(X, X.T, dense_output=True)
610 v, Q = linalg.eigh(K)
611 QT_y = np.dot(Q.T, y)
D:\Users\soates\AppData\Local\Enthought\Canopy\User\lib\site- packages\sklearn\utils\extmath.pyc in safe_sparse_dot(a, b, dense_output)
76 from scipy import sparse
77 if sparse.issparse(a) or sparse.issparse(b):
---> 78 ret = a * b
79 if dense_output and hasattr(ret, "toarray"):
80 ret = ret.toarray()
D:\Users\soates\AppData\Local\Enthought\Canopy\User\lib\site-packages\scipy\sparse\base.pyc in __mul__(self, other)
301 if self.shape[1] != other.shape[0]:
302 raise ValueError('dimension mismatch')
--> 303 return self._mul_sparse_matrix(other)
304
305 try:
D:\Users\soates\AppData\Local\Enthought\Canopy\User\lib\site- packages\scipy\sparse\compressed.pyc in _mul_sparse_matrix(self, other)
518
519 nnz = indptr[-1]
--> 520 indices = np.empty(nnz, dtype=np.intc)
521 data = np.empty(nnz, dtype=upcast(self.dtype,other.dtype))
522
ValueError: negative dimensions are not allowed

It looks like this problem occurs without using sklearn. Its in scipy.sparse matrix multiplication. There is this issue on a scipy-users board: sparse matrix multiplication problem. The crux of the problem is that scipy uses a 32-bit int for non-zero indices during sparse matrix multiplication. That's the marked line at the bottom of the traceback above. That can overflow if there are too many non-zero elements. That overflow causes the variable nnz to become negative. Then the code at the last arrow creates an empty array of size nnz, resulting in a ValueError due to a negative dimension.
You can generate the tail end of the traceback above without sklearn as follows:
import scipy.sparse as ss
X = ss.rand(75000, 42000, format='csr', density=0.01)
X * X.T
For this problem, the input is probably quite sparse, but RidgeCV looks like its multiplying X and X.T in the last part of the traceback within sklearn. That product might not be sparse enough.

We Keep Coding

Python is a programming language that lets you work quickly and integrate systems more effectively.

Cubic spline memory error - python

Related

try to interpolate data from a csv file, getting error 'x and y arrays must be equal in length along interpolation axis'. found solution doesn't work

Interpolate_to_grid returns all nans

Python: I keep getting a broadcasting error, but I'm not sure how to fix it

Result from function not a proper array of floats using optimize.curve_fit

ValueError: negative dimensions are not allowed in scikit linear regression CV model with sparse matrices

Categories

Resources