I am trying to quickly calculate the correlation matrix of a large dataset (745 rows x 18,048 columns) in Python 3. The dataset was originally read from a 50MB netCDF file and after some manipulation it came to this size. All of the data is stored as float32s. Therefore, I calculated that the final correlation matrix should be around 1.2 GB, which should easily fit into my 8 GB RAM. Using pandas' DataFrame and its methods, it can calculate the entire correlation matrix in around 100 minutes so it is possible to calculate it.
I read up on the dask module and decided to implement it. However, when I try to calculate using the same method, it almost immediately runs into a MemoryError, even though it should fit into memory. After some fiddling I realized it even fails on a relatively small 1000 x 1000 dataset. Is there something else that's going on underneath that is causing this error? I have posted my code below:
import dask.dataframe as ddf
# prepare data in dataframe called df
daskdf = ddf.from_pandas(df, chunksize=1000)
correlation = daskdf.corr()
correlation.compute()
And here's the error trace:
Traceback (most recent call last):
File "C:/Users/mughi/Documents/College Stuff/Project DIVA/Preprocessing Data/DaskCorr.py", line 36, in <module>
correlation.compute()
File "C:\Users\mughi\AppData\Local\Programs\Python\Python36\lib\site-packages\dask\base.py", line 94, in compute
(result,) = compute(self, traverse=False, **kwargs)
File "C:\Users\mughi\AppData\Local\Programs\Python\Python36\lib\site-packages\dask\base.py", line 201, in compute
results = get(dsk, keys, **kwargs)
File "C:\Users\mughi\AppData\Local\Programs\Python\Python36\lib\site-packages\dask\threaded.py", line 76, in get
**kwargs)
File "C:\Users\mughi\AppData\Local\Programs\Python\Python36\lib\site-packages\dask\async.py", line 500, in get_async
raise(remote_exception(res, tb))
dask.async.MemoryError:
Traceback
---------
File "C:\Users\mughi\AppData\Local\Programs\Python\Python36\lib\site-packages\dask\async.py", line 266, in execute_task
result = _execute_task(task, data)
File "C:\Users\mughi\AppData\Local\Programs\Python\Python36\lib\site-packages\dask\async.py", line 247, in _execute_task
return func(*args2)
File "C:\Users\mughi\AppData\Local\Programs\Python\Python36\lib\site-packages\dask\dataframe\core.py", line 3283, in cov_corr_chunk
keep = np.bitwise_and(mask[:, None, :], mask[:, :, None])
Thank you!
Related
I downloaded data from noaa and i wanted to calculate vertical velocity using the function vertical_velocity=metpy.calcmpcalc.vertical_velocity(omega,pressure,temperature). But something wrong when i dealing with the units of varibles.
import xarray as xr
import metpy.calc as mpcalc
omega=xr.open_dataset('D:\\data_english\\jwk\\omega.mon.mean.nc')
temperature=xr.open_dataset('D:\\data_english\\jwk\\air.mon.mean.nc')
height=xr.open_dataset('D:\\data_english\\jwk\\hgt.mon.mean.nc')
pressure=mpcalc.height_to_pressure_std(height['hgt'])
verticalwind=mpcalc.vertical_velocity(omega['omega'], pressure, temperature['air'])
Traceback (most recent call last):
File "<ipython-input-194-da22b63a1943>", line 1, in <module>
verticalwind=mpcalc.vertical_velocity(omega['omega'], pressure, temperature['air'])
File "D:\anaconda\lib\site-packages\metpy\xarray.py", line 1199, in wrapper
_mutate_arguments(bound_args, xr.DataArray, lambda arg, _: arg.metpy.unit_array)
File "D:\anaconda\lib\site-packages\metpy\xarray.py", line 1233, in _mutate_arguments
bound_args.arguments[arg_name] = mutate_arg(arg_val, arg_name)
File "D:\anaconda\lib\site-packages\metpy\xarray.py", line 1199, in <lambda>
_mutate_arguments(bound_args, xr.DataArray, lambda arg, _: arg.metpy.unit_array)
File "D:\anaconda\lib\site-packages\metpy\xarray.py", line 157, in unit_array
return units.Quantity(self._data_array.data, self.units)
File "D:\anaconda\lib\site-packages\metpy\xarray.py", line 134, in units
return units.parse_units(self._data_array.attrs.get('units', 'dimensionless'))
File "D:\anaconda\lib\site-packages\pint\registry.py", line 1084, in parse_units
units = self._parse_units(input_string, as_delta)
File "D:\anaconda\lib\site-packages\pint\registry.py", line 1298, in _parse_units
return super()._parse_units(input_string, as_delta)
File "D:\anaconda\lib\site-packages\pint\registry.py", line 1112, in _parse_units
cname = self.get_name(name)
File "D:\anaconda\lib\site-packages\pint\registry.py", line 636, in get_name
raise UndefinedUnitError(name_or_alias)
UndefinedUnitError: 'Pascal' is not defined in the unit registry
**The units of omega, height and temperature are 'Pascal/s', 'm' and 'degC', repectively. The varible pressure was calculate through the function mpcalc.height_to_pressure_std, and this function didn't give the unit of pressure. But the values of pressure range from 1000 to 0, so i think its unit is 'hpa'.
The error reported that "'Pascal' is not defined in the unit registry". Maybe 'Pascal/s' is not the default unit of omega? But how can i know which units are defined in the unit registry ? Can anyone help me? Thanks!**
This is a problem where the unit library MetPy uses (Pint) does not have the same rules about capitalization/case sensitivity as the UDUnits format used by the netCDF Climate and Forecasting Conventions for metadata. Fixing this is on MetPy's todo list, but some roadblocks have been encountered.
The work-around right now is to change your units to something that Pint understands, like:
omega['omega'].attrs['units'] = 'pascal / s'
I'm running a SARIMAX model but running into problems with specifying the exogenous variables. In the first block of code (below) I specify one exogenous variable lesdata['LESpost'] and the model runs without a problem. However, when I add in another exogenous variable I end up with an error message (see stack trace).
ar = (1,0,1) # AR(1 3)
ma = (0) # No MA terms
mod1 = sm.tsa.statespace.SARIMAX(lesdata['emadm'], exog= (lesdata['LESpost'],lesdata['QOF']), trend='c', order=(ar,0,ma), mle_regression=True)
Traceback (most recent call last):
File "<ipython-input-129-d1300aeaeffc>", line 4, in <module>
mle_regression=True)
File "C:\Users\danie\Anaconda2\lib\site-packages\statsmodels\tsa\statespace\sarimax.py", line 510, in __init__
endog, exog=exog, k_states=k_states, k_posdef=k_posdef, **kwargs
File "C:\Users\danie\Anaconda2\lib\site-packages\statsmodels\tsa\statespace\mlemodel.py", line 84, in __init__
missing='none')
File "C:\Users\danie\Anaconda2\lib\site-packages\statsmodels\tsa\base\tsa_model.py", line 43, in __init__
super(TimeSeriesModel, self).__init__(endog, exog, missing=missing)
File "C:\Users\danie\Anaconda2\lib\site-packages\statsmodels\base\model.py", line 212, in __init__
super(LikelihoodModel, self).__init__(endog, exog, **kwargs)
File "C:\Users\danie\Anaconda2\lib\site-packages\statsmodels\base\model.py", line 63, in __init__
**kwargs)
File "C:\Users\danie\Anaconda2\lib\site-packages\statsmodels\base\model.py", line 88, in _handle_data
data = handle_data(endog, exog, missing, hasconst, **kwargs)
File "C:\Users\danie\Anaconda2\lib\site-packages\statsmodels\base\data.py", line 630, in handle_data
**kwargs)
File "C:\Users\danie\Anaconda2\lib\site-packages\statsmodels\base\data.py", line 80, in __init__
self._check_integrity()
File "C:\Users\danie\Anaconda2\lib\site-packages\statsmodels\base\data.py", line 496, in _check_integrity
super(PandasData, self)._check_integrity()
File "C:\Users\danie\Anaconda2\lib\site-packages\statsmodels\base\data.py", line 403, in _check_integrity
raise ValueError("endog and exog matrices are different sizes")
ValueError: endog and exog matrices are different sizes
Is there something obvious I am missing here? The variables are all of the same length and there are no missing data.
Thanks for reading and hope you can help !
Two dimensional data needs to have observations in row and variables in columns after applying numpy.asarray.
exog = (lesdata['LESpost'],lesdata['QOF'])
Applying asarray to this tuple puts the variables in rows which is the numpy default from the C origin which is not what statsmodels wants.
DataFrames are already shaped in the appropriate way, so one option is to use a DataFrame with the desired columns
exog = lesdata[['LESpost', 'QOF']]
Another option for list or tuples of array_likes is to use numpy.column_stack, e.g.
exog = np.column_stack((lesdata['LESpost'].values,lesdata['QOF'].values))
I am trying to read a csv file and apply k-means algorithm to identify the groups of the elements.
My code is this:
import csv
import numpy as np
import scipy as sp
from sklearn import cluster as sk
print(sk.k_means(np.genfromtxt('keywords.csv', delimiter=' ')[:,:0],3))
I use genfromtxt because there are some missing values and with this statement I can bypass these.
For the moment I would like to see the full return of the k_means function but I get
/anaconda/lib/python3.6/site-packages/numpy/core/_methods.py:59: RuntimeWarning: Mean of empty slice.
warnings.warn("Mean of empty slice.", RuntimeWarning)
/anaconda/lib/python3.6/site-packages/numpy/core/_methods.py:70: RuntimeWarning: invalid value encountered in double_scalars
ret = ret.dtype.type(ret / rcount)
Traceback (most recent call last):
File "ejercicio2.py", line 6, in <module>
print(sk.k_means(np.genfromtxt('keywords.csv', delimiter=' ')[:,:0],3))
File "/anaconda/lib/python3.6/site-packages/sklearn/cluster/k_means_.py", line 345, in k_means
x_squared_norms=x_squared_norms, random_state=random_state)
File "/anaconda/lib/python3.6/site-packages/sklearn/cluster/k_means_.py", line 388, in _kmeans_single_elkan
X = check_array(X, order="C")
File "/anaconda/lib/python3.6/site-packages/sklearn/utils/validation.py", line 424, in check_array
context))
ValueError: Found array with 0 feature(s) (shape=(3312, 0)) while a minimum of 1 is required.
You are passing all the rows but no columns by writing [:, :0] and hence the error. You might want to send all the rows and columns, and in that case just remove it from that line. In general the syntax is -
data[x:y, a:b]
which just means, rows from x to y(exclusive) and columns from a to b(exclusive).
My regression model using statsmodels in python works with 48,065 lines of data, but while adding new data I have tracked down one line of code that produces a singular matrix error. Answers to similar questions seem to suggest missing data but I have checked and there is nothing visibibly irregular from the error prone row of code causing me major issues. Does anyone know if this is an error in my code or knows a solution to fix it as I'm out of ideas.
Data2.csv - http://www.sharecsv.com/s/8ff31545056b8864f2ad26ef2fe38a09/Data2.csv
import pandas as pd
import statsmodels.formula.api as smf
data = pd.read_csv("Data2.csv")
formula = 'is_success ~ goal_angle + goal_distance + np_distance + fp_distance + is_fast_attack + is_header + prev_tb + is_rebound + is_penalty + prev_cross + is_tb2 + is_own_goal + is_cutback + asst_dist'
model = smf.mnlogit(formula, data=data, missing='drop').fit()
CSV Line producing error: 0,0,0,0,0,0,0,1,22.94476,16.877204,13.484806,20.924627,0,0,11.765203
Error with Problematic line within the model:
runfile('C:/Users/User1/Desktop/Model Check.py', wdir='C:/Users/User1/Desktop')
Optimization terminated successfully.
Current function value: 0.264334
Iterations 20
Traceback (most recent call last):
File "<ipython-input-76-eace3b458e24>", line 1, in <module>
runfile('C:/Users/User1/Desktop/xG_xA Model Check.py', wdir='C:/Users/User1/Desktop')
File "C:\Users\User1\Anaconda2\lib\site-packages\spyderlib\widgets\externalshell\sitecustomize.py", line 714, in runfile
execfile(filename, namespace)
File "C:\Users\User1\Anaconda2\lib\site-packages\spyderlib\widgets\externalshell\sitecustomize.py", line 74, in execfile
exec(compile(scripttext, filename, 'exec'), glob, loc)
File "C:/Users/User1/Desktop/xG_xA Model Check.py", line 6, in <module>
model = smf.mnlogit(formula, data=data, missing='drop').fit()
File "C:\Users\User1\Anaconda2\lib\site-packages\statsmodels\discrete\discrete_model.py", line 587, in fit
disp=disp, callback=callback, **kwargs)
File "C:\Users\User1\Anaconda2\lib\site-packages\statsmodels\base\model.py", line 434, in fit
Hinv = np.linalg.inv(-retvals['Hessian']) / nobs
File "C:\Users\User1\Anaconda2\lib\site-packages\numpy\linalg\linalg.py", line 526, in inv
ainv = _umath_linalg.inv(a, signature=signature, extobj=extobj)
File "C:\Users\User1\Anaconda2\lib\site-packages\numpy\linalg\linalg.py", line 90, in _raise_linalgerror_singular
raise LinAlgError("Singular matrix")
LinAlgError: Singular matrix
As far as I can see:
The problem is the variable is_own_goal because all observation where this is 1 also have the dependent variable is_success equal to 1. That means there is no variation in the outcome because is_own_goal already specifies that it is a success.
As a consequence, we cannot estimate a coefficient for is_own_goal, the coefficient is not identified by the data. The variance of the coefficient would be infinite and inverting the Hessian to get the covariance of the parameter estimates fails because the Hessian is singular.
Given floating point precision, with some computational noise the hessian might be invertible and the Singular Matrix exception would not show up. Which, I guess, is the reason that it works with some but not all observations.
BTW: If the dependent variable, endog, is binary, then Logit is more appropriate, even though MNLogit has it as a special case.
BTW: Penalized estimation would be another way to force an estimate even in singular cases, although the coefficient would still not be identified by the data and be just a consequence of the penalization.
In this example,
mod = smf.logit(formula, data=data, missing='drop').fit_regularized()
works for me. This is L1 penalization. In statsmodels 0.8, there is also elastic net penalization for GLM which has Binomial (i.e. Logit) as a family.
I'm getting a ZeroDivisionError from the following code:
#stacking the array into a complex array allows np.unique to choose
#truely unique points. We also keep a handle on the unique indices
#to allow us to index `self` in the same order.
unique_points,index = np.unique(xdata[mask]+1j*ydata[mask],
return_index=True)
#Now we break it into the data structure we need.
points = np.column_stack((unique_points.real,unique_points.imag))
xx1,xx2 = self.meta['rcm_xx1'],self.meta['rcm_xx2']
yy1 = self.meta['rcm_yy2']
gx = np.arange(xx1,xx2+dx,dx)
gy = np.arange(-yy1,yy1+dy,dy)
GX,GY = np.meshgrid(gx,gy)
xi = np.column_stack((GX.ravel(),GY.ravel()))
gdata = griddata(points,self[mask][index],xi,method='linear',
fill_value=np.nan)
Here, xdata,ydata and self are all 2D numpy.ndarrays (or subclasses thereof) with the same shape and dtype=np.float32. mask is a 2d ndarray with the same shape and dtype=bool. Here's a link for those wanting to peruse the scipy.interpolate.griddata documentation.
Originally, xdata and ydata are derived from a non-uniform cylindrical grid that has a 4 point stencil -- I thought that the error might be coming from the fact that the same point was defined multiple times, so I made the set of input points unique as suggested in this question. Unfortunately, that hasn't seemed to help. The full traceback is:
Traceback (most recent call last):
File "/xxxxxxx/rcm.py", line 428, in <module>
x[...,1].to_pz0()
File "/xxxxxxx/rcm.py", line 285, in to_pz0
fill_value=fill_value)
File "/usr/local/lib/python2.7/site-packages/scipy/interpolate/ndgriddata.py", line 183, in griddata
ip = LinearNDInterpolator(points, values, fill_value=fill_value)
File "interpnd.pyx", line 192, in scipy.interpolate.interpnd.LinearNDInterpolator.__init__ (scipy/interpolate/interpnd.c:2935)
File "qhull.pyx", line 996, in scipy.spatial.qhull.Delaunay.__init__ (scipy/spatial/qhull.c:6607)
File "qhull.pyx", line 183, in scipy.spatial.qhull._construct_delaunay (scipy/spatial/qhull.c:1919)
ZeroDivisionError: float division
For what it's worth, the code "works" (No exception) if I use the "nearest" method.