Related
I have a dictionary called Rsize which have number-List as key-value pair. The dictionary is like this
{10: [0.6621485767296484, 0.6610747762560114, 0.659607022086639, 0.6567761845867727, 0.6535392433801197, 0.6485977028504701, 0.6393024556394106, 0.6223866436257335, 0.5999232392636733, 0.5418403536642005, 0.4961461379219235, 0.4280278015788386, 0.35462315989740956, 0.2863017237662875, 0.2312185739351389, 0.18306363413831017], 12: [0.6638977494825118, 0.663295576452323, 0.662262804664348, 0.6610413916318628, 0.6590939627030634, 0.655212304186114, 0.6492141689834672, 0.6380632834031537, 0.6096663492242224, 0.5647498006858608, 0.4983281599318278, 0.3961350546063216, 0.32119092575707087, 0.2257230704567207, 0.1816695139119151, 0.14363448808684576], 14: [0.6649598494971014, 0.6644370245269158, 0.6638578972784479, 0.6630511299276417, 0.6615070373022596, 0.6596206155163766, 0.6560628158033714, 0.6487119276511941, 0.6343385358239866, 0.5792725000508062, 0.49799837531709923, 0.42482204326408324, 0.26633662071414366, 0.2028085235063155, 0.12411214668987203, 0.09336935548451253]}
[0.6621485767296484, 0.6610747762560114, 0.659607022086639, 0.6567761845867727, 0.6535392433801197, 0.6485977028504701, 0.6393024556394106, 0.6223866436257335, 0.5999232392636733, 0.5418403536642005, 0.4961461379219235, 0.4280278015788386, 0.35462315989740956, 0.2863017237662875, 0.2312185739351389, 0.18306363413831017]
The keys are 10,14,16. I have used each list for plotting and want to find out their pairwise intersection points. I have written the following script for that and used shapely intersection function for the intersection points detection.
from shapely.geometry import LineString
Rsize={10: [0.6621485767296484, 0.6610747762560114, 0.659607022086639, 0.6567761845867727, 0.6535392433801197, 0.6485977028504701, 0.6393024556394106, 0.6223866436257335, 0.5999232392636733, 0.5418403536642005, 0.4961461379219235, 0.4280278015788386, 0.35462315989740956, 0.2863017237662875, 0.2312185739351389, 0.18306363413831017], 12: [0.6638977494825118, 0.663295576452323, 0.662262804664348, 0.6610413916318628, 0.6590939627030634, 0.655212304186114, 0.6492141689834672, 0.6380632834031537, 0.6096663492242224, 0.5647498006858608, 0.4983281599318278, 0.3961350546063216, 0.32119092575707087, 0.2257230704567207, 0.1816695139119151, 0.14363448808684576], 14: [0.6649598494971014, 0.6644370245269158, 0.6638578972784479, 0.6630511299276417, 0.6615070373022596, 0.6596206155163766, 0.6560628158033714, 0.6487119276511941, 0.6343385358239866, 0.5792725000508062, 0.49799837531709923, 0.42482204326408324, 0.26633662071414366, 0.2028085235063155, 0.12411214668987203, 0.09336935548451253]}
[0.6621485767296484, 0.6610747762560114, 0.659607022086639, 0.6567761845867727, 0.6535392433801197, 0.6485977028504701, 0.6393024556394106, 0.6223866436257335, 0.5999232392636733, 0.5418403536642005, 0.4961461379219235, 0.4280278015788386, 0.35462315989740956, 0.2863017237662875, 0.2312185739351389, 0.18306363413831017]
listkT = np.arange(4.0,4.8,0.05)
print(Rsize[10])
plt.figure(figsize=(18, 10))
plt.title ('Binder cumulant for critical point')
plt.plot(listkT, Rsize[10], '-',label='Lattice sie 10')
plt.plot(listkT, Rsize[12], '-',label='Lattice sie 12')
plt.plot(listkT, Rsize[14], '-',label='Lattice sie 14')
plt.legend()
plt.show()
curve_10=LineString(np.column_stack((listkT, Rsize[10])))
curve_12=LineString(np.column_stack((listkT, Rsize[12])))
curve_14=LineString(np.column_stack((listkT, Rsize[14])))
intersection12 = curve_10.intersection(curve_12)
intersection14 = curve_10.intersection(curve_14)
plt.plot(*LineString(intersection12).xy, 'o')
plt.plot(*LineString(intersection14).xy, 'o')
x12, y = LineString(intersection12).xy
x14, y = LineString(intersection14).xy
print(np.intersect1d(x12, x14))
print(x12,x14)
But shapely throws an AssertionError.
File "C:\Users\Endeavour\Anaconda3\lib\site-packages\spyder_kernels\customize\spydercustomize.py", line 827, in runfile
execfile(filename, namespace)
File "C:\Users\Endeavour\Anaconda3\lib\site-packages\spyder_kernels\customize\spydercustomize.py", line 110, in execfile
exec(compile(f.read(), filename, 'exec'), namespace)
File "E:/Project/Codes/3D.py", line 118, in <module>
plt.plot(*LineString(intersection12).xy, 'o')
File "C:\Users\Endeavour\Anaconda3\lib\site-packages\shapely\geometry\linestring.py", line 48, in __init__
self._set_coords(coordinates)
File "C:\Users\Endeavour\Anaconda3\lib\site-packages\shapely\geometry\linestring.py", line 97, in _set_coords
ret = geos_linestring_from_py(coordinates)
File "shapely/speedups/_speedups.pyx", line 87, in shapely.speedups._speedups.geos_linestring_from_py
AssertionError
The plots are drawn correctly by matplotlib though.
I am using shapely for first time with no prior experience in it. Any help will be much appreciated. Thank you.
Note: The final goal is to get the intersection of 3 curves. If no intersection found the point where they come closest is good enough. Any suggestion or library function to find that will be of great help.
Thank you in advance.
Following the assertion error, I checked shapely/speedups/_speedups.pyx, line 87. geos_linestring_from_py function expects you to either pass a LineString or a LinearRing. When I print your intersection12 and intersection14 I get:
POINT (4.503201814825258 0.4917840919384173)
POINT (4.51830999373466 0.4712012116887737)
So you are passing a Point instance to create a LineString, which creates an AssertionError.
Aside from the error you have, your approach is also wrong because it assumes that (1) there will be multiple intersections between two curves, and (2) there will be one absolute point where three curves intersect. If you zoom into your plot, you can see that neither is the case.
The red circle corresponds to your intersection12 and the purple one is intersection14. If you are looking for an approximate solution, maybe finding the mean of these points can help in this situation, but for more complex curves with multiple intersections per pair, it is also not recommended.
I am trying to quickly calculate the correlation matrix of a large dataset (745 rows x 18,048 columns) in Python 3. The dataset was originally read from a 50MB netCDF file and after some manipulation it came to this size. All of the data is stored as float32s. Therefore, I calculated that the final correlation matrix should be around 1.2 GB, which should easily fit into my 8 GB RAM. Using pandas' DataFrame and its methods, it can calculate the entire correlation matrix in around 100 minutes so it is possible to calculate it.
I read up on the dask module and decided to implement it. However, when I try to calculate using the same method, it almost immediately runs into a MemoryError, even though it should fit into memory. After some fiddling I realized it even fails on a relatively small 1000 x 1000 dataset. Is there something else that's going on underneath that is causing this error? I have posted my code below:
import dask.dataframe as ddf
# prepare data in dataframe called df
daskdf = ddf.from_pandas(df, chunksize=1000)
correlation = daskdf.corr()
correlation.compute()
And here's the error trace:
Traceback (most recent call last):
File "C:/Users/mughi/Documents/College Stuff/Project DIVA/Preprocessing Data/DaskCorr.py", line 36, in <module>
correlation.compute()
File "C:\Users\mughi\AppData\Local\Programs\Python\Python36\lib\site-packages\dask\base.py", line 94, in compute
(result,) = compute(self, traverse=False, **kwargs)
File "C:\Users\mughi\AppData\Local\Programs\Python\Python36\lib\site-packages\dask\base.py", line 201, in compute
results = get(dsk, keys, **kwargs)
File "C:\Users\mughi\AppData\Local\Programs\Python\Python36\lib\site-packages\dask\threaded.py", line 76, in get
**kwargs)
File "C:\Users\mughi\AppData\Local\Programs\Python\Python36\lib\site-packages\dask\async.py", line 500, in get_async
raise(remote_exception(res, tb))
dask.async.MemoryError:
Traceback
---------
File "C:\Users\mughi\AppData\Local\Programs\Python\Python36\lib\site-packages\dask\async.py", line 266, in execute_task
result = _execute_task(task, data)
File "C:\Users\mughi\AppData\Local\Programs\Python\Python36\lib\site-packages\dask\async.py", line 247, in _execute_task
return func(*args2)
File "C:\Users\mughi\AppData\Local\Programs\Python\Python36\lib\site-packages\dask\dataframe\core.py", line 3283, in cov_corr_chunk
keep = np.bitwise_and(mask[:, None, :], mask[:, :, None])
Thank you!
My regression model using statsmodels in python works with 48,065 lines of data, but while adding new data I have tracked down one line of code that produces a singular matrix error. Answers to similar questions seem to suggest missing data but I have checked and there is nothing visibibly irregular from the error prone row of code causing me major issues. Does anyone know if this is an error in my code or knows a solution to fix it as I'm out of ideas.
Data2.csv - http://www.sharecsv.com/s/8ff31545056b8864f2ad26ef2fe38a09/Data2.csv
import pandas as pd
import statsmodels.formula.api as smf
data = pd.read_csv("Data2.csv")
formula = 'is_success ~ goal_angle + goal_distance + np_distance + fp_distance + is_fast_attack + is_header + prev_tb + is_rebound + is_penalty + prev_cross + is_tb2 + is_own_goal + is_cutback + asst_dist'
model = smf.mnlogit(formula, data=data, missing='drop').fit()
CSV Line producing error: 0,0,0,0,0,0,0,1,22.94476,16.877204,13.484806,20.924627,0,0,11.765203
Error with Problematic line within the model:
runfile('C:/Users/User1/Desktop/Model Check.py', wdir='C:/Users/User1/Desktop')
Optimization terminated successfully.
Current function value: 0.264334
Iterations 20
Traceback (most recent call last):
File "<ipython-input-76-eace3b458e24>", line 1, in <module>
runfile('C:/Users/User1/Desktop/xG_xA Model Check.py', wdir='C:/Users/User1/Desktop')
File "C:\Users\User1\Anaconda2\lib\site-packages\spyderlib\widgets\externalshell\sitecustomize.py", line 714, in runfile
execfile(filename, namespace)
File "C:\Users\User1\Anaconda2\lib\site-packages\spyderlib\widgets\externalshell\sitecustomize.py", line 74, in execfile
exec(compile(scripttext, filename, 'exec'), glob, loc)
File "C:/Users/User1/Desktop/xG_xA Model Check.py", line 6, in <module>
model = smf.mnlogit(formula, data=data, missing='drop').fit()
File "C:\Users\User1\Anaconda2\lib\site-packages\statsmodels\discrete\discrete_model.py", line 587, in fit
disp=disp, callback=callback, **kwargs)
File "C:\Users\User1\Anaconda2\lib\site-packages\statsmodels\base\model.py", line 434, in fit
Hinv = np.linalg.inv(-retvals['Hessian']) / nobs
File "C:\Users\User1\Anaconda2\lib\site-packages\numpy\linalg\linalg.py", line 526, in inv
ainv = _umath_linalg.inv(a, signature=signature, extobj=extobj)
File "C:\Users\User1\Anaconda2\lib\site-packages\numpy\linalg\linalg.py", line 90, in _raise_linalgerror_singular
raise LinAlgError("Singular matrix")
LinAlgError: Singular matrix
As far as I can see:
The problem is the variable is_own_goal because all observation where this is 1 also have the dependent variable is_success equal to 1. That means there is no variation in the outcome because is_own_goal already specifies that it is a success.
As a consequence, we cannot estimate a coefficient for is_own_goal, the coefficient is not identified by the data. The variance of the coefficient would be infinite and inverting the Hessian to get the covariance of the parameter estimates fails because the Hessian is singular.
Given floating point precision, with some computational noise the hessian might be invertible and the Singular Matrix exception would not show up. Which, I guess, is the reason that it works with some but not all observations.
BTW: If the dependent variable, endog, is binary, then Logit is more appropriate, even though MNLogit has it as a special case.
BTW: Penalized estimation would be another way to force an estimate even in singular cases, although the coefficient would still not be identified by the data and be just a consequence of the penalization.
In this example,
mod = smf.logit(formula, data=data, missing='drop').fit_regularized()
works for me. This is L1 penalization. In statsmodels 0.8, there is also elastic net penalization for GLM which has Binomial (i.e. Logit) as a family.
I'm getting a ZeroDivisionError from the following code:
#stacking the array into a complex array allows np.unique to choose
#truely unique points. We also keep a handle on the unique indices
#to allow us to index `self` in the same order.
unique_points,index = np.unique(xdata[mask]+1j*ydata[mask],
return_index=True)
#Now we break it into the data structure we need.
points = np.column_stack((unique_points.real,unique_points.imag))
xx1,xx2 = self.meta['rcm_xx1'],self.meta['rcm_xx2']
yy1 = self.meta['rcm_yy2']
gx = np.arange(xx1,xx2+dx,dx)
gy = np.arange(-yy1,yy1+dy,dy)
GX,GY = np.meshgrid(gx,gy)
xi = np.column_stack((GX.ravel(),GY.ravel()))
gdata = griddata(points,self[mask][index],xi,method='linear',
fill_value=np.nan)
Here, xdata,ydata and self are all 2D numpy.ndarrays (or subclasses thereof) with the same shape and dtype=np.float32. mask is a 2d ndarray with the same shape and dtype=bool. Here's a link for those wanting to peruse the scipy.interpolate.griddata documentation.
Originally, xdata and ydata are derived from a non-uniform cylindrical grid that has a 4 point stencil -- I thought that the error might be coming from the fact that the same point was defined multiple times, so I made the set of input points unique as suggested in this question. Unfortunately, that hasn't seemed to help. The full traceback is:
Traceback (most recent call last):
File "/xxxxxxx/rcm.py", line 428, in <module>
x[...,1].to_pz0()
File "/xxxxxxx/rcm.py", line 285, in to_pz0
fill_value=fill_value)
File "/usr/local/lib/python2.7/site-packages/scipy/interpolate/ndgriddata.py", line 183, in griddata
ip = LinearNDInterpolator(points, values, fill_value=fill_value)
File "interpnd.pyx", line 192, in scipy.interpolate.interpnd.LinearNDInterpolator.__init__ (scipy/interpolate/interpnd.c:2935)
File "qhull.pyx", line 996, in scipy.spatial.qhull.Delaunay.__init__ (scipy/spatial/qhull.c:6607)
File "qhull.pyx", line 183, in scipy.spatial.qhull._construct_delaunay (scipy/spatial/qhull.c:1919)
ZeroDivisionError: float division
For what it's worth, the code "works" (No exception) if I use the "nearest" method.
For a subplot (self.intensity), I want to shade the area under the graph.
I tried this, hoping it was the correct syntax:
self.intensity.fill_between(arange(l,r), 0, projection)
Which I intend as to do shading for projection numpy array within (l,r) integer limits.
But it gives me an error. How do I do it correctly?
Heres the traceback:
Traceback (most recent call last):
File "/usr/lib/pymodules/python2.7/matplotlib/backends/backend_wx.py", line 1289, in _onLeftButtonDown
FigureCanvasBase.button_press_event(self, x, y, 1, guiEvent=evt)
File "/usr/lib/pymodules/python2.7/matplotlib/backend_bases.py", line 1576, in button_press_event
self.callbacks.process(s, mouseevent)
File "/usr/lib/pymodules/python2.7/matplotlib/cbook.py", line 265, in process
proxy(*args, **kwargs)
File "/usr/lib/pymodules/python2.7/matplotlib/cbook.py", line 191, in __call__
return mtd(*args, **kwargs)
File "/root/dev/spectrum/spectrum/plot_handler.py", line 55, in _onclick
self._call_click_callback(event.xdata)
File "/root/dev/spectrum/spectrum/plot_handler.py", line 66, in _call_click_callback
self.__click_callback(data)
File "/root/dev/spectrum/spectrum/plot_handler.py", line 186, in _on_plot_click
band_data = self._band_data)
File "/root/dev/spectrum/spectrum/plot_handler.py", line 95, in draw
self.intensity.fill_between(arange(l,r), 0, projection)
File "/usr/lib/pymodules/python2.7/matplotlib/axes.py", line 6457, in fill_between
raise ValueError("Argument dimensions are incompatible")
ValueError: Argument dimensions are incompatible
It seems like you are trying to fill the part of the projection from l to r. fill_between expects the x and y arrays to be of equal lengths, so you can not expect to fill part of the curve only.
To get what you want, you can do either of the following:
1. send only part of the projection that needs to be filled to the command; and draw the rest of the projection separately.
2. send a separate boolean array as argument that defines the sections to fill in. See the documentation!
For the former method, see the example code below:
from pylab import *
a = subplot(111)
t = arange(1, 100)/50.
projection = sin(2*pi*t)
# Draw the original curve
a.plot(t, projection)
# Define areas to fill in
l, r = 10, 50
# Fill the areas
a.fill_between(t[l:r], projection[l:r])
show()