Issue with csv2rec and polyfit - python

I am doing what I thought would be a simple regression on my data however something is wrong. I use csv2rec to read my data but then I print the regression parameters m and b I get nan nan.
In case you want to preview the csv file here is some of it:
"Oxide","ooh","oh",
"MoO",3.06,0.01,
"IrO",2.79,-0.23,
What I want is a regression on the two rows. x = a.oh and y = a.ooh
Here is the script I am using
import matplotlib
import matplotlib.mlab as mlab
import matplotlib.pyplot as plt
from pylab import polyfit
a = mlab.csv2rec('rutilecsv.csv')
fig = plt.figure()
ax = fig.add_subplot(111)
ax.set_xlabel('E_OH / eV', fontsize=12)
ax.set_ylabel('E_OOH / eV', fontsize=12)
(m, b) = polyfit(a.oh, a.ooh, 1)
print m, b
ax.plot(a.oh, a.ooh, 'go')
plt.axis([-2, 3, 1, 6])
plt.show()

Okay, just to put this to bed, this is exactly the symptom you'd get if there were missing data:
"Oxide","ooh","oh",
"MoO",3.06,0.01,
"IrO",2.79,-0.23,
"ZZ",2.79,,
results in
In [7]: a.ooh
Out[7]: array([ 3.06, 2.79, 2.79])
In [8]: a.oh
Out[8]: array([ 0.01, -0.23, nan])
In [9]: polyfit(a.oh, a.ooh, 1)
Out[9]: array([ nan, nan])
If you want to simply ignore the missing data, then you can simply pass polyfit only the points where both exist:
In [15]: good_data = ~(numpy.isnan(a.oh) | numpy.isnan(a.ooh))
In [16]: good_data
Out[16]: array([ True, True, False], dtype=bool)
In [17]: a.oh[good_data]
Out[17]: array([ 0.01, -0.23])
In [18]: a.ooh[good_data]
Out[18]: array([ 3.06, 2.79])
In [19]: polyfit(a.oh[good_data], a.ooh[good_data], 1)
Out[19]: array([ 1.125 , 3.04875])

Two things to check:
Are values converted propery
Try a['oh'] and a['ooh'] to access vectors
and maybe use option names to specify column names when reading file in.

Related

How to create a function that loops through numpy matrix to z-scale each and every data points returning the data standardized

How to create a function that loops through numpy matrix to z-scale each and every data points returning the data standardized. Just like how sklearn.preprocessing.StandardScaler does it. I have got up to here with no success. May somebody help me with this?
def stand_scaler(data):
mean = np.mean(data, axis=0)
std = np.std(data, axis=0)
for i in range(len(data)):
data[i] = (data[i] - mean)/std
return data
stand_scaler(data)
You shouldn't need a for-loop for this; numpy's array operations are intended for exactly this case. For a one dimensional array it's straightforward:
In [1]: import numpy as np
In [2]: x = np.random.normal(size=10)
In [3]: nx = (x - x.mean()) / x.std()
In [4]: x
Out[4]:
array([ 0.52700345, -0.57358563, -0.16925383, 2.14401554, 1.05223331,
0.72659482, 1.06816826, 0.31194848, 0.04004589, 1.09046925])
In [5]: nx
Out[5]:
array([-0.12859083, -1.62209992, -1.0734181 , 2.06570881, 0.58415071,
0.14225641, 0.60577458, -0.42042233, -0.78939654, 0.63603721])
In [6]: nx.mean()
Out[6]: 5.551115123125783e-17
In [7]: nx.std()
Out[7]: 1.0000000000000002
For higher dimensions, you can choose an axis to work over, and scale by using numpy's broadcasting; e.g., in this case, imagine each column is a different variable:
In [8]: y = np.array([10,1]) * np.random.normal(size=(5,2)) - np.array([5,-10])
In [9]: ny = (y - y.mean(axis=0)) / y.std(axis=0)
In [10]: ny
Out[10]:
array([[ 0.78076062, -0.26971997],
[-1.59591909, -1.2409338 ],
[-0.55740483, -0.81901609],
[ 1.22978416, 1.12697814],
[ 0.14277914, 1.20269171]])
In [11]: ny.mean(axis=0), ny.std(axis=0)
Out[11]: (array([-3.33066907e-17, 8.43769499e-16]), array([1., 1.]))

efficient numpy array creation

Given x, I want to produce x, log(x) as a numpy array whereby x has shape s, the result has shape (*s, 2). What's the neatest way to do this? x may just be a float, in which case I want a result with shape (2,).
An ugly way to do this is:
import numpy as np
x = np.asarray(x)
result = np.empty((*x.shape, 2))
result[..., 0] = x
result[..., 1] = np.log(x)
It's important to separate aesthetics from performance. Sometimes ugly code is
fast. In fact, that's the case here. Although creating an empty array and then
assigning values to slices may not look beautiful, it is fast.
import numpy as np
import timeit
import itertools as IT
import pandas as pd
def using_empty(x):
x = np.asarray(x)
result = np.empty(x.shape + (2,))
result[..., 0] = x
result[..., 1] = np.log(x)
return result
def using_concat(x):
x = np.asarray(x)
return np.concatenate([x, np.log(x)], axis=-1).reshape(x.shape+(2,), order='F')
def using_stack(x):
x = np.asarray(x)
return np.stack([x, np.log(x)], axis=x.ndim)
def using_ufunc(x):
return np.array([x, np.log(x)])
using_ufunc = np.vectorize(using_ufunc, otypes=[np.ndarray])
tests = [np.arange(600),
np.arange(600).reshape(20,30),
np.arange(960).reshape(8,15,8)]
# check that all implementations return the same result
for x in tests:
assert np.allclose(using_empty(x), using_concat(x))
assert np.allclose(using_empty(x), using_stack(x))
timing = []
funcs = ['using_empty', 'using_concat', 'using_stack', 'using_ufunc']
for test, func in IT.product(tests, funcs):
timing.append(timeit.timeit(
'{}(test)'.format(func),
setup='from __main__ import test, {}'.format(func), number=1000))
timing = pd.DataFrame(np.array(timing).reshape(-1, len(funcs)), columns=funcs)
print(timing)
yields, the following timeit results on my machine:
using_empty using_concat using_stack using_ufunc
0 0.024754 0.025182 0.030244 2.414580
1 0.025766 0.027692 0.031970 2.408344
2 0.037502 0.039644 0.044032 3.907487
So using_empty is the fastest (of the options tested applied to tests).
Note that np.stack does exactly what you want, so
np.stack([x, np.log(x)], axis=x.ndim)
looks reasonably pretty, but it is also the slowest of the three options tested.
Note that along with being much slower, using_ufunc returns an array of object dtype:
In [236]: x = np.arange(6)
In [237]: using_ufunc(x)
Out[237]:
array([array([ 0., -inf]), array([ 1., 0.]),
array([ 2. , 0.69314718]),
array([ 3. , 1.09861229]),
array([ 4. , 1.38629436]), array([ 5. , 1.60943791])], dtype=object)
which is not the same as the desired result:
In [240]: using_empty(x)
Out[240]:
array([[ 0. , -inf],
[ 1. , 0. ],
[ 2. , 0.69314718],
[ 3. , 1.09861229],
[ 4. , 1.38629436],
[ 5. , 1.60943791]])
In [238]: using_ufunc(x).shape
Out[238]: (6,)
In [239]: using_empty(x).shape
Out[239]: (6, 2)

Efficient way of merging two numpy masked arrays

I have two numpy masked arrays which I want to merge. I'm using the following code:
import numpy as np
a = np.zeros((10000, 10000), dtype=np.int16)
a[:5000, :5000] = 1
am = np.ma.masked_equal(a, 0)
b = np.zeros((10000, 10000), dtype=np.int16)
b[2500:7500, 2500:7500] = 2
bm = np.ma.masked_equal(b, 0)
arr = np.ma.array(np.dstack((am, bm)), mask=np.dstack((am.mask, bm.mask)))
arr = np.prod(arr, axis=2)
plt.imshow(arr)
The problem is that the np.prod() operation is very slow (4 seconds in my computer). Is there an alternative way of getting a merged array in a more efficient way?
Instead of your last two lines using dstack() and prod(), try this:
arr = np.ma.array(am.filled(1) * bm.filled(1), mask=(am.mask * bm.mask))
Now you don't need prod() at all, and you avoid allocating the 3D array entirely.
I took another approach that may not be particularly efficient, but is reasonably easy to extend and implement.
(I know I'm answering a question that is over 3 years old with functionality that has been around in numpy a long time, but bear with me)
The np.where function in numpy has two main purposes (it is a bit weird), the first is to give you indices for a boolean array:
>>> import numpy as np
>>> a = np.arange(12).reshape(3, 4)
>>> a
array([[ 0, 1, 2, 3],
[ 4, 5, 6, 7],
[ 8, 9, 10, 11]])
>>> m = (a % 3 == 0)
>>> m
array([[ True, False, False, True],
[False, False, True, False],
[False, True, False, False]], dtype=bool)
>>> row_ind, col_ind = np.where(m)
>>> row_ind
array([0, 0, 1, 2])
>>> col_ind
array([0, 3, 2, 1])
The other purpose of the np.where function is to pick from two arrays based on whether the given boolean array is True/False:
>>> np.where(m, a, np.zeros(a.shape))
array([[ 0., 0., 0., 3.],
[ 0., 0., 6., 0.],
[ 0., 9., 0., 0.]])
Turns out, there is also a numpy.ma.where which deals with masked arrays...
Given a list of masked arrays of the same shape, my code then looks like:
merged = masked_arrays[0]
for ma in masked_arrays[1:]:
merged = np.ma.where(ma.mask, merged, ma)
As I say, not particularly efficient, but certainly easy enough to implement.
HTH
Inspired by the accepted answer I've found a simple way of merging masked arrays. It works making some logical operations on the masks and simply adding 0 filled arrays.
import numpy as np
a = np.zeros((1000, 1000), dtype=np.int16)
a[:500, :500] = 2
am = np.ma.masked_equal(a, 0)
b = np.zeros((1000, 1000), dtype=np.int16)
b[250:750, 250:750] = 3
bm = np.ma.masked_equal(b, 0)
c = np.zeros((1000, 1000), dtype=np.int16)
c[500:1000, 500:1000] = 5
cm = np.ma.masked_equal(c, 0)
bm.mask = np.logical_or(np.logical_and(am.mask, bm.mask), np.logical_not(am.mask))
am = np.ma.array(am.filled(0) + bm.filled(0), mask=(am.mask * bm.mask))
cm.mask = np.logical_or(np.logical_and(am.mask, cm.mask), np.logical_not(am.mask))
am = np.ma.array(am.filled(0) + cm.filled(0), mask=(am.mask * cm.mask))
plt.imshow(am)
I hope someone find this helpful sometime. Masked arrays doesn't seem to be very efficient though. So, if someone finds an alternative to merge arrays I'd be happy to know.
Update: Based on #morningsun comment this implementation is 30% faster and much simpler:
import numpy as np
a = np.zeros((1000, 1000), dtype=np.int16)
a[:500, :500] = 2
am = np.ma.masked_equal(a, 0)
b = np.zeros((1000, 1000), dtype=np.int16)
b[250:750, 250:750] = 3
bm = np.ma.masked_equal(b, 0)
c = np.zeros((1000, 1000), dtype=np.int16)
c[500:1000, 500:1000] = 5
cm = np.ma.masked_equal(c, 0)
am[am.mask] = bm[am.mask]
am[am.mask] = cm[am.mask]
plt.imshow(am)

calculation of residuals with numpy lstsq

I have x,y data:
import numpy as np
x = np.array([ 2.5, 1.25, 0.625, 0.3125, 0.15625, 0.078125])
y = np.array([ 2448636.,1232116.,617889.,310678.,154454.,78338.])
X = np.vstack((x, np.zeros(len(x))))
popt,res,rank,val = np.linalg.lstsq(X.T,y)
popt,res,rank,val
Gives me:
(array([ 981270.29919414, 0. ]),
array([], dtype=float64),
1,
array([ 2.88639894, 0. ]))
Why are the residuals zero ? If I add ones instead of zero the residuals are calculated:
X = np.vstack((x, np.ones(len(x)))) # added ones instead of zeros
popt,res,rank,val = np.linalg.lstsq(X.T,y)
popt,res,rank,val
(array([ 978897.28500355, 4016.82089552]),
array([ 42727293.12864216]),
2,
array([ 3.49623683, 1.45176681]))
Additionally, If I calculate the sum of squared residuals in excel i get 9261214 if the intercept is set zero and 5478137 if ones are added to x.
lstsq is going to have a tough time fitting to that column of zeros: any value of the corresponding parameter (presumably intercept) will do.
To fix the intercept to 0, if that's what you need to do, just send the x array, but make sure that it's the right shape for lstsq:
In [214]: popt,res,rank,val = np.linalg.lstsq(np.atleast_2d(x).T,y)
In [215]: popt
Out[215]: array([ 981270.29919414])
In [216]: res
Out[216]: array([ 92621214.2278382])

linspace that would always include the final point?

For arbitrary pair of 2D points in the plane, I want to break the connecting vector to parts specified by a precision factor. However I want it to always include the start and endpoint. As an extra feature I am expecting the segmenting from the end of the vector to the beginning would give me the same segmentation from the beginning to end(of course after a flipping) . As I can see, numpy.linspace naturally satisfies this condition except for the situations where
the precision is too big that it only consists of one point. Is there any built-in function to take care of this situation or any hints that I would be able to correct this behaviour?
import numpy as np
alpha = np.array([0,0])
beta = np.array([1,1])
alpha_beta_dist = np.linalg.norm(beta - alpha)
for i in range(10):
precision = np.random.random(1)
traversal = np.linspace(0.0, 1.0, num = alpha_beta_dist / float(precision))
traversal2 = np.fliplr([np.linspace(1.0, 0.0, num = alpha_beta_dist / float(precision))])
traversal2 = traversal2[0]
if (traversal != traversal2).all():
print 'precision: ', precision
print 'taversal: ', traversal
print 'taversal2: ', traversal2[0]
Make sure num is at least 2:
traversal = np.linspace(0.0, 1.0,
num=max(alpha_beta_dist/float(precision), 2))
np.linspace will return both endpoints (by default) unless num is less than 2:
In [23]: np.linspace(0, 1, num=0)
Out[23]: array([], dtype=float64)
In [24]: np.linspace(0, 1, num=1)
Out[24]: array([ 0.])
In [25]: np.linspace(0, 1, num=2)
Out[25]: array([ 0., 1.])

Categories