I had a pandas dataframe that had columns with strings from 0-9 as column names:
working_df = pd.DataFrame(np.random.rand(5,10),index=range(0,5), columns=[str(x) for x in range(10)])
working_df.loc[:,'outcome'] = [0,1,1,0,1]
I then wanted to get an array of all of these numbers into one column so I did:
array_list = [Y for Y in x[[str(num) for num in range(10)]].values]
which gave me:
[array([ 0.0793451 , 0.3288617 , 0.75887129, 0.01128641, 0.64105905,
0.78789297, 0.69673768, 0.20354558, 0.48976411, 0.72848541]),
array([ 0.53511388, 0.08896322, 0.10302786, 0.08008444, 0.18218731,
0.2342337 , 0.52622153, 0.65607384, 0.86069294, 0.8864577 ]),
array([ 0.82878026, 0.33986175, 0.25707122, 0.96525733, 0.5897311 ,
0.3884232 , 0.10943644, 0.26944414, 0.85491211, 0.15801284]),
array([ 0.31818888, 0.0525836 , 0.49150727, 0.53682492, 0.78692193,
0.97945708, 0.53181293, 0.74330327, 0.91364064, 0.49085287]),
array([ 0.14909577, 0.33959452, 0.20607263, 0.78789116, 0.41780657,
0.0437907 , 0.67697385, 0.98579928, 0.1487507 , 0.41682309])]
I then attached it to my dataframe using:
working_df.loc[:,'array_list'] = pd.Series(array_list)
I then setup my rf_clf = RandomForestClassifier() and I try to rf_clf.fit(working_df['array_list'][1:].values, working_df['outcome'][1:].values) which results in the ValueError: setting an array element with sequence
Is it a problem with the array of arrays in the fitting? Thanks for any insight.
The problem is that scikit-learn expects a two-dimensional array of values as input. You're passing a one dimensional array of objects (with each object itself being a one-dimensional array).
A quick fix would be to do this:
X = np.array(list(working_df['array_list'][1:]))
y = working_df['outcome'][1:].values
rf_clf.fit(X, y)
A better fix would be to not store your two-dimensional feature array within a one-dimensional pandas column.
Related
I have a function that reads in and outputs a 2D array. I want the output to be constant (pi in this case) for every index in the input that equals 0, otherwise I perform some maths on it. E.g:
import numpy as np
import numpy.ma as ma
def my_func(x):
mask = ma.where(x==0,x)
# make an array of pi's the same size and shape as the input
y = np.pi * np.ones(x)
# psuedo-code bit I can't figure out
y.not_masked = y**2
return y
my_array = [[0,1,2],[1,0,2],[1,2,0]]
result_array = my_func(my_array)
This should give me the following:
result_array = [[3.14, 1, 4],[1, 3.14, 4], [1, 4, 3.14]]
I.e. it has applied y**2 to each element in the 2D list that doesn't equal zero, and replaced all the zeros with pi.
I need this because my function will include division, and I don't know the indexes beforehand. I'm trying to convert a matlab tutorial from a textbook into Python and this function is stumping me!
Thanks
Just use np.where() directly:
y = np.where(x, x**2, np.pi)
Example:
>>> x = np.asarray([[0,1,2],[1,0,2],[1,2,0]])
>>> y = np.where(x, x**2, np.pi)
>>> print(y)
[[ 3.14159265 1. 4. ]
[ 1. 3.14159265 4. ]
[ 1. 4. 3.14159265]]
Try this:
my_array = np.array([[0,1,2],[1,0,2],[1,2,0]]).astype(float)
def my_func(x):
mask = x == 0
x[mask] = np.pi
x[~mask] = x[~mask]**2 # or some other operation on x...
return x
I would suggest rather than using masks you can use a boolean array to achieve what you want.
def my_func(x):
#create a boolean matrix, a, that has True where x==0 and
#False where x!=0
a=x==0
x[a]=np.pi
#Use np.invert to flip where a is True and False so we can
#operate on the non-zero values of the array
x[~a]=x[~a]**2
return x #return the transformed array
my_array = np.array([[0.,1.,2.],[1.,0.,2.],[1.,2.,0.]])
result_array = my_func(my_array)
this gives the output:
array([[ 3.14159265, 1. , 4. ],
[ 1. , 3.14159265, 4. ],
[ 1. , 4. , 3.14159265]])
Notice that I passed to the function an numpy array specifically, originally you passed a list and that will give problems when you attempt to do mathematical operations. Also notice I defined the array with 1. rather than just 1, in order to make sure it was an array of floats rather than integers, because if it is an array of integers when you set values equal to pi it will truncate to 3.
Perhaps it would be good to add a piece to the function to check the dtype of the input argument and see if it is a numpy array rather than a list or other object, and also to make sure it contains floats, and if not you can adjust accordingly.
EDIT:
Change to using ~a rather than invert(a) as per Scotty1's suggestion.
I am working with library "scipy.signal" in Python and I have the next code:
from scipy import signal
b = [ 0.001016 0.00507999 0.01015998 0.01015998 0.00507999 0.001016 ]
a = [ 1. -3.0820186 4.04351697 -2.76126457 0.97291013 -0.14063199]
data = [[ 1.]
[ 1.]
[ 1.]
...]
# length = 264
y = signal.filtfilt(b, a, data)
But when I execute the code I get the next error message:
The length of the input vector x must be at least padlen, which is 18.
What could I do?
It appears that data is a two-dimensional array with shape (264, 1). By default, filtfilt filters along the last axis of the input array, so in your case it is trying to filter along an axis where the length of the data is 1, which is not long enough for the default padding method.
I assume you meant to interpret data as a one-dimensional array. You can add the argument axis=0
y = signal.filtfilt(b, a, data, axis=0)
to filter along the first dimension (i.e. down the column), in which case the output y will also have shape (264, 1). Alternatively, you can convert the input to a one-dimensional array by flattening it with np.ravel(data) or by using indexing to select the first (and only) column, data[:, 0]. (The latter will only work if data is, in fact, a numpy array and not a list of lists.) E.g.
y = signal.filtfilt(b, a, np.ravel(data))
In that case, the output y will also be a one-dimensional array, with shape (264,).
Assuming you have a two-dimensional array with shape (264, 2), you can also use np.hsplit() to split data into two separate arrays like so:
import numpy as np
arr1, arr2 = np.hsplit(data,2)
You can view the shape of each individual array, for example:
print(arr1.shape)
Your code will then look something like this:
y1 = signal.filtfilt(b, a, arr1)
y2 = signal.filtfilt(b, a, arr2)
I'm trying to get the hang of normalizing my data, doing some work on it and then changing it back. When doing an inverse_transform do I have to always pass in the exact same shape as it was when I did a fit_transform? The code below will give me a "non-broadcastable output operand with shape (3,1) doesn't match the broadcast shape (3,3)"
import numpy as np
from sklearn.preprocessing import MinMaxScaler
first = np.array([[ 1.2345, 1.220000,1.26245],
[ 1.234,1.220000,7.0901],
[ 1.23450,1.22000,1.14795]])
scaler = MinMaxScaler(feature_range=(0, 1))
dataset = scaler.fit_transform(first)
new_dataset = dataset[:,:1]
trainPredict2 = scaler.inverse_transform(new_dataset)
You don't have to pass a data set with exactly the same shape, but the number of columns must match the original data set as each row is interpreted as a record and each column is interpreted as a feature. And you can not miss features for your testing data set, technically. So for instance, slicing the rows will still work:
new_dataset = dataset[:1,:]
trainPredict2 = scaler.inverse_transform(new_dataset)
This gives back the first row of your original data set:
trainPredict2
# array([[ 1.2345 , 1.22 , 1.26245]])
If you really want to inverse one or two features, you can calculate this by inversing the min-max transformation x′:=(x−xmin)/(xmax−xmin) formula, :
scaler.data_range_[:1] * dataset[:,:1] + scaler.data_min_[:1]
# array([[ 1.2345],
# [ 1.234 ],
# [ 1.2345]])
This gives back the first column of your original data set.
I have the following problem with shape of ndarray:
out.shape = (20,)
reference.shape = (20,0)
norm = [out[i] / np.sum(out[i]) for i in range(len(out))]
# norm is a list now so I convert it to ndarray:
norm_array = np.array((norm))
norm_array.shape = (20,30)
# error: operands could not be broadcast together with shapes (20,30) (20,)
diff = np.fabs(norm_array - reference)
How can I change shape of norm_array from (20,30) into (20,) or reference to (20,30), so I can substract them?
EDIT: Can someone explain me, why they have different shape, if I can access both single elements with norm_array[0][0] and reference[0][0] ?
I am not sure what you are trying to do exactly, but here is some information on numpy arrays.
A 1-d numpy array is a row vector with a shape that is a single-valued tuple:
>>> np.array([1,2,3]).shape
(3,)
You can create multidimensional arrays by passing in nested lists. Each sub-list is a 1-d row vector of length 1, and there are 3 of them.
>>> np.array([[1],[2],[3]]).shape
(3,1)
Here is the weird part. You can create the same array, but leave the lists empty. You end up with 3 row vectors of length 0.
>>> np.array([[],[],[]]).shape
(3,0)
This is what you have for you reference array, an array with structure but no values. This brings me back to my original point:
You can't subtract an empty array.
If I make 2 arrays with the shapes you describe, I get an error
In [1856]: norm_array=np.ones((20,30))
In [1857]: reference=np.ones((20,0))
In [1858]: norm_array-reference
...
ValueError: operands could not be broadcast together with shapes (20,30) (20,0)
But it's different from yours. But if I change the shape of reference, the error messages match.
In [1859]: reference=np.ones((20,))
In [1860]: norm_array-reference
...
ValueError: operands could not be broadcast together with shapes (20,30) (20,)
So your (20,0) is wrong. I don't know if you mistyped something or not.
But if I make reference 2d with 1 in the last dimension, broadcasting works, producing a difference that matches (20,30) in shape:
In [1861]: reference=np.ones((20,1))
In [1862]: norm_array-reference
If reference = np.zeros((20,)), then I could use reference[:,None] to add that singleton last dimension.
If reference is (20,), you can't do reference[0][0]. reference[0][0] only works with 2d arrays with at least 1 in the last dim. reference[0,0] is the preferred way of indexing a single element of a 2d array.
So far this is normal array dimensions and broadcasting; something you'll learn with use.
===============
I'm puzzled about the shape of out. If it is (20,), how does norm_array end up as (20,30). out must consist of 20 arrays or lists, each of which has 30 elements.
If out was 2d array, we could normalize without iteration
In [1869]: out=np.arange(12).reshape(3,4)
with the list comprehension:
In [1872]: [out[i]/np.sum(out[i]) for i in range(out.shape[0])]
Out[1872]:
[array([ 0. , 0.16666667, 0.33333333, 0.5 ]),
array([ 0.18181818, 0.22727273, 0.27272727, 0.31818182]),
array([ 0.21052632, 0.23684211, 0.26315789, 0.28947368])]
In [1873]: np.array(_) # and to array
Out[1873]:
array([[ 0. , 0.16666667, 0.33333333, 0.5 ],
[ 0.18181818, 0.22727273, 0.27272727, 0.31818182],
[ 0.21052632, 0.23684211, 0.26315789, 0.28947368]])
Instead take row sums, and tell it to keep it 2d for ease of further use
In [1876]: out.sum(axis=1,keepdims=True)
Out[1876]:
array([[ 6],
[22],
[38]])
now divide
In [1877]: out/out.sum(axis=1,keepdims=True)
Out[1877]:
array([[ 0. , 0.16666667, 0.33333333, 0.5 ],
[ 0.18181818, 0.22727273, 0.27272727, 0.31818182],
[ 0.21052632, 0.23684211, 0.26315789, 0.28947368]])
I have a dataset array A. A is n×2. It can be plotted on the x and y axis.
A[:,1] gets me all of the Y values ans A[:,0] gets me all the x values.
Now, I have a few other dataset arrays that are similar to A. X values are the same for these similar arrays. How do I calculate the standard deviation of the datasets? There should be a std value for each X. In the end my result std should have a length of n.
I can do this the manual way with loops but I'm not sure how to do this using NumPy in a pythonic and simple manner.
here are some sample data:
A=[[0,2.54],[1,254.5],[2,-43]]
B=[[0,3.34],[1,154.5],[2,-93]]
std_Array=[std(2.54,3.54),std(254.5,154.5),std(-43,-93)]
Suppose your arrays are all the same shape and they are in a list. Then to get the standard deviation of the first column of each you can do
arrays = [np.random.rand(10, 2) for _ in range(8)]
np.dstack(arrays).std(axis=0)[0]
This stacks the 2-D arrays into a 3-D array an then takes the std along the first axis, giving a 2 X 8 (the number of arrays). The first row of the result is the std. devs. of the 8 sets of x-values.
If you post some sample data perhaps we could help more.
Is this pythonic enough?
std_Array = numpy.std((A,B), axis = 0)[:,1]
li_arr = [np.array(x)[: , 1] for x in [A , B]]
This will produce numpy arrays with specifi columns you want to add the result will be
[array([ 2.54, 254.5 , -43. ]), array([ 3.34, 154.5 , -93. ])]
then you stack the values using column_stack
arr = np.column_stack(li_arr)
this will be the result stacking
array([[ 2.54, 3.34],
[ 254.5 , 154.5 ],
[ -43. , -93. ]])
and then finally
np.std(arr , axis = 1)