Related
After the splitting of my data, im trying a feature ranking but when im trying to access the X_train.columns im getting this 'numpy.ndarray' object has no attribute 'columns'.
from sklearn.model_selection import train_test_split
y=df['DIED'].values
x=df.drop('DIED',axis=1).values
X_train,X_test,y_train,y_test=train_test_split(x,y,test_size=0.3,random_state=42)
print('X_train',X_train.shape)
print('X_test',X_test.shape)
print('y_train',y_train.shape)
print('y_test',y_test.shape)
bestfeatures = SelectKBest(score_func=chi2, k="all")
fit = bestfeatures.fit(X_train,y_train)
dfscores = pd.DataFrame(fit.scores_)
dfcolumns = pd.DataFrame(X_train.columns)
i know that train test split returns a numpy array, but how i should deal with it?
May be this code makes it clear:
from sklearn.model_selection import train_test_split
import numpy as np
import pandas as pd
# here i imitate your example of data
df = pd.DataFrame(data = np.random.randint(100, size = (50,5)), columns = ['DIED']+[f'col_{i}' for i in range(4)])
df.head()
Out[1]:
DIED col_0 col_1 col_2 col_3
0 36 0 23 43 55
1 81 59 83 37 31
2 32 86 94 50 87
3 10 69 4 69 27
4 1 16 76 98 74
#df here is a DataFrame, with all attributes, like df.columns
y=df['DIED'].values
x=df.drop('DIED',axis=1).values # <- here you get values, so the type of structure is array of array now (not DataFrame), so it hasn't any columns name
x
Out[2]:
array([[ 0, 23, 43, 55],
[59, 83, 37, 31],
[86, 94, 50, 87],
[69, 4, 69, 27],
[16, 76, 98, 74],
[17, 50, 52, 31],
[95, 4, 56, 68],
[82, 35, 67, 76],
.....
# now you can access to columns by index, like this:
x[:,2] # <- gives you access to the 3rd column
Out[3]:
array([43, 37, 50, 69, 98, 52, 56, 67, 81, 64, 48, 68, 14, 41, 78, 65, 11,
86, 80, 1, 11, 32, 93, 82, 93, 81, 63, 64, 47, 81, 79, 85, 60, 45,
80, 21, 27, 37, 87, 31, 97, 16, 59, 91, 20, 66, 66, 3, 9, 88])
# or you able to convert array of array back to DataFrame
pd.DataFrame(data = x, columns = df.columns[1:])
Out[4]:
col_0 col_1 col_2 col_3
0 0 23 43 55
1 59 83 37 31
2 86 94 50 87
3 69 4 69 27
....
The same approach with all your variables: X_train, X_test, Y_train, Y_test
I have a numpy array (of an image), the 3rd dimension is of length 3. An example of my array is below. I am attempting to iterate it so I access/print the last dimension of the array. But each of the techniques below accesses each individual value in the 3d array rather than the whole 3d array.
How can I iterate this numpy array at the 3d array level?
My array:
src = cv2.imread('./myimage.jpg')
# naive/shortened example of src contents (shape=(1, 3, 3))
[[[117 108 99]
[115 105 98]
[ 90 79 75]]]
When iterating my objective is print the following values each iteration:
[117 108 99] # iteration 1
[115 105 98] # iteration 2
[ 90 79 75] # iteration 3
# Attempt 1 to iterate
for index,value in np.ndenumerate(src):
print(src[index]) # src[index] and value = 117 when I was hoping it equals [117 108 99]
# Attempt 2 to iterate
for index,value in enumerate(src):
print(src[index]) # value = is the entire row
Solution
You could use any of the following two methods. However, Method-2 is more robust and the justification for that has been shown in the section: Detailed Solution below.
import numpy as np
src = [[117, 108, 99], [115, 105, 98], [ 90, 79, 75]]
src = np.array(src).reshape((1,3,3))
Method-1
for row in src[0,:]:
print(row)
Method-2
Robust method.
for e in np.transpose(src, [2,0,1]):
print(e)
Output:
[117 108 99]
[115 105 98]
[90 79 75]
Detailed Solution
Let us make an array of shape (3,4,5). So, if we iterate over the 3rd dimension, we should find 5 items, each with a shape of (3,4). You could achieve this by using numpy.transpose as shown below:
src = np.arange(3*4*5).reshape((3,4,5))
for e in np.transpose(src, [2,0,1]):
print(row)
Output:
[[ 0 5 10 15]
[20 25 30 35]
[40 45 50 55]]
[[ 1 6 11 16]
[21 26 31 36]
[41 46 51 56]]
[[ 2 7 12 17]
[22 27 32 37]
[42 47 52 57]]
[[ 3 8 13 18]
[23 28 33 38]
[43 48 53 58]]
[[ 4 9 14 19]
[24 29 34 39]
[44 49 54 59]]
Here the array src is:
array([[[ 0, 1, 2, 3, 4],
[ 5, 6, 7, 8, 9],
[10, 11, 12, 13, 14],
[15, 16, 17, 18, 19]],
[[20, 21, 22, 23, 24],
[25, 26, 27, 28, 29],
[30, 31, 32, 33, 34],
[35, 36, 37, 38, 39]],
[[40, 41, 42, 43, 44],
[45, 46, 47, 48, 49],
[50, 51, 52, 53, 54],
[55, 56, 57, 58, 59]]])
General advice: When working with numpy, explicit python loops should be a last resort. Numpy is an extremely powerful tool which covers most use cases. Learn how to use it properly! If it helps, you can think of numpy as almost its own mini-language within a language.
Now, onto the code. I chose here to keep only the subarrays whose values are all below 100, but of course this is completely arbitrary and serves only to demonstrate the code.
import numpy as np
arr = np.array([[[117, 108, 99], [115, 105, 98], [90, 79, 75]], [[20, 3, 99], [101, 250, 30], [75, 89, 83]]])
cond_mask = np.all(a=arr < 100, axis=2)
arr_result = arr[cond_mask]
Let me know if you have any questions about the code :)
I have a dataset that consists of columns 0 to 10, and I would like to extract the information that is only in columns 1 to 5, not 6, and 7 to 9 (it means not the last column). So far, I have done the following:
A = B[:, [[1:5], [7:-1]]]
but I got a syntax error, how can I retrieve that data?
Advanced indexing doesn't take a list of lists of slices. Instead, you can use numpy.r_. This function doesn't take negative indices, but you can get round this by using np.ndarray.shape:
A = B[:, np.r_[1:6, 7:B.shape[1]-1]]
Remember to add 1 to the second part, since a: b does not include b, in the same way slice(a, b) does not include b. Also note that indexing begins at 0.
Here's a demo:
import numpy as np
B = np.random.randint(0, 10, (3, 11))
print(B)
[[5 8 8 8 3 0 7 2 1 6 7]
[4 3 8 7 3 7 5 6 0 5 7]
[1 0 4 0 2 2 5 1 4 2 3]]
A = B[:,np.r_[1:6, 7:B.shape[1]-1]]
print(A)
[[8 8 8 3 0 2 1 6]
[3 8 7 3 7 6 0 5]
[0 4 0 2 2 1 4 2]]
Another way would be to get your slices independently, and then concatenate:
A = np.concatenate([B[:, 1:6], B[:, 7:-1]], axis=1)
Using similar example data as #jpp:
B = np.random.randint(0, 10, (3, 10))
>>> B
array([[0, 5, 0, 6, 8, 5, 9, 3, 2, 0],
[8, 8, 1, 7, 3, 5, 7, 7, 4, 8],
[5, 5, 5, 2, 3, 1, 6, 4, 9, 6]])
A = np.concatenate([B[:, 1:6], B[:, 7:-1]], axis=1)
>>> A
array([[5, 0, 6, 8, 5, 3, 2],
[8, 1, 7, 3, 5, 7, 4],
[5, 5, 2, 3, 1, 4, 9]])
how about union the range?
B[:, np.union1d(range(1,6), range(7,10))]
Just to add some of my thoughts. There are two approaches one can take using either numpy or pandas. So I will demonstrate with some data, and assume that the data is the grades for a student in different courses he/she is enrolled in.
import pandas as pd
import numpy as np
data = {'Course A': [84, 82, 81, 89, 73, 94, 92, 70, 88, 95],
'Course B': [85, 82, 72, 77, 75, 89, 95, 84, 77, 94],
'Course C': [97, 94, 93, 95, 88, 82, 78, 84, 69, 78],
'Course D': [84, 82, 81, 89, 73, 94, 92, 70, 88, 95],
'Course E': [85, 82, 72, 77, 75, 89, 95, 84, 77, 94],
'Course F': [97, 94, 93, 95, 88, 82, 78, 84, 69, 78]
}
df = pd.DataFrame(data=data)
df.head()
CA CB CC CD CE CF
0 84 85 97 84 85 97
1 82 82 94 82 82 94
2 81 72 93 81 72 93
3 89 77 95 89 77 95
4 73 75 88 73 75 88
NOTE: CA through CF represent Course A through Course F.
To help us remember column names and their associated indexes, we can build a list of columns and their indexes via list comprehension.
map_cols = [f"{c[0]}:{c[1]}" for c in enumerate(df.columns)]
['0:Course A',
'1:Course B',
'2:Course C',
'3:Course D',
'4:Course E',
'5:Course F']
Now, to select say Course A, and Course D through Course F using indexing in numpy, you can do the following:
df.iloc[:, np.r_[0, 3:df.shape[1]]]
CA CD CE CF
0 84 84 85 97
1 82 82 82 94
2 81 81 72 93
3 89 89 77 95
4 73 73 75 88
You can also use pandas to the same effect.
df[[df.columns[0], *df.columns[3:]]]
CA CD CE CF
0 84 84 85 97
1 82 82 82 94
2 81 81 72 93
3 89 89 77 95
4 73 73 75 88
One can solve that with the sum of range
[In]: columns = list(range(1,6)) + list(range(7,10))
[Out]:
[1, 2, 3, 4, 5, 7, 8, 9]
Then, considering that your df is called df, using iloc to select the DF columns
newdf = df.iloc[:, columns]
For example, if a question/answer you encounter posts an array like this:
[[ 0 1 2 3 4 5 6 7]
[ 8 9 10 11 12 13 14 15]
[16 17 18 19 20 21 22 23]
[24 25 26 27 28 29 30 31]
[32 33 34 35 36 37 38 39]
[40 41 42 43 44 45 46 47]
[48 49 50 51 52 53 54 55]
[56 57 58 59 60 61 62 63]]
How would you load it into a variable in a REPL session without having to add commas everywhere?
For a one-time occasion, I might do this:
Copy the text containing the array to the clipboard.
In an ipython shell, enter s = """, but do not hit return.
Paste the text from the clipboard.
Type the closing triple quote.
That gives me:
In [16]: s = """[[ 0 1 2 3 4 5 6 7]
...: [ 8 9 10 11 12 13 14 15]
...: [16 17 18 19 20 21 22 23]
...: [24 25 26 27 28 29 30 31]
...: [32 33 34 35 36 37 38 39]
...: [40 41 42 43 44 45 46 47]
...: [48 49 50 51 52 53 54 55]
...: [56 57 58 59 60 61 62 63]]"""
Then use np.loadtxt() as follows:
In [17]: a = np.loadtxt([line.lstrip(' [').rstrip(']') for line in s.splitlines()], dtype=int)
In [18]: a
Out[18]:
array([[ 0, 1, 2, 3, 4, 5, 6, 7],
[ 8, 9, 10, 11, 12, 13, 14, 15],
[16, 17, 18, 19, 20, 21, 22, 23],
[24, 25, 26, 27, 28, 29, 30, 31],
[32, 33, 34, 35, 36, 37, 38, 39],
[40, 41, 42, 43, 44, 45, 46, 47],
[48, 49, 50, 51, 52, 53, 54, 55],
[56, 57, 58, 59, 60, 61, 62, 63]])
If you have Pandas, pyperclip or something else to read from the clipboard you could use something like this:
from pandas.io.clipboard import clipboard_get
# import pyperclip
import numpy as np
import re
import ast
def numpy_from_clipboard():
inp = clipboard_get()
# inp = pyperclip.paste()
inp = inp.strip()
# if it starts with "array(" we just need to remove the
# leading "array(" and remove the optional ", dtype=xxx)"
if inp.startswith('array('):
inp = re.sub(r'^array\(', '', inp)
dtype = re.search(r', dtype=(\w+)\)$', inp)
if dtype:
return np.array(ast.literal_eval(inp[:dtype.start()]), dtype=dtype.group(1))
else:
return np.array(ast.literal_eval(inp[:-1]))
else:
# In case it's the string representation it's a bit harder.
# We need to remove all spaces between closing and opening brackets
inp = re.sub(r'\]\s+\[', '],[', inp)
# We need to remove all whitespaces following an opening bracket
inp = re.sub(r'\[\s+', '[', inp)
# and all leading whitespaces before closing brackets
inp = re.sub(r'\s+\]', ']', inp)
# replace all remaining whitespaces with ","
inp = re.sub(r'\s+', ',', inp)
return np.array(ast.literal_eval(inp))
And then read what you saved in the clipboard:
>>> numpy_from_clipboard()
array([[ 0, 1, 2, 3, 4, 5, 6, 7],
[ 8, 9, 10, 11, 12, 13, 14, 15],
[16, 17, 18, 19, 20, 21, 22, 23],
[24, 25, 26, 27, 28, 29, 30, 31],
[32, 33, 34, 35, 36, 37, 38, 39],
[40, 41, 42, 43, 44, 45, 46, 47],
[48, 49, 50, 51, 52, 53, 54, 55],
[56, 57, 58, 59, 60, 61, 62, 63]])
This should be able to parse (most) arrays (str as well as repr of arrays) from your clipboard. It should even work for multi-line arrays (where np.loadtxt fails):
[[ 0.34866207 0.38494993 0.7053722 0.64586156 0.27607369 0.34850162
0.20530567 0.46583039 0.52982216 0.92062115]
[ 0.06973858 0.13249867 0.52419149 0.94707951 0.868956 0.72904737
0.51666421 0.95239542 0.98487436 0.40597835]
[ 0.66246734 0.85333546 0.072423 0.76936201 0.40067016 0.83163118
0.45404714 0.0151064 0.14140024 0.12029861]
[ 0.2189936 0.36662076 0.90078913 0.39249484 0.82844509 0.63609079
0.18102383 0.05339892 0.3243505 0.64685352]
[ 0.803504 0.57531309 0.0372428 0.8308381 0.89134864 0.39525473
0.84138386 0.32848746 0.76247531 0.99299639]]
>>> numpy_from_clipboard()
array([[ 0.34866207, 0.38494993, 0.7053722 , 0.64586156, 0.27607369,
0.34850162, 0.20530567, 0.46583039, 0.52982216, 0.92062115],
[ 0.06973858, 0.13249867, 0.52419149, 0.94707951, 0.868956 ,
0.72904737, 0.51666421, 0.95239542, 0.98487436, 0.40597835],
[ 0.66246734, 0.85333546, 0.072423 , 0.76936201, 0.40067016,
0.83163118, 0.45404714, 0.0151064 , 0.14140024, 0.12029861],
[ 0.2189936 , 0.36662076, 0.90078913, 0.39249484, 0.82844509,
0.63609079, 0.18102383, 0.05339892, 0.3243505 , 0.64685352],
[ 0.803504 , 0.57531309, 0.0372428 , 0.8308381 , 0.89134864,
0.39525473, 0.84138386, 0.32848746, 0.76247531, 0.99299639]])
However I'm not too good with regexes so this probably isn't foolproof and using ast.literal_eval feels a bit awkard (but it avoids doing the parsing yourself).
Feel free to suggest improvements.
I have a Pandas DataFrame as follows;
data = pd.DataFrame({'A':[1,2,3,1,23,3,76,2,45,76],'B':[12,56,22,45,1,3,98,79,77,67]})
To remove duplicate values from the dataframe I have done this;
set(data['A'].unique()).union(set(data['B'].unique()))
which results in;
set([1, 2, 3, 12, 76, 77, 79, 67, 22, 23, 98, 45, 56])
Is there a better way of doing this? Is there a way of achieving this by using drop_duplicates?
Edit:
also, What if I had two more columns 'C' & 'D' but need to drop duplicates only from 'A' & 'B' ?
If you are intent on collapsing this
In [10]: np.unique(data.values.ravel())
Out[10]: array([ 1, 2, 3, 12, 22, 23, 45, 56, 67, 76, 77, 79, 98])
This will work as well
In [12]: data.unstack().drop_duplicates()
Out[12]:
A 0 1
1 2
2 3
4 23
6 76
8 45
B 0 12
1 56
2 22
6 98
7 79
8 77
9 67
dtype: int64