how to change string matrix to a integer matrix - python

I have a voting dataset like that:
republican,n,y,n,y,y,y,n,n,n,y,?,y,y,y,n,y
republican,n,y,n,y,y,y,n,n,n,n,n,y,y,y,n,?
democrat,?,y,y,?,y,y,n,n,n,n,y,n,y,y,n,n
democrat,n,y,y,n,?,y,n,n,n,n,y,n,y,n,n,y
but they are both string so I want to change them to integer matrix and make statistic
hou_dat = pd.read_csv("house.data", header=None)
for i in range (0, hou_dat.shape[0]):
for j in range (0, hou_dat.shape[1]):
if hou_dat[i, j] == "republican":
hou_dat[i, j] = 2
if hou_dat[i, j] == "democrat":
hou_dat[i, j] = 3
if hou_dat[i, j] == "y":
hou_dat[i, j] = 1
if hou_dat[i, j] == "n":
hou_dat[i, j] = 0
if hou_dat[i, j] == "?":
hou_dat[i, j] = -1
hou_sta = hou_dat.apply(pd.value_counts)
print(hou_sta)
however, it shows error, how to solve it?:
Exception has occurred: KeyError
(0, 0)

IIUC, you need map and stack
map_dict = {'republican' : 2,
'democrat' : 3,
'y' : 1,
'n' : 0,
'?' : -1}
df1 = df.stack().map(map_dict).unstack()
print(df1)
0 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16
0 2 0 1 0 1 1 1 0 0 0 1 -1 1 1 1 0 1
1 2 0 1 0 1 1 1 0 0 0 0 0 1 1 1 0 -1
2 3 -1 1 1 -1 1 1 0 0 0 0 1 0 1 1 0 0
3 3 0 1 1 0 -1 1 0 0 0 0 1 0 1 0 0 1

If you're dealing with data from csv, it is better to use pandas' methods.
In this case, you have replace method to do exactly what you asked for.
hou_dat.replace(to_replace={'republican':2, 'democrat':3, 'y':1, 'n':0, '?':-1}, inplace=True)
You can read more about it in this documentation

Related

while loop that is equivalent to for loop

I am trying to experiment how the while loop works.
docs = ['123867', '256789', '3aa', '4gg', '5yy', '6abc']
for i in range(0,len(docs)):
for j in range(i,len(docs[i])):
print(i, j)
My output for the above code is
0 0
0 1
0 2
0 3
0 4
0 5
1 1
1 2
1 3
1 4
1 5
2 2
I attempt to play with the while loop with
docs = ['123867', '256789', '3aa', '4gg', '5yy', '6abc']
i = 0
j = i
while i < len(docs):
while j < len(docs[i]):
print(i, j)
j += 1
i += 1
but the output is
0 0
0 1
0 2
0 3
0 4
0 5
How can I fix my while loop to match the for loop? Thanks!
docs = ['123867', '256789', '3aa', '4gg', '5yy', '6abc']
i = 0
while i < len(docs):
j = i # should be moved here
while j < len(docs[i]):
print(i, j)
j += 1
i += 1

Replace all the subdiagonals of a matrix for a given k in Python

I would like to replace all values of the subdiagonals under the k-diagonal.
For example :
We first import the numpy library :
import numpy as np
Then we create the matrix :
In [14]: matrix = np.matrix('1 1 1 1 1 1; 1 1 1 1 1 1; 1 1 1 1 1 1; 1 1 1 1 1 1; 1 1 1 1 1 1')
We are then getting :
In [15]: print(matrix)
Out[16]:
[[1 1 1 1 1 1]
[1 1 1 1 1 1]
[1 1 1 1 1 1]
[1 1 1 1 1 1]
[1 1 1 1 1 1]]
We then get the diagonals under the k-diagonal for k = 1 for example :
In [17]: lowerdiags = [np.diag(matrix, k=e+1).tolist() for e in range(-len(matrix), k)]
In [18]: print(lowerdiags)
Out[19]: [[1], [1, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1, 1, 1], [1, 1, 1, 1, 1]]
And, I'm stuck there, what should I add for it to be for k = 1 and replace all values per 0, like that: (Knowing that we just found the subdiagonals)
[[0 1 1 1 1 1]
[0 0 1 1 1 1]
[0 0 0 1 1 1]
[0 0 0 0 1 1]
[0 0 0 0 0 1]]
or even for k = 0 :
[[1 1 1 1 1 1]
[0 1 1 1 1 1]
[0 0 1 1 1 1]
[0 0 0 1 1 1]
[0 0 0 0 1 1]]
Thank you for your help and your patience.
I found a way by using the numpy method : fill_diagonal and by moving around the different k :
# Import numpy library
import numpy as np
def Exercise_3(matrix, k):
# print initial matrix
print(matrix)
for k in range(-len(matrix)+1, k):
if k < 0:
# Smart slicing when filling diagonals with "np.fill_diagonal" on our matrix for lower diagonals
np.fill_diagonal(matrix[-k:, :k], 0)
if k > 0:
# Smart slicing when filling diagonals with "np.fill_diagonal" on our matrix for upper diagonals
np.fill_diagonal(matrix[:-k, k:], 0)
if k == 0:
# Just replace the main diagonal by 0
np.fill_diagonal(matrix, 0)
# print to see each change on the matrix
#print(matrix)
#print(k)
return matrix
def main():
k = 0
# an another way of creating a matrix
#matrix = np.matrix('1 1 1 1 1 1; 1 1 1 1 1 1; 1 1 1 1 1 1; 1 1 1 1 1 1; 1 1 1 1 1 1; 1 1 1 1 1 1')
# matrix of 5 rows and 5 columns filled by 1
matrix = np.array(([1,1,1,1,1],[1,1,1,1,1],[1,1,1,1,1],[1,1,1,1,1],[1,1,1,1,1]))
NewMatrix = Exercise_3(matrix, k)
print(NewMatrix)
main()

Series calculation based on shifted values / recursive algorithm

I have the following:
df['PositionLong'] = 0
df['PositionLong'] = np.where(df['Alpha'] == 1, 1, (np.where(np.logical_and(df['PositionLong'].shift(1) == 1, df['Bravo'] == 1), 1, 0)))
This lines basically only take in df['Alpha'] but not the df['PositionLong'].shift(1).. It cannot recognize it but I dont understand why?
It produces this:
df['Alpha'] df['Bravo'] df['PositionLong']
0 0 0
1 1 1
0 1 0
1 1 1
1 1 1
However what I wanted the code to do is this:
df['Alpha'] df['Bravo'] df['PositionLong']
0 0 0
1 1 1
0 1 1
1 1 1
1 1 1
I believe the solution is to loop each row, but this will take very long.
Can you help me please?
You are looking for a recursive function, since a previous PositionLong value depends on Alpha, which itself is used to determine PositionLong.
But numpy.where is a regular function, so df['PositionLong'].shift(1) is evaluated as a series of 0 values, since you initialise the series with 0.
A manual loop need not be expensive. You can use numba to efficiently implement your recursive algorithm:
from numba import njit
#njit
def rec_algo(alpha, bravo):
res = np.empty(alpha.shape)
res[0] = 1 if alpha[0] == 1 else 0
for i in range(1, len(res)):
if (alpha[i] == 1) or ((res[i-1] == 1) and bravo[i] == 1):
res[i] = 1
else:
res[i] = 0
return res
df['PositionLong'] = rec_algo(df['Alpha'].values, df['Bravo'].values).astype(int)
Result:
print(df)
Alpha Bravo PositionLong
0 0 0 0
1 1 1 1
2 0 1 1
3 1 1 1
4 1 1 1

Python numpy zeros array being assigned 1 for every value when only one index is updated

The following is my code:
amount_features = X.shape[1]
best_features = np.zeros((amount_features,), dtype=int)
best_accuracy = 0
best_accuracy_index = 0
def find_best_features(best_features, best_accuracy):
for i in range(amount_features):
trial_features = best_features
trial_features[i] = 1
svc = SVC(C = 10, gamma = .1)
svc.fit(X_train[:,trial_features==1],y_train)
y_pred = svc.predict(X_test[:,trial_features==1])
accuracy = metrics.accuracy_score(y_test,y_pred)
if (accuracy > best_accuracy):
best_accuracy = accuracy
best_accuracy_index = i
print(best_accuracy_index)
best_features[best_accuracy_index] = 1
return best_features, best_accuracy
bf, ba = find_best_features(best_features, best_accuracy)
print(bf, ba)
And this is my output:
25
[1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1] 0.865853658537
And my expected output:
25
[0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 1 0 0 0 0 0 0] 0.865853658537
I am trying to update the zeros array with the index that gives the highest accuracy. As you see it should be index 25, and I follow that by assigning the 25 index for my array equal to 1. However, when I print the array it shows every index has been updated to 1.
Not sure what is the mishap. Thanks for spending your limited time on Earth to help me.
Change trial_features = best_features to trial_features = numpy.copy(best_features). Reasoning behind the change is already given by #Michael Butscher.

Unable to retrieve required indices from multiple NumPy arrays

I have 4 numpy arrays of same shape(i.e., 2d). I have to know the index of the last array (d) where the elements of d are smaller than 20, but those indices of d should be located in the region where elements of array(a) are 1; and the elements of array (b) and (c) are not 1.
I tried as follows:
mask = (a == 1)|(b != 1)|(c != 1)
answer = d[mask | d < 20]
Now, I have to set those regions of d into 1; and all other regions of d into 0.
d[answer] = 1
d[d!=1] = 0
print d
I could not solve this problem. How do you solve it?
import numpy as np
a = np.array([[0,0,0,1,1,1,1,1,0,0,0],
[0,0,0,1,1,1,1,1,0,0,0],
[0,0,0,1,1,1,1,1,0,0,0],
[0,0,0,1,1,1,1,1,0,0,0],
[0,0,0,1,1,1,1,1,0,0,0],
[0,0,0,1,1,1,1,1,0,0,0]])
b = np.array([[0,0,0,1,1,0,0,0,0,0,0],
[0,0,0,0,0,0,1,1,0,0,0],
[0,0,0,1,0,1,0,0,0,0,0],
[0,0,0,1,1,1,0,1,0,0,0],
[0,0,0,0,0,0,1,0,0,0,0],
[0,0,0,0,1,0,1,0,0,0,0]])
c = np.array([[0,0,0,0,0,0,1,0,0,0,0],
[0,0,0,0,0,0,0,0,0,0,0],
[0,0,0,0,0,0,1,1,0,0,0],
[0,0,0,0,0,0,1,0,0,0,0],
[0,0,0,0,1,0,0,0,0,0,0],
[0,0,0,0,0,1,0,0,0,0,0]])
d = np.array([[0,56,89,67,12,28,11,12,14,8,240],
[1,57,89,67,18,25,11,12,14,9,230],
[4,51,89,87,19,20,51,92,54,7,210],
[6,46,89,67,51,35,11,12,14,6,200],
[8,36,89,97,43,67,81,42,14,1,220],
[9,16,89,67,49,97,11,12,14,2,255]])
The conditions should be AND-ed together, instead of OR-ed. You can first get the Boolean array / mask representing desired region, and then modify d based on it:
mask = (a == 1) & (b != 1) & (c != 1) & (d < 20)
d[mask] = 1
d[~mask] = 0
print d
Output:
[[0 0 0 0 0 0 0 1 0 0 0]
[0 0 0 0 1 0 0 0 0 0 0]
[0 0 0 0 1 0 0 0 0 0 0]
[0 0 0 0 0 0 0 0 0 0 0]
[0 0 0 0 0 0 0 0 0 0 0]
[0 0 0 0 0 0 0 1 0 0 0]]

Categories