.argmax(axis =1) not working on a numpy array - python

Hi I am trying to use argmax function on a numpy array but it shows an error.
---------------------------------------------------------------------------
AttributeError Traceback (most recent call last)
<ipython-input-25-914c12a3a737> in <module>()
5 # TODO - Check for data issues
6 # Hint: You can convert from one-hot to integers with argmax
----> 7 train_df1 = train_df1.argmax(axis = 1)
8
9 # Initialise
AttributeError: 'function' object has no attribute 'argmax'
code:
train_df1
<bound method DataFrame.to_numpy of MEL NV BCC AKIEC BKL DF VASC
0 0.0 1.0 0.0 0.0 0.0 0.0 0.0
1 0.0 1.0 0.0 0.0 0.0 0.0 0.0
2 0.0 1.0 0.0 0.0 0.0 0.0 0.0
3 0.0 1.0 0.0 0.0 0.0 0.0 0.0
4 1.0 0.0 0.0 0.0 0.0 0.0 0.0
.. ... ... ... ... ... ... ...
194 0.0 1.0 0.0 0.0 0.0 0.0 0.0
195 0.0 1.0 0.0 0.0 0.0 0.0 0.0
196 0.0 1.0 0.0 0.0 0.0 0.0 0.0
197 0.0 1.0 0.0 0.0 0.0 0.0 0.0
198 0.0 0.0 1.0 0.0 0.0 0.0 0.0
[199 rows x 7 columns]>
train_df1 = train_df1.argmax(axis = 1)
does anyone understand why I am getting this?
Thanks

You'll need to post more code for us to duplicate that behaviour, but we can see from the error messages what's going on.
train_df1 is a function. You need to call the function:
my_values = train_df1()
my_max = my_values.argmax(axis = 1)
The main clue is when you type train_df1 at the prompt and it tells you it's a "bound method" that's another name for a function you need to call.
I'm also assuming your train_df1 function returns a numpy array, such that the argmax function call will work.
In one line, it's just:
my_max = train_df1().argmax(axis = 1)
You'll definitely not want to redefine "train_def1" as a variable, so be sure to pick a new named variable to hold the max values returned.

Related

How to use get_loc to get the position of a specific label?

I have a pandas Dataframe representing a Matrix:
product 63727 63729 63741 63750 ... 1180572 1181075 1181077 1182263
username ...
ali8 0.0 0.0 0.0 0.0 ... 0.0 0.0 0.0 0.0
micheal54 0.0 0.0 0.0 0.0 ... 0.0 0.0 0.0 0.0
aaron176 0.0 0.0 0.0 0.0 ... 0.0 0.0 0.0 0.0
rose_2 0.0 0.0 0.0 0.0 ... 0.0 0.0 0.0 0.0
sara_pv2 0.0 0.0 0.0 0.0 ... 0.0 0.0 0.0 0.0
I'm trying to perform KNN over it to get the most similar users to the one I'm indicating, so I'm trying to use:
query_index = order_products.index.get_loc('rose_2')
to get the index position of the username in question I want to get similar users for.
But that raises the next error:
Traceback (most recent call last):
File "<stdin>", line 1, in <module>
TypeError: 'int' object is not iterable
Which I don't know how to fix.
I wrote a small example that might help you:
import pandas as pd
import numpy as np
df = pd.DataFrame(data={"63727":["0.0","0.0","0.0","0.0"]},index=['product','username','ali8','micheal54'])
display(df)
np.where(df.index.values=='ali8')[0][0]
You can get the integer for the index position with the specified name this way.
I hope it is helpful.

Iterating over pandas DataFrame with identical columns header

I am trying to iterate through rows and columns of the Pandas DataFrame and write that result in a new DataFrame if some condition is met. I am able to iterate on the following DataFrame which has different names for row and column.
W0O5 W1O5 W2O5 W3O5
W0O5 0.0 0.0 0.0 0.0
W1O5 0.0 0.0 1.0 0.0
W2O5 0.0 1.0 0.0 0.0
W3O5 0.0 0.0 0.0 0.0
I used the following approach
for i in pandas_df.index:
for j in pandas_df.columns:
print(i, j)
print(pandas_df.at[i, j])
if pandas_df.at[i, j] ==1:
single_pandas_df.at['WO5', 'WO5_corner'] =1
where single_pandas_df  is the new DataFrame I created, on which I want to add the value at corresponding row and column.
However, when I try to iterate through  DataFrame containing identical header for row and columns as below:
WO5 WO5 WO5 WO5
WO5 0.0 0.0 0.0 0.0
WO5 0.0 0.0 1.0 0.0
WO5 0.0 1.0 0.0 0.0
WO5 0.0 0.0 0.0 0.0
I get the AttributeError saying
AttributeError: 'BlockManager' object has no attribute 'T'
I know the error is due to duplicate column names. I was curious is there any way to handle such case in pandas. I have all of my DataFrames as in the second case and I need to get the values of each index from row and column.
Thanks in advance.
Update after Yolos comment:
Actually I have many such DataFrames as below
DyO7 DyO7 DyO6 DyO7 DyO7 DyO6
DyO7 0.0 3.0 1.0 2.0 1.0 0.0
DyO7 3.0 0.0 0.0 1.0 0.0 1.0
DyO6 1.0 0.0 0.0 0.0 1.0 0.0
DyO7 2.0 1.0 0.0 0.0 3.0 1.0
DyO7 1.0 0.0 1.0 3.0 0.0 0.0
DyO6 0.0 1.0 0.0 1.0 0.0 0.0
and next one as
TaO6 TaO6
TaO6 0.0 1.0
TaO6 1.0 0.0
In these DataFrames 1 ,2 and 3 represents the corner, edge and face sharing. So if (i,j) item in the DataFrame is 1, it goes to "..."_corner, if 2 it goes to edge and 3 goes to face.
my initial single_pandas DataFrame looks like following
DyO6_corner DyO6_edge DyO6_face DyO7_corner DyO7_edge DyO7_face TaO6_corner TaO6_edge TaO6_face WO5_corner WO5_edge WO5_face
DyO6 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0
DyO7 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0
TaO6 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0
WO5 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0
As from my above script after updating this single_pandas DataFrame, there will be 1 at ('WO5', 'WO5_corner') and it becomes:
DyO6_corner DyO6_edge DyO6_face DyO7_corner DyO7_edge DyO7_face TaO6_corner TaO6_edge TaO6_face WO5_corner WO5_edge WO5_face
DyO6 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0
DyO7 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0
TaO6 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0
WO5 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 1 0.0 0.0

Need to add column names to numpy array

I am trying to create connect 4 game with 6/7 arrray in python, and i need column headers so that column 0 is named a, column 2 is named b, and so on. The purpose of this is for the moves to be initiated by typing 'a' (drops token in first column) 'b' (drops token in second) etc.... This is my code to create the array
def clear_board():
board = np.zeros((6,7))
return board
If you need column names, the easiest way is to use a pandas Dataframe instead of a numpy array:
import pandas as pd
def clear_board():
board = pd.DataFrame(np.zeros((6,7)),columns=list('ABCDEFG'))
return board
>>> clear_board()
A B C D E F G
0 0.0 0.0 0.0 0.0 0.0 0.0 0.0
1 0.0 0.0 0.0 0.0 0.0 0.0 0.0
2 0.0 0.0 0.0 0.0 0.0 0.0 0.0
3 0.0 0.0 0.0 0.0 0.0 0.0 0.0
4 0.0 0.0 0.0 0.0 0.0 0.0 0.0
5 0.0 0.0 0.0 0.0 0.0 0.0 0.0
Beyond that, take a look at the options provided in this answer

"Cannot reindex from a duplicate axis" when groupby.apply() on MultiIndex columns

I'm playing around with computing subtotals within a DataFrame that looks like this (note the MultiIndex):
0 1 2 3 4 5
A 1 0.0 0.0 0.0 0.0 0.0 0.0
2 0.0 0.0 0.0 0.0 0.0 0.0
B 1 0.0 0.0 0.0 0.0 0.0 0.0
2 0.0 0.0 0.0 0.0 0.0 0.0
I can successfully add the subtotals with the following code:
(
df
.groupby(level=0)
.apply(
lambda df: pd.concat(
[df.xs(df.name), df.sum().to_frame('Total').T]
)
)
)
And it looks like this:
0 1 2 3 4 5
A 1 0.0 0.0 0.0 0.0 0.0 0.0
2 0.0 0.0 0.0 0.0 0.0 0.0
Total 0.0 0.0 0.0 0.0 0.0 0.0
B 1 0.0 0.0 0.0 0.0 0.0 0.0
2 0.0 0.0 0.0 0.0 0.0 0.0
Total 0.0 0.0 0.0 0.0 0.0 0.0
However, when I work with the transposed DataFrame, it does not work. The DataFrame looks like:
A B
1 2 1 2
0 0.0 0.0 0.0 0.0
1 0.0 0.0 0.0 0.0
2 0.0 0.0 0.0 0.0
3 0.0 0.0 0.0 0.0
4 0.0 0.0 0.0 0.0
5 0.0 0.0 0.0 0.0
And I use the following code:
(
df2
.groupby(level=0, axis=1)
.apply(
lambda df: pd.concat(
[df.xs(df.name, axis=1), df.sum(axis=1).to_frame('Total')],
axis=1
)
)
)
I have specified axis=1 everywhere I can think of, but I get an error:
ValueError: cannot reindex from a duplicate axis
I would expect the output to be:
A B
1 2 Total 1 2 Total
0 0.0 0.0 0.0 0.0 0.0 0.0
1 0.0 0.0 0.0 0.0 0.0 0.0
2 0.0 0.0 0.0 0.0 0.0 0.0
3 0.0 0.0 0.0 0.0 0.0 0.0
4 0.0 0.0 0.0 0.0 0.0 0.0
5 0.0 0.0 0.0 0.0 0.0 0.0
Is this a bug? Or have I not specified the axis correctly everywhere? As a workaround, I can obviously transpose the DataFrame, produce the totals, and transpose back, but I'd like to know why it's not working here, and submit a bug report if necessary.
The problem DataFrame can be generated with:
df2 = pd.DataFrame(
np.zeros([6, 4]),
columns=pd.MultiIndex.from_product([['A', 'B'], [1, 2]])
)

Reorganizing an MxN 2D array of datapoints into an N-dimensional array

I've got a series of measurements in a 2D array such as
T mu1 mu2 mu3 a b c d e
0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0
0.0 0.0 0.0 1.0 0.0 0.0 0.0 0.0 0.0
0.0 0.0 0.0 2.0 0.0 0.0 0.0 0.0 0.0
0.0 0.0 0.0 3.0 0.0 0.0 0.0 0.0 0.0
0.0 0.0 1.0 0.0 0.0 0.0 0.0 0.0 0.0
0.0 0.0 1.0 1.0 0.0 0.0 0.0 0.0 0.0
0.0 0.0 1.0 2.0 0.0 0.0 0.0 0.0 0.0
0.0 0.0 1.0 3.0 0.0 0.0 0.0 0.0 0.0
0.0 1.0 2.0 0.0 0.0 0.0 0.0 0.0 0.0
0.0 1.0 2.0 1.0 0.0 0.0 0.0 0.0 0.0
0.0 1.0 2.0 2.0 0.0 0.0 0.0 0.0 0.0
0.0 1.0 2.0 3.0 0.0 0.0 0.0 0.0 0.0
0.0 1.0 3.0 0.0 0.0 0.0 0.0 0.0 0.0
0.0 1.0 3.0 1.0 0.0 0.0 0.0 0.0 0.0
0.0 1.0 3.0 2.0 0.0 0.0 0.0 0.0 0.0
0.0 1.0 3.0 3.0 0.0 0.0 0.0 0.0 0.0
1.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0
1.0 0.0 0.0 1.0 0.0 0.0 0.0 0.0 0.0
1.0 0.0 0.0 2.0 0.0 0.0 0.0 0.0 0.0
1.0 0.0 0.0 3.0 0.0 0.0 0.0 0.0 0.0
1.0 0.0 1.0 0.0 0.0 0.0 0.0 0.0 0.0
1.0 0.0 1.0 1.0 0.0 0.0 0.0 0.0 0.0
1.0 0.0 1.0 2.0 0.0 0.0 0.0 0.0 0.0
1.0 0.0 1.0 3.0 0.0 0.0 0.0 0.0 0.0
1.0 1.0 2.0 0.0 0.0 0.0 0.0 0.0 0.0
1.0 1.0 2.0 1.0 0.0 0.0 0.0 0.0 0.0
1.0 1.0 2.0 2.0 0.0 0.0 0.0 0.0 0.0
1.0 1.0 2.0 3.0 0.0 0.0 0.0 0.0 0.0
1.0 1.0 3.0 0.0 0.0 0.0 0.0 0.0 0.0
1.0 1.0 3.0 1.0 0.0 0.0 0.0 0.0 0.0
1.0 1.0 3.0 2.0 0.0 0.0 0.0 0.0 0.0
1.0 1.0 3.0 3.0 0.0 0.0 0.0 0.0 0.0
where T, mu1, mu2 and mu3 are the 4 axes of the variables I control (independent variables). a, b, c, d and e are the measurements I've made (dependent variables).
I would like to convert this 2D array into a 5D array in numpy. By specifying T, mu1, mu2 and mu3 (or at least their 4 indexes) I want to be able to retrieve the corresponding a, b, c, d and e values.
Is there a straightforward way to reshape this kind of array by specifying what columns the axes correspond to? The MultiIndex in Pandas seemed to smartly organize it in a table, but seems ill-suited for high dimensional arrays. I won't necessarily know ahead of time what the shape of the ndarray should be, but it seems to me that based on the values it should be possible to reshape the array properly. The increment values for each axis might also be different, but they will always be uniform.
My current idea involves ignoring the mu1, mu2 and mu3 columns, and stacking sets of T data into a 3D array. From there I would stack sets of 3D mu1 data into a 4D array, and repeat the process with mu2 and mu3. This seems like a tedious process that should have a simple solution though.
First, let's make some fake data:
# an N x 5 array containing a regular mesh representing the stimulus params
stim_params = np.mgrid[:2, :3, :4, :5, :6].reshape(5, -1).T
# an N x 3 array representing the output values for each simulation run
output_vals = np.arange(720 * 3).reshape(720, 3)
# shuffle the rows for a bit of added realism
shuf = np.random.permutation(stim_params.shape[0])
stim_params = stim_params[shuf]
output_vals = output_vals[shuf]
Now you can use np.lexsort to get the set of indices that will sort the rows of your 2D array of simulation parameters such that the values in each column are in ascending order. Having done that, you can apply these indices to the rows of simulation output values.
# get the number of unique values for each stimulus parameter
params_shape = tuple(np.unique(col).shape[0] for col in stim_params.T)
# get the set of row indices that will sort the stimulus parameters in ascending
# order, starting with the final column
idx = np.lexsort(stim_params[:, ::-1].T)
# sort and reshape the stimulus parameters:
sorted_params = stim_params[idx].T.reshape((5,) + params_shape)
# sort and reshape the output values
sorted_output = output_vals[idx].T.reshape((3,) + params_shape)
I find that the hardest part is often just trying to wrap your head around what all the different dimensions of the outputs correspond to:
# array of stimulus parameters, with dimensions (n_params, p1, p2, p3, p4, p5)
print(sorted_params.shape)
# (5, 2, 3, 4, 5, 6)
# to check that the sorting worked as expected, we can look at the values of the
# 5th parameter when all the others are held constant at 0:
print(sorted_params[4, 0, 0, 0, 0, :])
# [0 1 2 3 4 5]
# ... and the 1st parameter when we hold all the others constant:
print(sorted_params[0, :, 0, 0, 0, 0])
# [0, 1]
# ... now let the 1st and 2nd parameters covary:
print(sorted_params[:2, :, :, 0, 0, 0])
# [[[0 0 0]
# [1 1 1]]
# [[0 1 2]
# [0 1 2]]]
Hopefully you get the idea. The same indexing logic applies to the sorted simulation outputs:
# array of outputs, with dimensions (n_outputs, p1, p2, p3, p4, p5)
print(sorted_output.shape)
# (3, 2, 3, 4, 5, 6)
# the first output variable whilst holding the first 4 simulation parameters
# constant at 0:
print(sorted_output[0, 0, 0, 0, 0, :])
# [ 0 3 6 9 12 15]

Categories