np reshape within pandas apply - python

Arise Exception: Data must be 1-dimensional.
I'll present the problem with a toy example to be clear.
import pandas as pd
import numpy as np
Initial dataframe:
df = pd.DataFrame({"A": [[10,15,12,14],[20,30,10,43]], "R":[2,2] ,"C":[2,2]})
>>df
A C R
0 [10, 15, 12, 14] 2 2
1 [20, 30, 10, 43] 2 2
Conversion to numpy array and reshape:
df['A'] = df['A'].apply(lambda x: np.array(x))
df.apply(lambda x: print(x[0],(x[1],x[2])) ,axis=1)
df['A_reshaped'] = df.apply(lambda x[['A','R','C']]: np.reshape(x[0],(x[1],x[2])),axis=1)
df
A C R A_reshaped
0 [10, 15, 12, 14] 2 2 [[10,15],[12,14]]
1 [20, 30, 10, 43] 2 2 [[20,30],[10,43]]
Someone know the reason? It seems to not accept 2 dimensional arrays in pandas cells but it's strange...
Thanks in advance for any help!!!

Using apply directly doesn't work - the return value is a numpy 2d array, and placing it back in the DataFrame confuses Pandas, for some reason.
This seems to work, though:
df['reshaped'] = pd.Series([a.reshape((c, r)) for (a, c, r) in zip(df.A, df.C, df.R)])
>>> df
A C R reshaped
0 [10, 15, 12, 14] 2 2 [[10, 15], [12, 14]]
1 [20, 30, 10, 43] 2 2 [[20, 30], [10, 43]]

Related

Modyfing 2d array

I'm stuck on trying to modify 2d array... Nothing I try seem to work... I'm trying to write a function that will add a value to its specific location in the numbers column...
import pandas as pd
def twod_array(num):
data = {"group": [-1, 0, 1, 2],
'numbers': [[2], [14, 15], [16, 17], [19, 20, 21]],
}
df = pd.DataFrame(data=data)
print(df)
return 0
Currently it prints this:
group numbers
0 -1 [2]
1 0 [14, 15]
2 1 [16, 17]
3 2 [19, 20, 21]
What I'd like to do is to add a value based on the passed input, so for example if I pass 14.5 as a num, this is the output I'd like to see:
group numbers
0 -1 [2]
1 0 [14,14.5 15]
2 1 [16, 17]
3 2 [19, 20, 21]
Another example:
If I pass 18 as a num:
group numbers
0 -1 [2]
1 0 [14, 15]
2 1 [16, 17, 18]
3 2 [19, 20, 21]
I'm hoping someone can help with this.
df = pd.DataFrame({"group": [-1, 0, 1, 2],
'numbers': [[2], [14, 15], [16, 17], [19, 20, 21]],
})
arr = df['numbers'].to_list()
in_num = 18
for i, sub_arr in enumerate(arr):
for j, num in enumerate(sub_arr):
if arr[i][j]>in_num:
if j!=0: arr[i].insert(j,in_num)
else: arr[i-1].insert(-1 ,in_num)
df['numbers'] = arr

modifying nans position in the dataframe

I'm hoping I can explain this well. I have this df with 2 clumns: group and numbers. I'm trying to get that np.nan and pop it into it's new group.
def check_for_nan():
# for example let's say my new value is 14.5
new_nan_value=14.5
data = {"group:" : [-1,0,1,2,3],
'numbers': [[np.nan], [11, 12], [14, 15], [16, 17], [18, 19]],
}
df = pd.DataFrame(data=data)
# *** add some code ***
# I created a new dataframe to visually show how it should look like but we would want to operate only on the same df from above
data_2 = {"group" : [0,1,2,3],
'numbers': [[11, 12], [14,np.nan, 15], [16, 17], [18, 19]],
}
df_2 = pd.DataFrame(data=data_2)
# should return the new group number where the nan would live
return data_2["group"][1]
Output:
current:
group: numbers
0 -1 [nan]
1 0 [11, 12]
2 1 [14, 15]
3 2 [16, 17]
4 3 [18, 19]
Desired output when new_nan_value =14.5
group numbers
0 0 [11, 12]
1 1 [14, nan, 15]
2 2 [16, 17]
3 3 [18, 19]
return 1
With the dataframe you provided:
import pandas as pd
df = pd.DataFrame(
{
"group": [-1, 0, 1, 2, 3],
"numbers": [[pd.NA], [11, 12], [14, 15], [16, 17], [18, 19]],
}
)
new_nan_value = 14.5
Here is one way to do it:
def move_nan(df, new_nan_value):
"""Helper function.
Args:
df: input dataframe.
new_nan_value: insertion value.
Returns:
Dataframe with nan value at insertion point, new group.
"""
# Reshape dataframe along row axis
df = df.explode("numbers").dropna().reset_index(drop=True)
# Insert new row
insert_pos = df.loc[df["numbers"] < new_nan_value, "numbers"].index[-1] + 1
df = pd.concat(
[
df.loc[: insert_pos - 1, :],
pd.DataFrame({"group": [pd.NA], "numbers": pd.NA}, index=[insert_pos]),
df.loc[insert_pos:, :],
]
)
df["group"] = df["group"].fillna(method="bfill")
# Find new group value
new_group = df.loc[df["numbers"].isna(), "group"].values[0]
# Groupby and reshape dataframe along column axis
df = df.groupby("group").agg(list).reset_index(drop=False)
return df, new_group
So that:
df, new_group = move_nan(df, 14.5)
print(df)
# Output
group numbers
0 0 [11, 12]
1 1 [14, nan, 15]
2 2 [16, 17]
3 3 [18, 19]
print(new_group) # 1

how to use numpy plus each rows in a martix with every rows in another martix

Here is the example:
import numpy as np
a = np.array([1,2],[3,4],[5,6])
b = np.array([7,8],[9,10],[11,12],[12,13])
what I want is to use each item in a to plus every item in b then plus them together. For example, [1,2] should plus every row in b 1+7=8,2+8=10, 8+10=18;1+9=10,2+10=12,10+12=22... The result would like to be that[[18,22,26,28...],[22,26,....],[26,30....]]
My question is how to fulfil that? I know use numpy can be more efficient than loop but how to use the matrix to calculate this?
I believe this does what you want:
>>> a = np.array([[1,2],[3,4],[5,6]])
>>> b = np.array([[7,8],[9,10],[11,12],[12,13]])
>>> np.sum(a, axis=1)[:,None] + np.sum(b, axis=1)[None,:]
array([[18, 22, 26, 28],
[22, 26, 30, 32],
[26, 30, 34, 36]])
You can use list comprehensions:
import numpy as np
a = np.array([[1, 2], [3, 4], [5, 6]])
b = np.array([[7, 8], [9, 10], [11, 12], [12, 13]])
[[sum(i) + sum(j) for j in b] for i in a]
Output:
[[18, 22, 26, 28], [22, 26, 30, 32], [26, 30, 34, 36]]
This can be done in the most precise and succinct way as follows:
np.einsum('ijk-> ij', a[:,None,:]+b)
Let me explain each step.It combines einsum and broadcasting concepts of numpy.
Step 1- a[:,None,:] reshapes matrix a to shape (3,1,2). This mid axis with value 1 is helpful for broadcasting.
Step 2- a[:,None,:] + b broadcasts a and adds matrix b to get a resultant matrix of shape (3,4,2).
Step 3- np.einsum('ijk-> ij', a[:,None,:]+b) does sum reduction along the last axis of matrix obtained from previous step.

Scoring pandas column's vs other columns

I want to rank how many of other cols in df is greater than or equal to a reference col. Given testdf:
testdf = pd.DataFrame({'RefCol': [10, 20, 30, 40],
'Col1': [11, 19, 29, 40],
'Col2': [12, 21, 28, 39],
'Col3': [13, 22, 31, 38]
})
I am using the helper function:
def sorter(row):
sortedrow = row.sort_values()
return sortedrow.index.get_loc('RefCol')
as:
testdf['Score'] = testdf.apply(sorter, axis=1)
With actual data this method is very slow, how to speed it up? Thanks
Looks like you need to compare RefCol and check if there are any column less than the RefCol , use:
testdf.lt(testdf['RefCol'],axis=0).sum(1)
0 0
1 1
2 2
3 2
For greater than equal to use:
testdf.drop('RefCol',1).ge(testdf.RefCol,axis=0).sum(1)

Transforming multiindex to row-wise multi-dimensional NumPy array.

Suppose I have a MultiIndex DataFrame similar to an example from the MultiIndex docs.
>>> df
0 1 2 3
first second
bar one 0 1 2 3
two 4 5 6 7
baz one 8 9 10 11
two 12 13 14 15
foo one 16 17 18 19
two 20 21 22 23
qux one 24 25 26 27
two 28 29 30 31
I want to generate a NumPy array from this DataFrame with a 3-dimensional structure like
>>> desired_arr
array([[[ 0, 4],
[ 1, 5],
[ 2, 6],
[ 3, 7]],
[[ 8, 12],
[ 9, 13],
[10, 14],
[11, 15]],
[[16, 20],
[17, 21],
[18, 22],
[19, 23]],
[[24, 28],
[25, 29],
[26, 30],
[27, 31]]])
How can I do so?
Hopefully it is clear what is happening here - I am effectively unstacking the DataFrame by the first level and then trying to turn each top level in the resulting column MultiIndex to its own 2-dimensional array.
I can get half way there with
>>> df.unstack(1)
0 1 2 3
second one two one two one two one two
first
bar 0 4 1 5 2 6 3 7
baz 8 12 9 13 10 14 11 15
foo 16 20 17 21 18 22 19 23
qux 24 28 25 29 26 30 27 31
but then I am struggling to find a nice way to turn each column into a 2-dimensional array and then join them together, beyond doing so explicitly with loops and lists.
I feel like there should be some way for me to specify the shape of my desired NumPy array beforehand, fill it with np.nan and then use a specific iterating order to fill the values with my DataFrame, but I have not managed to solve the problem with this approach yet .
To generate the sample DataFrame
iterables = [['bar', 'baz', 'foo', 'qux'], ['one', 'two']]
ind = pd.MultiIndex.from_product(iterables, names=['first', 'second'])
df = pd.DataFrame(np.arange(8*4).reshape((8, 4)), index=ind)
Some reshape and swapaxes magic -
df.values.reshape(4,2,-1).swapaxes(1,2)
Generalizable to -
m,n = len(df.index.levels[0]), len(df.index.levels[1])
arr = df.values.reshape(m,n,-1).swapaxes(1,2)
Basically splitting the first axis into two of lengths 4 and 2 creating a 3D array and then swapping the last two axes, i.e. pushing in the axis of length 2 to the back (as the last one).
Sample output -
In [35]: df.values.reshape(4,2,-1).swapaxes(1,2)
Out[35]:
array([[[ 0, 4],
[ 1, 5],
[ 2, 6],
[ 3, 7]],
[[ 8, 12],
[ 9, 13],
[10, 14],
[11, 15]],
[[16, 20],
[17, 21],
[18, 22],
[19, 23]],
[[24, 28],
[25, 29],
[26, 30],
[27, 31]]])
to complete the answer of #divakar, for a multidimensionnal generalisation :
# sort values by index
A = df.sort_index()
# fill na
for idx in A.index.names:
A = A.unstack(idx).fillna(0).stack(1)
# create a tuple with the rights dimensions
reshape_size = tuple([len(x) for x in A.index.levels])
# reshape
arr = np.reshape(A.values, reshape_size ).swapaxes(0,1)

Categories