python shuffle columns in matrix - python

There is a way to randomly permute the columns of a matrix? I tried to use np.random.permutation but the obtained result is not what i need.
What i would like to obtain is to change randomly the position of the columns of the matrix, without to change the position of the values of each columns.
Es.
starting matrix:
1 6 11 16
2 7 12 17
3 8 13 18
4 9 14 19
5 10 15 20
Resulting matrix
11 7 1 16
12 8 2 17
13 9 3 18
14 10 4 19
15 11 5 20

You could shuffle the transposed array:
q = np.array([1, 6, 11, 16, 2, 7, 12, 17, 3, 8, 13, 18, 4, 9, 14, 19, 5, 10, 15, 20])
q = q.reshape((5,4))
print(q)
# [[ 1 6 11 16]
# [ 2 7 12 17]
# [ 3 8 13 18]
# [ 4 9 14 19]
# [ 5 10 15 20]]
np.random.shuffle(np.transpose(q))
print(q)
# [[ 1 16 6 11]
# [ 2 17 7 12]
# [ 3 18 8 13]
# [ 4 19 9 14]
# [ 5 20 10 15]]
Another option for general axis is:
q = np.array([1, 6, 11, 16, 2, 7, 12, 17, 3, 8, 13, 18, 4, 9, 14, 19, 5, 10, 15, 20])
q = q.reshape((5,4))
q = q[:, np.random.permutation(q.shape[1])]
print(q)
# [[ 6 11 16 1]
# [ 7 12 17 2]
# [ 8 13 18 3]
# [ 9 14 19 4]
# [10 15 20 5]]

Related

Reshape data frame, so the index column values become the columns

I want to reshape the data so that the values in the index column become the columns
My Data frame:
Gender_Male Gender_Female Location_london Location_North Location_South
Cat
V 5 4 4 2 3
W 15 12 12 7 8
X 11 15 16 4 6
Y 22 18 21 9 9
Z 8 7 7 4 4
Desired Data frame:
Is there an easy way to do this? I also have 9 other categorical variables in my data set in addition to the Gender and Location variables. I have only included two variables to keep the example simple.
Code to create the example dataframe:
df1 = pd.DataFrame({
'Cat' : ['V','W', 'X', 'Y', 'Z'],
'Gender_Male' : [5, 15, 11, 22, 8],
'Gender_Female' : [4, 12, 15, 18, 7],
'Location_london': [4,12, 16, 21, 7],
'Location_North' : [2, 7, 4, 9, 4],
'Location_South' : [3, 8, 6, 9, 4]
}).set_index('Cat')
df1
You can transpose the dataframe and then split and set the new index:
Transpose
dft = df1.T
print(dft)
Cat V W X Y Z
Gender_Male 5 15 11 22 8
Gender_Female 4 12 15 18 7
Location_london 4 12 16 21 7
Location_North 2 7 4 9 4
Location_South 3 8 6 9 4
Split and set the new index
dft.index = dft.index.str.split('_', expand=True)
dft.columns.name = None
print(dft)
V W X Y Z
Gender Male 5 15 11 22 8
Female 4 12 15 18 7
Location london 4 12 16 21 7
North 2 7 4 9 4
South 3 8 6 9 4

column index out of range with a data frame

I am currently training a neural network. I would like to split up my training and validation data with a 80:20 ratio. I would like to have the full purchases.
Unfortunately, I get an IndexError: column index (12) out of range. How can I fix this? At this position the error occurs mat[purchaseid, itemid] = 1.0. So I always split after each purchase (a complete purchase = if I have all rows with the same purchaseid).
Dataframe:
d = {'purchaseid': [0, 0, 0, 1, 2, 2, 3, 3, 3, 4, 4, 4, 5, 5, 5, 6, 6, 6, 7, 7, 8, 9, 9, 9, 9],
'itemid': [ 3, 8, 2, 10, 3, 10, 4, 12, 3, 12, 3, 4, 8, 6, 3, 0, 5, 12, 9, 9, 13, 1, 7, 11, 11]}
df = pd.DataFrame(data=d)
print(df.head(20))
Methods:
PERCENTAGE_SPLIT = 20
def splitter(df):
df_ = pd.DataFrame()
sum_purchase = df['purchaseid'].nunique()
amount = round((sum_purchase / 100) * PERCENTAGE_SPLIT)
random_list = random.sample(df['purchaseid'].unique().tolist(), amount)
df_ = df.loc[df['purchaseid'].isin(random_list)]
df_reduced = df.loc[~df['purchaseid'].isin(random_list)]
return [df_reduced, df_]
def generate_matrix(dataframe, name):
mat = sp.dok_matrix((dataframe.shape[0], len(dataframe['itemid'].unique())), dtype=np.float32)
for purchaseid, itemid in zip(dataframe['purchaseid'], dataframe['itemid']):
mat[purchaseid, itemid] = 1.0 # At this position is the error
return mat
Call:
dfs = splitter(df)
df_tr = dfs[0].copy(deep=True)
df_val = dfs[1].copy(deep=True)
train_mat_ = generate_matrix(df_tr, 'train')
val_mat_ = generate_matrix(df_val, 'val')
Error:
IndexError: column index (12) out of range
Dataframe:
#df
purchaseid itemid
0 0 3
1 0 8
2 0 2
3 1 10
4 2 3
5 2 10
6 3 4
7 3 12
8 3 3
9 4 12
10 4 3
11 4 4
12 5 8
13 5 6
14 5 3
15 6 0
16 6 5
17 6 12
18 7 9
19 7 9
# df_tr
purchaseid itemid
0 0 3
1 0 8
2 0 2
3 1 10
4 2 3
5 2 10
6 3 4
7 3 12
8 3 3
9 4 12
10 4 3
11 4 4
12 5 8
13 5 6
14 5 3
18 7 9
19 7 9
20 8 13
# df_val
purchaseid itemid
15 6 0
16 6 5
17 6 12
21 9 1
22 9 7
23 9 11
24 9 11
Try this instead. sp.dok_matrix requires dimensions of target matrix. I have assumed range of purchaseid to be within [ 0, max(purchaseid) ] and range of itemid to be within [ 0, max(itemid) ] looking at your data.
def generate_matrix(dataframe, name):
mat = sp.dok_matrix((dataframe['purchaseid'].max() + 1, dataframe['itemid'].max() + 1), dtype=np.float32)
for purchaseid, itemid in zip(dataframe['purchaseid'], dataframe['itemid']):
mat[purchaseid, itemid] = 1.0 # At this position is the error
return mat

Split matrix in python into square matrices?

Is there a quick and easy way to split a MxN matrix into matrices of size AxA (square matrices) starting greedily from the top left in python specifically? I had a 2d numpy array.
For example
1 2 3 4
6 7 8 9
1 2 3 4
6 7 8 9
0 0 0 0
If I want to split into 2X2 the outcome should be a list like:
1 2
6 7
3 4
8 9
1 2
6 7
3 4
8 9
(Notice the 0 0 0 0 at the bottom gets left out)
Is there a "clean" way to write this? I can write it in brute force but it is not at all pretty.
You may do this in one line (using numpy) by:
test = np.arange(35).reshape(5,7)
M, N = test.shape
A = 2
print(test)
print('\n')
split_test = test[0:M-M%A, 0:N-N%A].reshape(M//A, A, -1, A).swapaxes(1, 2).reshape(-1, A, A)
print(split_test)
Output of above code is:
[[ 0 1 2 3 4 5 6]
[ 7 8 9 10 11 12 13]
[14 15 16 17 18 19 20]
[21 22 23 24 25 26 27]
[28 29 30 31 32 33 34]]
[[[ 0 1]
[ 7 8]]
[[ 2 3]
[ 9 10]]
[[ 4 5]
[11 12]]
[[14 15]
[21 22]]
[[16 17]
[23 24]]
[[18 19]
[25 26]]]
If you are ok with using skimage:
a = np.r_[np.add.outer((1,6,1,6),range(4)),[[0,0,0,0]]]
from skimage.util import view_as_windows
sz = 2,2
view_as_windows(a,sz,sz)
# array([[[[1, 2],
# [6, 7]],
#
# [[3, 4],
# [8, 9]]],
#
#
# [[[1, 2],
# [6, 7]],
#
# [[3, 4],
# [8, 9]]]])

Is there an opposite function of pandas.DataFrame.droplevel (like keeplevel)?

Is there an opposite function of pandas.DataFrame.droplevel where I can keep some levels of the multi-level index/columns using either the level name or index?
Example:
df = pd.DataFrame([
[1, 2, 3, 4],
[5, 6, 7, 8],
[9, 10, 11, 12],
[13, 14, 15, 16]
], columns=['a','b','c','d']).set_index(['a','b','c']).T
a 1 5 9 13
b 2 6 10 14
c 3 7 11 15
d 4 8 12 16
Both the following commands can return the following dataframe:
df.droplevel(['a','b'], axis=1)
df.droplevel([0, 1], axis=1)
c 3 7 11 15
d 4 8 12 16
I am looking for a "keeplevel" command such that both the following commands can return the following dataframe:
df.keeplevel(['a','b'], axis=1)
df.keeplevel([0, 1], axis=1)
a 1 5 9 13
b 2 6 10 14
d 4 8 12 16
There is no keeplevel because it would be redundant: in a closed and well-defined set, when you define what you want to drop, you automatically define what you want to keep
You may get the difference from what you have and what droplevel returns.
def keeplevel(df, levels, axis=1):
return df.droplevel(df.axes[axis].droplevel(levels).names, axis=axis)
>>> keeplevel(df, [0, 1])
a 1 5 9 13
b 2 6 10 14
d 4 8 12 16
Using set to find the different
df.droplevel(list(set(df.columns.names)-set(['a','b'])),axis=1)
Out[134]:
a 1 5 9 13
b 2 6 10 14
d 4 8 12 16
You can modify the Index objects, which should be fast. Note, this will even modify inplace.
def keep_level(df, keep, axis):
idx = pd.MultiIndex.from_arrays([df.axes[axis].get_level_values(x) for x in keep])
df.set_axis(idx, axis=axis, inplace=True)
return df
keep_level(df.copy(), ['a', 'b'], 1) # Copy to not modify original for illustration
#a 1 5 9 13
#b 2 6 10 14
#d 4 8 12 16
keep_level(df.copy(), [0, 1], 1)
#a 1 5 9 13
#b 2 6 10 14
#d 4 8 12 16

Lookup values of one Pandas dataframe in another

I have two dataframes, and I want to do a lookup much like a Vlookup in excel.
df_orig.head()
A
0 3
1 4
2 6
3 7
4 8
df_new
Combined Length Group_name
0 [8, 9, 112, 114, 134, 135] 6 Group 1
1 [15, 16, 17, 18, 19, 20] 6 Group 2
2 [15, 16, 17, 18, 19] 5 Group 3
3 [16, 17, 18, 19, 20] 5 Group 4
4 [15, 16, 17, 18] 4 Group 5
5 [8, 9, 112, 114] 4 Group 6
6 [18, 19, 20] 3 Group 7
7 [28, 29, 30] 3 Group 8
8 [21, 22] 2 Group 9
9 [28, 29] 2 Group 10
10 [26, 27] 2 Group 11
11 [24, 25] 2 Group 12
12 [3, 4] 2 Group 13
13 [6, 7] 2 Group 14
14 [11, 14] 2 Group 15
15 [12, 13] 2 Group 16
16 [0, 1] 2 Group 17
How can I add the values in df_new["Group_name"] to df_orig["A"]?
The "Group_name" must be based on the lookup of the values from df_orig["A"] in df_new["Combined"].
So it would look like:
df_orig.head()
A Looked_up
0 3 Group 13
1 4 Group 13
2 6 Group 14
3 7 Group 14
4 8 Group 1
Thank you!
Two steps ***unnest*** + merge
df=pd.DataFrame({'Combined':df.Combined.sum(),'Group_name':df['Group_name'].repeat(df.Length)})
df_orig.merge(df.groupby('Combined').head(1).rename(columns={'Combined':'A'}))
Out[77]:
A Group_name
0 3 Group 13
1 4 Group 13
2 6 Group 14
3 7 Group 14
4 8 Group 1
Here is one way which mimics a vlookup. Minimal example below.
import pandas as pd
df_origin = pd.DataFrame({'A': [3, 11, 0, 12, 6]})
df_new = pd.DataFrame({'Combined': [[3, 4, 5], [6, 7], [11, 14, 20],
[12, 13], [3, 1], [0, 4]],
'Group_name': ['Group 13', 'Group 14', 'Group 15',
'Group 16', 'Group 17', 'Group 18']})
df_new['ID'] = list(zip(*df_new['Combined'].tolist()))[0]
df_origin['Group_name'] = df_origin['A'].map(df_new.drop_duplicates('ID')\
.set_index('ID')['Group_name'])
Result
A Group_name
0 3 Group 13
1 11 Group 15
2 0 Group 18
3 12 Group 16
4 6 Group 14
Explanation
Extract the first element of lists in df_new['Combined'] via zip.
Use drop_duplicates and then create a series mapping ID to Group_name.
Finally, use pd.Series.map to map df_origin['A'] to Group_name via this series.

Categories