How to create dataframe based on matrix? - python

There are two dataframe I have "df1" and "df2" and one matrix "res"
df1= a df2 = a
b c
c e
d
there are 4 record in df1 and 3 record in df2
so,
res = 4*3 matrix
res =
df2(index)
0 1 2
0 100 0 0
df1(index) 1 0 0 0
2 0 100 0
3 0 0 0
so I have above data based on this data or matrix I want following output in the form of dataframe
df1 df2 score
a a 100
a c 0
a e 0
b a 0
b c 0
b e 0
c a 0
c c 100
c e 0
d a 0
d c 0
d e 0

Set index and columns names by df1, df2:
res.index = df1[:len(res.index)]
res.columns = df2[:len(res.columns)]
And then reshape by DataFrame.melt:
df = res.rename_axis(index='df1', columns='df2').melt(ignore_index=False)
Or DataFrame.stack:
df = res.rename_axis(index='df1', columns='df2').stack().reset_index(name='value')

Related

Match a data frame columns to another data frame rows content

I have a pandas data frame as follows
A
B
C
D
...
Z
and another data frame in which every column has zero or more letters as follows:
Letters
A,C,D
A,B,F
A,H,G
A
B,F
None
I want to match the two dataframes to have something like this
A
B
C
D
...
Z
1
0
1
1
0
0
make example and desired output for answer
Example:
data = ['A,C,D', 'A,B,F', 'A,E,G', None]
df = pd.DataFrame(data, columns=['letter'])
df :
letter
0 A,C,D
1 A,B,F
2 A,E,G
3 None
get_dummies and groupby
pd.get_dummies(df['letter'].str.split(',').explode()).groupby(level=0).sum()
output:
A B C D E F G
0 1 0 1 1 0 0 0
1 1 1 0 0 0 1 0
2 1 0 0 0 1 0 1
3 0 0 0 0 0 0 0

Pandas: occurrence matrix from one hot encoding from pandas dataframe

I have a dataframe, it's in one hot format:
dummy_data = {'a': [0,0,1,0],'b': [1,1,1,0], 'c': [0,1,0,1],'d': [1,1,1,0]}
data = pd.DataFrame(dummy_data)
Output:
a b c d
0 0 1 0 1
1 0 1 1 1
2 1 1 0 1
3 0 0 1 0
I am trying to get the occurrence matrix from dataframe, but if I have columns name in list instead of one hot like this:
raw = [['b','d'],['b','c','d'],['a','b','d'],['c']]
unique_categories = ['a','b','c','d']
Then I am able to find the occurrence matrix like this:
df = pd.DataFrame(raw).stack().rename('val').reset_index().drop(columns='level_1')
df = df.loc[df.val.isin(unique_categories)]
df = df.merge(df, on='level_0').query('val_x != val_y')
final = pd.crosstab(df.val_x, df.val_y)
adj_matrix = (pd.crosstab(df.val_x, df.val_y)
.reindex(unique_categories, axis=0).reindex(unique_categories, axis=1)).fillna(0)
Output:
val_y a b c d
val_x
a 0 1 0 1
b 1 0 1 3
c 0 1 0 1
d 1 3 1 0
How to get the occurrence matrix directly from one hot dataframe?
You can have some fun with matrix math!
u = np.diag(np.ones(df.shape[1], dtype=bool))
df.T.dot(df) * (~u)
a b c d
a 0 1 0 1
b 1 0 1 3
c 0 1 0 1
d 1 3 1 0

Replace upper and lower triangle values in a dataframe with zero, or keep only diagonal values

I have the following DataFrame as a toy example:
a = [5,2,6,8]
b = [2,10,19,16]
c = [3,8,15,17]
d = [3,8,12,20]
df = pd.DataFrame([a,b,c,d], columns = ['a','b','c','d'])
df
I want to create a new DataFrame df1 that keeps only the diagonal elements and converts upper and lower triangular values to zero.
My final dataset should look like:
a b c d
0 5 0 0 0
1 0 10 0 0
2 0 0 15 0
3 0 0 0 20
You could use numpy.diag:
df = pd.DataFrame(data=np.diag(np.diag(df)), columns=df.columns)
print(df)
Output
a b c d
0 5 0 0 0
1 0 10 0 0
2 0 0 15 0
3 0 0 0 20
import pandas as pd
def diag(df):
res_df = pd.DataFrame(0, index=df.index, columns=df.columns)
for i in range(min(df.shape)): res_df.iloc[i, i] = df.iloc[i, i]
return res_df

How to reshape dataframe if they have same index?

If I have a dataframe like
df= pd.DataFrame(['a','b','c','d'],index=[0,0,1,1])
0
0 a
0 b
1 c
1 d
How can I reshape the dataframe based on index like below i.e
df= pd.DataFrame([['a','b'],['c','d']],index=[0,1])
0 1
0 a b
1 c d
Let's use set_index, groupby, cumcount, and unstack:
df.set_index(df.groupby(level=0).cumcount(), append=True)[0].unstack()
Output:
0 1
0 a b
1 c d
You can use pivot with cumcount :
a = df.groupby(level=0).cumcount()
df = pd.pivot(index=df.index, columns=a, values=df[0])
Couple of ways
1.
In [490]: df.groupby(df.index)[0].agg(lambda x: list(x)).apply(pd.Series)
Out[490]:
0 1
0 a b
1 c d
2.
In [447]: df.groupby(df.index).apply(lambda x: pd.Series(x.values.tolist()).str[0])
Out[447]:
0 1
0 a b
1 c d
3.
In [455]: df.assign(i=df.index, c=df.groupby(level=0).cumcount()).pivot('i', 'c', 0)
Out[455]:
c 0 1
i
0 a b
1 c d
to remove names
In [457]: (df.assign(i=df.index, c=df.groupby(level=0).cumcount()).pivot('i', 'c', 0)
.rename_axis(None).rename_axis(None, 1))
Out[457]:
0 1
0 a b
1 c d

Transform the relationship data with weight into a Matrix in python

Input data format like that: data.txt
col1 col2 weight
a b 1
a c 2
a d 0
b c 3
b d 0
c d 0
i want the output data format like that: result.txt
a b c d
a 0 1 2 0
b 1 0 3 0
c 2 3 0 0
d 0 0 0 0
I would use pandas in this way
import pandas as pd
# Read your data from a .csv file
df = pd.read_csv('yourdata.csv')
# Pivot table
mat = pd.pivot_table(df,index='col1',columns='col2',values='weight')
# Rebuild the index
index = mat.index.union(mat.columns)
# Build the new full matrix and fill NaN values with 0
mat = mat.reindex(index=index, columns=index).fillna(0)
# Make the matrix symmetric
m = mat + mat.T
This returns:
a b c d
a 0 1 2 0
b 1 0 3 0
c 2 3 0 0
d 0 0 0 0
EDIT: instead of pivot_table() you can also use:
mat = df.pivot(index='col1',columns='col2',values='weight')
give a, b, c, d values and set col 1 = i, and col 2 = j. evaluate row by row. For example, row 1, i = 0, j = 1 , weights(i,j) = 1

Categories