I would like to transpose a list of items into a square matrix format using python .
I tried pivot_table in pandas but it didn't work.
Here is my code , the input being a two column csv file
with open(path_to_file,"r") as f:
reader = csv.reader(f,delimiter = ',')
data = list(reader)
row_count=len(data)
print(row_count - 1)
df = pd.read_csv(path_to_file)
groups = df.groupby(['transmitter chan', 'receiver chan'])
max_for_AS = defaultdict(int)
df = df.assign(ID = [0 + i for i in xrange(len(df))])
print(df)
for g in groups:
transmitter, count = g[0][0], len(g[1])
max_for_AS[ transmitter ] = max( max_for_AS[transmitter], count )
for g in groups:
transmitter, receiver, count = g[0][0], g[0][1], len(g[1])
if count == max_for_AS[ transmitter ]:
dataFinal = "{} , {} , {}".format(transmitter, receiver, count )
print( dataFinal )
Data:
V1 V2 count
0 A R 1
1 Z T 4
2 E B 9
3 R O 8
4 T M 7
5 Y K 5
6 B I 6
7 T Z 2
8 A O 7
9 Y B 8
I think you need:
df = pd.read_csv(path_to_file)
df1 = df.pivot(index='V1',columns='V2',values='count').fillna(0).astype(int)
df1 = df.set_index(['V1','V2'])['count'].unstack(fill_value=0)
But if duplicates in V1 and V2 need aggregate them:
df1 = df.pivot_table(index='V1',columns='V2',values='count', fill_value=0)
df1 = df.groupby(['V1','V2'])['count'].mean().unstack(fill_value=0)
#for change ordering add reindex
df1 = df1.reindex(index=df.V1.unique(), columns=df.V2.unique())
print (df1)
V2 R T B O M K I Z
V1
A 1 0 0 7 0 0 0 0
Z 0 4 0 0 0 0 0 0
E 0 0 9 0 0 0 0 0
R 0 0 0 8 0 0 0 0
T 0 0 0 0 7 0 0 2
Y 0 0 8 0 0 5 0 0
B 0 0 0 0 0 0 6 0
Since it's not clear what you are trying to achieve, I'll approach this answer with an assumption.
I assume that you have a pandas dataframe. If that's true, to get the transpose of it using numpy, you might have to,
Convert the dataframe(df) to a numpy ndarray like this: df=df.values
Find the transpose using numpy.transpose on the result of step 1
Edit:
Better way. You can also do df.transpose()
Related
I have a dataframe, it's in one hot format:
dummy_data = {'a': [0,0,1,0],'b': [1,1,1,0], 'c': [0,1,0,1],'d': [1,1,1,0]}
data = pd.DataFrame(dummy_data)
Output:
a b c d
0 0 1 0 1
1 0 1 1 1
2 1 1 0 1
3 0 0 1 0
I am trying to get the occurrence matrix from dataframe, but if I have columns name in list instead of one hot like this:
raw = [['b','d'],['b','c','d'],['a','b','d'],['c']]
unique_categories = ['a','b','c','d']
Then I am able to find the occurrence matrix like this:
df = pd.DataFrame(raw).stack().rename('val').reset_index().drop(columns='level_1')
df = df.loc[df.val.isin(unique_categories)]
df = df.merge(df, on='level_0').query('val_x != val_y')
final = pd.crosstab(df.val_x, df.val_y)
adj_matrix = (pd.crosstab(df.val_x, df.val_y)
.reindex(unique_categories, axis=0).reindex(unique_categories, axis=1)).fillna(0)
Output:
val_y a b c d
val_x
a 0 1 0 1
b 1 0 1 3
c 0 1 0 1
d 1 3 1 0
How to get the occurrence matrix directly from one hot dataframe?
You can have some fun with matrix math!
u = np.diag(np.ones(df.shape[1], dtype=bool))
df.T.dot(df) * (~u)
a b c d
a 0 1 0 1
b 1 0 1 3
c 0 1 0 1
d 1 3 1 0
I have the following DataFrame as a toy example:
a = [5,2,6,8]
b = [2,10,19,16]
c = [3,8,15,17]
d = [3,8,12,20]
df = pd.DataFrame([a,b,c,d], columns = ['a','b','c','d'])
df
I want to create a new DataFrame df1 that keeps only the diagonal elements and converts upper and lower triangular values to zero.
My final dataset should look like:
a b c d
0 5 0 0 0
1 0 10 0 0
2 0 0 15 0
3 0 0 0 20
You could use numpy.diag:
df = pd.DataFrame(data=np.diag(np.diag(df)), columns=df.columns)
print(df)
Output
a b c d
0 5 0 0 0
1 0 10 0 0
2 0 0 15 0
3 0 0 0 20
import pandas as pd
def diag(df):
res_df = pd.DataFrame(0, index=df.index, columns=df.columns)
for i in range(min(df.shape)): res_df.iloc[i, i] = df.iloc[i, i]
return res_df
Using get_dummies(), it is possible to create one-hot encoded dummy variables for categorical data. For example:
import pandas as pd
df = pd.DataFrame({'A': ['a', 'b', 'a'],
'B': ['b', 'a', 'c']})
print(pd.get_dummies(df))
# A_a A_b B_a B_b B_c
# 0 1 0 0 1 0
# 1 0 1 1 0 0
# 2 1 0 0 0 1
So far, so good. But how can I use get_dummies() in combination with multi-index columns? The default behavior is not very practical: The multi-index tuple is converted into a string and the same suffix mechanism applies as with the simple-index columns.
df = pd.DataFrame({('i','A'): ['a', 'b', 'a'],
('ii','B'): ['b', 'a', 'c']})
ret = pd.get_dummies(df)
print(ret)
print(type(ret.columns[0]))
# ('i','A')_a ('i','A')_b ('ii','B')_a ('ii','B')_b ('ii','B')_c
# 0 1 0 0 1 0
# 1 0 1 1 0 0
# 2 1 0 0 0 1
#
# str
What I would like to get, however, is that the dummies create a new column level:
ret = pd.get_dummies(df, ???)
print(ret)
print(type(ret.columns[0]))
# i ii
# A B
# a b a b c
# 0 1 0 0 1 0
# 1 0 1 1 0 0
# 2 1 0 0 0 1
#
# tuple
#
# Note that the ret would be equivalent to the following:
# ('i','A','a') ('i','A','b') ('ii','B','a') ('ii','B','b') ('ii','B','c')
# 0 1 0 0 1 0
# 1 0 1 1 0 0
# 2 1 0 0 0 1
How could this be achieved?
Update: I placed a feature request for better support of multi-index data-frames in get_dummies: https://github.com/pandas-dev/pandas/issues/26560
You can parse the column names and rename them:
import ast
def parse_dummy(x):
parts = x.split('_')
return ast.literal_eval(parts[0]) + (parts[1],)
ret.columns = pd.Series(ret.columns).apply(parse_dummy)
# (i, A, a) (i, A, b) (ii, B, a) (ii, B, b) (ii, B, c)
#0 1 0 0 1 0
#1 0 1 1 0 0
#2 1 0 0 0 1
Note that this DataFrame is not the same as a DataFrame with three-level multiindex column names.
I had a similar need, but in a more complex DataFrame, with a multi index as row index and numerical columns which shall not be converted to dummy. So my case required to scan through the columns, expand to dummy only the columns of dtype='object', and build a new column index as a concatenation of the name of the column with the dummy variable and the value of the dummy variable itself. This because I didn't want to add a new column index level.
Here is the code
first build a dataframe in the format I need
import pandas as pd
import numpy as np
df_size = 3
objects = ['obj1','obj2']
attributes = ['a1','a2','a3']
cols = pd.MultiIndex.from_product([objects, attributes], names=['objects', 'attributes'])
lab1 = ['car','truck','van']
lab2 = ['bay','zoe','ros','lol']
time = np.arange(df_size)
node = ['n1','n2']
idx = pd.MultiIndex.from_product([time, node], names=['time', 'node'])
df = pd.DataFrame(np.random.randint(10,size=(len(idx),len(cols))),columns=cols,index=idx)
c1 = map(lambda i:lab1[i],np.random.randint(len(lab1),size=len(idx)))
c2 = map(lambda i:lab2[i],np.random.randint(len(lab2),size=len(idx)))
df[('obj1','a3')]=list(c1)
df[('obj2','a2')]=list(c2)
print(df)
objects obj1 obj2
attributes a1 a2 a3 a1 a2 a3
time node
0 n1 6 5 truck 3 ros 3
n2 5 6 car 9 lol 7
1 n1 0 8 truck 7 zoe 8
n2 4 3 truck 8 bay 3
2 n1 5 8 van 0 bay 0
n2 4 8 car 5 lol 4
And here is the code to dummify only the object columns
for t in [df.columns[i] for i,dt in enumerate(df.dtypes) if dt=='object']:
dummy_block = pd.get_dummies( df[t] )
dummy_block.columns = pd.MultiIndex.from_product([[t[0]],[f'{t[1]}_{c}' for c in dummy_block.columns]],
names=df.columns.names)
df = pd.concat([df.drop(t,axis=1),dummy_block],axis=1).sort_index(axis=1)
print(df)
objects obj1 obj2
attributes a1 a2 a3_car a3_truck a3_van a1 a2_bay a2_lol a2_ros a2_zoe a3
time node
0 n1 6 5 0 1 0 3 0 0 1 0 3
n2 5 6 1 0 0 9 0 1 0 0 7
1 n1 0 8 0 1 0 7 0 0 0 1 8
n2 4 3 0 1 0 8 1 0 0 0 3
2 n1 5 8 0 0 1 0 1 0 0 0 0
n2 4 8 1 0 0 5 0 1 0 0 4
It can be easily changed to answer to original use case, adding one more row to the columns multi index.
df = pd.DataFrame({('i','A'): ['a', 'b', 'a'],
('ii','B'): ['b', 'a', 'c']})
print(df)
i ii
A B
0 a b
1 b a
2 a c
df.columns = pd.MultiIndex.from_tuples([t+('',) for t in df.columns])
for t in [df.columns[i] for i,dt in enumerate(df.dtypes) if dt=='object']:
dummy_block = pd.get_dummies( df[t] )
dummy_block.columns = pd.MultiIndex.from_product([[t[0]],[t[1]],[c for c in dummy_block.columns]],
names=df.columns.names)
df = pd.concat([df.drop(t,axis=1),dummy_block],axis=1).sort_index(axis=1)
print(df)
i ii
A B
a b a b c
0 1 0 0 1 0
1 0 1 1 0 0
2 1 0 0 0 1
note that it still works if there are numerical columns - it just adds an empty additional level to them in the columns index as well.
I have the following DataFrame (named df2 later):
recipe_id ingredients
0 3332 [11307, 11322, 11632, 11338, 11478, 11438]
1 3333 [11322, 11338, 11632, 11314, 11682, 11478, 108...
2 3334 [11632, 11682, 11338, 11337, 10837, 11435, 113...
3 3335 [11149, 11322, 11532, 11996, 10616, 10837, 113...
4 3336 [11330, 11632, 11422, 11256, 11338, 11314, 114...
5 3812 [959, 92, 3, 554, 12271, 202]
...
I want to create another DataFrame which will have following colums : ['ingredients', "recipe_id1", "recipe_id2", ..., "recipe_idn"], where n is total number of recipes in database. I did that with the following snippet:
columns = ['ingredient'] + (list(df2['recipe_id'].unique()))
ingredient_df = pd.DataFrame(columns=columns)
After i create this DataFrame (which i did already), and populate it (problem i'm having), output should look like this:
In [1]:
# Create and populate ingredient_df by some method
columns = ['ingredient'] + (list(df2['recipe_id'].unique()))
ingredient_df = pd.DataFrame(columns=columns)
ingredient_df = populate_df(ingredient_df, df2)
Out [1]:
In [2]:
ingredient_df
Out[2]:
ingredient ... 3332 3333 3334 3335 3336 ...
...
11322 ... 1 1 0 1 0 ...
...
In the example above, value at (11322, 3334) is 0 because ingredient 11322 is not present in recipe with id 3334.
In other words, I want for every ingredient to have mapping (ingredient, recipe_id) = 1 if ingredient is present in that recipe, and 0 otherwise.
I've managed to do this by iterating over all recipes and trough all ingredient, but this is very slow. How can i do this in more robust and elegant way using Pandas methods (if this is possible at all)?
setup
df = pd.DataFrame(
dict(
recipe_id=list('abcde'),
ingredients=[list('xyz'),
list('tuv'),
list('ytw'),
list('vy'),
list('zxs')]
)
)[['recipe_id', 'ingredients']]
df
recipe_id ingredients
0 a [x, y, z]
1 b [t, u, v]
2 c [y, t, w]
3 d [v, y]
4 e [z, x, s]
method 1
df.set_index('recipe_id').ingredients.apply(pd.value_counts) \
.fillna(0).astype(int).T.rename_axis('ingredients')
recipe_id a b c d e
ingredients
s 0 0 0 0 1
t 0 1 1 0 0
u 0 1 0 0 0
v 0 1 0 1 0
w 0 0 1 0 0
x 1 0 0 0 1
y 1 0 1 1 0
z 1 0 0 0 1
method 2
idx = np.repeat(df.index.values, df.ingredients.str.len())
df1 = df.drop('ingredients', 1).loc[idx]
df1['ingredients'] = df.ingredients.sum()
df1.groupby('ingredients').recipe_id.apply(pd.value_counts) \
.unstack(fill_value=0).rename_axis('recipe_id', 1)
recipe_id a b c d e
ingredients
s 0 0 0 0 1
t 0 1 1 0 0
u 0 1 0 0 0
v 0 1 0 1 0
w 0 0 1 0 0
x 1 0 0 0 1
y 1 0 1 1 0
z 1 0 0 0 1
Input data format like that: data.txt
col1 col2 weight
a b 1
a c 2
a d 0
b c 3
b d 0
c d 0
i want the output data format like that: result.txt
a b c d
a 0 1 2 0
b 1 0 3 0
c 2 3 0 0
d 0 0 0 0
I would use pandas in this way
import pandas as pd
# Read your data from a .csv file
df = pd.read_csv('yourdata.csv')
# Pivot table
mat = pd.pivot_table(df,index='col1',columns='col2',values='weight')
# Rebuild the index
index = mat.index.union(mat.columns)
# Build the new full matrix and fill NaN values with 0
mat = mat.reindex(index=index, columns=index).fillna(0)
# Make the matrix symmetric
m = mat + mat.T
This returns:
a b c d
a 0 1 2 0
b 1 0 3 0
c 2 3 0 0
d 0 0 0 0
EDIT: instead of pivot_table() you can also use:
mat = df.pivot(index='col1',columns='col2',values='weight')
give a, b, c, d values and set col 1 = i, and col 2 = j. evaluate row by row. For example, row 1, i = 0, j = 1 , weights(i,j) = 1