From the below mentioned data frame, I am trying to calculate excel type SUMPRODUCT of columns V1, V2 and V3 against columns S1, S2 and S3.
df = pd.DataFrame({'Name': ['A', 'B', 'C'],
'Qty': [100, 150, 200],
'Remarks': ['Bad', 'Avg', 'Good'],
'V1': [0,1,1],
'V2': [1,1,0],
'V3': [0,0,1],
'S1': [1,0,1],
'S2': [0,1,0],
'S3': [1,0,1]
})
I am looking a way to do this without having to use each column's name like:
df['SP'] = df[['V1', 'S1']].prod(axis=1) + df[['V2', 'S2']].prod(axis=1) + df[['V3', 'S3']].prod(axis=1)
In my real data frame, I have more than 50 columns in both 'V' and 'S' categories so the above approach is not possible.
Any suggestions?
Thanks!
Filter the S and V like columns then multiply the S columns with the corresponding V columns and sum the result along columns axis
s = df.filter(regex='S\d+')
p = df.filter(regex='V\d+')
df['SP'] = s.mul(p.values).sum(1)
Name Qty Remarks V1 V2 V3 S1 S2 S3 SP
0 A 100 Bad 0 1 0 1 0 1 0
1 B 150 Avg 1 1 0 0 1 0 1
2 C 200 Good 1 0 1 1 0 1 2
PS: This solution assumes that the order of appearance of S and V columns in the original dataframe matches.
You could try something like this:
# need to edit these two lines to work with your larger DataFrame
v_cols = df.columns[3:6] # ['V1', 'V2', 'V3']
s_cols = df.columns[6:] # ['S1', 'S2', 'S3']
df['SP'] = (df[v_cols].to_numpy() * df[s_cols].to_numpy()).sum(axis=1)
Edited with an alternative after seeing comment from #ALollz about MultiIndex making alignment simpler:
df.set_index(['Name', 'Qty', 'Remarks'], inplace=True)
n_cols = df.shape[1] // 2
v_cols = df.columns[:n_cols]
s_cols = df.columns[n_cols:]
df['SP'] = (df[v_cols].to_numpy() * df[s_cols].to_numpy()).sum(axis=1)
You can then reset index if you prefer:
df.reset_index(inplace=True)
Results:
Name Qty Remarks V1 V2 V3 S1 S2 S3 SP
0 A 100 Bad 0 1 0 1 0 1 0
1 B 150 Avg 1 1 0 0 1 0 1
2 C 200 Good 1 0 1 1 0 1 2
If your Vn and Sn in columns are in order
v_cols = df.filter(like='V').columns
s_cols = df.filter(like='S').columns
df['SP2'] = sum([df[[v, s]].prod(axis=1) for v, s in zip(v_cols, s_cols)])
print(df)
Name Qty Remarks V1 V2 V3 S1 S2 S3 SP SP2
0 A 100 Bad 0 1 0 1 0 1 0 0
1 B 150 Avg 1 1 0 0 1 0 1 1
2 C 200 Good 1 0 1 1 0 1 2 2
Related
I have the following dataframe:
>>> df
n1 n2 dense c1 c2 c3
0 1 4 [1, 4] a h1 tt
1 2 5 [2, 5] b bbw ebay
2 3 6 [3, 6] c we yahoo
If I want to create a one-hot encoding columns for c1, c2, c3 columns:
>>> df_updated = pd.get_dummies(df, prefix_sep='_', dummy_na=True, columns=['c1', 'c2', 'c3'])
>>> df_updated
n1 n2 dense c1_a c1_b c1_c c1_nan c2_bbw c2_h1 c2_we c2_nan c3_ebay c3_tt c3_yahoo c3_nan
0 1 4 [1, 4] 1 0 0 0 0 1 0 0 0 1 0 0
1 2 5 [2, 5] 0 1 0 0 1 0 0 0 1 0 0 0
2 3 6 [3, 6] 0 0 1 0 0 0 1 0 0 0 1 0
But how can I get a list of columns that is generated by get_dummies()?
Ex. ['c1_a', 'c1_b', 'c1_c', 'c1_nan', 'c2_bbw', 'c2_h1', 'c2_we', 'c2_nan', 'c3_ebay', 'c3_tt', 'c3_yahoo', 'c3_nan']
I know one way of doing that is list(set(df_updated.columns) - set(df.columns)) but is there a better way?
One way is to store the pre hot-encoded columns in a variable and then use filter :
cols, sep = ['c1', 'c2', 'c3'], '_'
df_updated = pd.get_dummies(df, prefix_sep=sep,
dummy_na=True, columns=cols)
df_dum = df_updated.filter(regex=f'^{"|".join(cols)}{sep}\w+', axis=1)
Or, simply and even better, use difference :
cols_dum = list(df_updated.columns.difference(df))
Output :
print(list(df_dum.columns)) #or print(cols_dum)
['c1_a', 'c1_b', 'c1_c', 'c1_nan', 'c2_bbw', 'c2_h1',
'c2_we', 'c2_nan', 'c3_ebay', 'c3_tt', 'c3_yahoo', 'c3_nan']
I am trying to groupby-aggregate a dataframe using lambda functions that are being created programatically. This so I can simulate a one-hot encoder of the categories present in a column.
Dataframe:
df = pd.DataFrame(np.array([[10, 'A'], [10, 'B'], [20, 'A'],[30,'B']]),
columns=['ID', 'category'])
ID category
10 A
10 B
20 A
30 B
Expected result:
ID A B
10 1 1
20 1 0
30 0 1
What I am trying:
one_hot_columns = ['A','B']
lambdas = [lambda x: 1 if x.eq(column).any() else 0 for column in one_hot_columns]
df_g = df.groupby('ID').category.agg(lambdas)
Result:
ID A B
10 1 1
20 0 0
30 1 1
But the above is not quite the expected result. Not sure what I am doing wrong.
I know I could do this with get_dummies, but using lambdas is more convenient for automation. Also, I can ensure the order of the output columns.
Use crosstab:
pd.crosstab(df.ID, df['category']).reset_index()
Output:
category ID A B
0 10 1 1
1 20 1 0
2 30 0 1
You can use pd.get_dummies with Groupby.sum:
In [4331]: res = pd.get_dummies(df, columns=['category']).groupby('ID', as_index=False).sum()
In [4332]: res
Out[4332]:
ID category_A category_B
0 10 1 1
1 20 1 0
2 30 0 1
OR, use pd.concat with pd.get_dummies:
In [4329]: res = pd.concat([df, pd.get_dummies(df.category)], axis=1).groupby('ID', as_index=False).sum()
In [4330]: res
Out[4330]:
ID A B
0 10 1 1
1 20 1 0
2 30 0 1
I defined a MultiIndex Dataframe as follows:
columns = pd.MultiIndex.from_product(
[assets, ['A', 'B', 'C']],
names=['asset', 'var']
)
res = pd.DataFrame(0, index=data.index, columns=columns)
However, I did not have success setting or updating single values of this Dataframe. Any suggestion? Or should I move to use numpy arrays - because of efficiency?
Use DataFrame.loc with tuples for select MultiIndex columns and set new value like:
assets = ['X','Y']
columns = pd.MultiIndex.from_product(
[assets, ['A', 'B', 'C']],
names=['asset', 'var']
)
res = pd.DataFrame(0, index=range(3), columns=columns)
print (res)
asset X Y
var A B C A B C
0 0 0 0 0 0 0
1 0 0 0 0 0 0
2 0 0 0 0 0 0
res.loc[0, ('X','B')] = 100
print (res)
asset X Y
var A B C A B C
0 0 100 0 0 0 0
1 0 0 0 0 0 0
2 0 0 0 0 0 0
Using get_dummies(), it is possible to create one-hot encoded dummy variables for categorical data. For example:
import pandas as pd
df = pd.DataFrame({'A': ['a', 'b', 'a'],
'B': ['b', 'a', 'c']})
print(pd.get_dummies(df))
# A_a A_b B_a B_b B_c
# 0 1 0 0 1 0
# 1 0 1 1 0 0
# 2 1 0 0 0 1
So far, so good. But how can I use get_dummies() in combination with multi-index columns? The default behavior is not very practical: The multi-index tuple is converted into a string and the same suffix mechanism applies as with the simple-index columns.
df = pd.DataFrame({('i','A'): ['a', 'b', 'a'],
('ii','B'): ['b', 'a', 'c']})
ret = pd.get_dummies(df)
print(ret)
print(type(ret.columns[0]))
# ('i','A')_a ('i','A')_b ('ii','B')_a ('ii','B')_b ('ii','B')_c
# 0 1 0 0 1 0
# 1 0 1 1 0 0
# 2 1 0 0 0 1
#
# str
What I would like to get, however, is that the dummies create a new column level:
ret = pd.get_dummies(df, ???)
print(ret)
print(type(ret.columns[0]))
# i ii
# A B
# a b a b c
# 0 1 0 0 1 0
# 1 0 1 1 0 0
# 2 1 0 0 0 1
#
# tuple
#
# Note that the ret would be equivalent to the following:
# ('i','A','a') ('i','A','b') ('ii','B','a') ('ii','B','b') ('ii','B','c')
# 0 1 0 0 1 0
# 1 0 1 1 0 0
# 2 1 0 0 0 1
How could this be achieved?
Update: I placed a feature request for better support of multi-index data-frames in get_dummies: https://github.com/pandas-dev/pandas/issues/26560
You can parse the column names and rename them:
import ast
def parse_dummy(x):
parts = x.split('_')
return ast.literal_eval(parts[0]) + (parts[1],)
ret.columns = pd.Series(ret.columns).apply(parse_dummy)
# (i, A, a) (i, A, b) (ii, B, a) (ii, B, b) (ii, B, c)
#0 1 0 0 1 0
#1 0 1 1 0 0
#2 1 0 0 0 1
Note that this DataFrame is not the same as a DataFrame with three-level multiindex column names.
I had a similar need, but in a more complex DataFrame, with a multi index as row index and numerical columns which shall not be converted to dummy. So my case required to scan through the columns, expand to dummy only the columns of dtype='object', and build a new column index as a concatenation of the name of the column with the dummy variable and the value of the dummy variable itself. This because I didn't want to add a new column index level.
Here is the code
first build a dataframe in the format I need
import pandas as pd
import numpy as np
df_size = 3
objects = ['obj1','obj2']
attributes = ['a1','a2','a3']
cols = pd.MultiIndex.from_product([objects, attributes], names=['objects', 'attributes'])
lab1 = ['car','truck','van']
lab2 = ['bay','zoe','ros','lol']
time = np.arange(df_size)
node = ['n1','n2']
idx = pd.MultiIndex.from_product([time, node], names=['time', 'node'])
df = pd.DataFrame(np.random.randint(10,size=(len(idx),len(cols))),columns=cols,index=idx)
c1 = map(lambda i:lab1[i],np.random.randint(len(lab1),size=len(idx)))
c2 = map(lambda i:lab2[i],np.random.randint(len(lab2),size=len(idx)))
df[('obj1','a3')]=list(c1)
df[('obj2','a2')]=list(c2)
print(df)
objects obj1 obj2
attributes a1 a2 a3 a1 a2 a3
time node
0 n1 6 5 truck 3 ros 3
n2 5 6 car 9 lol 7
1 n1 0 8 truck 7 zoe 8
n2 4 3 truck 8 bay 3
2 n1 5 8 van 0 bay 0
n2 4 8 car 5 lol 4
And here is the code to dummify only the object columns
for t in [df.columns[i] for i,dt in enumerate(df.dtypes) if dt=='object']:
dummy_block = pd.get_dummies( df[t] )
dummy_block.columns = pd.MultiIndex.from_product([[t[0]],[f'{t[1]}_{c}' for c in dummy_block.columns]],
names=df.columns.names)
df = pd.concat([df.drop(t,axis=1),dummy_block],axis=1).sort_index(axis=1)
print(df)
objects obj1 obj2
attributes a1 a2 a3_car a3_truck a3_van a1 a2_bay a2_lol a2_ros a2_zoe a3
time node
0 n1 6 5 0 1 0 3 0 0 1 0 3
n2 5 6 1 0 0 9 0 1 0 0 7
1 n1 0 8 0 1 0 7 0 0 0 1 8
n2 4 3 0 1 0 8 1 0 0 0 3
2 n1 5 8 0 0 1 0 1 0 0 0 0
n2 4 8 1 0 0 5 0 1 0 0 4
It can be easily changed to answer to original use case, adding one more row to the columns multi index.
df = pd.DataFrame({('i','A'): ['a', 'b', 'a'],
('ii','B'): ['b', 'a', 'c']})
print(df)
i ii
A B
0 a b
1 b a
2 a c
df.columns = pd.MultiIndex.from_tuples([t+('',) for t in df.columns])
for t in [df.columns[i] for i,dt in enumerate(df.dtypes) if dt=='object']:
dummy_block = pd.get_dummies( df[t] )
dummy_block.columns = pd.MultiIndex.from_product([[t[0]],[t[1]],[c for c in dummy_block.columns]],
names=df.columns.names)
df = pd.concat([df.drop(t,axis=1),dummy_block],axis=1).sort_index(axis=1)
print(df)
i ii
A B
a b a b c
0 1 0 0 1 0
1 0 1 1 0 0
2 1 0 0 0 1
note that it still works if there are numerical columns - it just adds an empty additional level to them in the columns index as well.
I would like to transpose a list of items into a square matrix format using python .
I tried pivot_table in pandas but it didn't work.
Here is my code , the input being a two column csv file
with open(path_to_file,"r") as f:
reader = csv.reader(f,delimiter = ',')
data = list(reader)
row_count=len(data)
print(row_count - 1)
df = pd.read_csv(path_to_file)
groups = df.groupby(['transmitter chan', 'receiver chan'])
max_for_AS = defaultdict(int)
df = df.assign(ID = [0 + i for i in xrange(len(df))])
print(df)
for g in groups:
transmitter, count = g[0][0], len(g[1])
max_for_AS[ transmitter ] = max( max_for_AS[transmitter], count )
for g in groups:
transmitter, receiver, count = g[0][0], g[0][1], len(g[1])
if count == max_for_AS[ transmitter ]:
dataFinal = "{} , {} , {}".format(transmitter, receiver, count )
print( dataFinal )
Data:
V1 V2 count
0 A R 1
1 Z T 4
2 E B 9
3 R O 8
4 T M 7
5 Y K 5
6 B I 6
7 T Z 2
8 A O 7
9 Y B 8
I think you need:
df = pd.read_csv(path_to_file)
df1 = df.pivot(index='V1',columns='V2',values='count').fillna(0).astype(int)
df1 = df.set_index(['V1','V2'])['count'].unstack(fill_value=0)
But if duplicates in V1 and V2 need aggregate them:
df1 = df.pivot_table(index='V1',columns='V2',values='count', fill_value=0)
df1 = df.groupby(['V1','V2'])['count'].mean().unstack(fill_value=0)
#for change ordering add reindex
df1 = df1.reindex(index=df.V1.unique(), columns=df.V2.unique())
print (df1)
V2 R T B O M K I Z
V1
A 1 0 0 7 0 0 0 0
Z 0 4 0 0 0 0 0 0
E 0 0 9 0 0 0 0 0
R 0 0 0 8 0 0 0 0
T 0 0 0 0 7 0 0 2
Y 0 0 8 0 0 5 0 0
B 0 0 0 0 0 0 6 0
Since it's not clear what you are trying to achieve, I'll approach this answer with an assumption.
I assume that you have a pandas dataframe. If that's true, to get the transpose of it using numpy, you might have to,
Convert the dataframe(df) to a numpy ndarray like this: df=df.values
Find the transpose using numpy.transpose on the result of step 1
Edit:
Better way. You can also do df.transpose()