I have a dataframe with multiple bit columns, I want to combine them into multiple integer columns. Can someone guide me how to do that? Here is an example
Test A B C D E
t1 0 0 0 1 0
t2 1 0 1 0 1
t3 1 1 1 1 0
t4 0 0 0 0 1
Here, I want to combine 3 columns together, so I will be combining {A, B, C} and {D, E} and here is the expected output:
Test X Y
t1 0 2
t2 5 1
t3 7 2
t4 0 1
Can someone please guide me how to do this in python?
Thanks.
First convert to strings and then apply lambda function:
df = df.set_index('Test')
a = df[['A','B','C']].astype(str).apply(lambda x: int(''.join(x),2), 1)
b = df[['D','E']].astype(str).apply(lambda x: int(''.join(x),2), 1)
df = pd.DataFrame({'X':a, 'Y':b}).reset_index()
print (df)
Test X Y
0 t1 0 2
1 t2 5 1
2 t3 7 2
3 t4 0 1
Another faster solutions, inpired by answers:
df = df.set_index('Test')
#define columns in dictionary
cols = {'X':['A','B','C'],'Y':['D','E']}
#dictionary of Series
d = {k:df[v].dot((1 << np.arange(len(v) - 1, -1, -1))) for k, v in cols.items()}
#alternative, inspired by divakar answer
#d ={k:pd.Series((2**np.arange(len(v)-1,-1,-1)).dot(df[v].values.T)) for k,v in cols.items()}
df = pd.concat(d, 1).reset_index()
print (df)
Test X Y
0 t1 0 2
1 t2 5 1
2 t3 7 2
3 t4 0 1
Dynamic solution - create dict of columns names by groupby by floor divide helper array create by arange:
df = df.set_index('Test')
cols = pd.Series(df.columns).groupby(np.arange(len(df.columns)) // 3).apply(list).to_dict()
{0: ['A', 'B', 'C'], 1: ['D', 'E']}
d = {k:df[v].dot((1 << np.arange(len(v) - 1, -1, -1))) for k, v in cols.items()}
df = pd.concat(d, 1).reset_index()
print (df)
Test 0 1
0 t1 0 2
1 t2 5 1
2 t3 7 2
3 t4 0 1
You can write a function combining any list of columns in binary like this:
def join_columns(df, columns, name):
series = None
for column in columns:
if series is not None:
series *= 2
series += df[column]
else:
series = df[column].copy()
series.name = name
return series
Then use it to combine columns in your dataframe:
X = join_columns(df, ['A', 'B', 'C'], 'X')
Y = join_columns(df, ['D', 'E'], 'Y')
print(pd.concat([X, Y], axis = 1))
Related
From the below mentioned data frame, I am trying to calculate excel type SUMPRODUCT of columns V1, V2 and V3 against columns S1, S2 and S3.
df = pd.DataFrame({'Name': ['A', 'B', 'C'],
'Qty': [100, 150, 200],
'Remarks': ['Bad', 'Avg', 'Good'],
'V1': [0,1,1],
'V2': [1,1,0],
'V3': [0,0,1],
'S1': [1,0,1],
'S2': [0,1,0],
'S3': [1,0,1]
})
I am looking a way to do this without having to use each column's name like:
df['SP'] = df[['V1', 'S1']].prod(axis=1) + df[['V2', 'S2']].prod(axis=1) + df[['V3', 'S3']].prod(axis=1)
In my real data frame, I have more than 50 columns in both 'V' and 'S' categories so the above approach is not possible.
Any suggestions?
Thanks!
Filter the S and V like columns then multiply the S columns with the corresponding V columns and sum the result along columns axis
s = df.filter(regex='S\d+')
p = df.filter(regex='V\d+')
df['SP'] = s.mul(p.values).sum(1)
Name Qty Remarks V1 V2 V3 S1 S2 S3 SP
0 A 100 Bad 0 1 0 1 0 1 0
1 B 150 Avg 1 1 0 0 1 0 1
2 C 200 Good 1 0 1 1 0 1 2
PS: This solution assumes that the order of appearance of S and V columns in the original dataframe matches.
You could try something like this:
# need to edit these two lines to work with your larger DataFrame
v_cols = df.columns[3:6] # ['V1', 'V2', 'V3']
s_cols = df.columns[6:] # ['S1', 'S2', 'S3']
df['SP'] = (df[v_cols].to_numpy() * df[s_cols].to_numpy()).sum(axis=1)
Edited with an alternative after seeing comment from #ALollz about MultiIndex making alignment simpler:
df.set_index(['Name', 'Qty', 'Remarks'], inplace=True)
n_cols = df.shape[1] // 2
v_cols = df.columns[:n_cols]
s_cols = df.columns[n_cols:]
df['SP'] = (df[v_cols].to_numpy() * df[s_cols].to_numpy()).sum(axis=1)
You can then reset index if you prefer:
df.reset_index(inplace=True)
Results:
Name Qty Remarks V1 V2 V3 S1 S2 S3 SP
0 A 100 Bad 0 1 0 1 0 1 0
1 B 150 Avg 1 1 0 0 1 0 1
2 C 200 Good 1 0 1 1 0 1 2
If your Vn and Sn in columns are in order
v_cols = df.filter(like='V').columns
s_cols = df.filter(like='S').columns
df['SP2'] = sum([df[[v, s]].prod(axis=1) for v, s in zip(v_cols, s_cols)])
print(df)
Name Qty Remarks V1 V2 V3 S1 S2 S3 SP SP2
0 A 100 Bad 0 1 0 1 0 1 0 0
1 B 150 Avg 1 1 0 0 1 0 1 1
2 C 200 Good 1 0 1 1 0 1 2 2
I have a number of dfs stored in a list (df_list). Some dfs share an identical column ('b'). dfs with this identical column should be extracted from the list and stored into a new list of dataframes. Is there a way to 'groupby' dfs in a list programmatically, to handle more cases where this will happen?
Example data and the expected output is shown below. all comments welcome. thanks so much
example data
df1 = pd.DataFrame(data={'id': [1, 2, 3], 'a': [1,2,3], 'b': ['t1','t1','t1']})
df2 = pd.DataFrame(data={'id': [10, 11, 12], 'a': [2,3,4], 'b': ['t1','t1','t1']})
df3 = pd.DataFrame(data={'id': [1, 2, 3], 'a': [1,2,3], 'b': ['t2','t2','t2']})
df4 = pd.DataFrame(data={'id': [1, 2, 3], 'a': [1,2,3], 'b': ['t3','t3','t3']})
df5 = pd.DataFrame(data={'id': [10, 11, 12], 'a': [2,3,4], 'b': ['t1','t1','t1']})
df_list = (df1, df2, df3, df4, df5)
expected output : grouped lists
df_list_t1 = (df1, df2, df5)
df_list_t2 = (df3)
df_list_t3 = (df4)
You can use itertools.groupby to group the dataframes:
from itertools import groupby
out = []
for _, g in groupby(
sorted(df_list, key=lambda k: k["b"].tolist()), lambda k: k["b"].tolist()
):
out.append(list(g))
# pretty print the list:
for subl in out:
print(*subl, sep="\n\n")
print("-" * 80)
Prints:
id a b
0 1 1 t1
1 2 2 t1
2 3 3 t1
id a b
0 10 2 t1
1 11 3 t1
2 12 4 t1
id a b
0 10 2 t1
1 11 3 t1
2 12 4 t1
--------------------------------------------------------------------------------
id a b
0 1 1 t2
1 2 2 t2
2 3 3 t2
--------------------------------------------------------------------------------
id a b
0 1 1 t3
1 2 2 t3
2 3 3 t3
--------------------------------------------------------------------------------
You can do this simply as below using defaultdict:
from collections import defaultdict
dfs = defaultdict(list)
for df in [df1, df2, df3, df4, df5]:
k = df['b'].unique()[0]
dfs[k].append(df)
df_list_t1, df_list_t2, df_list_t3 = list(dfs.values())
Output:
>>> df_list_t1
[ id a b
0 1 1 t1
1 2 2 t1
2 3 3 t1,
id a b
0 10 2 t1
1 11 3 t1
2 12 4 t1,
id a b
0 10 2 t1
1 11 3 t1
2 12 4 t1]
>>>
>>> df_list_t2
[ id a b
0 1 1 t2
1 2 2 t2
2 3 3 t2]
>>>
>>> df_list_t3
[ id a b
0 1 1 t3
1 2 2 t3
2 3 3 t3]
I have a file that consists of three columns: A, B and C with some integer. Using python, Let say I would like to grouby() column 'A' and get the size() of each group with number greater than 4 , 6 and 8 in column 'B'. So I implemented the code below:
>>> import pandas as pd
>>>
>>> df = pd.read_csv("test.txt", sep="\t")
>>> df
A B C
0 1 4 3
1 1 5 4
2 1 2 10
3 2 7 2
4 2 4 4
5 2 6 6
>>>
>>> out1 = df[df['B'] > 4].groupby(['A']).size().reset_index()
>>> out1
A 0
0 1 1
1 2 2
>>> out2 = df[df['B'] > 6].groupby(['A']).size().reset_index()
>>> out2
A 0
0 2 1
>>> out3 = df[df['B'] > 8].groupby(['A']).size().reset_index()
>>> out3
Empty DataFrame
Columns: [A, 0]
Index: []
>>>
out1 is the output that I want. But for out2 and out3, how do I get the data frame similar to out1 with zero as below?
out2:
A 0
0 2 1
1 2 0
out3:
A 0
0 2 0
1 2 0
Thanks in advance.
Idea is create boolean mask, convert to integers and aggregate sum - here is necessary grouping by Series like df['A'] instead by column name A:
out3 = (df['B'] > 8).astype(int).groupby(df['A']).sum().reset_index()
#alternative
#out3 = (df['B'] > 8).view('i1').groupby(df['A']).sum().reset_index()
print (out3)
A B
0 1 0
1 2 0
Another idea is create helper column - e.g. assign B to new values and then aggregate sum:
out3 = df.assign(B = (df['B'] > 8).astype(int)).groupby('A')['B'].sum().reset_index()
print (out3)
A B
0 1 0
1 2 0
Using get_dummies(), it is possible to create one-hot encoded dummy variables for categorical data. For example:
import pandas as pd
df = pd.DataFrame({'A': ['a', 'b', 'a'],
'B': ['b', 'a', 'c']})
print(pd.get_dummies(df))
# A_a A_b B_a B_b B_c
# 0 1 0 0 1 0
# 1 0 1 1 0 0
# 2 1 0 0 0 1
So far, so good. But how can I use get_dummies() in combination with multi-index columns? The default behavior is not very practical: The multi-index tuple is converted into a string and the same suffix mechanism applies as with the simple-index columns.
df = pd.DataFrame({('i','A'): ['a', 'b', 'a'],
('ii','B'): ['b', 'a', 'c']})
ret = pd.get_dummies(df)
print(ret)
print(type(ret.columns[0]))
# ('i','A')_a ('i','A')_b ('ii','B')_a ('ii','B')_b ('ii','B')_c
# 0 1 0 0 1 0
# 1 0 1 1 0 0
# 2 1 0 0 0 1
#
# str
What I would like to get, however, is that the dummies create a new column level:
ret = pd.get_dummies(df, ???)
print(ret)
print(type(ret.columns[0]))
# i ii
# A B
# a b a b c
# 0 1 0 0 1 0
# 1 0 1 1 0 0
# 2 1 0 0 0 1
#
# tuple
#
# Note that the ret would be equivalent to the following:
# ('i','A','a') ('i','A','b') ('ii','B','a') ('ii','B','b') ('ii','B','c')
# 0 1 0 0 1 0
# 1 0 1 1 0 0
# 2 1 0 0 0 1
How could this be achieved?
Update: I placed a feature request for better support of multi-index data-frames in get_dummies: https://github.com/pandas-dev/pandas/issues/26560
You can parse the column names and rename them:
import ast
def parse_dummy(x):
parts = x.split('_')
return ast.literal_eval(parts[0]) + (parts[1],)
ret.columns = pd.Series(ret.columns).apply(parse_dummy)
# (i, A, a) (i, A, b) (ii, B, a) (ii, B, b) (ii, B, c)
#0 1 0 0 1 0
#1 0 1 1 0 0
#2 1 0 0 0 1
Note that this DataFrame is not the same as a DataFrame with three-level multiindex column names.
I had a similar need, but in a more complex DataFrame, with a multi index as row index and numerical columns which shall not be converted to dummy. So my case required to scan through the columns, expand to dummy only the columns of dtype='object', and build a new column index as a concatenation of the name of the column with the dummy variable and the value of the dummy variable itself. This because I didn't want to add a new column index level.
Here is the code
first build a dataframe in the format I need
import pandas as pd
import numpy as np
df_size = 3
objects = ['obj1','obj2']
attributes = ['a1','a2','a3']
cols = pd.MultiIndex.from_product([objects, attributes], names=['objects', 'attributes'])
lab1 = ['car','truck','van']
lab2 = ['bay','zoe','ros','lol']
time = np.arange(df_size)
node = ['n1','n2']
idx = pd.MultiIndex.from_product([time, node], names=['time', 'node'])
df = pd.DataFrame(np.random.randint(10,size=(len(idx),len(cols))),columns=cols,index=idx)
c1 = map(lambda i:lab1[i],np.random.randint(len(lab1),size=len(idx)))
c2 = map(lambda i:lab2[i],np.random.randint(len(lab2),size=len(idx)))
df[('obj1','a3')]=list(c1)
df[('obj2','a2')]=list(c2)
print(df)
objects obj1 obj2
attributes a1 a2 a3 a1 a2 a3
time node
0 n1 6 5 truck 3 ros 3
n2 5 6 car 9 lol 7
1 n1 0 8 truck 7 zoe 8
n2 4 3 truck 8 bay 3
2 n1 5 8 van 0 bay 0
n2 4 8 car 5 lol 4
And here is the code to dummify only the object columns
for t in [df.columns[i] for i,dt in enumerate(df.dtypes) if dt=='object']:
dummy_block = pd.get_dummies( df[t] )
dummy_block.columns = pd.MultiIndex.from_product([[t[0]],[f'{t[1]}_{c}' for c in dummy_block.columns]],
names=df.columns.names)
df = pd.concat([df.drop(t,axis=1),dummy_block],axis=1).sort_index(axis=1)
print(df)
objects obj1 obj2
attributes a1 a2 a3_car a3_truck a3_van a1 a2_bay a2_lol a2_ros a2_zoe a3
time node
0 n1 6 5 0 1 0 3 0 0 1 0 3
n2 5 6 1 0 0 9 0 1 0 0 7
1 n1 0 8 0 1 0 7 0 0 0 1 8
n2 4 3 0 1 0 8 1 0 0 0 3
2 n1 5 8 0 0 1 0 1 0 0 0 0
n2 4 8 1 0 0 5 0 1 0 0 4
It can be easily changed to answer to original use case, adding one more row to the columns multi index.
df = pd.DataFrame({('i','A'): ['a', 'b', 'a'],
('ii','B'): ['b', 'a', 'c']})
print(df)
i ii
A B
0 a b
1 b a
2 a c
df.columns = pd.MultiIndex.from_tuples([t+('',) for t in df.columns])
for t in [df.columns[i] for i,dt in enumerate(df.dtypes) if dt=='object']:
dummy_block = pd.get_dummies( df[t] )
dummy_block.columns = pd.MultiIndex.from_product([[t[0]],[t[1]],[c for c in dummy_block.columns]],
names=df.columns.names)
df = pd.concat([df.drop(t,axis=1),dummy_block],axis=1).sort_index(axis=1)
print(df)
i ii
A B
a b a b c
0 1 0 0 1 0
1 0 1 1 0 0
2 1 0 0 0 1
note that it still works if there are numerical columns - it just adds an empty additional level to them in the columns index as well.
Is there a way in pandas to select, out of a grouped dataframe, the groups with more than x members ?
something like:
grouped = df.groupby(['a', 'b'])
dupes = [g[['a', 'b', 'c', 'd']] for _, g in grouped if len(g) > 1]
I can't find a solution in the docs or on SO.
use filter:
grouped.filter(lambda x: len(x) > 1)
Example:
In [64]:
df = pd.DataFrame({'a':[0,0,1,2],'b':np.arange(4)})
df
Out[64]:
a b
0 0 0
1 0 1
2 1 2
3 2 3
In [65]:
df.groupby('a').filter(lambda x: len(x)>1)
Out[65]:
a b
0 0 0
1 0 1