Is there a way in pandas to select, out of a grouped dataframe, the groups with more than x members ?
something like:
grouped = df.groupby(['a', 'b'])
dupes = [g[['a', 'b', 'c', 'd']] for _, g in grouped if len(g) > 1]
I can't find a solution in the docs or on SO.
use filter:
grouped.filter(lambda x: len(x) > 1)
Example:
In [64]:
df = pd.DataFrame({'a':[0,0,1,2],'b':np.arange(4)})
df
Out[64]:
a b
0 0 0
1 0 1
2 1 2
3 2 3
In [65]:
df.groupby('a').filter(lambda x: len(x)>1)
Out[65]:
a b
0 0 0
1 0 1
Related
I defined a MultiIndex Dataframe as follows:
columns = pd.MultiIndex.from_product(
[assets, ['A', 'B', 'C']],
names=['asset', 'var']
)
res = pd.DataFrame(0, index=data.index, columns=columns)
However, I did not have success setting or updating single values of this Dataframe. Any suggestion? Or should I move to use numpy arrays - because of efficiency?
Use DataFrame.loc with tuples for select MultiIndex columns and set new value like:
assets = ['X','Y']
columns = pd.MultiIndex.from_product(
[assets, ['A', 'B', 'C']],
names=['asset', 'var']
)
res = pd.DataFrame(0, index=range(3), columns=columns)
print (res)
asset X Y
var A B C A B C
0 0 0 0 0 0 0
1 0 0 0 0 0 0
2 0 0 0 0 0 0
res.loc[0, ('X','B')] = 100
print (res)
asset X Y
var A B C A B C
0 0 100 0 0 0 0
1 0 0 0 0 0 0
2 0 0 0 0 0 0
I have a dataframe with multiple bit columns, I want to combine them into multiple integer columns. Can someone guide me how to do that? Here is an example
Test A B C D E
t1 0 0 0 1 0
t2 1 0 1 0 1
t3 1 1 1 1 0
t4 0 0 0 0 1
Here, I want to combine 3 columns together, so I will be combining {A, B, C} and {D, E} and here is the expected output:
Test X Y
t1 0 2
t2 5 1
t3 7 2
t4 0 1
Can someone please guide me how to do this in python?
Thanks.
First convert to strings and then apply lambda function:
df = df.set_index('Test')
a = df[['A','B','C']].astype(str).apply(lambda x: int(''.join(x),2), 1)
b = df[['D','E']].astype(str).apply(lambda x: int(''.join(x),2), 1)
df = pd.DataFrame({'X':a, 'Y':b}).reset_index()
print (df)
Test X Y
0 t1 0 2
1 t2 5 1
2 t3 7 2
3 t4 0 1
Another faster solutions, inpired by answers:
df = df.set_index('Test')
#define columns in dictionary
cols = {'X':['A','B','C'],'Y':['D','E']}
#dictionary of Series
d = {k:df[v].dot((1 << np.arange(len(v) - 1, -1, -1))) for k, v in cols.items()}
#alternative, inspired by divakar answer
#d ={k:pd.Series((2**np.arange(len(v)-1,-1,-1)).dot(df[v].values.T)) for k,v in cols.items()}
df = pd.concat(d, 1).reset_index()
print (df)
Test X Y
0 t1 0 2
1 t2 5 1
2 t3 7 2
3 t4 0 1
Dynamic solution - create dict of columns names by groupby by floor divide helper array create by arange:
df = df.set_index('Test')
cols = pd.Series(df.columns).groupby(np.arange(len(df.columns)) // 3).apply(list).to_dict()
{0: ['A', 'B', 'C'], 1: ['D', 'E']}
d = {k:df[v].dot((1 << np.arange(len(v) - 1, -1, -1))) for k, v in cols.items()}
df = pd.concat(d, 1).reset_index()
print (df)
Test 0 1
0 t1 0 2
1 t2 5 1
2 t3 7 2
3 t4 0 1
You can write a function combining any list of columns in binary like this:
def join_columns(df, columns, name):
series = None
for column in columns:
if series is not None:
series *= 2
series += df[column]
else:
series = df[column].copy()
series.name = name
return series
Then use it to combine columns in your dataframe:
X = join_columns(df, ['A', 'B', 'C'], 'X')
Y = join_columns(df, ['D', 'E'], 'Y')
print(pd.concat([X, Y], axis = 1))
For each row in a dataframe, I wish to create duplicates of it with an additional column to identify each duplicate.
E.g Original dataframe is
A | A
B | B
I wish to make make duplicate of each row with an additional column to identify it. Resulting in:
A | A | 1
A | A | 2
B | B | 1
B | B | 2
You can use df.reindex followed by a groupby on df.index.
df = df.reindex(df.index.repeat(2))
df['count'] = df.groupby(level=0).cumcount() + 1
df = df.reset_index(drop=True)
df
a b count
0 A A 1
1 A A 2
2 B B 1
3 B B 2
Similarly, using reindex and assign with np.tile:
df = df.reindex(df.index.repeat(2))\
.assign(count=np.tile(df.index, 2) + 1)\
.reset_index(drop=True)
df
a b count
0 A A 1
1 A A 2
2 B B 1
3 B B 2
Use Index.repeat with loc, for count groupby with cumcount:
df = pd.DataFrame({'a': ['A', 'B'], 'b': ['A', 'B']})
print (df)
a b
0 A A
1 B B
df = df.loc[df.index.repeat(2)]
df['new'] = df.groupby(level=0).cumcount() + 1
df = df.reset_index(drop=True)
print (df)
a b new
0 A A 1
1 A A 2
2 B B 1
3 B B 2
Or:
df = df.loc[df.index.repeat(2)]
df['new'] = np.tile(range(int(len(df.index)/2)), 2) + 1
df = df.reset_index(drop=True)
print (df)
a b new
0 A A 1
1 A A 2
2 B B 1
3 B B 2
Setup
Borrowed from #jezrael
df = pd.DataFrame({'a': ['A', 'B'], 'b': ['A', 'B']})
a b
0 A A
1 B B
Solution 1
Create a pd.MultiIndex with pd.MultiIndex.from_product
Then use pd.DataFrame.reindex
idx = pd.MultiIndex.from_product(
[df.index, [1, 2]],
names=[df.index.name, 'New']
)
df.reindex(idx, level=0).reset_index('New')
New a b
0 1 A A
0 2 A A
1 1 B B
1 2 B B
Solution 2
This uses the same loc and reindex concept used by #cᴏʟᴅsᴘᴇᴇᴅ and #jezrael, but simplifies the final answer by using list and int multiplication rather than np.tile.
df.loc[df.index.repeat(2)].assign(New=[1, 2] * len(df))
a b New
0 A A 1
0 A A 2
1 B B 1
1 B B 2
Use pd.concat() to repeat, and then groupby with cumcount() to count:
In [24]: df = pd.DataFrame({'col1': ['A', 'B'], 'col2': ['A', 'B']})
In [25]: df
Out[25]:
col1 col2
0 A A
1 B B
In [26]: df_repeat = pd.concat([df]*3).sort_index()
In [27]: df_repeat
Out[27]:
col1 col2
0 A A
0 A A
0 A A
1 B B
1 B B
1 B B
In [28]: df_repeat["count"] = df_repeat.groupby(level=0).cumcount() + 1
In [29]: df_repeat # df_repeat.reset_index(drop=True); if index reset required.
Out[29]:
col1 col2 count
0 A A 1
0 A A 2
0 A A 3
1 B B 1
1 B B 2
1 B B 3
If I have a dataframe like
df= pd.DataFrame(['a','b','c','d'],index=[0,0,1,1])
0
0 a
0 b
1 c
1 d
How can I reshape the dataframe based on index like below i.e
df= pd.DataFrame([['a','b'],['c','d']],index=[0,1])
0 1
0 a b
1 c d
Let's use set_index, groupby, cumcount, and unstack:
df.set_index(df.groupby(level=0).cumcount(), append=True)[0].unstack()
Output:
0 1
0 a b
1 c d
You can use pivot with cumcount :
a = df.groupby(level=0).cumcount()
df = pd.pivot(index=df.index, columns=a, values=df[0])
Couple of ways
1.
In [490]: df.groupby(df.index)[0].agg(lambda x: list(x)).apply(pd.Series)
Out[490]:
0 1
0 a b
1 c d
2.
In [447]: df.groupby(df.index).apply(lambda x: pd.Series(x.values.tolist()).str[0])
Out[447]:
0 1
0 a b
1 c d
3.
In [455]: df.assign(i=df.index, c=df.groupby(level=0).cumcount()).pivot('i', 'c', 0)
Out[455]:
c 0 1
i
0 a b
1 c d
to remove names
In [457]: (df.assign(i=df.index, c=df.groupby(level=0).cumcount()).pivot('i', 'c', 0)
.rename_axis(None).rename_axis(None, 1))
Out[457]:
0 1
0 a b
1 c d
I want to slice a column in a dataframe (which contains only strings) based on the integers from a series. Here is an example:
data = pandas.DataFrame(['abc','scb','dvb'])
indices = pandas.Series([0,1,0])
Then apply some function so I get the following:
0
0 a
1 c
2 d
You can use python to manipulate the lists beforehand.
l1 = ['abc','scb','dvb']
l2 = [0,1,0]
l3 = [l1[i][l2[i]] for i in range(len(l1))]
You get l3 as
['a', 'c', 'd']
Now converting it to DataFrame
data = pd.DataFrame(l3)
You get the desired dataframe
You can use the following vectorized approach:
In [191]: [tuple(x) for x in indices.reset_index().values]
Out[191]: [(0, 0), (1, 1), (2, 0)]
In [192]: data[0].str.extractall(r'(.)') \
.loc[[tuple(x) for x in indices.reset_index().values]]
Out[192]:
0
match
0 0 a
1 1 c
2 0 d
In [193]: data[0].str.extractall(r'(.)') \
.loc[[tuple(x) for x in indices.reset_index().values]] \
.reset_index(level=1, drop=True)
Out[193]:
0
0 a
1 c
2 d
Explanation:
In [194]: data[0].str.extractall(r'(.)')
Out[194]:
0
match
0 0 a
1 b
2 c
1 0 s
1 c
2 b
2 0 d
1 v
2 b
In [195]: data[0].str.extractall(r'(.)').loc[ [ (0,0), (1,1) ] ]
Out[195]:
0
match
0 0 a
1 1 c
Numpy solution:
In [259]: a = np.array([list(x) for x in data.values.reshape(1, len(data))[0]])
In [260]: a
Out[260]:
array([['a', 'b', 'c'],
['s', 'c', 'b'],
['d', 'v', 'b']],
dtype='<U1')
In [263]: pd.Series(a[np.arange(len(data)), indices])
Out[263]:
0 a
1 c
2 d
dtype: object