Pandas Groupby : group **by** a column containing tuples - python

I'm trying to group by a column containing tuples. Each tuple has a different length.
I'd like to perform simple groupby operations on this column of tuples, such as sum or count.
Example :
df = pd.DataFrame(data={
'col1': [1,2,3,4] ,
'col2': [('a', 'b'), ('a'), ('b', 'n', 'k'), ('a', 'c', 'k', 'z') ] ,
})
print df
outputs :
col1 col2
0 1 (a, b)
1 2 (a, m)
2 3 (b, n, k)
3 4 (a, c, k, z)
I'd like to be able to group by col2 on col1, with for instance a sum.
Expected output would be :
col2 sum_col1
0 a 7
1 b 4
2 c 4
3 n 3
3 m 2
3 k 7
3 z 4
I feel that pd.melt might be able to use, but i can't see exactly how.

Here is an approach using .get_dummies and .melt:
import pandas as pd
df = pd.DataFrame(data={
'col1': [1,2,3,4] ,
'col2': [('a', 'b'), ('a'), ('b', 'n', 'k'), ('a', 'c', 'k', 'z') ] ,
})
value_col = 'col1'
id_col = 'col2'
Unpack tuples to DataFrame:
df = df.join(df.col2.apply(lambda x: pd.Series(x)))
Create columns with values of tuples:
dummy_cols = df.columns.difference(df[[value_col, id_col]].columns)
dfd = pd.get_dummies(df[dummy_cols | pd.Index([value_col])])
Producing:
col1 0_a 0_b 1_b 1_c 1_n 2_k 3_z
0 1 1 0 1 0 0 0 0
1 2 1 0 0 0 0 0 0
2 3 0 1 0 0 1 1 0
3 4 1 0 0 1 0 1 1
Then .melt it and clean variable column from prefixes:
dfd = pd.melt(dfd, value_vars=dfd.columns.difference([value_col]).tolist(), id_vars=value_col)
dfd['variable'] = dfd.variable.str.replace(r'\d_', '')
print dfd.head()
Yielding:
col1 variable value
0 1 a 1
1 2 a 1
2 3 a 0
3 4 a 1
4 1 b 0
And finally get your output:
dfd[dfd.value != 0].groupby('variable')[value_col].sum()
variable
a 7
b 4
c 4
k 7
n 3
z 4
Name: col1, dtype: int64

Related

drop rows using pandas groupby and filter

I'm trying to drop rows from a df where certain conditions are met. Using below, I'm grouping values using column C. For each unique group, I want to drop ALL rows where A is less than 1 AND B is greater than 100. This has to occur on the same row though. If I use .any() or .all(), it doesn't return what I want.
df = pd.DataFrame({
'A' : [1,0,1,0,1,0,0,1,0,1],
'B' : [101, 2, 3, 1, 5, 101, 2, 3, 4, 5],
'C' : ['d', 'd', 'd', 'd', 'e', 'e', 'e', 'f', 'f',],
})
df.groupby(['C']).filter(lambda g: g['A'].lt(1) & g['B'].gt(100))
initial df:
A B C
0 1 101 d # A is not lt 1 so keep all d's
1 0 2 d
2 1 3 d
3 0 1 d
4 1 5 e
5 0 101 e # A is lt 1 and B is gt 100 so drop all e's
6 0 2 e
7 1 3 f
8 0 4 f
9 1 5 f
intended out:
A B C
0 1 101 d
1 0 2 d
2 1 3 d
3 0 1 d
7 1 3 f
8 0 4 f
9 1 5 f
For better performnce get all C values match condition and then filter original column C by Series.isin in boolean indexing with inverted mask:
df1 = df[~df['C'].isin(df.loc[df['A'].lt(1) & df['B'].gt(100), 'C'])]
Another idea is use GroupBy.transform with GroupBy.any for test if match at least one value:
df1 = df[~(df['A'].lt(1) & df['B'].gt(100)).groupby(df['C']).transform('any')]
Your solution is possible with any and not for scalars, if large DataFrame it should be slow:
df1 = df.groupby(['C']).filter(lambda g:not ( g['A'].lt(1) & g['B'].gt(100)).any())
df1 = df.groupby(['C']).filter(lambda g: (g['A'].ge(1) | g['B'].le(100)).all())
print (df1)
A B C
0 1 101 d
1 0 2 d
2 1 3 d
3 0 1 d
7 1 3 f
8 0 4 f
9 1 5 f

Copy data from a row to another row in Pandas dataframe based on condition

Let's say I have a (pandas) dataframe like this:
Index A ID B C
1 a 1 0 0
2 b 2 0 0
3 c 2 a a
4 d 3 0 0
I want to copy the data of the third row to the second row, because their IDs are matching, but the data is not filled. However, I want to leave column 'A' intact. Looking for a result like this:
Index A ID B C
1 a 1 0 0
2 b 2 a a
3 c 2 a a
4 d 3 0 0
What would you suggest as solution?
You can try replacing '0' with NaN then ffill()+bfill() using groupby()+apply():
df[['B','C']]=df[['B','C']].replace('0',float('NaN'))
df[['B','C']]=df.groupby('ID')[['B','C']].apply(lambda x:x.ffill().bfill()).fillna('0')
output of df:
Index A ID B C
0 1 a 1 0 0
1 2 b 2 a a
2 3 c 2 a a
3 4 d 3 0 0
Note: you can also use transform() method in place of apply() method
You can use combine_first:
s = df.loc[df[["B","C"]].ne("0").all(1)].set_index("ID")[["B", "C"]]
print (s.combine_first(df.set_index("ID")).reset_index())
ID A B C Index
0 1 a 0 0 1.0
1 2 b a a 2.0
2 2 c a a 3.0
3 3 d 0 0 4.0
import pandas as pd
data = { 'A': ['a', 'b', 'c', 'd'], 'ID': [1, 2, 2, 3], 'B': [0, 0, 'a', 0], 'C': [0, 0, 'a', 0]}
df = pd.DataFrame(data)
df.index += 1
index_to_be_replaced = 2
index_to_use_to_replace = 3
columns_to_replace = ['ID', 'B', 'C']
columns_not_to_replace = ['A']
x = df[columns_not_to_replace].loc[index_to_be_replaced]
y = df[columns_to_replace].loc[index_to_use_to_replace]
df.loc[index_to_be_replaced] = pd.concat([x, y])
print(df)
Does it solve your problem? I would check on other pandas functions, as well. Like join, merge.
❯ python3 b.py
A ID B C
1 a 1 0 0
2 b 2 a a
3 c 2 a a
4 d 3 0 0

How to group pandas DataFrame if some values are range of integers, while others are pure integer?

I want to group a df by a column col_2, which contains mostly integers, but some cells contain a range of integers. In my real life example, each unique integer represents a specific serial number of an assembled part. Each row in the dataframe represents a single part, which is allocated to the assembled part by col_2. Some parts can only be allocated to the assembled part with a given uncertainty (range).
The expected output would be one single group for each referenced integer (assembled part S/N). For example, the entry col_1 = c should be allocated to both groups where col_2 = 1 and col_2 = 2.
df = pd.DataFrame( {'col_1': ['a', 'b', 'c', 'd', 'e', 'f'],
'col_2': [1, 2, range(1,3), 3,range(2,5),5]})
col_1 col_2
0 a 1
1 b 2
2 c (1, 2)
3 d 3
4 e (2, 3, 4)
5 f 5
print(df.groupby(['col_2']).groups)
The code above gives an error:
TypeError: '<' not supported between instances of 'range' and 'int'
I think this does what you want:
s = df.col_2.apply(pd.Series).set_index(df.col_1).stack().astype(int)
s.reset_index().groupby(0).col_1.apply(list)
The first step gives you:
col_1
a 0 1
b 0 2
c 0 1
1 2
d 0 3
e 0 2
1 3
2 4
f 0 5
And the final result is:
1 [a, c]
2 [b, c, e]
3 [d, e]
4 [e]
5 [f]
Try this:
df = pd.DataFrame( {'col_1': ['a', 'b', 'c', 'd', 'e', 'f'],
'col_2': [1, 2, range(1,3), 3,range(2,5),5]})
col_1 col_2
0 a 1
1 b 2
2 c (1, 2)
3 d 3
4 e (2, 3, 4)
5 f 5
df['col_2'] = df.col_2.map(lambda x: range(x) if type(x) != range else x)
print(df.groupby(['col_2']).groups)```

Python - Pandas - Edit duplicate items keeping last

Lets say my df is:
import pandas as pd
df = pd.DataFrame({'col1':['a', 'a', 'a', 'b', 'b', 'c', 'd', 'd', 'd'],
'col2':[10,20, 30, 10, 20, 10, 10, 20, 30]})
How can I make all numbers zero keeping the last one only? In this case the result should be:
col1 col2
a 0
a 0
a 30
b 0
b 20
c 10
d 0
d 0
d 30
Thanks!
Use loc and duplicated with the argument keep='last':
df.loc[df.duplicated(subset='col1',keep='last'), 'col2'] = 0
>>> df
col1 col2
0 a 0
1 a 0
2 a 30
3 b 0
4 b 20
5 c 10
6 d 0
7 d 0
8 d 30

Create a unique indicator two join two datasets in pandas/python

How can I combine four columns in a dataframe in pandas/python to create a unique indicator and do a left join?
Is this even the best way to do what I am trying to accomplish?
example: make a unique indicator (col5)
then setup a join with another dataframe using the same logic
col1 col2 col3 col4 col5
apple pear mango tea applepearmangotea
then do a join something like
pd.merge(df1, df2, how='left', on='col5')
This problem is the same whether its 4 columns or 2. You don't need to create a unique combined key. You just need to merge on multiple columns.
Consider the two dataframes d1 and d2. They share two columns in common.
d1 = pd.DataFrame([
[0, 0, 'a', 'b'],
[0, 1, 'c', 'd'],
[1, 0, 'e', 'f'],
[1, 1, 'g', 'h']
], columns=list('ABCD'))
d2 = pd.DataFrame([
[0, 0, 'a', 'b'],
[0, 1, 'c', 'd'],
[1, 0, 'e', 'f'],
[2, 0, 'g', 'h']
], columns=list('ABEF'))
d1
A B C D
0 0 0 a b
1 0 1 c d
2 1 0 e f
3 1 1 g h
d2
A B E F
0 0 0 a b
1 0 1 c d
2 1 0 e f
3 2 0 g h
We can perform the equivalent of a left join using pd.DataFrame.merge
d1.merge(d2, 'left')
A B C D E F
0 0 0 a b a b
1 0 1 c d c d
2 1 0 e f e f
3 1 1 g h NaN NaN
We can be explicit with the columns
d1.merge(d2, 'left', on=['A', 'B'])
A B C D E F
0 0 0 a b a b
1 0 1 c d c d
2 1 0 e f e f
3 1 1 g h NaN NaN

Categories