Pandas - Keeping groups having at least two different codes

Pandas - Keeping groups having at least two different codes - python

I'm working with a DataFrame having the following structure:
import pandas as pd
df = pd.DataFrame({'group' : [1, 1, 1, 2, 2, 2, 2, 3, 3, 3, 4, 4],
'brand' : ['A', 'B', 'X', 'A', 'B', 'C', 'X', 'B', 'C', 'X', 'A', 'B'],
'code' : [2185, 2185, 0, 1410, 1390, 1390, 0, 3670, 4870, 0, 2000, 0]})
print(df)
group brand code
0 1 A 2185
1 1 B 2185
2 1 X 0
3 2 A 1410
4 2 B 1390
5 2 C 1390
6 2 X 0
7 3 B 3670
8 3 C 4870
9 3 X 0
10 4 A 2000
11 4 B 0
My goal is to view only the groups having a least two different codes. Missing codes labelled with 0's should not be taken into consideration in the filtering criterion. For example, even though the two records from group 4 have different codes, we don't keep this group in the final DataFrame since one of the code is missing.
The resulting DataFrame on the above example should look like this:
group brand code
1 2 A 1410
2 2 B 1390
3 2 C 1390
4 2 X 0
5 3 B 3670
6 3 C 4870
7 3 X 0
I didn't manage to do much with this problem. I think that the first step should be to create a mask to remove the records with a missing (0) code. Something like:
mask = df['code'].eq(0)
df = df[~mask]
print(df)
group brand code
0 1 A 2185
1 1 B 2185
3 2 A 1410
4 2 B 1390
5 2 C 1390
7 3 B 3670
8 3 C 4870
10 4 A 2000
And now only keep the groups having a least two different codes but I don't know how to work this out in Python. Also, this method will remove the records with an missing code in my final DataFrame which I don't want. I want to have a view on the full group.
Any additional help would be appreciated.

This is transform():
mask = (df.groupby('group')['code']
.transform(lambda x: x.mask(x==0) # mask out the 0 values
.nunique() # count the nunique
)
.gt(1)
)
df[mask]
Output:
group brand code
3 2 A 1410
4 2 B 1390
5 2 C 1390
6 2 X 0
7 3 B 3670
8 3 C 4870
9 3 X 0
Option 2: Similar idea, but without the lambda function:
mask = (df['code'].mask(df['code']==0) # mask out the 0 values
.groupby(df['group']) # groupby
.transform('nunique') # count uniques
.gt(1) # at least 2
)

We can also use groupby.filter:
df.groupby('group').filter(lambda x: x.code.mask(x.code.eq(0)).nunique()>1)
or surely faster than the previous:
( df.assign(code=df['code'].replace(0,np.nan))
.groupby('group')
.filter(lambda x: x.code.nunique()>1)
.fillna({'code':0}) )
Output
group brand code
3 2 A 1410
4 2 B 1390
5 2 C 1390
6 2 X 0
7 3 B 3670
8 3 C 4870
9 3 X 0

Related

drop rows using pandas groupby and filter

I'm trying to drop rows from a df where certain conditions are met. Using below, I'm grouping values using column C. For each unique group, I want to drop ALL rows where A is less than 1 AND B is greater than 100. This has to occur on the same row though. If I use .any() or .all(), it doesn't return what I want.
df = pd.DataFrame({
'A' : [1,0,1,0,1,0,0,1,0,1],
'B' : [101, 2, 3, 1, 5, 101, 2, 3, 4, 5],
'C' : ['d', 'd', 'd', 'd', 'e', 'e', 'e', 'f', 'f',],
})
df.groupby(['C']).filter(lambda g: g['A'].lt(1) & g['B'].gt(100))
initial df:
A B C
0 1 101 d # A is not lt 1 so keep all d's
1 0 2 d
2 1 3 d
3 0 1 d
4 1 5 e
5 0 101 e # A is lt 1 and B is gt 100 so drop all e's
6 0 2 e
7 1 3 f
8 0 4 f
9 1 5 f
intended out:
A B C
0 1 101 d
1 0 2 d
2 1 3 d
3 0 1 d
7 1 3 f
8 0 4 f
9 1 5 f

For better performnce get all C values match condition and then filter original column C by Series.isin in boolean indexing with inverted mask:
df1 = df[~df['C'].isin(df.loc[df['A'].lt(1) & df['B'].gt(100), 'C'])]
Another idea is use GroupBy.transform with GroupBy.any for test if match at least one value:
df1 = df[~(df['A'].lt(1) & df['B'].gt(100)).groupby(df['C']).transform('any')]
Your solution is possible with any and not for scalars, if large DataFrame it should be slow:
df1 = df.groupby(['C']).filter(lambda g:not ( g['A'].lt(1) & g['B'].gt(100)).any())
df1 = df.groupby(['C']).filter(lambda g: (g['A'].ge(1) | g['B'].le(100)).all())
print (df1)
A B C
0 1 101 d
1 0 2 d
2 1 3 d
3 0 1 d
7 1 3 f
8 0 4 f
9 1 5 f

pd.Series.explode and ValueError: cannot reindex from a duplicate axis

I consulted a lot of the posts on ValueError: cannot reindex from a duplicate axis ([What does `ValueError: cannot reindex from a duplicate axis` mean? and other related posts. I understand that the error can arise with duplicate row indices or column names, but I still can't quite figure out what exactly is throwing me the error.
Below is my best at reproducing the spirit of the dataframe, which does throw the error.
d = {"id" : [1,2,3,4,5],
"cata" : [['aaa1','bbb2','ccc3'],['aaa4','bbb5','ccc6'],['aaa7','bbb8','ccc9'],['aaa10','bbb11','ccc12'],['aaa13','bbb14','ccc15']],
"catb" : [['ddd1','eee2','fff3','ggg4'],['ddd5','eee6','fff7','ggg8'],['ddd9','eee10','fff11','ggg12'],['ddd13','eee14','fff15','ggg16'],['ddd17','eee18','fff19','ggg20']],
"catc" : [['hhh1','iii2','jjj3', 'kkk4', 'lll5'],['hhh6','iii7','jjj8', 'kkk9', 'lll10'],['hhh11','iii12','jjj13', 'kkk14', 'lll15'],['hhh16','iii17','jjj18', 'kkk18', 'lll19'],['hhh20','iii21','jjj22', 'kkk23', 'lll24']]}
df = pd.DataFrame(d)
df.head()
id cata catb catc
0 1 [aaa1, bbb2, ccc3] [ddd1, eee2, fff3, ggg4] [hhh1, iii2, jjj3, kkk4, lll5]
1 2 [aaa4, bbb5, ccc6] [ddd5, eee6, fff7, ggg8] [hhh6, iii7, jjj8, kkk9, lll10]
2 3 [aaa7, bbb8, ccc9] [ddd9, eee10, fff11, ggg12] [hhh11, iii12, jjj13, kkk14, lll15]
3 4 [aaa10, bbb11, ccc12] [ddd13, eee14, fff15, ggg16] [hhh16, iii17, jjj18, kkk18, lll19]
4 5 [aaa13, bbb14, ccc15] [ddd17, eee18, fff19, ggg20] [hhh20, iii21, jjj22, kkk23, lll24]
df.set_index(['id']).apply(pd.Series.explode).reset_index()
Here is the error:
---------------------------------------------------------------------------
ValueError Traceback (most recent call last)
<ipython-input-63-17e7c29b180c> in <module>()
----> 1 df.set_index(['id']).apply(pd.Series.explode).reset_index()
14 frames
/usr/local/lib/python3.6/dist-packages/pandas/core/indexes/base.py in _can_reindex(self, indexer)
3097 # trying to reindex on an axis with duplicates
3098 if not self.is_unique and len(indexer):
-> 3099 raise ValueError("cannot reindex from a duplicate axis")
3100
3101 def reindex(self, target, method=None, level=None, limit=None, tolerance=None):
ValueError: cannot reindex from a duplicate axis
The dataset I'm using is a few hundred MBs and it's a pain - lots of lists inside lists, but the example of above is a fair representation of where I'm stuck. Even when I try to generate a fake dataframe with unique values, I still don't understand why I'm getting the ValueError.
I have explored other ways to explode the lists like using df.apply(lambda x: x.apply(pd.Series).stack()).reset_index().drop('level_1', 1), which doesn't throw a value error, however, it's definitely not as fast and I'd probably would reconsider how I'm processing the df. Still, I want to understand why I'm getting the ValueError I'm getting when I don't have any obvious duplicate values.
Thanks!!!!
Adding desired output here, below, which i generated by chaining apply/stack/dropping levels.
id cata catb catc
0 1 aaa1 ddd1 hhh1
1 1 bbb2 eee2 iii2
2 1 ccc3 fff3 jjj3
3 1 NaN ggg4 kkk4
4 1 NaN NaN lll5
5 2 aaa4 ddd5 hhh6
6 2 bbb5 eee6 iii7
7 2 ccc6 fff7 jjj8
8 2 NaN ggg8 kkk9
9 2 NaN NaN lll10
10 3 aaa7 ddd9 hhh11
11 3 bbb8 eee10 iii12
12 3 ccc9 fff11 jjj13
13 3 NaN ggg12 kkk14
14 3 NaN NaN lll15
15 4 aaa10 ddd13 hhh16
16 4 bbb11 eee14 iii17
17 4 ccc12 fff15 jjj18
18 4 NaN ggg16 kkk18
19 4 NaN NaN lll19
20 5 aaa13 ddd17 hhh20
21 5 bbb14 eee18 iii21
22 5 ccc15 fff19 jjj22
23 5 NaN ggg20 kkk23
24 5 NaN NaN lll24

The error of pd.Series.explode() cannot be solved, but a long form with an 'id' column is created.
tmp = pd.concat([df['id'],df['cata'].apply(pd.Series),df['catb'].apply(pd.Series),df['catc'].apply(pd.Series)],axis=1)
tmp2 = tmp.unstack().to_frame().reset_index()
tmp2 = tmp2[tmp2['level_0'] != 'id']
tmp2.drop('level_1', axis=1, inplace=True)
tmp2.rename(columns={'level_0':'id', 0:'value'}).set_index()
tmp2.reset_index(drop=True, inplace=True)
id value
0 0 aaa1
1 0 aaa4
2 0 aaa7
3 0 aaa10
4 0 aaa13
5 1 bbb2
6 1 bbb5
7 1 bbb8
8 1 bbb11
9 1 bbb14
10 2 ccc3
11 2 ccc6
12 2 ccc9
...

I had to rethink how I was parsing the data. What I accidentally omitted from this post was that I got to unbalanced lists as a consequence of using .str.findall(regex_pattern).to_frame() on different columns. Unbalanced lists resulted because certain metadata fields were missing over the years (e.g., "name") However, because I started with a column of lists of lists, I exploded that using df.explode and then use findall to extract patterns to new cols, which meant that null values could be created too.
For a 500MB dataset of several hundred thousand rows of fields with string type data, the whole process took probably less than 5 min.

from pandas import DataFrame as df
import numpy as np
import pandas as pd
df = pd.DataFrame(
{"id" : [1,2,3],
0: [['x', 'y', 'z'], ['a', 'b', 'c'], ['a', 'b', 'c']],
1: [['a', 'b', 'c'], ['a', 'b', 'c'], ['a', 'b', 'c']],
2: [['a', 'b', 'c'], ['x', 'y', 'z'], ['a', 'b', 'c']]},
)
print(df)
"""
id 0 1 2
0 1 [x, y, z] [a, b, c] [a, b, c]
1 2 [a, b, c] [a, b, c] [x, y, z]
2 3 [a, b, c] [a, b, c] [a, b, c]
"""
bb = (
df.set_index('id').stack().explode()
.reset_index(name='val')
.drop(columns='level_1').reindex()
)
print (bb)
"""
id val
0 1 x
1 1 y
2 1 z
3 1 a
4 1 b
5 1 c
6 1 a
7 1 b
8 1 c
9 2 a
10 2 b
11 2 c
12 2 a
13 2 b
14 2 c
15 2 x
16 2 y
17 2 z
18 3 a
19 3 b
20 3 c
21 3 a
22 3 b
23 3 c
24 3 a
25 3 b
26 3 c
"""
aa = df.set_index('id').apply(pd.Series.explode).reset_index()
print(aa)
"""
id 0 1 2
0 1 x a a
1 1 y b b
2 1 z c c
3 2 a a x
4 2 b b y
5 2 c c z
6 3 a a a
7 3 b b b
8 3 c c c
"""

How can I extract a column from dataframe and attach it to rows while keeping other columns intact

How can I extract a column from pandas dataframe attach it to rows while keeping the other columns same.
This is my example dataset.
import pandas as pd
import numpy as np
df = pd.DataFrame({'ID': np.arange(0,5),
'sample_1' : [5,6,7,8,9],
'sample_2' : [10,11,12,13,14],
'group_id' : ["A","B","C","D","E"]})
The output I'm looking for is:
df2 = pd.DataFrame({'ID': [0, 1, 2, 3, 4, 0, 1, 2, 3, 4],
'sample_1' : [5,6,7,8,9,10,11,12,13,14],
'group_id' : ["A","B","C","D","E","A","B","C","D","E"]})
I have tried to slice the dataframe and concat using pd.concat but it was giving NaN values.
My original dataset is large.

You could do this using stack: Set the index to the columns you don't want to modify, call stack, sort by the "sample" column, then reset your index:
df.set_index(['ID','group_id']).stack().sort_values(0).reset_index([0,1]).reset_index(drop=True)
ID group_id 0
0 0 A 5
1 1 B 6
2 2 C 7
3 3 D 8
4 4 E 9
5 0 A 10
6 1 B 11
7 2 C 12
8 3 D 13
9 4 E 14

Using pd.wide_to_long:
res = pd.wide_to_long(df, stubnames='sample_', i='ID', j='group_id')
res.index = res.index.droplevel(1)
res = res.rename(columns={'sample_': 'sample_1'}).reset_index()
print(res)
ID group_id sample_1
0 0 A 5
1 1 B 6
2 2 C 7
3 3 D 8
4 4 E 9
5 0 A 10
6 1 B 11
7 2 C 12
8 3 D 13
9 4 E 14

The function you are looking for is called melt
For example:
df2 = pd.melt(df, id_vars=['ID', 'group_id'], value_vars=['sample_1', 'sample_2'], value_name='sample_1')
df2 = df2.drop('variable', axis=1)

Slicing multiple ranges of columns in Pandas, by list of names

I am trying to select multiple columns in a Pandas dataframe in two different approaches:
1)via the columns number, for examples, columns 1-3 and columns 6 onwards.
and
2)via a list of column names, for instance:
years = list(range(2000,2017))
months = list(range(1,13))
years_month = list(["A", "B", "B"])
for y in years:
for m in months:
y_m = str(y) + "-" + str(m)
years_month.append(y_m)
Then, years_month would produce the following:
['A',
'B',
'C',
'2000-1',
'2000-2',
'2000-3',
'2000-4',
'2000-5',
'2000-6',
'2000-7',
'2000-8',
'2000-9',
'2000-10',
'2000-11',
'2000-12',
'2001-1',
'2001-2',
'2001-3',
'2001-4',
'2001-5',
'2001-6',
'2001-7',
'2001-8',
'2001-9',
'2001-10',
'2001-11',
'2001-12']
That said, what is the best(or correct) way to load only the columns in which the names are in the list years_month in the two approaches?

I think you need numpy.r_ for concanecate positions of columns, then use iloc for selecting:
print (df.iloc[:, np.r_[1:3, 6:len(df.columns)]])
and for second approach subset by list:
print (df[years_month])
Sample:
df = pd.DataFrame({'2000-1':[1,3,5],
'2000-2':[5,3,6],
'2000-3':[7,8,9],
'2000-4':[1,3,5],
'2000-5':[5,3,6],
'2000-6':[7,8,9],
'2000-7':[1,3,5],
'2000-8':[5,3,6],
'2000-9':[7,4,3],
'A':[1,2,3],
'B':[4,5,6],
'C':[7,8,9]})
print (df)
2000-1 2000-2 2000-3 2000-4 2000-5 2000-6 2000-7 2000-8 2000-9 A \
0 1 5 7 1 5 7 1 5 7 1
1 3 3 8 3 3 8 3 3 4 2
2 5 6 9 5 6 9 5 6 3 3
B C
0 4 7
1 5 8
2 6 9
print (df.iloc[:, np.r_[1:3, 6:len(df.columns)]])
2000-2 2000-3 2000-7 2000-8 2000-9 A B C
0 5 7 1 5 7 1 4 7
1 3 8 3 3 4 2 5 8
2 6 9 5 6 3 3 6 9
You can also sum of ranges (cast to list in python 3 is necessary):
rng = list(range(1,3)) + list(range(6, len(df.columns)))
print (rng)
[1, 2, 6, 7, 8, 9, 10, 11]
print (df.iloc[:, rng])
2000-2 2000-3 2000-7 2000-8 2000-9 A B C
0 5 7 1 5 7 1 4 7
1 3 8 3 3 4 2 5 8
2 6 9 5 6 3 3 6 9

I’m not sure what exactly you are asking but in general DataFrame.loc allows you to select by label, DataFrame.iloc by index.
For example selecting columns # 0, 1 and 4:
dataframe.iloc[:, [0, 1, 4]]
and selecting columns labelled 'A', 'B' and 'C':
dataframe.loc[:, ['A', 'B', 'C']]

Python Pandas add column with relative order numbers

How do I add a order number column to an existing DataFrame?
This is my DataFrame:
import pandas as pd
import math
frame = pd.DataFrame([[1, 4, 2], [8, 9, 2], [10, 2, 1]], columns=['a', 'b', 'c'])
def add_stats(row):
row['sum'] = sum([row['a'], row['b'], row['c']])
row['sum_sq'] = sum(math.pow(v, 2) for v in [row['a'], row['b'], row['c']])
row['max'] = max(row['a'], row['b'], row['c'])
return row
frame = frame.apply(add_stats, axis=1)
print(frame.head())
The resulting data is:
a b c sum sum_sq max
0 1 4 2 7 21 4
1 8 9 2 19 149 9
2 10 2 1 13 105 10
First, I would like to add 3 extra columns with order numbers, sorting on sum, sum_sq and max, respectively. Next, these 3 columns should be combined into one column - the mean of the order numbers - but I do know how to do that part (with apply and axis=1).

I think you're looking for rank where you mention sorting. Given your example, add:
frame['sum_order'] = frame['sum'].rank()
frame['sum_sq_order'] = frame['sum_sq'].rank()
frame['max_order'] = frame['max'].rank()
frame['mean_order'] = frame[['sum_order', 'sum_sq_order', 'max_order']].mean(axis=1)
To get:
a b c sum sum_sq max sum_order sum_sq_order max_order mean_order
0 1 4 2 7 21 4 1 1 1 1.000000
1 8 9 2 19 149 9 3 3 2 2.666667
2 10 2 1 13 105 10 2 2 3 2.333333
The rank method has some options as well, to specify the behavior in case of identical or NA-values for example.

We Keep Coding

Python is a programming language that lets you work quickly and integrate systems more effectively.

Pandas - Keeping groups having at least two different codes - python

Related

drop rows using pandas groupby and filter

pd.Series.explode and ValueError: cannot reindex from a duplicate axis

How can I extract a column from dataframe and attach it to rows while keeping other columns intact

Slicing multiple ranges of columns in Pandas, by list of names

Python Pandas add column with relative order numbers

Categories

Resources