Pandas pick the higher value for each unique id - python

I have a df of customers
CUST_ID | SEGMENT | AREA
1 | B | CAD
1 | A | RAM
2 | B | CAD
2 | C | RAM
3 | B | RAM
4 | A | RAM
I want to count the unique number of CUST_ID per SEGMENT so I did
df.groupby(['SEGMENT'])['CUST_ID'].nunique()
However if there are same CUST_ID with different SEGMENT types then the number per SEGMENT gets inflated. I want to pick the highest value SEGMENT per CUST_ID and then count. A being the highest and C being the lowest. So the resulting df would look like:
CUST_ID | SEGMENT | AREA
1 | A | RAM
2 | B | CAD
3 | B | RAM
4 | A | RAM
and the count would be
A - 2
B - 2
C - 0
How would I be able to do this?

You can try groupby CUST_ID column then filter rows by getting the min value of SEGMENT column.
out = (df.groupby(['CUST_ID'])
.apply(lambda g: g[g['SEGMENT'].eq(g['SEGMENT'].min())])
.reset_index(drop=True))
NOTE: Since you want to pick the highest value SEGMENT per CUST_ID and then count, A being the highest and C being the lowest, in ASCII talbe, A is 65, C is 67. When comparing, A actually is smaller than C. That's why use min here.
print(out)
CUST_ID SEGMENT AREA
0 1 A RAM
1 2 B CAD
2 3 B RAM
3 4 A RAM
res = out.value_counts('SEGMENT')
print(res)
A 2
B 2
Name: SEGMENT, dtype: int64

You can go like this:
(df.sort_values('SEGMENT').drop_duplicates('CUST_ID') # remove duplicates, keep only first 'CUST_ID'
.groupby('SEGMENT')['CUST_ID'].nunique() # or just `.size()` because there are no duplicates
)

Related

Filtering a pandas dataframe to remove duplicates with a criterion

I am new to pandas dataframes, so I apologies in case there's an easy or even built-in way to do so.
Let's say I have a dataframe df with 3 columns A (a string), B (a float) and C (a bool). Values of column A are not unique. B is a random number and rows with same A value can have different values of B. Columns C is True if the value of A is repeated in the dataset.
An example
| | A | B | C |
|---|-----|-----|-------|
| 0 | cat | 10 | True |
| 1 | dog | 10 | False |
| 2 | cat | 20 | True |
| 3 | bee | 100 | False |
(The column C is actually redundant and could be obtained with df['C']=df['A'].duplicated(keep=False))
What I want to obtain is a dataframe were, for duplicated entries of A (C==True), only the row with the highest B value is kept.
I know how to get the list of rows with maximum value of B:
df.loc[df[df['C']].groupby('A')['B'].idxmax()] #is this the best way actually?
but what I want is the opposite: filter df so to get only the entries not duplicated (C==False) and the duplicated ones with the highest B.
One possibility could be to concatenate df[~df['C']] and the previous table but is it the best way actually?
One approach:
res = df.iloc[df.groupby("A")["B"].idxmax()]
print(res)
Output
A B C
3 bee 100 False
2 cat 20 True
1 dog 10 False

How do you identify which IDs have an increasing value over time in another column in a Python dataframe?

Lets say I have a data frame with 3 columns:
| id | value | date |
+====+=======+===========+
| 1 | 50 | 1-Feb-19 |
+----+-------+-----------+
| 1 | 100 | 5-Feb-19 |
+----+-------+-----------+
| 1 | 200 | 6-Jun-19 |
+----+-------+-----------+
| 1 | 500 | 1-Dec-19 |
+----+-------+-----------+
| 2 | 10 | 6-Jul-19 |
+----+-------+-----------+
| 3 | 500 | 1-Mar-19 |
+----+-------+-----------+
| 3 | 200 | 5-Apr-19 |
+----+-------+-----------+
| 3 | 100 | 30-Jun-19 |
+----+-------+-----------+
| 3 | 10 | 25-Dec-19 |
+----+-------+-----------+
ID column contains the ID of a particular person.
Value column contains the value of their transaction.
Date column contains the date of their transaction.
Is there a way in Python to identify ID 1 as the ID with the increasing value of transactions over time?
I'm looking for some way I can extract ID 1 as my desired ID with increasing value of transactions, filter out ID 2 because it doesn't have enough transactions to analyze a trend and also filter out ID 3 as it's trend of transactions is declining over time.
Perhaps group by the id, and check that the sorted values are the same whether sorted by values or by date:
>>> df.groupby('id').apply( lambda x:
... (
... x.sort_values('value', ignore_index=True)['value'] == x.sort_values('date', ignore_index=True)['value']
... ).all()
... )
id
1 True
2 True
3 False
dtype: bool
EDIT:
To make id=2 not True, we can do this instead:
>>> df.groupby('id').apply( lambda x:
... (
... (x.sort_values('value', ignore_index=True)['value'] == x.sort_values('date', ignore_index=True)['value'])
... & (len(x) > 1)
... ).all()
... )
id
1 True
2 False
3 False
dtype: bool
df['new'] = df.groupby(['id'])['value'].transform(lambda x : \
np.where(x.diff()>0,'incresase',
np.where(x.diff()<0,'decrease','--')))
df = df.groupby('id').new.agg(['last'])
df
Output:
last
id
1 increase
2 --
3 decrease
Only increasing ID:
increasingList = df[(df['last']=='increase')].index.values
print(increasingList)
Result:
[1]
Assuming this won't happen
1 50
1 100
1 50
If so, then:
df['new'] = df.groupby(['id'])['value'].transform(lambda x : \
np.where(x.diff()>0,'increase',
np.where(x.diff()<0,'decrease','--')))
df
Output:
value new
id
1 50 --
1 100 increase
1 200 increase
2 10 --
3 500 --
3 300 decrease
3 100 decrease
Concat strings:
df = df.groupby(['id'])['new'].apply(lambda x: ','.join(x)).reset_index()
df
Intermediate Result:
id new
0 1 --,increase,increase
1 2 --
2 3 --,decrease,decrease
Check if decrease exist in a row / only "--" exists. Drop them
df = df.drop(df[df['new'].str.contains("dec")].index.values)
df = df.drop(df[(df['new']=='--')].index.values)
df
Result:
id new
0 1 --,increase,increase

Replace the cell with the most frequent word in Pandas DataFrame

I have a DataFrame like this:
df = pd.DataFrame({'Source1': ['Corona,Corona,Corona','Sars,Sars','Corona,Sars',
'Sars,Corona','Sars'],
'Area': ['A,A,A,B','A','A,B,B,C','C,C,B,C','A,B,C']})
df
Source1 Area
0 Corona,Corona,Corona A,A,A,B
1 Sars,Sars A
2 Corona,Sars A,B,B,C
3 Sars,Corona C,C,B,C
4 Sars A,B,C
I want to check each cell in each column (the real data has many columns) and find the frequency of each unique word (we can distinguish the unique words by ','), and replace the whole entry by the most frequent word.
In the case of a tie, it doesn't matter which word to replace. So the desired output would look like this:
df
Source Area
0 Corona A
1 Sars A
2 Corona B
3 Sars C
4 Sars A
In this case, I randomly chose to pick the first word when there is a tie, but it really doesn't matter.
Thanks in advance.
Create DataFrames by Series.str.split and expand=True and is used DataFrame.mode with selecting first column by position:
df['Source1'] = df['Source1'].str.split(',', expand=True).mode(axis=1).iloc[:, 0]
df['Area'] = df['Area'].str.split(',', expand=True).mode(axis=1).iloc[:, 0]
print (df)
Source1 Area
0 Corona A
1 Sars A
2 Corona B
3 Sars C
4 Sars A
Another idea with collections.Counter.most_common:
from collections import Counter
f = lambda x: [Counter(y.split(',')).most_common(1)[0][0] for y in x]
df[['Source1', 'Area']] = df[['Source1', 'Area']].apply(f)
#all columns
#df = df.apply(f)
print (df)
Source1 Area
0 Corona A
1 Sars A
2 Corona B
3 Sars C
4 Sars A
Here would be my offering that can be executed in a single line for each series and requires no extra imports.
df['Area'] = df['Area'].apply(lambda x: max(x.replace(',',''), key=x.count))
After replacing all , in the characters found in the Area series, we replace the field with the element that has the greatest number of occurrences (or first element in the case of equal values) with the key=x.count argument.
You could also use use something similar (demonstrated with the Source1 series), returning the maximum from the list of elements created by splitting the field.
df['Source1'] = df['Source1'].apply(lambda x: max(list(x.split(',')), key=x.count))
+---+---------+------+
| | Source1 | Area |
+---+---------+------+
| 0 | Corona | A |
| 1 | Sars | A |
| 2 | Corona | B |
| 3 | Sars | C |
| 4 | Sars | A |
+---+---------+------+
Two methods shown above to highlight choices; both would work adequately on either or both series.

Use pandas groupby.size() results for arithmetical operation

I got the following problem which I got stuck on and unfortunately cannot resolve by myself or by similar questions that I found on stackoverflow.
To keep it simple, I'll give a short example of my problem:
I got a Dataframe with several columns and one column that indicates the ID of a user. It might happen that the same user has several entries in this data frame:
| | userID | col2 | col3 |
+---+-----------+----------------+-------+
| 1 | 1 | a | b |
| 2 | 1 | c | d |
| 3 | 2 | a | a |
| 4 | 3 | d | e |
Something like this. Now I want to known the number of rows that belongs to a certain userID. For this operation I tried to use df.groupby('userID').size() which in return I want to use for another simple calculation, like division whatsover.
But as I try to save the results of the calculation in a seperate column, I keep getting NaN values.
Is there a way to solve this so that I get the result of the calculations in a seperate column?
Thanks for your help!
edit//
To make clear, how my output should look like. The upper dataframe is my main data frame so to say. Besides this frame I got a second frame looking like this:
| | userID | value | value/appearances |
+---+-----------+----------------+-------+
| 1 | 1 | 10 | 10 / 2 = 5 |
| 3 | 2 | 20 | 20 / 1 = 20 |
| 4 | 3 | 30 | 30 / 1 = 30 |
So I basically want in the column 'value/appearances' to have the result of the number in the value column divided by the number of appearances of this certain user in the main dataframe. For user with ID=1 this would be 10/2, as this user has a value of 10 and has 2 rows in the main dataframe.
I hope this makes it a bit clearer.
IIUC you want to do the following, groupby on 'userID' and call transform on the grouped column and pass 'size' to identify the method to call:
In [54]:
df['size'] = df.groupby('userID')['userID'].transform('size')
df
Out[54]:
userID col2 col3 size
1 1 a b 2
2 1 c d 2
3 2 a a 1
4 3 d e 1
What you tried:
In [55]:
df.groupby('userID').size()
Out[55]:
userID
1 2
2 1
3 1
dtype: int64
When assigned back to the df aligns with the df index so it introduced NaN for the last row:
In [57]:
df['size'] = df.groupby('userID').size()
df
Out[57]:
userID col2 col3 size
1 1 a b 2
2 1 c d 1
3 2 a a 1
4 3 d e NaN

How to exclude a single value from Groupby method using Pandas

I have a dataframe where I have transformed all NaN to 0 for a specific reason. In doing another calculation on the df, my group by is picking up a 0 and making it a value to perform the counts on. Any idea how to get python and pandas to exclude the 0 value? In this case the 0 represents a single row in the data. Is there a way to exclude all 0's from the groupby?
My groupby looks like this
+----------------+----------------+-------------+
| Team | Method | Count |
+----------------+----------------+-------------+
| Team 1 | Automated | 1 |
| Team 1 | Manual | 14 |
| Team 2 | Automated | 5 |
| Team 2 | Hybrid | 1 |
| Team 2 | Manual | 25 |
| Team 4 | 0 | 1 |
| Team 4 | Automated | 1 |
| Team 4 | Hybrid | 13 |
+----------------+----------------+-------------+
My code looks like this (after importing excel file)
df = df1.filnna(0)
a = df[['Team', 'Method']]
b = a.groupby(['Team', 'Method']).agg({'Method' : 'count'}
I'd filter the df prior to grouping:
In [8]:
a = df.loc[df['Method'] !=0, ['Team', 'Method']]
b = a.groupby(['Team', 'Method']).agg({'Method' : 'count'})
b
Out[8]:
Method
Team Method
1 Automated 1
Manual 1
2 Automated 1
Hybrid 1
Manual 1
4 Automated 1
Hybrid 1
Here we only select rows where method is not equal to 0
compare against without filtering:
In [9]:
a = df[['Team', 'Method']]
b = a.groupby(['Team', 'Method']).agg({'Method' : 'count'})
b
Out[9]:
Method
Team Method
1 Automated 1
Manual 1
2 Automated 1
Hybrid 1
Manual 1
4 0 1
Automated 1
Hybrid 1
You need the filter.
The filter method returns a subset of the original object. Suppose
we want to take only elements that belong to groups with a group sum
greater than 2.
Example:
In [94]: sf = pd.Series([1, 1, 2, 3, 3, 3])
In [95]: sf.groupby(sf).filter(lambda x: x.sum() > 2) Out[95]: 3 3
4 3 5 3 dtype: int64
Source.

Categories