Edit columns based on duplicate values found in Pandas

Edit columns based on duplicate values found in Pandas - python

I have below dataframe:
No: Fee:
111 500
111 500
222 300
222 300
123 400
If data in No is duplicate, I want to keep only one fee and remove others.
Should look like below:
No: Fee:
111 500
111
222 300
222
123 400
I actually have no idea where to start, so please guide here.
Thanks.

Use DataFrame.duplicated with set empty string by DataFrame.loc:
#if need test duplicated by both columns
mask = df.duplicated(['No','Fee'])
df.loc[mask, 'Fee'] = ''
print (df)
No Fee
0 111 500
1 111
2 222 300
3 222
4 123 400
But then lost numeric column, because mixed numbers with strings:
print (df['Fee'].dtype)
object
Possible solution is use missing values if need numeric column:
df.loc[mask, 'Fee'] = np.nan
print (df)
No Fee
0 111 500.0
1 111 NaN
2 222 300.0
3 222 NaN
4 123 400.0
print (df['Fee'].dtype)
float64
df.loc[mask, 'Fee'] = np.nan
df['Fee'] = df['Fee'].astype('Int64')
print (df)
No Fee
0 111 500
1 111 <NA>
2 222 300
3 222 <NA>
4 123 400
print (df['Fee'].dtype)
Int64

Related

How to create new columns based off of columns from groupby results in Pandas? [duplicate]

There is pandas DataFrame as:
print(df)
call_id calling_number call_status
1 123 BUSY
2 456 BUSY
3 789 BUSY
4 123 NO_ANSWERED
5 456 NO_ANSWERED
6 789 NO_ANSWERED
In this case, records with different call_status, (say "ERROR" or something else, what i can't predict), values may appear in the dataframe. I need to add a new column on the fly for such a value.
I have applied the pivot_table() function and I get the result I want:
df1 = df.pivot_table(df,index='calling_number',columns='status_code', aggfunc = 'count').fillna(0).astype('int64')
calling_number ANSWERED BUSY NO_ANSWER
123 0 1 1
456 0 1 1
789 0 1 1
Now I need to add one more column that would contain the percentage of answered calls with the given calling_number, calculated as the ratio of ANSWERED to the total.
Source dataframe 'df' may not contain entries with call_status = 'ANSWERED', so in that case the percentage column should naturally has zero value.
Expected result is :
calling_number ANSWERED BUSY NO_ANSWER ANS_PERC(%)
123 0 1 1 0
456 0 1 1 0
789 0 1 1 0

Use crosstab:
df1 = pd.crosstab(df['calling_number'], df['status_code'])
Or if need exclude NaNs by count function use pivot_table with added parameter fill_value=0:
df1 = df.pivot_table(df,
index='calling_number',
columns='status_code',
aggfunc = 'count',
fill_value=0)
Then for ratio divide summed values per rows:
df1 = df1.div(df1.sum(axis=1), axis=0)
print (df1)
ANSWERED BUSY NO_ANSWER
calling_number
123 0.333333 0.333333 0.333333
456 0.333333 0.333333 0.333333
789 0.333333 0.333333 0.333333
EDIT: For add possible non exist some categories use DataFrame.reindex:
df1 = (pd.crosstab(df['calling_number'], df['call_status'])
.reindex(columns=['ANSWERED','BUSY','NO_ANSWERED'], fill_value=0))
df1['ANS_PERC(%)'] = df1['ANSWERED'].div(df1['ANSWERED'].sum()).fillna(0)
print (df1)
call_status ANSWERED BUSY NO_ANSWERED ANS_PERC(%)
calling_number
123 0 1 1 0.0
456 0 1 1 0.0
789 0 1 1 0.0
If need total per rows:
df1['ANS_PERC(%)'] = df1['ANSWERED'].div(df1.sum(axis=1))
print (df1)
call_status ANSWERED BUSY NO_ANSWERED ANS_PERC(%)
calling_number
123 0 1 1 0.0
456 0 1 1 0.0
789 0 1 1 0.0
EDIT1:
Soluton with replace some wrong values to ERROR:
print (df)
call_id calling_number call_status
0 1 123 ttt
1 2 456 BUSY
2 3 789 BUSY
3 4 123 NO_ANSWERED
4 5 456 NO_ANSWERED
5 6 789 NO_ANSWERED
L = ['ANSWERED', 'BUSY', 'NO_ANSWERED']
df['call_status'] = df['call_status'].where(df['call_status'].isin(L), 'ERROR')
print (df)
0 1 123 ERROR
1 2 456 BUSY
2 3 789 BUSY
3 4 123 NO_ANSWERED
4 5 456 NO_ANSWERED
5 6 789 NO_ANSWERED
df1 = (pd.crosstab(df['calling_number'], df['call_status'])
.reindex(columns=L + ['ERROR'], fill_value=0))
df1['ANS_PERC(%)'] = df1['ANSWERED'].div(df1.sum(axis=1))
print (df1)
call_status ANSWERED BUSY NO_ANSWERED ERROR ANS_PERC(%)
calling_number
123 0 0 1 1 0.0
456 0 1 1 0 0.0
789 0 1 1 0 0.0

I like the cross_tab idea but I am a fan of column manipulation so that it's easy to refer back to:
# define a function to capture all the other call_statuses into one bucket
def tester(x):
if x not in ['ANSWERED', 'BUSY', 'NO_ANSWERED']:
return 'OTHER'
else:
return x
#capture the simplified status in a new column
df['refined_status'] = df['call_status'].apply(tester)
#Do the pivot (or cross tab) to capture the sums:
df1= df.pivot_table(values="call_id", index = 'calling_number', columns='refined_status', aggfunc='count')
#Apply a division to get the percentages:
df1["TOTAL"] = df1[['ANSWERED', 'BUSY', 'NO_ANSWERED', 'OTHER']].sum(axis=1)
df1["ANS_PERC"] = df1["ANSWERED"]/df1.TOTAL * 100
print(df1)

Transpose by grouping a Dataframe having both numeric and string variables

I have a DataFrame and I want to convert it into the following:
import pandas as pd
df = pd.DataFrame({'ID':[111,111,111,222,222,333],
'class':['merc','humvee','bmw','vw','bmw','merc'],
'imp':[1,2,3,1,2,1]})
print(df)
ID class imp
0 111 merc 1
1 111 humvee 2
2 111 bmw 3
3 222 vw 1
4 222 bmw 2
5 333 merc 1
Desired output:
ID 0 1 2
0 111 merc humvee bmw
1 111 1 2 3
2 222 vw bmw
3 222 1 2
4 333 merc
5 333 1
I wish to transpose the entire dataframe, but grouped by a particular column, ID in this case and maintaining the row order.
My attempt: I tried using .set_index() und .unstack(), but it did not work.

Use GroupBy.cumcount for counter and then reshape by DataFrame.stack and Series.unstack:
df1 = (df.set_index(['ID',df.groupby('ID').cumcount()])
.stack()
.unstack(1, fill_value='')
.reset_index(level=1, drop=True)
.reset_index())
print (df1)
ID 0 1 2
0 111 merc humvee bmw
1 111 1 2 3
2 222 vw bmw
3 222 1 2
4 333 merc
5 333 1

Another method would be to use groupby and concat - although this is not totally dynamic it works fine if you only have two columns you want to work with, namely class and imp
s = df.set_index([df['ID'],df.groupby('ID').cumcount()]).unstack(1)
df1 = pd.concat([s['class'],s['imp']],axis=0).sort_index().fillna('')
print(df1)
idx 0 1 2
ID
111 merc humvee bmw
111 1 2 3
222 vw bmw
222 1 2
333 merc
333 1

count id which are smaller than certain value

My data consists of unique ids with a certain distance to a point. The goal is to count the id which is
equal or smaller than the radius.
Following example shows my DataFrame:
id distance radius
111 0.5 1
111 2 1
111 1 1
222 1 2
222 3 2
333 5 3
333 4 3
The output should look like this:
id count
111 2
222 1
333 0

You can do:
df['distance'].le(df['radius']).groupby(df['id']).sum()
Output:
id
111 2.0
222 1.0
333 0.0
dtype: float64
Or you can do:
(df.loc[df.distance <= df.radius, 'id']
.value_counts()
.reindex(df['id'].unique(), fill_value=0)
)
Output:
111 2
222 1
333 0
Name: id, dtype: int64

How to merge DataFrames based on on column while adding another

I have the following mock DataFrames:
df1:
ID FILLER1 FILLER2 QUANTITY
01 123 132 12
02 123 132 5
03 123 132 10
df2:
ID FILLER1 FILLER2 QUANTITY
01 123 132 +1
02 123 132 -1
which would result in the 'Quantity' of DF1 will result in 13, 4 and 10.
Thx in advance for any help provided!

Question is not super clear but if I get what you're trying to do here is a way:
# A left join and filling 0 instead of NaN for that third row
In [19]: merged = df1.merge(df2, on=['ID', 'FILLER1', 'FILLER2'], how='left').fillna(0)
In [20]: merged
Out[20]:
ID FILLER1 FILLER2 QUANTITY_x QUANTITY_y
0 1 123 132 12 1.0
1 2 123 132 5 -1.0
2 3 123 132 10 0.0
# Adding new quantity column
In [21]: merged['QUANTITY'] = merged['QUANTITY_x'] + merged['QUANTITY_y']
In [22]: merged
Out[22]:
ID FILLER1 FILLER2 QUANTITY_x QUANTITY_y QUANTITY
0 1 123 132 12 1.0 13.0
1 2 123 132 5 -1.0 4.0
2 3 123 132 10 0.0 10.0
# Removing _x and _y columns
In [23]: merged = merged[['ID', 'FILLER1', 'FILLER2', 'QUANTITY']]
In [24]: merged
Out[24]:
ID FILLER1 FILLER2 QUANTITY
0 1 123 132 13.0
1 2 123 132 4.0
2 3 123 132 10.0

get the new value based on the last row and checking the ID

The current dateframe.
ID Date Start Value Payment
111 1/1/2018 1000 0
111 1/2/2018 100
111 1/3/2018 500
111 1/4/2018 400
111 1/5/2018 0
222 4/1/2018 2000 200
222 4/2/2018 100
222 4/3/2018 700
222 4/4/2018 0
222 4/5/2018 0
222 4/6/2018 1000
222 4/7/2018 0
This is the dataframe what I am trying to get. Basically, i am trying to fill the star value for each row. AS you can see, every ID has a start value on the first day. next day's start value = last day's start value - last day's payment.
ID Date Start Value Payment
111 1/1/2018 1000 0
111 1/2/2018 1000 100
111 1/3/2018 900 500
111 1/4/2018 400 400
111 1/5/2018 0 0
222 4/1/2018 2000 200
222 4/2/2018 1800 100
222 4/3/2018 1700 700
222 4/4/2018 1000 0
222 4/5/2018 1000 0
222 4/6/2018 1000 1000
222 4/7/2018 0 0
Right now, I use Excel with this formula.
Start Value = if(ID in this row == ID in last row, last row's start value - last row's payment, Start Value)
It works well, I am wondering if I can do it in Python/Pandas. Thank you.

We can using groupby and shift + cumsum, ffill will setting up initial value for all row under the same Id, then we just need to deduct the cumulative payment from that row till the start , we get the remaining value at that point
df.StartValue.fillna(df.groupby('ID').apply(lambda x : x['StartValue'].ffill()-x['Payment'].shift().cumsum()).reset_index(level=0,drop=True))
Out[61]:
0 1000.0
1 1000.0
2 900.0
3 400.0
4 0.0
5 2000.0
6 1800.0
7 1700.0
8 1000.0
9 1000.0
10 1000.0
11 0.0
Name: StartValue, dtype: float64
Assign it back by adding inplace=Ture
df.StartValue.fillna(df.groupby('ID').apply(lambda x : x['StartValue'].ffill()-x['Payment'].shift().cumsum()).reset_index(level=0,drop=True),inplace=True)
df
Out[63]:
ID Date StartValue Payment
0 111 1/1/2018 1000.0 0
1 111 1/2/2018 1000.0 100
2 111 1/3/2018 900.0 500
3 111 1/4/2018 400.0 400
4 111 1/5/2018 0.0 0
5 222 4/1/2018 2000.0 200
6 222 4/2/2018 1800.0 100
7 222 4/3/2018 1700.0 700
8 222 4/4/2018 1000.0 0
9 222 4/5/2018 1000.0 0
10 222 4/6/2018 1000.0 1000
11 222 4/7/2018 0.0 0

We Keep Coding

Python is a programming language that lets you work quickly and integrate systems more effectively.

Edit columns based on duplicate values found in Pandas - python

I have below dataframe: No: Fee: 111 500 111 500 222 300 222 300 123 400 If data in No is duplicate, I want to keep only one fee and remove others. Should look like below: No: Fee: 111 500 111 222 300 222 123 400 I actually have no idea where to start, so please guide here. Thanks.

Related

How to create new columns based off of columns from groupby results in Pandas? [duplicate]

Transpose by grouping a Dataframe having both numeric and string variables

count id which are smaller than certain value

How to merge DataFrames based on on column while adding another

get the new value based on the last row and checking the ID

Categories

Resources