I have below dataframe:
No: Fee:
111 500
111 500
222 300
222 300
123 400
If data in No is duplicate, I want to keep only one fee and remove others.
Should look like below:
No: Fee:
111 500
111
222 300
222
123 400
I actually have no idea where to start, so please guide here.
Thanks.
Use DataFrame.duplicated with set empty string by DataFrame.loc:
#if need test duplicated by both columns
mask = df.duplicated(['No','Fee'])
df.loc[mask, 'Fee'] = ''
print (df)
No Fee
0 111 500
1 111
2 222 300
3 222
4 123 400
But then lost numeric column, because mixed numbers with strings:
print (df['Fee'].dtype)
object
Possible solution is use missing values if need numeric column:
df.loc[mask, 'Fee'] = np.nan
print (df)
No Fee
0 111 500.0
1 111 NaN
2 222 300.0
3 222 NaN
4 123 400.0
print (df['Fee'].dtype)
float64
df.loc[mask, 'Fee'] = np.nan
df['Fee'] = df['Fee'].astype('Int64')
print (df)
No Fee
0 111 500
1 111 <NA>
2 222 300
3 222 <NA>
4 123 400
print (df['Fee'].dtype)
Int64
Related
There is pandas DataFrame as:
print(df)
call_id calling_number call_status
1 123 BUSY
2 456 BUSY
3 789 BUSY
4 123 NO_ANSWERED
5 456 NO_ANSWERED
6 789 NO_ANSWERED
In this case, records with different call_status, (say "ERROR" or something else, what i can't predict), values may appear in the dataframe. I need to add a new column on the fly for such a value.
I have applied the pivot_table() function and I get the result I want:
df1 = df.pivot_table(df,index='calling_number',columns='status_code', aggfunc = 'count').fillna(0).astype('int64')
calling_number ANSWERED BUSY NO_ANSWER
123 0 1 1
456 0 1 1
789 0 1 1
Now I need to add one more column that would contain the percentage of answered calls with the given calling_number, calculated as the ratio of ANSWERED to the total.
Source dataframe 'df' may not contain entries with call_status = 'ANSWERED', so in that case the percentage column should naturally has zero value.
Expected result is :
calling_number ANSWERED BUSY NO_ANSWER ANS_PERC(%)
123 0 1 1 0
456 0 1 1 0
789 0 1 1 0
Use crosstab:
df1 = pd.crosstab(df['calling_number'], df['status_code'])
Or if need exclude NaNs by count function use pivot_table with added parameter fill_value=0:
df1 = df.pivot_table(df,
index='calling_number',
columns='status_code',
aggfunc = 'count',
fill_value=0)
Then for ratio divide summed values per rows:
df1 = df1.div(df1.sum(axis=1), axis=0)
print (df1)
ANSWERED BUSY NO_ANSWER
calling_number
123 0.333333 0.333333 0.333333
456 0.333333 0.333333 0.333333
789 0.333333 0.333333 0.333333
EDIT: For add possible non exist some categories use DataFrame.reindex:
df1 = (pd.crosstab(df['calling_number'], df['call_status'])
.reindex(columns=['ANSWERED','BUSY','NO_ANSWERED'], fill_value=0))
df1['ANS_PERC(%)'] = df1['ANSWERED'].div(df1['ANSWERED'].sum()).fillna(0)
print (df1)
call_status ANSWERED BUSY NO_ANSWERED ANS_PERC(%)
calling_number
123 0 1 1 0.0
456 0 1 1 0.0
789 0 1 1 0.0
If need total per rows:
df1['ANS_PERC(%)'] = df1['ANSWERED'].div(df1.sum(axis=1))
print (df1)
call_status ANSWERED BUSY NO_ANSWERED ANS_PERC(%)
calling_number
123 0 1 1 0.0
456 0 1 1 0.0
789 0 1 1 0.0
EDIT1:
Soluton with replace some wrong values to ERROR:
print (df)
call_id calling_number call_status
0 1 123 ttt
1 2 456 BUSY
2 3 789 BUSY
3 4 123 NO_ANSWERED
4 5 456 NO_ANSWERED
5 6 789 NO_ANSWERED
L = ['ANSWERED', 'BUSY', 'NO_ANSWERED']
df['call_status'] = df['call_status'].where(df['call_status'].isin(L), 'ERROR')
print (df)
0 1 123 ERROR
1 2 456 BUSY
2 3 789 BUSY
3 4 123 NO_ANSWERED
4 5 456 NO_ANSWERED
5 6 789 NO_ANSWERED
df1 = (pd.crosstab(df['calling_number'], df['call_status'])
.reindex(columns=L + ['ERROR'], fill_value=0))
df1['ANS_PERC(%)'] = df1['ANSWERED'].div(df1.sum(axis=1))
print (df1)
call_status ANSWERED BUSY NO_ANSWERED ERROR ANS_PERC(%)
calling_number
123 0 0 1 1 0.0
456 0 1 1 0 0.0
789 0 1 1 0 0.0
I like the cross_tab idea but I am a fan of column manipulation so that it's easy to refer back to:
# define a function to capture all the other call_statuses into one bucket
def tester(x):
if x not in ['ANSWERED', 'BUSY', 'NO_ANSWERED']:
return 'OTHER'
else:
return x
#capture the simplified status in a new column
df['refined_status'] = df['call_status'].apply(tester)
#Do the pivot (or cross tab) to capture the sums:
df1= df.pivot_table(values="call_id", index = 'calling_number', columns='refined_status', aggfunc='count')
#Apply a division to get the percentages:
df1["TOTAL"] = df1[['ANSWERED', 'BUSY', 'NO_ANSWERED', 'OTHER']].sum(axis=1)
df1["ANS_PERC"] = df1["ANSWERED"]/df1.TOTAL * 100
print(df1)
I have a DataFrame and I want to convert it into the following:
import pandas as pd
df = pd.DataFrame({'ID':[111,111,111,222,222,333],
'class':['merc','humvee','bmw','vw','bmw','merc'],
'imp':[1,2,3,1,2,1]})
print(df)
ID class imp
0 111 merc 1
1 111 humvee 2
2 111 bmw 3
3 222 vw 1
4 222 bmw 2
5 333 merc 1
Desired output:
ID 0 1 2
0 111 merc humvee bmw
1 111 1 2 3
2 222 vw bmw
3 222 1 2
4 333 merc
5 333 1
I wish to transpose the entire dataframe, but grouped by a particular column, ID in this case and maintaining the row order.
My attempt: I tried using .set_index() und .unstack(), but it did not work.
Use GroupBy.cumcount for counter and then reshape by DataFrame.stack and Series.unstack:
df1 = (df.set_index(['ID',df.groupby('ID').cumcount()])
.stack()
.unstack(1, fill_value='')
.reset_index(level=1, drop=True)
.reset_index())
print (df1)
ID 0 1 2
0 111 merc humvee bmw
1 111 1 2 3
2 222 vw bmw
3 222 1 2
4 333 merc
5 333 1
Another method would be to use groupby and concat - although this is not totally dynamic it works fine if you only have two columns you want to work with, namely class and imp
s = df.set_index([df['ID'],df.groupby('ID').cumcount()]).unstack(1)
df1 = pd.concat([s['class'],s['imp']],axis=0).sort_index().fillna('')
print(df1)
idx 0 1 2
ID
111 merc humvee bmw
111 1 2 3
222 vw bmw
222 1 2
333 merc
333 1
My data consists of unique ids with a certain distance to a point. The goal is to count the id which is
equal or smaller than the radius.
Following example shows my DataFrame:
id distance radius
111 0.5 1
111 2 1
111 1 1
222 1 2
222 3 2
333 5 3
333 4 3
The output should look like this:
id count
111 2
222 1
333 0
You can do:
df['distance'].le(df['radius']).groupby(df['id']).sum()
Output:
id
111 2.0
222 1.0
333 0.0
dtype: float64
Or you can do:
(df.loc[df.distance <= df.radius, 'id']
.value_counts()
.reindex(df['id'].unique(), fill_value=0)
)
Output:
111 2
222 1
333 0
Name: id, dtype: int64
I have the following mock DataFrames:
df1:
ID FILLER1 FILLER2 QUANTITY
01 123 132 12
02 123 132 5
03 123 132 10
df2:
ID FILLER1 FILLER2 QUANTITY
01 123 132 +1
02 123 132 -1
which would result in the 'Quantity' of DF1 will result in 13, 4 and 10.
Thx in advance for any help provided!
Question is not super clear but if I get what you're trying to do here is a way:
# A left join and filling 0 instead of NaN for that third row
In [19]: merged = df1.merge(df2, on=['ID', 'FILLER1', 'FILLER2'], how='left').fillna(0)
In [20]: merged
Out[20]:
ID FILLER1 FILLER2 QUANTITY_x QUANTITY_y
0 1 123 132 12 1.0
1 2 123 132 5 -1.0
2 3 123 132 10 0.0
# Adding new quantity column
In [21]: merged['QUANTITY'] = merged['QUANTITY_x'] + merged['QUANTITY_y']
In [22]: merged
Out[22]:
ID FILLER1 FILLER2 QUANTITY_x QUANTITY_y QUANTITY
0 1 123 132 12 1.0 13.0
1 2 123 132 5 -1.0 4.0
2 3 123 132 10 0.0 10.0
# Removing _x and _y columns
In [23]: merged = merged[['ID', 'FILLER1', 'FILLER2', 'QUANTITY']]
In [24]: merged
Out[24]:
ID FILLER1 FILLER2 QUANTITY
0 1 123 132 13.0
1 2 123 132 4.0
2 3 123 132 10.0
The current dateframe.
ID Date Start Value Payment
111 1/1/2018 1000 0
111 1/2/2018 100
111 1/3/2018 500
111 1/4/2018 400
111 1/5/2018 0
222 4/1/2018 2000 200
222 4/2/2018 100
222 4/3/2018 700
222 4/4/2018 0
222 4/5/2018 0
222 4/6/2018 1000
222 4/7/2018 0
This is the dataframe what I am trying to get. Basically, i am trying to fill the star value for each row. AS you can see, every ID has a start value on the first day. next day's start value = last day's start value - last day's payment.
ID Date Start Value Payment
111 1/1/2018 1000 0
111 1/2/2018 1000 100
111 1/3/2018 900 500
111 1/4/2018 400 400
111 1/5/2018 0 0
222 4/1/2018 2000 200
222 4/2/2018 1800 100
222 4/3/2018 1700 700
222 4/4/2018 1000 0
222 4/5/2018 1000 0
222 4/6/2018 1000 1000
222 4/7/2018 0 0
Right now, I use Excel with this formula.
Start Value = if(ID in this row == ID in last row, last row's start value - last row's payment, Start Value)
It works well, I am wondering if I can do it in Python/Pandas. Thank you.
We can using groupby and shift + cumsum, ffill will setting up initial value for all row under the same Id, then we just need to deduct the cumulative payment from that row till the start , we get the remaining value at that point
df.StartValue.fillna(df.groupby('ID').apply(lambda x : x['StartValue'].ffill()-x['Payment'].shift().cumsum()).reset_index(level=0,drop=True))
Out[61]:
0 1000.0
1 1000.0
2 900.0
3 400.0
4 0.0
5 2000.0
6 1800.0
7 1700.0
8 1000.0
9 1000.0
10 1000.0
11 0.0
Name: StartValue, dtype: float64
Assign it back by adding inplace=Ture
df.StartValue.fillna(df.groupby('ID').apply(lambda x : x['StartValue'].ffill()-x['Payment'].shift().cumsum()).reset_index(level=0,drop=True),inplace=True)
df
Out[63]:
ID Date StartValue Payment
0 111 1/1/2018 1000.0 0
1 111 1/2/2018 1000.0 100
2 111 1/3/2018 900.0 500
3 111 1/4/2018 400.0 400
4 111 1/5/2018 0.0 0
5 222 4/1/2018 2000.0 200
6 222 4/2/2018 1800.0 100
7 222 4/3/2018 1700.0 700
8 222 4/4/2018 1000.0 0
9 222 4/5/2018 1000.0 0
10 222 4/6/2018 1000.0 1000
11 222 4/7/2018 0.0 0