I have below dataframe:
No: Fee:
111 500
111 500
222 300
222 300
123 400
If data in No is duplicate, I want to keep only one fee and remove others.
Should look like below:
No: Fee:
111 500
111
222 300
222
123 400
I actually have no idea where to start, so please guide here.
Thanks.
Use DataFrame.duplicated with set empty string by DataFrame.loc:
#if need test duplicated by both columns
mask = df.duplicated(['No','Fee'])
df.loc[mask, 'Fee'] = ''
print (df)
No Fee
0 111 500
1 111
2 222 300
3 222
4 123 400
But then lost numeric column, because mixed numbers with strings:
print (df['Fee'].dtype)
object
Possible solution is use missing values if need numeric column:
df.loc[mask, 'Fee'] = np.nan
print (df)
No Fee
0 111 500.0
1 111 NaN
2 222 300.0
3 222 NaN
4 123 400.0
print (df['Fee'].dtype)
float64
df.loc[mask, 'Fee'] = np.nan
df['Fee'] = df['Fee'].astype('Int64')
print (df)
No Fee
0 111 500
1 111 <NA>
2 222 300
3 222 <NA>
4 123 400
print (df['Fee'].dtype)
Int64
I have a DataFrame and I want to convert it into the following:
import pandas as pd
df = pd.DataFrame({'ID':[111,111,111,222,222,333],
'class':['merc','humvee','bmw','vw','bmw','merc'],
'imp':[1,2,3,1,2,1]})
print(df)
ID class imp
0 111 merc 1
1 111 humvee 2
2 111 bmw 3
3 222 vw 1
4 222 bmw 2
5 333 merc 1
Desired output:
ID 0 1 2
0 111 merc humvee bmw
1 111 1 2 3
2 222 vw bmw
3 222 1 2
4 333 merc
5 333 1
I wish to transpose the entire dataframe, but grouped by a particular column, ID in this case and maintaining the row order.
My attempt: I tried using .set_index() und .unstack(), but it did not work.
Use GroupBy.cumcount for counter and then reshape by DataFrame.stack and Series.unstack:
df1 = (df.set_index(['ID',df.groupby('ID').cumcount()])
.stack()
.unstack(1, fill_value='')
.reset_index(level=1, drop=True)
.reset_index())
print (df1)
ID 0 1 2
0 111 merc humvee bmw
1 111 1 2 3
2 222 vw bmw
3 222 1 2
4 333 merc
5 333 1
Another method would be to use groupby and concat - although this is not totally dynamic it works fine if you only have two columns you want to work with, namely class and imp
s = df.set_index([df['ID'],df.groupby('ID').cumcount()]).unstack(1)
df1 = pd.concat([s['class'],s['imp']],axis=0).sort_index().fillna('')
print(df1)
idx 0 1 2
ID
111 merc humvee bmw
111 1 2 3
222 vw bmw
222 1 2
333 merc
333 1
I have two columns "ID" and "division" as shown below.
df = pd.DataFrame(np.array([['111', 'AAA'],['222','AAA'],['333','BBB'],['444','CCC'],['444','AAA'],['222','BBB'],['111','BBB']]),columns=['ID','division'])
ID division
0 111 AAA
1 222 AAA
2 333 BBB
3 444 CCC
4 444 AAA
5 222 BBB
6 111 BBB
The expected output is as shown below where I need to pivot on the same column but the count is dependent on "division". This should be presented in a heatmap.
df = pd.DataFrame(np.array([['0','2','1','1'],['2','0','1','1'],['1','1','0','0'],['1','1','0','0']]),columns=['111','222','333','444'],index=['111','222','333','444'])
111 222 333 444
111 0 2 1 1
222 2 0 1 1
333 1 1 0 0
444 1 1 0 0
So, technically I am doing an overlap between ID's with respect to division.
Example:
The highlighted box in red where the overlap between 111 and 222 ID's is 2(AAA and BBB). where as the overlap between 111 and 444 is 1 (AAA highlighted in the black box).
I could do this in excel in 2 steps.Not sure if below one helps.
Step1:=SUM(COUNTIFS($B$2:$B$8,$B2,$A$2:$A$8,$G2),COUNTIFS($B$2:$B$8,$B2,$A$2:$A$8,H$1))-1
Step2:=IF($G12=H$1,0,SUMIFS(H$2:H$8,$G$2:$G$8,$G12))
But is there any way that we can do it in Python using dataframes.
Appreciate your help
Case-2
if df = pd.DataFrame(np.array([['111', 'AAA','4'],['222','AAA','5'],['333','BBB','6'],
['444','CCC','3'],['444','AAA','2'], ['222','BBB','2'],
['111','BBB','7']]),columns=['ID','division','count'])
ID division count
0 111 AAA 4
1 222 AAA 5
2 333 BBB 6
3 444 CCC 3
4 444 AAA 2
5 222 BBB 2
6 111 BBB 7
Expected output would be
df_result = pd.DataFrame(np.array([['0','18','13','6'],['18','0','8','7'],['13','8','0','0'],['6','7','0','0']]),columns=['111','222','333','444'],index=['111','222','333','444'])
111 222 333 444
111 0 18 13 6
222 18 0 8 7
333 13 8 0 0
444 6 7 0 0
Calculation: Here there is an overlap between 111 and 222 with respect to divisions AAA and BBB hence the sum would be 4+5+2+7=18
Another way to do this is to use a self join with merge and pd.crosstab:
df_out = df.merge(df, on='division')
results = pd.crosstab(df_out.ID_x, df_out.ID_y)
np.fill_diagonal(results.values, 0)
Output:
ID_y 111 222 333 444
ID_x
111 0.0 2.0 1.0 1.0
222 2.0 0.0 1.0 1.0
333 1.0 1.0 0.0 0.0
444 1.0 1.0 0.0 0.0
Case 2
df = pd.DataFrame(np.array([['111', 'AAA','4'],['222','AAA','5'],['333','BBB','6'],
['444','CCC','3'],['444','AAA','2'], ['222','BBB','2'],
['111','BBB','7']]),columns=['ID','division','count'])
df['count'] = df['count'].astype(int)
df_out = df.merge(df, on='division')
df_out = df_out.assign(count = df_out.count_x + df_out.count_y)
results = pd.crosstab(df_out.ID_x, df_out.ID_y, df_out['count'], aggfunc='sum').fillna(0)
np.fill_diagonal(results.values, 0)
Output:
ID_y 111 222 333 444
ID_x
111 0.0 18.0 13.0 6.0
222 18.0 0.0 8.0 7.0
333 13.0 8.0 0.0 0.0
444 6.0 7.0 0.0 0.0
I have a dataframe like bellow
ID Date
111 1.1.2018
222 5.1.2018
333 7.1.2018
444 8.1.2018
555 9.1.2018
666 13.1.2018
and I would like to bin them into 5 days intervals.
The output should be
ID Date Bin
111 1.1.2018 1
222 5.1.2018 1
333 7.1.2018 2
444 8.1.2018 2
555 9.1.2018 2
666 13.1.2018 3
How can I do this in python, please?
Looks like groupby + ngroup does it:
df['Date'] = pd.to_datetime(df.Date, errors='coerce', dayfirst=True)
df['Bin'] = df.groupby(pd.Grouper(freq='5D', key='Date')).ngroup() + 1
df
ID Date Bin
0 111 2018-01-01 1
1 222 2018-01-05 1
2 333 2018-01-07 2
3 444 2018-01-08 2
4 555 2018-01-09 2
5 666 2018-01-13 3
If you don't want to mutate the Date column, then you may first call assign for a copy based assignment, and then do the groupby:
df['Bin'] = df.assign(
Date=pd.to_datetime(df.Date, errors='coerce', dayfirst=True)
).groupby(pd.Grouper(freq='5D', key='Date')).ngroup() + 1
df
ID Date Bin
0 111 1.1.2018 1
1 222 5.1.2018 1
2 333 7.1.2018 2
3 444 8.1.2018 2
4 555 9.1.2018 2
5 666 13.1.2018 3
One way is to create an array of your date range and use numpy.digitize.
df['Date'] = pd.to_datetime(df['Date'], dayfirst=True)
date_ranges = pd.date_range(df['Date'].min(), df['Date'].max(), freq='5D')\
.astype(np.int64).values
df['Bin'] = np.digitize(df['Date'].astype(np.int64).values, date_ranges)
Result:
ID Date Bin
0 111 2018-01-01 1
1 222 2018-01-05 1
2 333 2018-01-07 2
3 444 2018-01-08 2
4 555 2018-01-09 2
5 666 2018-01-13 3