Comparing Two Data Frames in python - python

I have two data frames. I have to compare the two data frames and get the position of the unmatched data using python.
Note:
The First column will always not be unique.
Data Frame 1:
0 1 2 3 4
0 1 Dhoni 24 Kota 60000.0
1 2 Raina 90 Delhi 41500.0
2 3 Kholi 67 Ahmedabad 20000.0
3 4 Ashwin 45 Bhopal 8500.0
4 5 Watson 64 Mumbai 6500.0
5 6 KL Rahul 19 Indore 4500.0
6 7 Hardik 24 Bengaluru 1000.0
Data Frame 2
0 1 2 3 4
0 3 Kholi 67 Ahmedabad 20000.0
1 7 Hardik 24 Bengaluru 1000.0
2 4 Ashwin 45 Bhopal 8500.0
3 2 Raina 90 Delhi 41500.0
4 6 KL Rahul 19 Chennai 4500.0
5 1 Dhoni 24 Kota 60000.0
6 5 Watson 64 Mumbai 6500.0
I expect the output of (3,5)-(Indore - Chennai).

df1=pd.DataFrame({'A':['Dhoni','Raina','KL Rahul'],'B':[24,90,67],'C':['Kota','Delhi','Indore'],'D':[6000.0,41500.0,4500.0]})
df2=pd.DataFrame({'A':['Dhoni','Raina','KL Rahul'],'B':[24,90,67],'C':['Kota','Delhi','Chennai'],'D':[6000.0,41500.0,4500.0]})
df1['df']='df1'
df2['df']='df2'
df=pd.concat([df1,df2],sort=False).drop_duplicates(subset=['A','B','C','D'],keep=False)
print(df)
A B C D df
2 KL Rahul 67 Indore 4500.0 df1
2 KL Rahul 67 Chennai 4500.0 df2
I have added df column to show, from which df difference comes from

Related

An unwanted level and wrong calculation on the graphic on pandas

I have the following dataset:
df = pd.DataFrame({'id':[1,2,3,4,5,6,7,8,9,10,11,12],
'city':['Pau','Pau','Pau','Pau','Pau','Pau','Lyon','Dax','Dax','Lyon','Lyon','Lyon'],
'type':['A','A','A','A','B','B','B','A','B','A','B','B'],
'val':[100,90,95,95,90,75,100,70,75,90,95,85]})
id city type val
0 1 Pau A 100
1 2 Pau A 90
2 3 Pau A 95
3 4 Pau A 95
4 5 Pau B 90
5 6 Pau B 75
6 7 Lyon B 100
7 8 Dax A 70
8 9 Dax B 75
9 10 Lyon A 90
10 11 Lyon B 95
11 12 Lyon B 85
And I want to create a plot grouped by variable city, and get the frequency percentage per type. I have tried this:
df.groupby(['city','type']).agg({'type':'count'}).transform(lambda x: x/x.sum()).unstack().plot()
But I get wrong values per group and an unwanted 'None'. The expected values should be:
type A B
city
Dax .50 .50
Lyon .33 .66
Pau .66 .33
Looking at your requirement, you may want crosstab with normalize:
pd.crosstab(df['city'],df['type'],normalize='index').plot()
Where:
print(pd.crosstab(df['city'],df['type'],normalize='index'))
type A B
city
Dax 0.500000 0.500000
Lyon 0.250000 0.750000
Pau 0.666667 0.333333

Conditional filling of column based on string

I have a dataset which I have to fill conditional or dropping the conditional rows. But, I am still unsuccessful.
Idx Fruits Days Name
0 60 20
1 15 85.5
2 10 62 Peter
3 40 90 Maria
4 5 10.2
5 92 66
6 65 87 John
7 50 1 Eric
8 50 0 Maria
9 80 87 John
Now, I have some empty cells. I can fill with fillna or regex or can drop empty cells.
I want only first starting cells until the string starts, either dropping or filling with "."
Like below
Idx Fruits Days Name
0 60 20 .
1 15 85.5 .
2 10 62 Peter
3 40 90 Maria
4 5 10.2
5 92 66
6 65 87 John
7 50 1 Eric
8 50 0 Maria
9 80 87 John
and
Idx Fruits Days Name
2 10 62 Peter
3 40 90 Maria
4 5 10.2
5 92 66
6 65 87 John
7 50 1 Eric
8 50 0 Maria
9 80 87 John
Is there any possibility using pandas? or any looping?
You can try this:
df['Name'] = df['Name'].replace('', np.nan)
df['Name'] = df['Name'].where(df['Name'].ffill().notna(), '.')
print(df)
Idx Fruits Days Name
0 0 60 20.0 .
1 1 15 85.5 .
2 2 10 62.0 Peter
3 3 40 90.0 Maria
4 4 5 10.2
5 5 92 66.0
6 6 65 87.0 John
7 7 50 1.0 Eric
8 8 50 0.0 Maria
9 9 80 87.0 John

How to perform conditional updation of column values in Pandas DataFrame?

I have a below dataframe is there any way to perform conditional addition of column values in pandas.
emp_id emp_name City months_worked default_sal total_sal jan feb mar apr may jun
111 aaa pune 2 90 NaN 4 5 5 54 3 2
222 bbb pune 1 70 NaN 5 4 4 8 3 4
333 ccc mumbai 2 NaN NaN 9 3 4 8 4 3
444 ddd hyd 4 NaN NaN 3 8 6 4 2 7
What I want to achive
if city = pune default_sal should be updated in total_sal for ex for
emp_id 111 total_salary should be 90
if city!=pune then depending on months_worked value total salary
should be updated.For ex for emp id 333 months_worked =2 So addition
of jan and feb value should be updated as total_sal which is 9+3=12
Desired O/P
emp_id emp_name City months_worked default_sal total_sal jan feb mar apr may jun
111 aaa pune 2 90 90 4 5 5 54 3 2
222 bbb pune 1 70 70 5 4 4 8 3 4
333 ccc mumbai 2 NaN 12 9 3 4 8 4 3
444 ddd hyd 4 NaN 21 3 8 6 4 2 7
Using np.where after create the help series
s1=pd.Series([df.iloc[x,6:y+6].sum() for x,y in enumerate(df.months_worked)],index=df.index)
np.where(df.City=='pune',df.default_sal,s1 )
Out[429]: array([90., 70., 12., 21.])
#df['total']=np.where(df.City=='pune',df.default_sal,s1 )

Multiple dataframe dates and groups conditions

A B C D E
0 2002-01-13 Dan 2002-01-15 26 -1
1 2002-01-13 Dan 2002-01-15 10 0
2 2002-01-13 Dan 2002-01-15 16 1
3 2002-01-13 Vic 2002-01-17 14 0
4 2002-01-13 Vic 2002-01-03 18 0
5 2002-01-28 Mel 2002-02-08 37 0
6 2002-01-28 Mel 2002-02-06 29 0
7 2002-01-28 Mel 2002-02-10 20 0
8 2002-01-28 Rob 2002-02-12 30 -1
9 2002-01-28 Rob 2002-02-12 48 1
10 2002-01-28 Rob 2002-02-12 0 1
11 2002-01-28 Rob 2002-02-01 19 0
Wen answered a very similar question an hour ago, but I forgot to include some conditions. I´ll write them down in bold style:
I want to create a new df['F'] column, with next conditions, per each B group and ignoring zeros in D column:
F=D value, where A dates are nearest to 10 days later than C date and where E=0.
If E=0 doesn´t exist in the nearest A date to 10 days (case of 2002-01-28 Rob), F will be the mean of D values when E=-1 and E=1.
If there are two C dates at the same distance to 10 days from A (case of 2002-01-28 Mel), F will be the mean of these same-period D values.
Output should be:
A B C D E F
0 2002-01-13 Dan 2002-01-15 26 -1 10
1 2002-01-13 Dan 2002-01-15 10 0 10
2 2002-01-13 Dan 2002-01-15 16 1 10
3 2002-01-13 Vic 2002-01-17 14 0 14
4 2002-01-13 Vic 2002-01-03 18 0 14
5 2002-01-28 Mel 2002-02-08 37 0 33
6 2002-01-28 Mel 2002-02-06 29 0 33
7 2002-01-28 Mel 2002-02-10 20 0 33
8 2002-01-28 Rob 2002-02-12 30 -1 39
9 2002-01-28 Rob 2002-02-12 48 1 39
10 2002-01-28 Rob 2002-02-12 0 1 39
11 2002-01-28 Rob 2002-02-01 19 0 39
Wen answered:
df['F']=abs((df.C-df.A).dt.days-10)# get the days different
df['F']=df.B.map(df.loc[df.F==df.groupby('B').F.transform('min')].groupby('B').D.mean())# find the min value for the different , and get the mean
df
But now I can´t get to insert the new conditions (that I´ve put in bold style).
Change the mapper to
m=df.loc[(df.F==df.groupby('B').F.transform('min'))&(df.D!=0)].groupby('B').apply(lambda x : x['D'][x['E']==0].mean() if (x['E']==0).any() else x['D'].mean())
df['F']=df.B.map(m)

How to reset an unordered index to an ordered one in python?

I have a dataframe like below
textdata
id user_category operator circle
0 23 1 vodafone mumbai
1 45 2 airtel andhra
2 65 3 airtel chennai
3 23 6 vodafone mumbai
4 45 1 airtel gurgaon
5 65 3 airtel ongole
6 23 4 vodafone mumbai
7 45 1 airtel telangana
8 65 3 airtel chennai
In my data 1,2,4,6 in user category is transactional and 3 in user_category is promotional data.So i have divided that by using the following commands
transactional = textdata[textdata['user_category'].isin([1,2,4,6])]
promotional = textdata[textdata['user_category'].isin([1])]
so i got the output for transactional and promotional like below
transactional
id user_category operator circle
0 23 1 vodafone mumbai
1 45 2 airtel andhra
3 23 6 vodafone mumbai
4 45 1 airtel gurgaon
6 23 4 vodafone mumbai
7 45 1 airtel telangana
promotional
id user_category operator circle
2 65 3 airtel chennai
8 65 3 airtel chennai
5 65 3 airtel ongole
but what i am expecting is to order the index
Expected output:
transactional
id user_category operator circle
0 23 1 vodafone mumbai
1 45 2 airtel andhra
2 23 6 vodafone mumbai
3 45 1 airtel gurgaon
4 23 4 vodafone mumbai
5 45 1 airtel telangana
promotional
id user_category operator circle
1 65 3 airtel chennai
2 65 3 airtel chennai
3 65 3 airtel ongole
this is how i tried for that
transactional.reset_index(inplace = True)
And this is how i got
transactional
index id user_category operator circle
0 0 23 1 vodafone mumbai
1 1 45 2 airtel andhra
2 3 23 6 vodafone mumbai
3 4 45 1 airtel gurgaon
4 6 23 4 vodafone mumbai
5 7 45 1 airtel telangana
But i am expecting in the following way
transactional
id user_category operator circle
0 23 1 vodafone mumbai
1 45 2 airtel andhra
2 23 6 vodafone mumbai
3 45 1 airtel gurgaon
4 23 4 vodafone mumbai
5 45 1 airtel telangana
Please help me how can i do this.
But don't suggest me like this way
del transactional['index']
Thanks in advance
Use the drop=True option of reset_index.
drop : boolean, default False.
Do not try to insert index into dataframe columns. This resets the index to the default integer index
So, instead of calling:
transactional.reset_index(inplace = True)
Do:
transactional.reset_index(inplace = True, drop=True)

Categories