Comparing Two Data Frames in python

Comparing Two Data Frames in python - python

I have two data frames. I have to compare the two data frames and get the position of the unmatched data using python.
Note:
The First column will always not be unique.
Data Frame 1:
0 1 2 3 4
0 1 Dhoni 24 Kota 60000.0
1 2 Raina 90 Delhi 41500.0
2 3 Kholi 67 Ahmedabad 20000.0
3 4 Ashwin 45 Bhopal 8500.0
4 5 Watson 64 Mumbai 6500.0
5 6 KL Rahul 19 Indore 4500.0
6 7 Hardik 24 Bengaluru 1000.0
Data Frame 2
0 1 2 3 4
0 3 Kholi 67 Ahmedabad 20000.0
1 7 Hardik 24 Bengaluru 1000.0
2 4 Ashwin 45 Bhopal 8500.0
3 2 Raina 90 Delhi 41500.0
4 6 KL Rahul 19 Chennai 4500.0
5 1 Dhoni 24 Kota 60000.0
6 5 Watson 64 Mumbai 6500.0
I expect the output of (3,5)-(Indore - Chennai).

df1=pd.DataFrame({'A':['Dhoni','Raina','KL Rahul'],'B':[24,90,67],'C':['Kota','Delhi','Indore'],'D':[6000.0,41500.0,4500.0]})
df2=pd.DataFrame({'A':['Dhoni','Raina','KL Rahul'],'B':[24,90,67],'C':['Kota','Delhi','Chennai'],'D':[6000.0,41500.0,4500.0]})
df1['df']='df1'
df2['df']='df2'
df=pd.concat([df1,df2],sort=False).drop_duplicates(subset=['A','B','C','D'],keep=False)
print(df)
A B C D df
2 KL Rahul 67 Indore 4500.0 df1
2 KL Rahul 67 Chennai 4500.0 df2
I have added df column to show, from which df difference comes from

Related

An unwanted level and wrong calculation on the graphic on pandas

I have the following dataset:
df = pd.DataFrame({'id':[1,2,3,4,5,6,7,8,9,10,11,12],
'city':['Pau','Pau','Pau','Pau','Pau','Pau','Lyon','Dax','Dax','Lyon','Lyon','Lyon'],
'type':['A','A','A','A','B','B','B','A','B','A','B','B'],
'val':[100,90,95,95,90,75,100,70,75,90,95,85]})
id city type val
0 1 Pau A 100
1 2 Pau A 90
2 3 Pau A 95
3 4 Pau A 95
4 5 Pau B 90
5 6 Pau B 75
6 7 Lyon B 100
7 8 Dax A 70
8 9 Dax B 75
9 10 Lyon A 90
10 11 Lyon B 95
11 12 Lyon B 85
And I want to create a plot grouped by variable city, and get the frequency percentage per type. I have tried this:
df.groupby(['city','type']).agg({'type':'count'}).transform(lambda x: x/x.sum()).unstack().plot()
But I get wrong values per group and an unwanted 'None'. The expected values should be:
type A B
city
Dax .50 .50
Lyon .33 .66
Pau .66 .33

Looking at your requirement, you may want crosstab with normalize:
pd.crosstab(df['city'],df['type'],normalize='index').plot()
Where:
print(pd.crosstab(df['city'],df['type'],normalize='index'))
type A B
city
Dax 0.500000 0.500000
Lyon 0.250000 0.750000
Pau 0.666667 0.333333

Conditional filling of column based on string

I have a dataset which I have to fill conditional or dropping the conditional rows. But, I am still unsuccessful.
Idx Fruits Days Name
0 60 20
1 15 85.5
2 10 62 Peter
3 40 90 Maria
4 5 10.2
5 92 66
6 65 87 John
7 50 1 Eric
8 50 0 Maria
9 80 87 John
Now, I have some empty cells. I can fill with fillna or regex or can drop empty cells.
I want only first starting cells until the string starts, either dropping or filling with "."
Like below
Idx Fruits Days Name
0 60 20 .
1 15 85.5 .
2 10 62 Peter
3 40 90 Maria
4 5 10.2
5 92 66
6 65 87 John
7 50 1 Eric
8 50 0 Maria
9 80 87 John
and
Idx Fruits Days Name
2 10 62 Peter
3 40 90 Maria
4 5 10.2
5 92 66
6 65 87 John
7 50 1 Eric
8 50 0 Maria
9 80 87 John
Is there any possibility using pandas? or any looping?

You can try this:
df['Name'] = df['Name'].replace('', np.nan)
df['Name'] = df['Name'].where(df['Name'].ffill().notna(), '.')
print(df)
Idx Fruits Days Name
0 0 60 20.0 .
1 1 15 85.5 .
2 2 10 62.0 Peter
3 3 40 90.0 Maria
4 4 5 10.2
5 5 92 66.0
6 6 65 87.0 John
7 7 50 1.0 Eric
8 8 50 0.0 Maria
9 9 80 87.0 John

How to perform conditional updation of column values in Pandas DataFrame?

I have a below dataframe is there any way to perform conditional addition of column values in pandas.
emp_id emp_name City months_worked default_sal total_sal jan feb mar apr may jun
111 aaa pune 2 90 NaN 4 5 5 54 3 2
222 bbb pune 1 70 NaN 5 4 4 8 3 4
333 ccc mumbai 2 NaN NaN 9 3 4 8 4 3
444 ddd hyd 4 NaN NaN 3 8 6 4 2 7
What I want to achive
if city = pune default_sal should be updated in total_sal for ex for
emp_id 111 total_salary should be 90
if city!=pune then depending on months_worked value total salary
should be updated.For ex for emp id 333 months_worked =2 So addition
of jan and feb value should be updated as total_sal which is 9+3=12
Desired O/P
emp_id emp_name City months_worked default_sal total_sal jan feb mar apr may jun
111 aaa pune 2 90 90 4 5 5 54 3 2
222 bbb pune 1 70 70 5 4 4 8 3 4
333 ccc mumbai 2 NaN 12 9 3 4 8 4 3
444 ddd hyd 4 NaN 21 3 8 6 4 2 7

Using np.where after create the help series
s1=pd.Series([df.iloc[x,6:y+6].sum() for x,y in enumerate(df.months_worked)],index=df.index)
np.where(df.City=='pune',df.default_sal,s1 )
Out[429]: array([90., 70., 12., 21.])
#df['total']=np.where(df.City=='pune',df.default_sal,s1 )

Multiple dataframe dates and groups conditions

A B C D E
0 2002-01-13 Dan 2002-01-15 26 -1
1 2002-01-13 Dan 2002-01-15 10 0
2 2002-01-13 Dan 2002-01-15 16 1
3 2002-01-13 Vic 2002-01-17 14 0
4 2002-01-13 Vic 2002-01-03 18 0
5 2002-01-28 Mel 2002-02-08 37 0
6 2002-01-28 Mel 2002-02-06 29 0
7 2002-01-28 Mel 2002-02-10 20 0
8 2002-01-28 Rob 2002-02-12 30 -1
9 2002-01-28 Rob 2002-02-12 48 1
10 2002-01-28 Rob 2002-02-12 0 1
11 2002-01-28 Rob 2002-02-01 19 0
Wen answered a very similar question an hour ago, but I forgot to include some conditions. I´ll write them down in bold style:
I want to create a new df['F'] column, with next conditions, per each B group and ignoring zeros in D column:
F=D value, where A dates are nearest to 10 days later than C date and where E=0.
If E=0 doesn´t exist in the nearest A date to 10 days (case of 2002-01-28 Rob), F will be the mean of D values when E=-1 and E=1.
If there are two C dates at the same distance to 10 days from A (case of 2002-01-28 Mel), F will be the mean of these same-period D values.
Output should be:
A B C D E F
0 2002-01-13 Dan 2002-01-15 26 -1 10
1 2002-01-13 Dan 2002-01-15 10 0 10
2 2002-01-13 Dan 2002-01-15 16 1 10
3 2002-01-13 Vic 2002-01-17 14 0 14
4 2002-01-13 Vic 2002-01-03 18 0 14
5 2002-01-28 Mel 2002-02-08 37 0 33
6 2002-01-28 Mel 2002-02-06 29 0 33
7 2002-01-28 Mel 2002-02-10 20 0 33
8 2002-01-28 Rob 2002-02-12 30 -1 39
9 2002-01-28 Rob 2002-02-12 48 1 39
10 2002-01-28 Rob 2002-02-12 0 1 39
11 2002-01-28 Rob 2002-02-01 19 0 39
Wen answered:
df['F']=abs((df.C-df.A).dt.days-10)# get the days different
df['F']=df.B.map(df.loc[df.F==df.groupby('B').F.transform('min')].groupby('B').D.mean())# find the min value for the different , and get the mean
df
But now I can´t get to insert the new conditions (that I´ve put in bold style).

Change the mapper to
m=df.loc[(df.F==df.groupby('B').F.transform('min'))&(df.D!=0)].groupby('B').apply(lambda x : x['D'][x['E']==0].mean() if (x['E']==0).any() else x['D'].mean())
df['F']=df.B.map(m)

How to reset an unordered index to an ordered one in python?

I have a dataframe like below
textdata
id user_category operator circle
0 23 1 vodafone mumbai
1 45 2 airtel andhra
2 65 3 airtel chennai
3 23 6 vodafone mumbai
4 45 1 airtel gurgaon
5 65 3 airtel ongole
6 23 4 vodafone mumbai
7 45 1 airtel telangana
8 65 3 airtel chennai
In my data 1,2,4,6 in user category is transactional and 3 in user_category is promotional data.So i have divided that by using the following commands
transactional = textdata[textdata['user_category'].isin([1,2,4,6])]
promotional = textdata[textdata['user_category'].isin([1])]
so i got the output for transactional and promotional like below
transactional
id user_category operator circle
0 23 1 vodafone mumbai
1 45 2 airtel andhra
3 23 6 vodafone mumbai
4 45 1 airtel gurgaon
6 23 4 vodafone mumbai
7 45 1 airtel telangana
promotional
id user_category operator circle
2 65 3 airtel chennai
8 65 3 airtel chennai
5 65 3 airtel ongole
but what i am expecting is to order the index
Expected output:
transactional
id user_category operator circle
0 23 1 vodafone mumbai
1 45 2 airtel andhra
2 23 6 vodafone mumbai
3 45 1 airtel gurgaon
4 23 4 vodafone mumbai
5 45 1 airtel telangana
promotional
id user_category operator circle
1 65 3 airtel chennai
2 65 3 airtel chennai
3 65 3 airtel ongole
this is how i tried for that
transactional.reset_index(inplace = True)
And this is how i got
transactional
index id user_category operator circle
0 0 23 1 vodafone mumbai
1 1 45 2 airtel andhra
2 3 23 6 vodafone mumbai
3 4 45 1 airtel gurgaon
4 6 23 4 vodafone mumbai
5 7 45 1 airtel telangana
But i am expecting in the following way
transactional
id user_category operator circle
0 23 1 vodafone mumbai
1 45 2 airtel andhra
2 23 6 vodafone mumbai
3 45 1 airtel gurgaon
4 23 4 vodafone mumbai
5 45 1 airtel telangana
Please help me how can i do this.
But don't suggest me like this way
del transactional['index']
Thanks in advance

Use the drop=True option of reset_index.
drop : boolean, default False.
Do not try to insert index into dataframe columns. This resets the index to the default integer index
So, instead of calling:
transactional.reset_index(inplace = True)
Do:
transactional.reset_index(inplace = True, drop=True)

We Keep Coding

Python is a programming language that lets you work quickly and integrate systems more effectively.

Comparing Two Data Frames in python - python

Related

An unwanted level and wrong calculation on the graphic on pandas

Conditional filling of column based on string

How to perform conditional updation of column values in Pandas DataFrame?

Multiple dataframe dates and groups conditions

How to reset an unordered index to an ordered one in python?

Categories

Resources