How to subtract rows of one pandas data frame from another? - python

The operation that I want to do is similar to merger. For example, with the inner merger we get a data frame that contains rows that are present in the first AND second data frame. With the outer merger we get a data frame that are present EITHER in the first OR in the second data frame.
What I need is a data frame that contains rows that are present in the first data frame AND NOT present in the second one? Is there a fast and elegant way to do it?

Consider Following:
df_one is first DataFrame
df_two is second DataFrame
Present in First DataFrame and Not in Second DataFrame
Solution: by Index
df = df_one[~df_one.index.isin(df_two.index)]
index can be replaced by required column upon which you wish to do exclusion.
In above example, I've used index as a reference between both Data Frames
Additionally, you can also use a more complex query using boolean pandas.Series to solve for above.

How about something like the following?
print df1
Team Year foo
0 Hawks 2001 5
1 Hawks 2004 4
2 Nets 1987 3
3 Nets 1988 6
4 Nets 2001 8
5 Nets 2000 10
6 Heat 2004 6
7 Pacers 2003 12
print df2
Team Year foo
0 Pacers 2003 12
1 Heat 2004 6
2 Nets 1988 6
As long as there is a non-key commonly named column, you can let the added on sufffexes do the work (if there is no non-key common column then you could create one to use temporarily ... df1['common'] = 1 and df2['common'] = 1):
new = df1.merge(df2,on=['Team','Year'],how='left')
print new[new.foo_y.isnull()]
Team Year foo_x foo_y
0 Hawks 2001 5 NaN
1 Hawks 2004 4 NaN
2 Nets 1987 3 NaN
4 Nets 2001 8 NaN
5 Nets 2000 10 NaN
Or you can use isin but you would have to create a single key:
df1['key'] = df1['Team'] + df1['Year'].astype(str)
df2['key'] = df1['Team'] + df2['Year'].astype(str)
print df1[~df1.key.isin(df2.key)]
Team Year foo key
0 Hawks 2001 5 Hawks2001
2 Nets 1987 3 Nets1987
4 Nets 2001 8 Nets2001
5 Nets 2000 10 Nets2000
6 Heat 2004 6 Heat2004
7 Pacers 2003 12 Pacers2003

You could run into errors if your non-index column has cells with NaN.
print df1
Team Year foo
0 Hawks 2001 5
1 Hawks 2004 4
2 Nets 1987 3
3 Nets 1988 6
4 Nets 2001 8
5 Nets 2000 10
6 Heat 2004 6
7 Pacers 2003 12
8 Problem 2112 NaN
print df2
Team Year foo
0 Pacers 2003 12
1 Heat 2004 6
2 Nets 1988 6
3 Problem 2112 NaN
new = df1.merge(df2,on=['Team','Year'],how='left')
print new[new.foo_y.isnull()]
Team Year foo_x foo_y
0 Hawks 2001 5 NaN
1 Hawks 2004 4 NaN
2 Nets 1987 3 NaN
4 Nets 2001 8 NaN
5 Nets 2000 10 NaN
6 Problem 2112 NaN NaN
The problem team in 2112 has no value for foo in either table. So, the left join here will falsely return that row, which matches in both DataFrames, as not being present in the right DataFrame.
Solution:
What I do is to add a unique column to the inner DataFrame and set a value for all rows. Then when you join, you can check to see if that column is NaN for the inner table to find unique records in the outer table.
df2['in_df2']='yes'
print df2
Team Year foo in_df2
0 Pacers 2003 12 yes
1 Heat 2004 6 yes
2 Nets 1988 6 yes
3 Problem 2112 NaN yes
new = df1.merge(df2,on=['Team','Year'],how='left')
print new[new.in_df2.isnull()]
Team Year foo_x foo_y in_df1 in_df2
0 Hawks 2001 5 NaN yes NaN
1 Hawks 2004 4 NaN yes NaN
2 Nets 1987 3 NaN yes NaN
4 Nets 2001 8 NaN yes NaN
5 Nets 2000 10 NaN yes NaN
NB. The problem row is now correctly filtered out, because it has a value for in_df2.
Problem 2112 NaN NaN yes yes

I suggest using parameter 'indicator' in merge. Also if 'on' is None this defaults to the intersection of the columns in both DataFrames.
new = df1.merge(df2,how='left', indicator=True) # adds a new column '_merge'
new = new[(new['_merge']=='left_only')].copy() #rows only in df1 and not df2
new = new.drop(columns='_merge').copy()
Team Year foo
0 Hawks 2001 5
1 Hawks 2004 4
2 Nets 1987 3
4 Nets 2001 8
5 Nets 2000 10
Reference: https://pandas.pydata.org/pandas-docs/stable/generated/pandas.DataFrame.merge.html
indicator : boolean or string, default False
If True, adds a column to output DataFrame called “_merge” with information on the source of each row.
Information column is Categorical-type and takes on a value of
“left_only” for observations whose merge key only appears in ‘left’ DataFrame,
“right_only” for observations whose merge key only appears in ‘right’ DataFrame,
and “both” if the observation’s merge key is found in both.

Related

Moving rows value to another existing column

I have a messy datasets as attached below
Sales Credit type Year Status
0 NaN GS 2000 Confirmed
1 NaN V 2000 Assigned
2 GS 2001 Assigned NaN
3 V 2004 Received NaN
I am trying to move over the corresponding value into the right columns. So ideally should be like this one.
Sales Credit type Year Status
0 NaN GS 2000 Confirmed
1 NaN V 2000 Assigned
2 NaN GS 2001 Assigned
3 NaN V 2004 Received
I have tried to find the solutions in this platforms but no luck. I used df.loc to placed the datasets but seems like the result is not like what I expected. I would really appreciate your support for solving this issue. Thank you
*Update
It works with #jezrael solution, thanks! but is it possible if we use it for this case?
ID Sales Credit_type Year Status
0 1 Aston GS 2000 Confirmed
1 1 NaN V 2000 Assigned
2 2 GS 2001 Assigned NaN
3 3 V 2004 Received NaN
And the result should be like this:
ID Sales Credit_type Year Status
0 1 Aston GS 2000 Confirmed
1 1 NaN V 2000 Assigned
2 2 NaN GS 2001 Assigned
3 3 NaN V 2004 Received
You can create mask by last column for test if missing values by Series.isna and then use DataFrame.shift with axis=1 only for filtered rows:
m = df.iloc[:, -1].isna()
df[m] = df[m].shift(axis=1)
print (df)
Sales Credit type Year Status
0 NaN GS 2000 Confirmed
1 NaN V 2000 Assigned
2 NaN GS 2001 Assigned
3 NaN V 2004 Received
If need set all columns without first use DataFrame.iloc with indexing .iloc[m, 1:]:
m = df.iloc[:, -1].isna().to_numpy()
df.iloc[m, 1:] = df.iloc[m, 1:].shift(axis=1)
print (df)
ID Sales Credit_type Year Status
0 1 Aston GS 2000 Confirmed
1 1 NaN V 2000 Assigned
2 2 NaN GS 2001 Assigned
3 3 NaN V 2004 Received

Adding new dataframe colonms using information extracted from the url in the url column, but the url could be missing information

Given: A pandas dataframe that contains a user_url column among other columns.
Expectation: New columns added to the original dataframe where the columns are composed of information extracted from the URL in the user_url column. Those columns being car_make, model, year and user_id.
Some Extra info: We know that the car_make will only contain letters either with or without a '-'. The model can contain any characters. The year will only be 4 digits long. The user_id will consist of digits of any length.
I tired using a regex to split the url but it failed when there was missing information or extra information. I also tried just splinting the data but I has the same issue using split.
Given dataframe
mpg miles user_url
0 NaN NaN https://www.somewebsite.com/suzuki/swift/2015/674857
1 31.6 NaN https://www.somewebsite.com/bmw/x3/2009/461150
2 28.5 NaN https://www.somewebsite.com/mercedes-benz/e300/1998/13
3 46.8 NaN https://www.somewebsite.com/320d/2010/247233
4 21.0 244.4 https://www.somewebsite.com/honda/pass/2019/1038865
5 25.0 254.4 https://www.somewebsite.com/volkswagen/passat/11
Expected Dataframe
mpg miles user_url car_make model year \
0 NaN NaN https://www.somewebsite.com/suzuki/swift/2015/674857 suzuki swift 2015
1 31.6 NaN https://www.somewebsite.com/bmw/x3/2009/461150 bmw x3 2009
2 28.5 NaN https://www.somewebsite.com/mercedes-benz/e300/1998/13 mercedes-benz e300 1998
3 46.8 NaN https://www.somewebsite.com/320d/2010/247233 NaN 320d 2010
4 21.0 244.4 https://www.somewebsite.com/honda/pass/2019/1038865 honda pass 2019
5 25.0 254.4 https://www.somewebsite.com/volkswagen/passat/11 volkswagen passat NaN
user_id
0 674857
1 461150
2 13
3 247233
4 1038865
5 11
you just have to do
split=df['user_url'].str.split("/", n = 4, expand = True)
df['car_make']=split[3]
df.loc[df['car_make'].str.contains('1|2|3|4|5|6|7|8|9|0'),'car_make']=np.nan

Pandas how to add value to an existing data-frame by index

I have an example data frame let's call it df. I want to add more numbers to df but i don't want to start adding after NaN's which will be the index 7 i want to start adding from index 3.
year number letter
0 1945 10 a
1 1950 15 b
2 1955 20 c
3 1960 NaN NaN
4 1965 NaN Nan
5 1970 NaN Nan
6 1975 NaN Nan
Let's say we have a column like this:
number2
0 25
1 30
2 35
3 40
my target is to get a df like this
year number letter
0 1945 10 a
1 1950 15 b
2 1955 20 c
3 1960 25 NaN
4 1965 30 Nan
5 1970 35 Nan
6 1975 40 Nan
I hope I explained it well enough. Thank you for your support !
number2 = [25,30,35,40]
df.loc[df.number.isna(), 'number'] = number2
Result df:

Alternative to Excel SUM in Pandas

I have a dataframe (i.e df1) with the below values. I wanted to SUM Row 4 to 9 and put the resulting value in Row3. How can we achieve it? In excel it has been simple SUM formula like this =SUM(B9:B14) but what is the alternative in pandas?
Detail Value
0 Day 23
1 Month Aug
2 Year 2020
3 Total Tickets NaN
4 Pune 2
5 Mumbai 3
6 Thane 33
7 Kolkatta NaN
8 Hyderabad NaN
9 Kerala 283

Filling NA's with numeric order in pandas

I have a pandas data frame as below:
LogdiffT Year Country
0 -0.003094 2002 Australia
1 -0.015327 2001 NaN
2 0.100617 2000 NaN
3 0.067728 1999 NaN
4 0.089962 2010 China
5 -0.041844 2009 NaN
6 -0.031013 2008 NaN
7 0.091948 2007 NaN
8 0.082764 2006 Greece
9 0.103519 2005 NaN
10 -0.048100 2004 NaN
11 -0.014992 2003 NaN
12 0.166187 1966 Japan
If you see all NA's under the country column and following a country name belongs to that country name until a new country name is encountered. Like all 3 NA's after Australia stand for Australis and the 3 NAs' after China stand for China and so on. I want to recode this variable as a numeric variable such that all observations which belong to one country are coded as same. Like all 4 obs for Australia(AUS+3NAs) should have 1, China as 2 and so on.
In SAS I can do order by and use first. and last. and recode. How do we do similar stuff in pandas.
Any ideas?
EDIT:
I tried implementing one of below solution. And here is one interesting thing I am getting and am not sure why?
My data frame is as above. When I run this:
df.Country or df['Country'] I get an error that there is no column called Country when there is;
---------------------------------------------------------------------------
AttributeError Traceback (most recent call last)
<ipython-input-360-361952a0cbf3> in <module>()
2 data_train=data_2yr[features] # Subsetting the features from original data
3 # Recoding Country, Year variable
----> 4 data_train.Country
/Users/lib/python2.7/site-packages/pandas/core/generic.pyc in __getattr__(self, name)
1945 return self[name]
1946 raise AttributeError("'%s' object has no attribute '%s'" %
-> 1947 (type(self).__name__, name))
1948
1949 def __setattr__(self, name, value):
AttributeError: 'DataFrame' object has no attribute 'Country'
Because of this I am not able to implement the solutions proposed. What is going wrong here?
Call df = df.ffill() and then call factorize, factorize returns a tuple of array values and an index which is composed of your Series values, we only want the array values here:
In [476]:
df['Country'] = df['Country'].factorize()[0]
df
Out[476]:
LogdiffT Year Country
0 -0.003094 2002 0
1 -0.015327 2001 0
2 0.100617 2000 0
3 0.067728 1999 0
4 0.089962 2010 1
5 -0.041844 2009 1
6 -0.031013 2008 1
7 0.091948 2007 1
8 0.082764 2006 2
9 0.103519 2005 2
10 -0.048100 2004 2
11 -0.014992 2003 2
12 0.166187 1966 3
output from factorize:
In [480]:
df['Country'].factorize()
Out[480]:
(array([0, 0, 0, 0, 1, 1, 1, 1, 2, 2, 2, 2, 3]),
Index(['Australia', 'China', 'Greece', 'Japan'], dtype='object'))
As suggest by #John Galt you could compact the above into a one-liner:
df['Country'] = df['Country'].ffill().factorize()[0]
Let's say your dataframe is called df and you have a nested dictionary of your country codes as shown below. Then you can use replace
country_code = {'Country': {'Australia':1, 'China':2, 'Greece':3, 'Japan':4}}
df = df.replace(country_code)
# alternatively df['Country'] = df['Country'].replace(country_code['Country'])
df['Country'] = df['Country'].fillna(method='pad') # fills up the nans
Here's one way to do it.
Get unique countries list by dropping NaNs
In [66]: country_dict = {y:x for x,y in enumerate(df['Country'].dropna().unique())}
In [67]: country_dict
Out[67]: {'Australia': 0, 'China': 1, 'Greece': 2, 'Japan': 3}
Replace Country with country_dict
In [68]: dff = df.replace({'Country': country_dict})
In [69]: dff
Out[69]:
LogdiffT Year Country
0 -0.003094 2002 0
1 -0.015327 2001 NaN
2 0.100617 2000 NaN
3 0.067728 1999 NaN
4 0.089962 2010 1
5 -0.041844 2009 NaN
6 -0.031013 2008 NaN
7 0.091948 2007 NaN
8 0.082764 2006 2
9 0.103519 2005 NaN
10 -0.048100 2004 NaN
11 -0.014992 2003 NaN
12 0.166187 1966 3
And, then ffill() with previous values.
In [70]: dff.ffill()
Out[70]:
LogdiffT Year Country
0 -0.003094 2002 0
1 -0.015327 2001 0
2 0.100617 2000 0
3 0.067728 1999 0
4 0.089962 2010 1
5 -0.041844 2009 1
6 -0.031013 2008 1
7 0.091948 2007 1
8 0.082764 2006 2
9 0.103519 2005 2
10 -0.048100 2004 2
11 -0.014992 2003 2
12 0.166187 1966 3

Categories