how to compare or merge two data frames using python pandas? - python

How do I compare/merge two data frames based on the start and data columns and get the missing gaps with the count.
Dataframe 1
id start
1 2009
1 2010
1 2011
1 2012
2 2010
2 2011
2 2012
2 2013
2 2014
Data frame 2
id data
1 2010
1 2012
2 2010
2 2011
2 2012
Expected Output:
id first last size
1 2009 2009 1
1 2011 2011 1
2 2013 2014 2
How may I achieve this.

Use merge with indicator=True and outer join first:
df11 = df1.rename(columns={'start':'data'})
df = df2.merge(df11, how='outer', indicator=True, on=['id','data']).sort_values(['id','data'])
print (df)
id data _merge
5 1 2009 right_only
0 1 2010 both
6 1 2011 right_only
1 1 2012 both
2 2 2010 both
3 2 2011 both
4 2 2012 both
7 2 2013 right_only
8 2 2014 right_only
And then use old solution - only change condition:
#boolean mask for check no right_only to variable for reuse
m = (df['_merge'] != 'right_only').rename('g')
#create index by cumulative sum for unique groups for consecutive NaNs
df.index = m.cumsum()
print (df)
id data _merge
g
0 1 2009 right_only
1 1 2010 both
1 1 2011 right_only
2 1 2012 both
3 2 2010 both
4 2 2011 both
5 2 2012 both
5 2 2013 right_only
5 2 2014 right_only
#filter only NaNs row and aggregate first, last and count.
df2 = (df[~m.values].groupby(['id', 'g'])['data']
.agg(['first','last','size'])
.reset_index(level=1, drop=True)
.reset_index())
print (df2)
id first last size
0 1 2009 2009 1
1 1 2011 2011 1
2 2 2013 2014 2

I answered a similar question for you yesterday. I dont know where you are getting the first and last columns but here is a way to find the missing years based on the example above:
df1_year = pd.DataFrame(df1.groupby('id')['start'].apply(list))
df2_year = pd.DataFrame(df2.groupby('id')['data'].apply(list))
dfs = [df1_year,df2_year]
df_final = reduce(lambda left,right: pd.merge(left,right,on='id'), dfs)
df_final.reset_index(inplace=True)
def noMatch(a, b):
return [x for x in a if x not in b]
df3 = []
for i in range(0, len(df_final)):
df3.append(noMatch(df_final['start'][i],df_final['data'][i]))
missing_year = pd.DataFrame(df3)
missing_year['missingYear'] = missing_year.values.tolist()
df_concat = pd.concat([df_final, missing_year], axis=1)
df_concat = df_concat[['id','missingYear']]
df4 = []
for i in range(0,len(df_concat)):
df4.append(df_concat.applymap(lambda x: x[i] if isinstance(x, list) else x))
df_final1 = reduce(lambda left,right: pd.merge(left,right,on='id'), df4)
pd.concat([df_final1[['id','missingYear_x']], df_final1[['id','missingYear_y']].rename(columns={'missingYear_y':'missingYear_x'})]).rename(columns={'missingYear_x':'missingYear'}).sort_index()
id missingYear
0 1 2009
0 1 2011
1 2 2013
1 2 2014
to add it to df2 per your comment just append data

Related

Creating subsets of df using pandas groupby and getting a value based on a function

I have df similar to below. I need to select rows where df['Year 2'] is equal or closest to df['Year'] in subsets grouped by df['ID'] so in this example rows 1,2 and 5.
df
Year ID A Year 2 C
0 2020 12 0 2019 0
1 2020 12 0 2020 0 <-
2 2017 10 1 2017 0 <-
3 2017 10 0 2018 0
4 2019 6 0 2017 0
5 2019 6 1 2018 0 <-
I am trying to achieve that with the following piece of code using group by and passing a function to get the proper row with the closest value for both columns.
df1 = df.groupby(['ID']).apply(min(df['Year 2'], key=lambda x:abs(x-df['Year'].min())))
This particular line returns 'int' object is not callable. Any ideas how to fix this line of code or a fresh approach to the problem is appreciated.
TYIA.
You can subtract both columns by Series.sub, convert to absolute and aggregate indices by minimum values by DataFrameGroupBy.idxmin:
idx = df['Year 2'].sub(df['Year']).abs().groupby(df['ID']).idxmin()
If need new column filled by boolean use Index.isin:
df['new'] = df.index.isin(idx)
print (df)
Year ID A Year 2 C new
0 2020 12 0 2019 0 False
1 2020 12 0 2020 0 True
2 2017 10 1 2017 0 True
3 2017 10 0 2018 0 False
4 2019 6 0 2017 0 False
5 2019 6 1 2018 0 True
If need filter rows use DataFrame.loc:
df1 = df.loc[idx]
print (df1)
Year ID A Year 2 C
5 2019 6 1 2018 0
2 2017 10 1 2017 0
1 2020 12 0 2020 0
One row solution:
df1 = df.loc[df['Year 2'].sub(df['Year']).abs().groupby(df['ID']).idxmin()]
You could get the idxmin per group:
idx = (df['Year 2']-df['Year']).abs().groupby(df['ID']).idxmin()
# assignment for test
df.loc[idx, 'D'] = '<-'
for selection only:
df2 = df.loc[idx]
output:
Year ID A Year 2 C D
0 2020 12 0 2019 0 NaN
1 2020 12 0 2020 0 <-
2 2017 10 1 2017 0 <-
3 2017 10 0 2018 0 NaN
4 2019 6 0 2017 0 NaN
5 2019 6 1 2018 0 <-
Note that there is a difference between:
df.loc[df.index.isin(idx)]
which gets all the min rows
and:
df.loc[idx]
which gets the first match

How to merge on multiple keys and add remaming info on conditions?

I have a simple Dataframe that i am performing a merge it has three labels, Id, year and a value, and I have another Df that has the same Id a different year and some names for a simple example df1 looks like this :
Id Value Year
1 10 2010
6 11 2020
3 12 2019
4 15 2018
2 17 2017
and df2 looks like this:
Id names Year
1 bs 2017
2 fs 2017
6 td 2020
4 dh 2018
3 sv 2019
So I'm merging on using:
df3 = pd.merge(df1, df2, left_on=['Id', 'Year'],right_on=['Id', 'Year'],how='left')
The answer I want to get is this but I don't know how to do it:
Id Value Year names
1 10 2010 bs
6 11 2020 td
3 12 2019 sv
4 15 2018 dh
5 17 2017 fs
So the idea is that the data below 2017 can be assigned from the data of 2017 the dataframe I have is much longer.
You can make temporary column where you use some constant for years <=2017 and merge on Id and this column:
df1["tmp"] = np.where(df1["Year"] <= 2017, 1, df1["Year"])
df2["tmp"] = np.where(df2["Year"] <= 2017, 1, df2["Year"])
df3 = pd.merge(
df1, df2, left_on=["Id", "tmp"], right_on=["Id", "tmp"], how="left"
)
print(
df3[["Id", "Value", "Year_x", "names"]].rename(columns={"Year_x": "Year"})
)
Prints:
Id Value Year names
0 1 10 2010 bs
1 6 11 2020 td
2 3 12 2019 sv
3 4 15 2018 dh
4 2 17 2017 fs
As you are going to attach the names column of df2 to df1 with matching Id, we can make the 2 dataframes with same index on Id and join them after dropping the Year column of df2.
We can use .join() with .set_index() as follows:
(df1.set_index('Id')
.join(
df2.set_index('Id')
.drop(columns='Year')
)
).reset_index()
# Result
Id Value Year names
0 1 10 2010 bs
1 6 11 2020 td
2 3 12 2019 sv
3 4 15 2018 dh
4 2 17 2017 fs

Look for dates in a dataframe that are in list, grouping by ID

I have an identification data frame with a series of years that represent the start of cycles, as such:
ID YEAR1 YEAR2 YEAR3 YEAR4 YEAR5
1 2002 2004 2006 2008 2010
2 2006 2009 2012 2015 2018
...
and I have a main dataframe, which has the colum ID and also Year. I would like match the two dataframes in the following way: Create a new column 'start_cycle' and return 1 if for that ID, the Year column is any of the years in the identifying dataframe (any of the YEAR1, YEAR2, YEAR3...columns). Such as:
ID YEAR start_cycle
1 2005 0
1 2006 1
2 2006 1
2 2010 0
How could I do this? Thank you!
# convert df1 to ID - YEAR pair by stacking
df1_stacked = df1.set_index('ID').stack().rename('YEAR') \
.reset_index(drop=False)[['ID', 'YEAR']] \
.drop_duplicates()
# perform a left join of df2 with df1_stacked, if the value exists in both df2 and df1_stacked,
# it will have an indicator of both which should result in 1 for the start_cycle.
df2_with_indicator = df2.merge(df1_stacked, how='left', indicator=True) \
.rename(columns={'_merge': 'start_cycle'}) \
.assign(start_cycle=lambda df: df.start_cycle.eq('both').astype(int))
ID YEAR start_cycle
0 1 2005 0
1 1 2006 1
2 2 2006 1
3 2 2010 0
Working example
Data
print(df)
ID YEAR1 YEAR2 YEAR3 YEAR4 YEAR5
0 1 2002 2004 2006 2008 2010
1 2 2006 2009 2012 2015 2018
print(df1)
ID YEAR
0 1 2005
1 1 2006
2 2 2006
3 2 2010
Solution
melt df and check if the value is in df1['YEAR]. That will give you a boolean which you can convert to integer.
df1['start_cycle']=df1.YEAR.isin(pd.melt(df, id_vars=['ID'],value_vars=['YEAR1','YEAR1','YEAR2','YEAR3'])['value']).astype(int)
ID YEAR start_cycle
0 1 2005 0
1 1 2006 1
2 2 2006 1
3 2 2010 0

Converting columns with date in names to separate rows in Python

I already got answer to this question in R, wondering how this can be implemented in Python.
Let's say we have a pandas DataFrame like this:
import pandas as pd
d = pd.DataFrame({'2019Q1':[1], '2019Q2':[2], '2019Q3':[3]})
which displays like this:
2019Q1 2019Q2 2019Q3
0 1 2 3
How can I transform it to looks like this:
Year Quarter Value
2019 1 1
2019 2 2
2019 3 3
Use Series.str.split for MultiIndex with expand=True and then reshape by DataFrame.unstack, last data cleaning with with Series.reset_index and Series.rename_axis:
d = pd.DataFrame({'2019Q1':[1], '2019Q2':[2], '2019Q3':[3]})
d.columns = d.columns.str.split('Q', expand=True)
df = (d.unstack(0)
.reset_index(level=2, drop=True)
.rename_axis(('Year','Quarter'))
.reset_index(name='Value'))
print (df)
Year Quarter Value
0 2019 1 1
1 2019 2 2
2 2019 3 3
Thank you #Jon Clements for another solution:
df = (d.melt()
.variable
.str.extract('(?P<Year>\d{4})Q(?P<Quarter>\d)')
.assign(Value=d.T.values.flatten()))
print (df)
Year Quarter Value
0 2019 1 1
1 2019 2 2
2 2019 3 3
Alternative with split:
df = (d.melt()
.variable
.str.split('Q', expand=True)
.rename(columns={0:'Year',1:'Quarter'})
.assign(Value=d.T.values.flatten()))
print (df)
Year Quarter Value
0 2019 1 1
1 2019 2 2
2 2019 3 3
Using DataFrame.stack with DataFrame.pop and Series.str.split:
df = d.stack().reset_index(level=1).rename(columns={0:'Value'})
df[['Year', 'Quarter']] = df.pop('level_1').str.split('Q', expand=True)
Value Year Quarter
0 1 2019 1
0 2 2019 2
0 3 2019 3
If you care about the order of columns, use reindex:
df = df.reindex(['Year', 'Quarter', 'Value'], axis=1)
Year Quarter Value
0 2019 1 1
0 2019 2 2
0 2019 3 3

pandas dataframe join with where restriction

I have 2 pandas DataFrames looking like this:
ranks:
year name rank
2015 A 1
2015 B 2
2015 C 3
2014 A 4
2014 B 5
2014 C 6
and tourneys:
date name
20150506 A
20150708 B
20150910 C
20141212 A
20141111 B
20141010 C
I want to join these two DataFrames based on the name column, however as you can see the names are not unique and are repeated each year. Hence a restriction for the join is that the year of ranks should match the first four characters of date of tourneys.
the result should look like this:
date name_t year name_r rank
20150506 A 2015 A 1
20150708 B 2015 B 2
20150910 C 2015 C 3
20141212 A 2014 A 4
20141111 B 2014 B 5
20141010 C 2014 C 6
I am aware of the join method in pandas, however, I would also need to restrict the join with some sort of WHERE ranks.year == tourneys.date[:4].
Create a new date4 for df2 and then merge df1 and df2
In [103]: df2['date4'] = (df2['date']/10000).astype(int)
Now, merge df1 and df2 on ['year', 'name'] and ['date4', 'name'] combinations.
In [104]: df1.merge(df2, left_on=['year', 'name'], right_on=['date4', 'name'])
Out[104]:
year name rank date date4
0 2015 A 1 20150506 2015
1 2015 B 2 20150708 2015
2 2015 C 3 20150910 2015
3 2014 A 4 20141212 2014
4 2014 B 5 20141111 2014
5 2014 C 6 20141010 2014
Where df1 and df2 looks like
In [105]: df1
Out[105]:
year name rank
0 2015 A 1
1 2015 B 2
2 2015 C 3
3 2014 A 4
4 2014 B 5
5 2014 C 6
In [106]: df2
Out[106]:
date name date4
0 20150506 A 2015
1 20150708 B 2015
2 20150910 C 2015
3 20141212 A 2014
4 20141111 B 2014
5 20141010 C 2014

Categories