pandas dataframe join with where restriction - python

I have 2 pandas DataFrames looking like this:
ranks:
year name rank
2015 A 1
2015 B 2
2015 C 3
2014 A 4
2014 B 5
2014 C 6
and tourneys:
date name
20150506 A
20150708 B
20150910 C
20141212 A
20141111 B
20141010 C
I want to join these two DataFrames based on the name column, however as you can see the names are not unique and are repeated each year. Hence a restriction for the join is that the year of ranks should match the first four characters of date of tourneys.
the result should look like this:
date name_t year name_r rank
20150506 A 2015 A 1
20150708 B 2015 B 2
20150910 C 2015 C 3
20141212 A 2014 A 4
20141111 B 2014 B 5
20141010 C 2014 C 6
I am aware of the join method in pandas, however, I would also need to restrict the join with some sort of WHERE ranks.year == tourneys.date[:4].

Create a new date4 for df2 and then merge df1 and df2
In [103]: df2['date4'] = (df2['date']/10000).astype(int)
Now, merge df1 and df2 on ['year', 'name'] and ['date4', 'name'] combinations.
In [104]: df1.merge(df2, left_on=['year', 'name'], right_on=['date4', 'name'])
Out[104]:
year name rank date date4
0 2015 A 1 20150506 2015
1 2015 B 2 20150708 2015
2 2015 C 3 20150910 2015
3 2014 A 4 20141212 2014
4 2014 B 5 20141111 2014
5 2014 C 6 20141010 2014
Where df1 and df2 looks like
In [105]: df1
Out[105]:
year name rank
0 2015 A 1
1 2015 B 2
2 2015 C 3
3 2014 A 4
4 2014 B 5
5 2014 C 6
In [106]: df2
Out[106]:
date name date4
0 20150506 A 2015
1 20150708 B 2015
2 20150910 C 2015
3 20141212 A 2014
4 20141111 B 2014
5 20141010 C 2014

Related

Doing joins between 2 csv files [duplicate]

For df2 which only has data in the year of 2019:
type year value
0 a 2019 13
1 b 2019 5
2 c 2019 5
3 d 2019 20
df1 has multiple years data:
type year value
0 a 2015 12
1 a 2016 2
2 a 2019 3
3 b 2018 50
4 b 2019 10
5 c 2017 1
6 c 2016 5
7 c 2019 8
I need to concatenate them together while replacing df2's values in 2019 with the values from df1's same year.
The expected result will like this:
type date value
0 a 2015 12
1 a 2016 2
2 b 2018 50
3 c 2017 1
4 c 2016 5
5 a 2019 13
6 b 2019 5
7 c 2019 5
8 d 2019 20
The result from pd.concat([df1, df2], ignore_index=True, sort =False), which clearly have multiple values in year of 2019 for one type. How should I improve the code? Thank you.
type date value
0 a 2019 13
1 b 2019 5
2 c 2019 5
3 d 2019 20
4 a 2015 12
5 a 2016 2
6 a 2019 3
7 b 2018 50
8 b 2019 10
9 c 2017 1
10 c 2016 5
11 c 2019 8
Add DataFrame.drop_duplicates for get last rows per type and date after concat.
Solution working if type and date pairs are unique in both DataFrames.
df = (pd.concat([df1, df2], ignore_index=True, sort =False)
.drop_duplicates(['type','date'], keep='last'))

merging two csv using python [duplicate]

For df2 which only has data in the year of 2019:
type year value
0 a 2019 13
1 b 2019 5
2 c 2019 5
3 d 2019 20
df1 has multiple years data:
type year value
0 a 2015 12
1 a 2016 2
2 a 2019 3
3 b 2018 50
4 b 2019 10
5 c 2017 1
6 c 2016 5
7 c 2019 8
I need to concatenate them together while replacing df2's values in 2019 with the values from df1's same year.
The expected result will like this:
type date value
0 a 2015 12
1 a 2016 2
2 b 2018 50
3 c 2017 1
4 c 2016 5
5 a 2019 13
6 b 2019 5
7 c 2019 5
8 d 2019 20
The result from pd.concat([df1, df2], ignore_index=True, sort =False), which clearly have multiple values in year of 2019 for one type. How should I improve the code? Thank you.
type date value
0 a 2019 13
1 b 2019 5
2 c 2019 5
3 d 2019 20
4 a 2015 12
5 a 2016 2
6 a 2019 3
7 b 2018 50
8 b 2019 10
9 c 2017 1
10 c 2016 5
11 c 2019 8
Add DataFrame.drop_duplicates for get last rows per type and date after concat.
Solution working if type and date pairs are unique in both DataFrames.
df = (pd.concat([df1, df2], ignore_index=True, sort =False)
.drop_duplicates(['type','date'], keep='last'))

create new column with values from another column based on condition

I have a dataframe
A B Value FY
1 5 a 2020
2 6 b 2020
3 7 c 2021
4 8 d 2021
I want to create a column 'prev_FY' which looks at the 'value' column and previous year and populates in current year row in FY column;
my desired output is:
A B Value FY prev_FY
1 5 a 2020
2 6 b 2020
3 7 c 2021 a
4 8 d 2021 b
I tried using pivottable but it does not work as the values remain the same as corresponding to the FY. SHIFT function is not feasible as I have millions of rows.
Use:
df['g'] = df.groupby('FY').cumcount()
df2 = df[['FY','Value','g']].assign(FY = df['FY'].add(1))
df = df.merge(df2, on=['FY','g'], how='left', suffixes=('','_prev')).drop('g', axis=1)
print (df)
A B Value FY Value_prev
0 1 5 a 2020 NaN
1 2 6 b 2020 NaN
2 3 7 c 2021 a
3 4 8 d 2021 b

how to compare or merge two data frames using python pandas?

How do I compare/merge two data frames based on the start and data columns and get the missing gaps with the count.
Dataframe 1
id start
1 2009
1 2010
1 2011
1 2012
2 2010
2 2011
2 2012
2 2013
2 2014
Data frame 2
id data
1 2010
1 2012
2 2010
2 2011
2 2012
Expected Output:
id first last size
1 2009 2009 1
1 2011 2011 1
2 2013 2014 2
How may I achieve this.
Use merge with indicator=True and outer join first:
df11 = df1.rename(columns={'start':'data'})
df = df2.merge(df11, how='outer', indicator=True, on=['id','data']).sort_values(['id','data'])
print (df)
id data _merge
5 1 2009 right_only
0 1 2010 both
6 1 2011 right_only
1 1 2012 both
2 2 2010 both
3 2 2011 both
4 2 2012 both
7 2 2013 right_only
8 2 2014 right_only
And then use old solution - only change condition:
#boolean mask for check no right_only to variable for reuse
m = (df['_merge'] != 'right_only').rename('g')
#create index by cumulative sum for unique groups for consecutive NaNs
df.index = m.cumsum()
print (df)
id data _merge
g
0 1 2009 right_only
1 1 2010 both
1 1 2011 right_only
2 1 2012 both
3 2 2010 both
4 2 2011 both
5 2 2012 both
5 2 2013 right_only
5 2 2014 right_only
#filter only NaNs row and aggregate first, last and count.
df2 = (df[~m.values].groupby(['id', 'g'])['data']
.agg(['first','last','size'])
.reset_index(level=1, drop=True)
.reset_index())
print (df2)
id first last size
0 1 2009 2009 1
1 1 2011 2011 1
2 2 2013 2014 2
I answered a similar question for you yesterday. I dont know where you are getting the first and last columns but here is a way to find the missing years based on the example above:
df1_year = pd.DataFrame(df1.groupby('id')['start'].apply(list))
df2_year = pd.DataFrame(df2.groupby('id')['data'].apply(list))
dfs = [df1_year,df2_year]
df_final = reduce(lambda left,right: pd.merge(left,right,on='id'), dfs)
df_final.reset_index(inplace=True)
def noMatch(a, b):
return [x for x in a if x not in b]
df3 = []
for i in range(0, len(df_final)):
df3.append(noMatch(df_final['start'][i],df_final['data'][i]))
missing_year = pd.DataFrame(df3)
missing_year['missingYear'] = missing_year.values.tolist()
df_concat = pd.concat([df_final, missing_year], axis=1)
df_concat = df_concat[['id','missingYear']]
df4 = []
for i in range(0,len(df_concat)):
df4.append(df_concat.applymap(lambda x: x[i] if isinstance(x, list) else x))
df_final1 = reduce(lambda left,right: pd.merge(left,right,on='id'), df4)
pd.concat([df_final1[['id','missingYear_x']], df_final1[['id','missingYear_y']].rename(columns={'missingYear_y':'missingYear_x'})]).rename(columns={'missingYear_x':'missingYear'}).sort_index()
id missingYear
0 1 2009
0 1 2011
1 2 2013
1 2 2014
to add it to df2 per your comment just append data

Panadas pivot table

When trying to use the pd.pivot_table on a given dataset, I noticed that it creates levels for all existing levels on a parent group, not all possible levels. For example, on a dataset like this:
YEAR CLASS
0 2013 A
1 2013 A
2 2013 B
3 2013 B
4 2013 B
5 2013 C
6 2013 C
7 2013 D
8 2014 A
9 2014 A
10 2014 A
11 2014 B
12 2014 B
13 2014 B
14 2014 C
15 2014 C
there is no level D for year 2014, so the pivot table will look like this:
pd.pivot_table(d,index=["YEAR","CLASS"],values=["YEAR"],aggfunc=[len],fill_value=0)
len
YEAR CLASS
2013 A 2
B 3
C 2
D 1
2014 A 3
B 3
C 2
What I want is to get a separate group for D in 2014 with length 0 in my pivot table. How can I include all possible levels in the child variable for the parent variable?
I think you can use crosstab and stack:
print pd.pivot_table(df,
index=["YEAR","CLASS"],
values=["YEAR"],
aggfunc=[len],
fill_value=0)
len
YEAR CLASS
2013 A 2
B 3
C 2
D 1
2014 A 3
B 3
C 2
print pd.crosstab(df['YEAR'],df['CLASS'])
CLASS A B C D
YEAR
2013 2 3 2 1
2014 3 3 2 0
df = pd.crosstab(df['YEAR'],df['CLASS']).stack()
df.name = 'len'
print df
YEAR CLASS
2013 A 2
B 3
C 2
D 1
2014 A 3
B 3
C 2
D 0
Name: len, dtype: int64

Categories