I'd like to compare the difference in data frames. xyz has all of the same columns as abc, but it has an additional column.
In the comparison, I'd like match up the two like columns (Sport) but only show the SportLeague in the output (if a difference exists, that is). Example, instead of showing 'Soccer' as a difference, show 'Soccer:MLS', which is the adjacent column in xyz)
Here's a screenshot of the two data frames:
import pandas as pd
import numpy as np
abc = {'Sport' : ['Football', 'Basketball', 'Baseball', 'Hockey'], 'Year' : ['2021','2021','2022','2022'], 'ID' : ['1','2','3','4']}
abc = pd.DataFrame({k: pd.Series(v) for k, v in abc.items()})
abc
xyz = {'Sport' : ['Football', 'Football', 'Basketball', 'Baseball', 'Hockey', 'Soccer'], 'SportLeague' : ['Football:NFL', 'Football:XFL', 'Basketball:NBA', 'Baseball:MLB', 'Hockey:NHL', 'Soccer:MLS'], 'Year' : ['2022','2019', '2022','2022','2022', '2022'], 'ID' : ['2','0', '3','2','4', '1']}
xyz = pd.DataFrame({k: pd.Series(v) for k, v in xyz.items()})
xyz = xyz.sort_values(by = ['ID'], ascending = True)
xyz
Code already tried:
abc.compare(xyz, align_axis=1, keep_shape=False, keep_equal=False)
The error I get is the following (since the data frames don't have the exact same columns):
Example. If xyz['Sport'] does not show up anywhere within abc['Sport'], then show xyz['SportLeague]' as the difference between the data frames
Further clarification of the logic:
Does abc['Sport'] appear anywhere in xyz['Sport']? If not, indicate "Not Found in xyz data frame". If it does exist, are its corresponding abc['Year'] and abc['ID'] values the same? If not, show "Change from xyz['Year'] and xyz['ID'] to abc['Year'] and abc['ID'].
Does xyz['Sport'] appear anywhere in abc['Sport']? If not, indicate "Remove xyz['SportLeague']".
What I've explained above is similar to the .compare method. However, the data frames in this example may not be the same length and have different amounts of variables.
If I understand you correctly, we basically want to merge both DataFrames, and then apply a number of comparisons between both DataFrames, and add a column that explains the course of action to be taken, given a certain result of a given comparison.
Note: in the example here I have added one sport ('Cricket') to your df abc, to trigger the condition abc['Sport'] does not exist in xyz['Sport'].
abc = {'Sport' : ['Football', 'Basketball', 'Baseball', 'Hockey','Cricket'], 'Year' : ['2021','2021','2022','2022','2022'], 'ID' : ['1','2','3','4','5']}
abc = pd.DataFrame({k: pd.Series(v) for k, v in abc.items()})
print(abc)
Sport Year ID
0 Football 2021 1
1 Basketball 2021 2
2 Baseball 2022 3
3 Hockey 2022 4
4 Cricket 2022 5
I've left xyz unaltered. Now, let's merge these two dfs:
df = xyz.merge(abc, on='Sport', how='outer', suffixes=('_xyz','_abc'))
print(df)
Sport SportLeague Year_xyz ID_xyz Year_abc ID_abc
0 Football Football:XFL 2019 0 2021 1
1 Football Football:NFL 2022 2 2021 1
2 Soccer Soccer:MLS 2022 1 NaN NaN
3 Baseball Baseball:MLB 2022 2 2022 3
4 Basketball Basketball:NBA 2022 3 2021 2
5 Hockey Hockey:NHL 2022 4 2022 4
6 Cricket NaN NaN NaN 2022 5
Now, we have a df where we can evaluate your set of conditions using np.select(conditions, choices, default). Like this:
conditions = [ df.Year_abc.isnull(),
df.Year_xyz.isnull(),
(df.Year_xyz != df.Year_abc) & (df.ID_xyz != df.ID_abc),
df.Year_xyz != df.Year_abc,
df.ID_xyz != df.ID_abc
]
choices = [ 'Sport not in abc',
'Sport not in xyz',
'Change year and ID to xyz',
'Change year to xyz',
'Change ID to xyz']
df['action'] = np.select(conditions, choices, default=np.nan)
Result as below with a new column action with notes on which course of action to take.
Sport SportLeague Year_xyz ID_xyz Year_abc ID_abc \
0 Football Football:XFL 2019 0 2021 1
1 Football Football:NFL 2022 2 2021 1
2 Soccer Soccer:MLS 2022 1 NaN NaN
3 Baseball Baseball:MLB 2022 2 2022 3
4 Basketball Basketball:NBA 2022 3 2021 2
5 Hockey Hockey:NHL 2022 4 2022 4
6 Cricket NaN NaN NaN 2022 5
action
0 Change year and ID to xyz # match, but mismatch year and ID
1 Change year and ID to xyz # match, but mismatch year and ID
2 Sport not in abc # no match: Sport in xyz, but not in abc
3 Change ID to xyz # match, but mismatch ID
4 Change year and ID to xyz # match, but mismatch year and ID
5 nan # complete match: no action needed
6 Sport not in xyz # no match: Sport in abc, but not in xyz
Let me know if this is a correct interpretation of what you are looking to achieve.
I have a dataframe (results) of EPL results from the past 28 years, and I am trying to calculate the average home team points (HPts) from their previous 5 home games within the current season. The rows are already in chronological order. What I am effectively looking for is a version of the starter code below that partitions by HomeTeam and Season and calculates the mean of HPts using a window of the previous 5 rows with matching HomeTeam and Season. Clearly the existing code as written does not do what I need (it looks only at last 5 rows regardless of team and season), but is just there to show what I mean as a starting point.
HomeTeam AwayTeam Season Result HPts APts
0 Arsenal Coventry 1993 A 0 3
1 Aston Villa QPR 1993 H 3 0
2 Chelsea Blackburn 1993 A 0 3
3 Liverpool Sheffield Weds 1993 H 3 0
4 Man City Leeds 1993 D 1 1
.. ... ... ... ... ... ...
375 Liverpool Crystal Palace 2020 H 3 0
376 Man City Everton 2020 H 3 0
377 Sheffield United Burnley 2020 H 3 0
378 West Ham Southampton 2020 H 3 0
379 Wolves Man United 2020 A 0 3
[10804 rows x 6 columns]
# Starting point for my code for home team avg points from last 5 home games
results['HomeLast5'] = results['HPts'].rolling(5).mean()
Anyone know how I can add a new column with the rolling average points for a given team and season? I could probably figure out a way of doing this with a loop, but I'm sure that's not going to be the most efficient way to solve this problem.
Group the dataframe by HomeTeam and Season, then calculate rolling mean on HPts. Then, in order to assign the calculated mean back to the original dataframe drop the levels 0, 1 from the index so that index alignment would work properly.
g = results.groupby(['HomeTeam', 'Season'])['HPts']
results['HomeLast5'] = g.rolling(5).mean().droplevel([0, 1])
I have a dataframe df1 like:
cycleName quarter product qty price sell/buy
0 2020 q3 wood 10 100 sell
1 2020 q3 leather 5 200 buy
2 2020 q3 wood 2 200 buy
3 2020 q4 wood 12 40 sell
4 2020 q4 leather 12 40 sell
5 2021 q1 wood 12 80 sell
6 2021 q2 leather 12 90 sell
And another dataframe df2 as below. It has unique products of df1:
product currentValue
0 wood 20
1 leather 50
I want to create new column in df2, called income which will be based on calculations on df1 data. Example if product is wood the income2020 will be created seeing if cycleName is 2020 and if sell/buy is sell then add quantity * price else subtract quantity * price.
product currentValue income2020
0 wood 20 10 * 100 - 2 * 200 + 12 * 40 (=1080)
1 leather 50 -5 * 200 + 12 * 40 (= -520)
I have a problem statement in python, which I am trying to do using pandas dataframes, which I am very new to.
I am not able to understand how to create that column in df2 based on different conditions on df1.
You can map sell as 1 and buy as -1 using pd.Series.map then multiply columns qty, price and sell/buy using df.prod to get only 2020 cycleName values use df.query and groupby by product and take sum using GroupBy.sum
df_2020 = df.query('cycleName == 2020').copy() # `df[df['cycleName'] == 2020].copy()`
df_2020['sell/buy'] = df_2020['sell/buy'].map({'sell':1, 'buy':-1})
df_2020[['qty', 'price', 'sell/buy']].prod(axis=1).groupby(df_2020['cycleName']).sum()
product
leather -520
wood 1080
dtype: int64
Note:
Use .copy else you would get SettingWithCopyWarning
To maintain the order use sort=False in df.groupby
(df_2020[['qty', 'price', 'sell/buy']].
prod(axis=1).
groupby(df_2020['product'],sort=False).sum()
)
product
wood 1080
leather -520
dtype: int64
I want to find a matching row for another row in a Pandas dataframe. Given this example frame:
name location type year area delta
0 building NY a 2019 650.3 ?
1 building NY b 2019 400.0 ?
2 park LA a 2017 890.7 ?
3 lake SF b 2007 142.2 ?
4 park LA b 2017 333.3 ?
...
Each row has a matching row, where all values equal - except the "type" and the "area". For example row 0 and 1 match, and 2 and 4, ...
I want to somehow get the matching rows; and write the difference between their areas in their "delta" column (e.g. |650.3 - 400.0| = 250.3 for row 0).
The "delta" column doesn't exist yet, but an empty column could be easily added with df["Delta"] = 0. I just don't know how to be able to fill the delta column for ALL rows.
I tried getting a matching row with df[name = 'building' & location = 'type' ... ~& type = 'a']; but I can't edit the result I get from that. Maybe I also don't quite understand when I get a copy, and when a reference.
I hope my problem is clear. If not, I am happy to explain further.
Thanks a lot already for your help!
IIUC, you want groupby.transform:
df['delta']=( df.groupby(df.columns.difference(['type','area']).tolist())
.transform('diff').abs() )
print(df)
name location type year area delta
0 building NY a 2019 650.3 NaN
1 building NY b 2019 400.0 250.3
2 park LA a 2017 890.7 NaN
3 lake SF b 2007 142.2 NaN
4 park LA b 2017 333.3 557.4
If you want to write the difference in both rows ofdelta column:
df['delta']=( df.groupby(df.columns.difference(['type','area']).tolist())
.transform(lambda x: x.diff().bfill()).abs() )
print(df)
name location type year area delta
0 building NY a 2019 650.3 250.3
1 building NY b 2019 400.0 250.3
2 park LA a 2017 890.7 557.4
3 lake SF b 2007 142.2 NaN
4 park LA b 2017 333.3 557.4
Detail:
df.columns.difference(['type','area']).tolist()
#[*df.columns.difference(['type','area'])] or this
#['location', 'name', 'year'] #Output
A solution with merge:
df['other_type'] = np.where(df['type']=='a', 'b', 'a')
(df.merge(df,
left_on=['name','location', 'year', 'type'],
right_on=['name','location', 'year', 'other_type'],
suffixes=['','_r'])
.assign(delta=lambda x: x['area']-x['area_r'])
.drop(['area_r', 'other_type_r'], axis=1)
)
I have a dataframe with 2 columns as below:
Index Year Country
0 2015 US
1 2015 US
2 2015 UK
3 2015 Indonesia
4 2015 US
5 2016 India
6 2016 India
7 2016 UK
I want to create a new dataframe containing the maximum count of country in every year.
The new dataframe will contain 3 columns as below:
Index Year Country Count
0 2015 US 3
1 2016 India 2
Is there any function in pandas where this can be done quickly?
One way can be to use groupby and along with size for finding in each category adn sort values and slice by possible number of year. You can try the following:
num_year = df['Year'].nunique()
new_df = df.groupby(['Year', 'Country']).size().rename('Count').sort_values(ascending=False).reset_index()[:num_year]
Result:
Year Country Count
0 2015 US 3
1 2016 India 2
Use:
1.
First get count of each pairs Year and Country by groupby and size.
Then get index of max value by idxmax and select row by loc:
df = df.groupby(['Year','Country']).size()
df = df.loc[df.groupby(level=0).idxmax()].reset_index(name='Count')
print (df)
Year Country Count
0 2015 US 3
1 2016 India 2
2.
Use custom function with value_counts and head:
df = df.groupby('Year')['Country']
.apply(lambda x: x.value_counts().head(1))
.rename_axis(('Year','Country'))
.reset_index(name='Count')
print (df)
Year Country Count
0 2015 US 3
1 2016 India 2
Just provide a method without groupby
Count=pd.Series(list(zip(df2.Year,df2.Country))).value_counts()
.head(2).reset_index(name='Count')
Count[['Year','Country']]=Count['index'].apply(pd.Series)
Count.drop('index',1)
Out[266]:
Count Year Country
0 3 2015 US
1 2 2016 India