Pandas DataFrames: Create new rows with calculations across existing rows - python

How can I create new rows from an existing DataFrame by grouping by certain fields (in the example "Country" and "Industry") and applying some math to another field (in the example "Field" and "Value")?
Source DataFrame
df = pd.DataFrame({'Country': ['USA','USA','USA','USA','USA','USA','Canada','Canada'],
'Industry': ['Finance', 'Finance', 'Retail',
'Retail', 'Energy', 'Energy',
'Retail', 'Retail'],
'Field': ['Import', 'Export','Import',
'Export','Import', 'Export',
'Import', 'Export'],
'Value': [100, 50, 80, 10, 20, 5, 30, 10]})
Country Industry Field Value
0 USA Finance Import 100
1 USA Finance Export 50
2 USA Retail Import 80
3 USA Retail Export 10
4 USA Energy Import 20
5 USA Energy Export 5
6 Canada Retail Import 30
7 Canada Retail Export 10
Target DataFrame
Net = Import - Export
Country Industry Field Value
0 USA Finance Net 50
1 USA Retail Net 70
2 USA Energy Net 15
3 Canada Retail Net 20

There are quite possibly many ways. Here's one using groupby and unstack:
(df.groupby(['Country', 'Industry', 'Field'], sort=False)['Value']
.sum()
.unstack('Field')
.eval('Import - Export')
.reset_index(name='Value'))
Country Industry Value
0 USA Finance 50
1 USA Retail 70
2 USA Energy 15
3 Canada Retail 20

IIUC
df=df.set_index(['Country','Industry'])
Newdf=(df.loc[df.Field=='Export','Value']-df.loc[df.Field=='Import','Value']).reset_index().assign(Field='Net')
Newdf
Country Industry Value Field
0 USA Finance -50 Net
1 USA Retail -70 Net
2 USA Energy -15 Net
3 Canada Retail -20 Net
pivot_table
df.pivot_table(index=['Country','Industry'],columns='Field',values='Value',aggfunc='sum').\
diff(axis=1).\
dropna(1).\
rename(columns={'Import':'Value'}).\
reset_index()
Out[112]:
Field Country Industry Value
0 Canada Retail 20.0
1 USA Energy 15.0
2 USA Finance 50.0
3 USA Retail 70.0

You can do it this way to add those rows to your original dataframe:
df.set_index(['Country','Industry','Field'])\
.unstack()['Value']\
.eval('Net = Import - Export')\
.stack().rename('Value').reset_index()
Output:
Country Industry Field Value
0 Canada Retail Export 10
1 Canada Retail Import 30
2 Canada Retail Net 20
3 USA Energy Export 5
4 USA Energy Import 20
5 USA Energy Net 15
6 USA Finance Export 50
7 USA Finance Import 100
8 USA Finance Net 50
9 USA Retail Export 10
10 USA Retail Import 80
11 USA Retail Net 70

You can use Groupby.diff() and after that recreate the Field column and finally use DataFrame.dropna:
df['Value'] = df.groupby(['Country', 'Industry'])['Value'].diff().abs()
df['Field'] = 'Net'
df.dropna(inplace=True)
df.reset_index(drop=True, inplace=True)
print(df)
Country Industry Field Value
0 USA Finance Net 50.0
1 USA Retail Net 70.0
2 USA Energy Net 15.0
3 Canada Retail Net 20.0

This answer takes advantage of the fact that pandas puts the group keys in the multiindex of the resulting dataframe. (If there were only one group key, you could use loc.)
>>> s = df.groupby(['Country', 'Industry', 'Field'])['Value'].sum()
>>> s.xs('Import', axis=0, level='Field') - s.xs('Export', axis=0, level='Field')
Country Industry
Canada Retail 20
USA Energy 15
Finance 50
Retail 70
Name: Value, dtype: int64

Related

Missing value replacemnet using mode in pandas in subgroup of a group

Having a data set as below.Here I need to group the subset in column and fill the missing values using mode method.Here specifically needs to fill the missing value of Tom from UK. So I need to group the TOM from Uk, and in that group the most repeating value needs to be added to the missing value.
Below fig shows how i need to do the group by.From the below matrix i need to replace all the Nan values using mode.
the desired output:
attaching the dataset
Name location Value
Tom USA 20
Tom UK Nan
Tom USA Nan
Tom UK 20
Jack India Nan
Nihal Africa 30
Tom UK Nan
Tom UK 20
Tom UK 30
Tom UK 20
Tom UK 30
Sam UK 30
Sam UK 30
try:
df = df\
.set_index(['Name', 'location'])\
.fillna(
df[df.Name.eq('Tom') & df.location.eq('UK')]\
.groupby(['Name', 'location'])\
.agg(pd.Series.mode)\
.to_dict()
)\
.reset_index()
Output:
Name location Value
0 Tom USA 20
1 Tom UK 20
2 Tom USA NaN
3 Tom UK 20
4 Jack India NaN
5 Nihal Africa 30
6 Tom UK 20
7 Tom UK 20
8 Tom UK 30
9 Tom UK 20
10 Tom UK 30
11 Sam UK 30
12 Sam UK 30

compare two value from different dataframe and based on that add value in pandas

Need to compare two different Dataframe and based on the result add value to a column
country = {'Year':[2020,2021],'Host':['Mexico','Panama'],'Winners':['Canada','Japan']}
country_df = pd.DataFrame(country,columns=['Year','Host','Winners'])
Year Host Winners
0 2020 Mexico Canada
1 2021 Panama Japan
all_country = {'Country': ['USA','Mexico','USA','Panama','Japan'],'Year':[2021,2020,2020,2021,2021]}
all_country_df=pd.DataFrame(all_country,columns=['Country','Year']
Country Year
0 USA 2021
1 Mexico 2020
2 USA 2020
3 Panama 2021
4 Japan 2021
I want to compare the all_country_df with the country_df to find which country was the host in the given year as well as the winners so something like
all_country= {'Country':['USA','Mexico','USA','Panama','Japan'],'Year':[2021,2020,2020,2021,2021],'Winner':[None,None,None,None,'Winner'],'Host':[None,'Host',None,'Host',None]}
all_Country_df=pd.DataFrame(all_country,columns=['Country','Year','Winner','Host'])
Like this
Country Year Winner Host
0 USA 2021 None None
1 Mexico 2020 None Host
2 USA 2020 None None
3 Panama 2021 None Host
4 Japan 2021 Winner None
Try with merge and np.where:
newdf = all_country_df.merge(country_df)
newdf['Winners'] = np.where(newdf['Country'].ne(newdf['Winner']), np.nan, 'Winners')
newdf['Host'] = np.where(newdf['Country'].ne(newdf['Host']), np.nan, 'Host')
print(newdf)
Output:
Country Year Host Winners
0 USA 2021 nan nan
1 Panama 2021 Host nan
2 Japan 2021 nan Winner
3 Mexico 2020 Host nan
4 USA 2020 nan nan

Merge dataframes on same row

I have a python code that gets links from a dataframe (df1) , collect data from a website and return output in a new dataframe
df1:
id Name link Country Continent
1 Company1 www.link1.com France Europe
2 Company2 www.link2.com France Europe
3 Company3 www.Link3.com France Europe
The ouput from the code is df2:
link numberOfPPL City
www.link1.com 8 Paris
www.link1.com 9 Paris
www.link2.com 15 Paris
www.link2.com 1 Paris
I want to join these 2 dataframes in one (dfinal). My code:
dfinal = df1.append(df2, ignore_index=True)
I got dfinal:
link numberOfPPL City id Name Country Continent
www.link1.com 8 Paris
www.link1.com 9 Paris
www.link2.com 15 Paris
www.link2.com 1 Paris
www.link1.com 1 Company1 France Continent
..
..
I Want my final dataframe to be like this:
link numberOfPPL City id Name Country Continent
www.link1.com 8 Paris 1 Company1 France Europe
www.link1.com 9 Paris 1 Company1 France Europe
www.link2.com 15 Paris 1 Company1 France Europe
www.link2.com 1 Paris 2 Company2 France Europe
Can anyone help please ??
You can merge the two dataframes on 'link':
outputDF = df2.merge(df1, how='left', on=['link'])

Fuzzy Matching Two Columns in the Same Dataframe Using Python

I have two datasets within the same data frame each showing a list of companies. One dataset is from 2017 and the other is from this year. I am trying to match the two company datasets to each other and figured fuzzy matching ( FuzzyWuzzy) was the best way to do this. Using a partial ratio, I want to simply have the columns with the values listed as so: last year company's name, highest fuzzy matching ratio, this year company associated with that highest score. The original data frame has been given the variable "data" with last year company names under the column "Company" and this year company names under the column "Company name". To accomplish this task, I tried to create a function with the extractOne fuzzy matching process and then apply that function to each value/row in the dataframe. I would then add the results to my original data frame.
Here is the code below:
names_array=[]
ratio_array=[]
def match_names(last_year,this_year):
for row in last_year:
x=process.extractOne(row,this_year)
names_array.append(x[0])
ratio_array.append(x[1])
return names_array,ratio_array
#last year company names dataset
last_year=data['Company'].dropna().values
#this year companydataset
this_year=data['Company name'].values
name_match,ratio_match=match_names(last_year,this_year)
data['this_year']=pd.Series(name_match)
data['match_rating']=pd.Series(ratio_match)
data.to_csv("test.csv")
However, every time I execute this part of the code, the two added columns I created, do not show up in the csv. In fact, "test.csv" is just the same data frame as before despite the computer showing it as recently created. If anyone could point out the problem or help me out in any way, it would truly be appreciated.
Edit ( data frame preview):
Company Company name
0 BODYPHLO SPORTIQUE NaN
1 JOSEPH A PERRY NaN
2 PCH RESORT TENNIS SHOP NaN
3 GREYSTONE GOLF CLUB INC. NaN
4 MUSGROVE COUNTRY CLUB NaN
5 CITY OF PELHAM RACQUET CLUB NaN
6 NORTHRIVER YACHT CLUB NaN
7 LAKE FOREST NaN
8 TNL TENNIS PRO SHOP NaN
9 SOUTHERN ATHLETIC CLUB NaN
10 ORANGE BEACH TENNIS CENTER NaN
Then after the Company entries (last year company data sets) end, the "Company name" column ( this year company data sets) begins as so:
4168 NaN LEWIS TENNIS
4169 NaN CHUCKS PRO SHOP AT
4170 NaN CHUCK KINYON
4171 NaN LAKE COUNTRY RACQUET CLUB
4172 NaN SPORTS ACADEMY & RAC CLUB
Your dataframe structure is odd considering that one column only begins once the other end, however we can make it work. Let's take the following sample dataframe for data that you supplied:
Company Company name
0 BODYPHLO SPORTIQUE NaN
1 JOSEPH A PERRY NaN
2 PCH RESORT TENNIS SHOP NaN
3 GREYSTONE GOLF CLUB INC. NaN
4 MUSGROVE COUNTRY CLUB NaN
5 CITY OF PELHAM RACQUET CLUB NaN
6 NORTHRIVER YACHT CLUB NaN
7 LAKE FOREST NaN
8 TNL TENNIS PRO SHOP NaN
9 SOUTHERN ATHLETIC CLUB NaN
10 ORANGE BEACH TENNIS CENTER NaN
11 NaN LEWIS TENNIS
12 NaN CHUCKS PRO SHOP AT
13 NaN CHUCK KINYON
14 NaN LAKE COUNTRY RACQUET CLUB
15 NaN SPORTS ACADEMY & RAC CLUB
Then perform your matching:
import pandas as pd
from fuzzywuzzy import process, fuzz
known_list = data['Company name'].dropna()
def find_match(x):
match = process.extractOne(x['Company'], known_list, scorer=fuzz.partial_token_sort_ratio)
return pd.Series([match[0], match[1]])
data[['this year','match_rating']] = data.dropna(subset=['Company']).apply(find_match, axis=1, result_type='expand')
Yields:
Company Company name this year \
0 BODYPHLO SPORTIQUE NaN SPORTS ACADEMY & RAC CLUB
1 JOSEPH A PERRY NaN CHUCKS PRO SHOP AT
2 PCH RESORT TENNIS SHOP NaN LEWIS TENNIS
3 GREYSTONE GOLF CLUB INC. NaN LAKE COUNTRY RACQUET CLUB
4 MUSGROVE COUNTRY CLUB NaN LAKE COUNTRY RACQUET CLUB
5 CITY OF PELHAM RACQUET CLUB NaN LAKE COUNTRY RACQUET CLUB
6 NORTHRIVER YACHT CLUB NaN LAKE COUNTRY RACQUET CLUB
7 LAKE FOREST NaN LAKE COUNTRY RACQUET CLUB
8 TNL TENNIS PRO SHOP NaN LEWIS TENNIS
9 SOUTHERN ATHLETIC CLUB NaN SPORTS ACADEMY & RAC CLUB
10 ORANGE BEACH TENNIS CENTER NaN LEWIS TENNIS
match_rating
0 47.0
1 43.0
2 67.0
3 43.0
4 67.0
5 72.0
6 48.0
7 64.0
8 67.0
9 50.0
10 67.0

How to groupby and collpase with pandas?

I have a dataframe of the following type:
Country Year Age Male Female
0 Canada 2005 50 400 25
1 Canada 2005 51 100 25
2 Canada 2006 50 100 70
3 Columbia 2005 50 75 75
I would like to, for example, get the total number of males+females of any age, grouped by country and year. I.e. I'm trying to understand what operation could allow me to see a table such as
Country Year Total over ages and sexes
0 Canada 2005 550
1 Canada 2006 170
2 Columbia 2005 150
In the above example, the value 550 comes from the total number of males and females in Canada for the year 2005, regardless of age: so 550 = 400+25+100+25.
I probably need to groupby Country and Year, but I'm not sure how to collapse the ages and total the number of males and females.
df["Total"] = df.Male + df.Female
df.groupby(["Country", "Year"]).Total.sum()
Output:
Country Year
Canada 2005 550
2006 170
Columbia 2005 150
Name: Total, dtype: int64
Update
cᴏʟᴅsᴘᴇᴇᴅ's chained version:
(df.assign(Total=df.Male + df.Female)
.groupby(['Country', 'Year'])
.Total
.sum()
.reset_index(name='Total over ages and sexes'))

Categories