Column Differences for MultiIndex Dataframes - python
I've got (probably) quite a simple problem that I just cannot wrap my head around right now. I'm collecting the following two series:
from pandas_datareader import wb
countries = [
'DZA', 'ARM','AZE','BLR','BIH','BRN','KHM','CHN','HRV', 'CZE','EGY',\
'EST','GEO','HUN','IND','IDN','ISR','JPN','JOR','KAZ','KOR','KGZ','LAO','LVA',\
'LBN','LTU','MYS','MDA','MNG','MMR','MKD','PHL','POL','ROU', 'RUS','SAU',\
'SGP','SVK','SVN','TJK','THA','TUR','UKR','UZB','VNM'
]
dat = wb.download(indicator='FR.INR.LEND', country=countries, start=2010, end=2019)
dat.columns = ['lending_rate']
us = wb.download(indicator='FR.INR.LEND', country='US', start=2010, end=2019)
us.columns = ['lending_rate_us']
dat2=pd.concat([dat,us])
dat2
I'd like to take the difference between lending_rate and lending_rate_us but obviously would like to subtract lending_rate_us for the US only from lending_rate in all other countries (ie. avoid what would otherwise lead to NANs everywhere).
So I guess what I'm trying to do is to copy the values for lending_rate_us to all other countries to then take the difference between both columns.
Does anyone have an idea how to do that (or an alternative idea that makes more sense)?
Thanks!
EDIT:
I tried the following, alas without success:
from pandas_datareader import wb
countries = [
'DZA', 'ARM','AZE','BLR','BIH','BRN','KHM','CHN','HRV', 'CZE','EGY',\
'EST','GEO','HUN','IND','IDN','ISR','JPN','JOR','KAZ','KOR','KGZ','LAO','LVA',\
'LBN','LTU','MYS','MDA','MNG','MMR','MKD','PHL','POL','ROU', 'RUS','SAU',\
'SGP','SVK','SVN','TJK','THA','TUR','UKR','UZB','VNM'
]
dat = wb.download(indicator='FR.INR.LEND', country=countries, start=2010, end=2019)
dat.columns = ['lending_rate']
us = wb.download(indicator='FR.INR.LEND', country='US', start=2010, end=2019)
us.columns = ['lending_rate']
for i in dat.index.get_level_values(0).unique():
dat["lending_rate_spread"]=dat.loc[i,:]-us.loc["United States",:]
dat
Output:
lending_rate lending_rate_spread
country year
Armenia
2019 12.141989 NaN
2018 12.793042 NaN
2017 14.406002 NaN
2016 17.356706 NaN
2015 17.590330 NaN
... ... ... ...
Vietnam
2014 8.665000 NaN
2013 10.374167 NaN
2012 13.471667 NaN
2011 16.953833 NaN
2010 13.135250 NaN
450 rows × 2 columns
But when I just print the result of the loop without creating a new column I get the correct values:
for i in dat.index.get_level_values(0):
print(dat.loc[i,:]-us.loc["United States",:])
Output:
lending_rate
year
2019 6.859489
2018 7.888875
2017 10.309335
2016 13.845039
2015 14.330330
2014 13.158665
2013 12.744987
2012 13.980068
2011 14.504474
2010 15.950428
lending_rate
year
2019 11.998715
2018 12.544167
2017 12.445833
2016 12.863560
2015 14.274167
I don't understand why I would get the correct result but not present it in the correct way?
In response to your comment, I reviewed the data again. I reworked the data for each country as NA data existed, and found that all of the data was for 10 years.
The method #Paul commented on is possible, so I modified the code.
dat['lending_rate_us'] = us['lending_rate_us']*len(dat['country'].unique())
dat['lending_rate_spread'] = dat['lending_rate'] - us['lending_rate_us']
dat.head()
country year lending_rate lending_rate_us lending_rate_spread
0 Armenia 2019 12.141989 237.7125 6.859489
1 Armenia 2018 12.793042 220.6875 7.888875
2 Armenia 2017 14.406002 184.3500 10.309335
3 Armenia 2016 17.356706 158.0250 13.845039
4 Armenia 2015 17.590330 146.7000 14.330330
Related
Pandas subtract value from one dataframe from columns in another dataframe depending on index
I have two dataframes like the ones below, let's call them df1 and df2 respectively. state year val1 ALABAMA 2012 22.186789 2016 27.725147 2020 25.461653 ALASKA 2012 13.988918 2016 14.730641 2020 10.061191 ARIZONA 2012 9.064766 2016 3.543962 2020 -0.308710 year val2 2000 -0.491702 2004 2.434132 2008 -7.399984 2012 -3.935184 2016 -2.181941 2020 -4.448889 For each row in df1, I want to subtract val2 in df2 from every corresponding year in df1. I.e., I want to find the difference between val1 and val2 for each year in every state. The dataframe I am trying to obtain is party_simplified val1 difference state year ALABAMA 2012 22.186789 26.121973 2016 27.725147 29.907088 2020 25.461653 29.910542 ALASKA 2012 13.988918 17.924102 2016 14.730641 16.912582 2020 10.061191 14.510080 ARIZONA 2012 9.064766 12.999950 2016 3.543962 5.725903 2020 -0.308710 4.140180 I have been able to accomplish this using a for loop like the one below, but am wondering if there is a more efficient way. for i in range(2012, 2024, 4): df1.loc[(slice(None), i), 'difference'] = df1.loc[(slice(None), i), 'val1'] - df2.loc[i]['val2']
This works for me, though you might need to tweak it based on how your indices are set up: merged = df1.reset_index().merge(df2, left_on='year', right_index=True) df_new = (merged .assign(**{'difference': merged.val1 - merged.val2}) .set_index(['state', 'year']) .filter(['val1', 'difference']) .sort_index()) df_new val1 difference state year ALABAMA 2012 22.186789 26.121973 2016 27.725147 29.907088 2020 25.461653 29.910542 ALASKA 2012 13.988918 17.924102 2016 14.730641 16.912582 2020 10.061191 14.510080 ARIZONA 2012 9.064766 12.999950 2016 3.543962 5.725903 2020 -0.308710 4.140179
Move data from row 1 to row 0
I have this function written in python. I want this thing show difference between row from production column. Here's the code def print_df(): mycursor.execute("SELECT * FROM productions") myresult = mycurson.fetchall() myresult.sort(key=lambda x: x[0]) df = pd.DataFrame(myresult, columns=['Year', 'Production (Ton)']) df['Dif'] = abs(df['Production (Ton)']. diff()) print(abs(df)) And of course the output is this Year Production (Ton) Dif 0 2010 339491 NaN 1 2011 366999 27508.0 2 2012 361986 5013.0 3 2013 329461 32525.0 4 2014 355464 26003.0 5 2015 344998 10466.0 6 2016 274317 70681.0 7 2017 200916 73401.0 8 2018 217246 16330.0 9 2019 119830 97416.0 10 2020 66640 53190.0 But I want the output like this Year Production (Ton) Dif 0 2010 339491 27508.0 1 2011 366999 5013.0 2 2012 361986 32525.0 3 2013 329461 26003.0 4 2014 355464 10466.0 5 2015 344998 70681.0 6 2016 274317 73401.0 7 2017 200916 16330.0 8 2018 217246 97416.0 9 2019 119830 53190.0 10 2020 66640 66640.0 What should I change or add to my code?
You can use a negative period input to diff to get the differences the way you want, and then fillna to fill the last value with the value from the Production column: df['Dif'] = df['Production (Ton)'].diff(-1).fillna(df['Production (Ton)']).abs() Output: Year Production (Ton) Dif 0 2010 339491 27508.0 1 2011 366999 5013.0 2 2012 361986 32525.0 3 2013 329461 26003.0 4 2014 355464 10466.0 5 2015 344998 70681.0 6 2016 274317 73401.0 7 2017 200916 16330.0 8 2018 217246 97416.0 9 2019 119830 53190.0 10 2020 66640 66640.0
Use shift(-1) to shift all rows one position up. df['Dif'] = (df['Production (Ton)'] - df['Production (Ton)'].shift(-1).fillna(0)).abs() Notice that by setting fillna(0), you avoid the NaNs. You can also use diff: df['Dif'] = df['Production (Ton)'].diff().shift(-1).fillna(0).abs()
argument of type "float" is not iterable when trying to use for loop
I have a countrydf as below, in which each cell in the country column contains a list of the countries where the movie was released. countrydf id Country release_year s1 [US] 2020 s2 [South Africa] 2021 s3 NaN 2021 s4 NaN 2021 s5 [India] 2021 I want to make a new df which look like this: country_yeardf Year US UK Japan India 1925 NaN NaN NaN NaN 1926 NaN NaN NaN NaN 1927 NaN NaN NaN NaN 1928 NaN NaN NaN NaN It has the release year and the number of movies released in each country. My solution is that: with a blank df like the second one, run a for loop to count the number of movies released and then modify the value in the cell relatively. countrylist=['Afghanistan', 'Aland Islands', 'Albania', 'Algeria', 'American Samoa', 'Andorra', 'Angola', 'Anguilla', 'Antarctica', ….] for x in countrylist: for j in list(range(0,8807)): if x in countrydf.country[j]: t=int (countrydf.release_year[j] ) country_yeardf.at[t, x] = country_yeardf.at[t, x]+1 an error occurred which read: TypeError Traceback (most recent call last) <ipython-input-25-225281f8759a> in <module>() 1 for x in countrylist: 2 for j in li: ----> 3 if x in countrydf.country[j]: 4 t=int(countrydf.release_year[j]) 5 country_yeardf.at[t, x] = country_yeardf.at[t, x]+1 TypeError: argument of type 'float' is not iterable I don’t know which one is of float type here, I have check the type of countrydf.country[j] and it returned int. I was using pandas and I am just getting started with it. Can anyone please explain the error and suggest a solution for a df that I want to create? P/s: my English is not so good so hop you guys understand.
Here is a solution using groupby df = pd.DataFrame([['US', 2015], ['India', 2015], ['US', 2015], ['Russia', 2016]], columns=['country', 'year']) country year 0 US 2015 1 India 2015 2 US 2015 3 Russia 2016 Now just groupby country and year and unstack the output: df.groupby(['year', 'country']).size().unstack() country India Russia US year 2015 1.0 NaN 2.0 2016 NaN 1.0 NaN
Some alternative ways to achieve this in pandas without loops. If the Country Column have more than 1 value in the list in each row, you can try the below: >>df['Country'].str.join("|").str.get_dummies().groupby(df['release_year']).sum() India South Africa US release_year 2020 0 0 1 2021 1 1 0 Else if Country has just 1 value per row in the list as you have shown in the example, you can use crosstab >>pd.crosstab(df['release_year'],df['Country'].str[0]) Country India South Africa US release_year 2020 0 0 1 2021 1 1 0
How to get years without starting with df=df.set_index
I have this set of dataframe:Dataframe I can obtain the values that is 15% greater than the mean by: df[df['Interest']>(df["Interest"].mean()*1.15)].Interest.to_string() I obtained all values that are 15% greater than interest in their respective categories The question is how do I get the year where these values occurred without starting with: df=df.set_index('Year") at the start as the function above requires my year values with df.iloc
How do I get the year where these values occurred without starting with df.set_index('Year") Use .loc: >>> df Year Dividends Interest Other Types Rent Royalties Trade Income 0 2007 7632014 4643033 206207 626668 89715 18654926 1 2008 6718487 4220161 379049 735494 58535 29677697 2 2009 1226858 5682198 482776 1015181 138083 22712088 3 2010 978925 2229315 565625 1260765 146791 15219378 4 2011 1500621 2452712 675770 1325025 244073 19697549 5 2012 308064 2346778 591180 1483543 378998 33030888 6 2013 275019 4274425 707344 1664747 296136 17503798 7 2014 226634 3124281 891466 1807172 443671 16023363 8 2015 2171559 3474825 1144862 1858838 585733 16778858 9 2016 767713 4646350 2616322 1942102 458543 13970498 10 2017 759016 4918320 1659303 2001220 796343 9730659 11 2018 687308 6057191 1524474 2127583 1224471 19570540 >>> df.loc[df['Interest']>(df["Interest"].mean()*1.15), ['Year', 'Interest']] Year Interest 0 2007 4643033 2 2009 5682198 9 2016 4646350 10 2017 4918320 11 2018 6057191
This will return a DataFrame with Year and the Interest values that match your condition df[df['Interest']>(df["Interest"].mean()*1.15)][['Year', 'Interest']]
This will return the Year :- df.loc[df["Interest"]>df["Interest"].mean()*1.15]["Year"]
I have multiIndexes for my dataframe, how do I calculate the sum for one level?
Hi everyone, I want to calculate the sum of Violent_type count according to year. For example, calculating total count of violent_type for year 2013, which is 18728+121662+1035. But I don't know how to select the data when there are multiIndexes. Any advice will be appreciated. Thanks.
The level argument in pandas.DataFrame.groupby() is what you are looking for. level int, level name, or sequence of such, default None If the axis is a MultiIndex (hierarchical), group by a particular level or levels. To answer your question, you only need: df.groupby(level=[0, 1]).sum() # or df.groupby(level=['district', 'year']).sum() To see the effect import pandas as pd iterables = [['001', 'SST'], [2013, 2014], ['Dangerous', 'Non-Violent', 'Violent']] index = pd.MultiIndex.from_product(iterables, names=['district', 'year', 'Violent_type']) df = pd.DataFrame(list(range(0, len(index))), index=index, columns=['count']) ''' print(df) count district year Violent_type 001 2013 Dangerous 0 Non-Violent 1 Violent 2 2014 Dangerous 3 Non-Violent 4 Violent 5 SST 2013 Dangerous 6 Non-Violent 7 Violent 8 2014 Dangerous 9 Non-Violent 10 Violent 11 ''' print(df.groupby(level=[0, 1]).sum()) ''' count district year 001 2013 3 2014 12 SST 2013 21 2014 30 ''' print(df.groupby(level=['district', 'year']).sum()) ''' count district year 001 2013 3 2014 12 SST 2013 21 2014 30 '''