Find key from value for Pandas Series - python

I have a dictionary whose values are in a pandas series. I want to make a new series that will look up a value in a series and return a new series with associated key. Example:
import pandas as pd
df = pd.DataFrame({'season' : ['Nor 2014', 'Nor 2013', 'Nor 2013', 'Norv 2013',
'Swe 2014', 'Swe 2014', 'Swe 2013',
'Swe 2013', 'Sven 2013', 'Sven 2013', 'Norv 2014']})
nmdict = {'Norway' : [s for s in list(set(df.season)) if 'No' in s],
'Sweden' : [s for s in list(set(df.season)) if 'S' in s]}
Desired result with df['country'] as the new column name:
season country
0 Nor 2014 Norway
1 Nor 2013 Norway
2 Nor 2013 Norway
3 Norv 2013 Norway
4 Swe 2014 Sweden
5 Swe 2014 Sweden
6 Swe 2013 Sweden
7 Swe 2013 Sweden
8 Sven 2013 Sweden
9 Sven 2013 Sweden
10 Norv 2014 Norway
Due to nature of my data I must manually make the nmdict as shown. I've tried this but couldn't reverse my nmdict as arrays are not same length.
More importantly, I think my approach may be wrong. I'm coming from Excel and thinking of a vlookup solution, but according to this answer, I shouldn't be using the dictionary in this way.
Any answers appreciated.

I've done it in a verbose manner to allow you to follow through.
First, let's define a function that determines the value 'country'
In [4]: def get_country(s):
...: if 'Nor' in s:
...: return 'Norway'
...: if 'S' in s:
...: return 'Sweden'
...: # return 'Default Country' # if you get unmatched values
In [5]: get_country('Sven')
Out[5]: 'Sweden'
In [6]: get_country('Norv')
Out[6]: 'Norway'
We can use map to run get_country on every row. Pandas DataFrames also have a apply() which works similarly*.
In [7]: map(get_country, df['season'])
Out[7]:
['Norway',
'Norway',
'Norway',
'Norway',
'Sweden',
'Sweden',
'Sweden',
'Sweden',
'Sweden',
'Sweden',
'Norway']
Now we assign that result to the column called 'country'
In [8]: df['country'] = map(get_country, df['season'])
Let's view the final result:
In [9]: df
Out[9]:
season country
0 Nor 2014 Norway
1 Nor 2013 Norway
2 Nor 2013 Norway
3 Norv 2013 Norway
4 Swe 2014 Sweden
5 Swe 2014 Sweden
6 Swe 2013 Sweden
7 Swe 2013 Sweden
8 Sven 2013 Sweden
9 Sven 2013 Sweden
10 Norv 2014 Norway
*With apply() here's how it would look:
In [16]: df['country'] = df['season'].apply(get_country)
In [17]: df
Out[17]:
season country
0 Nor 2014 Norway
1 Nor 2013 Norway
2 Nor 2013 Norway
3 Norv 2013 Norway
4 Swe 2014 Sweden
5 Swe 2014 Sweden
6 Swe 2013 Sweden
7 Swe 2013 Sweden
8 Sven 2013 Sweden
9 Sven 2013 Sweden
10 Norv 2014 Norway
A more scalable country matcher
pseudo-code only :)
# Modify this as needed
country_matchers = {
'Norway': ['Nor', 'Norv'],
'Sweden': ['S', 'Swed'],
}
def get_country(s):
"""
Run the passed string s against "matchers" for each country
Return the first matched country
"""
for country, matchers in country_matchers.items():
for matcher in matchers:
if matcher in s:
return country

IIUC, I would do the following:
df['country'] = df['season'].apply(lambda x: 'Norway' if 'No' in x else 'Sweden' if 'S' in x else x)

You could create the country dictionary using a dictionary comprehension:
country_id = df.season.str.split().str.get(0).drop_duplicates()
country_dict = {c: ('Norway' if c.startswith('N') else 'Sweden') for c in country_id.values}
to get:
{'Nor': 'Norway', 'Swe': 'Sweden', 'Sven': 'Sweden', 'Norv': 'Norway'}
This works fine for two countries, otherwise you can apply a self-defined function in similar way:
def country_dict(country_id):
if country_id.startswith('S'):
return 'Sweden'
elif country_id.startswith('N'):
return 'Norway'
elif country_id.startswith('XX'):
return ...
else:
return 'default'
Either way, map the dictionary to the country_id part of the season column, extracted using pandas string methods:
df['country'] = df.season.str.split().str.get(0).map(country_dict)
season country
0 Nor 2014 Norway
1 Nor 2013 Norway
2 Nor 2013 Norway
3 Norv 2013 Norway
4 Swe 2014 Sweden
5 Swe 2014 Sweden
6 Swe 2013 Sweden
7 Swe 2013 Sweden
8 Sven 2013 Sweden
9 Sven 2013 Sweden
10 Norv 2014 Norway

Related

calculate bad month from the given csv

I tried finding the five worst months from the data but I'm not sure about the process as I'm very confused . The answer should be something like (June 2001, July 2002 )but when I tried to solve it my answer wasn't as expected. Only the data of January was sorted. This the the way I tried solving my question and the csv data file is also provided on the screenshot.
My solution is given below:
PATH = "tourist_arrival.csv"
df = pd.read_csv(PATH)
print(df.sort_values(by=['Jan.','Feb.','Mar.','Apr.','May.','Jun.','Jul.','Aug.','Sep.','Oct.','Nov.','Dec.'],ascending=False))
Year ,Jan.,Feb.,Mar.,Apr.,May.,Jun.,Jul.,Aug.,Sep.,Oct.,Nov.,Dec.,Total
1992, 17451,27489,31505,30682,29089,22469,20942,27338,24839,42647,32341,27561,334353
1993 ,19238,23931,30818,20121,20585,19602,13588,21583,23939,42242,30378,27542,293567
1994, 21735,24872,31586,27292,26232,22907,19739,27610,27959,39393,28008,29198,326531
1995 ,22207,28240,34219,33994,27843,25650,23980,27686,30569,46845,35782,26380,363395
1996 ,27886,29676,39336,36331,29728,26749,22684,29080,32181,47314,37650,34998,393613
1997,25585,32861,43177,35229,33456,26367,26091,35549,31981,56272,40173,35116,421857
1998,28822,37956,41338,41087,35814,29181,27895,36174,39664,62487,47403,35863,463684
1999,29752,38134,46218,40774,42712,31049,27193,38449,44117,66543,48865,37698,491504
2000,25307,38959,44944,43635,28363,26933,24480,34670,43523,59195,52993,40644,463646
2001,30454,38680,46709,39083,28345,13030,18329,25322,31170,41245,30282,18588,361237
2002,17176,20668,28815,21253,19887,17218,16621,21093,23752,35272,28723,24990,275468
2003,21215,24349,27737,25851,22704,20351,22661,27568,28724,45459,38398,33115,338132
2004,30988,35631,44290,33514,26802,19793,24860,33162,25496,43373,36381,31007,385297
2005,25477,20338,29875,23414,25541,22608,23996,36910,36066,51498,41505,38170,375398
2006,28769,25728,36873,21983,22870,26210,25183,33150,33362,49670,44119,36009,383926
2007,33192,39934,54722,40942,35854,31316,35437,44683,45552,70644,52273,42156,526705
2008,36913,46675,58735,38475,30410,24349,25427,40011,41622,66421,52399,38840,500277
2009,29278,40617,49567,43337,30037,31749,30432,44174,42771,72522,54423,41049,509956
2010,33645,49264,63058,45509,32542,33263,38991,54672,54848,79130,67537,50408,602867
2011,42622,56339,67565,59751,46202,46115,42661,71398,63033,96996,83460,60073,736215
2012,52501,66459,89151,69796,50317,53630,49995,71964,66383,86379,83173,63344,803092
2013,47846,67264,88697,65152,52834,54599,54011,68478,66755,99426,75485,57069,797616
melt your DataFrame and then sort_values:
output = df.melt("Year", df.drop(["Year", "Total"], axis=1).columns, var_name="Month").sort_values("value").reset_index(drop=True)
>>> output
Year Month value
0 2001 Jun. 13030
1 1993 Jul. 13588
2 2002 Jul. 16621
3 2002 Jan. 17176
4 2002 Jun. 17218
.. ... ... ...
259 2012 Oct. 86379
260 2013 Mar. 88697
261 2012 Mar. 89151
262 2011 Oct. 96996
263 2013 Oct. 99426
[264 rows x 3 columns]
For just the 5 worst months, you can do:
>>> output.iloc[:5]
Year Month value
0 2001 Jun. 13030
1 1993 Jul. 13588
2 2002 Jul. 16621
3 2002 Jan. 17176
4 2002 Jun. 17218

Adding columns and index to sum up values using Panda in Python

I have a .csv file, after reading it using Panda I have this output
Year Month Brunei Darussalam ... Thailand Viet Nam Myanmar
348 2007 Jan 3813 ... 25863 12555 4887
349 2007 Feb 3471 ... 22575 11969 3749
350 2007 Mar 4547 ... 33087 14060 5480
351 2007 Apr 3265 ... 34500 15553 6838
352 2007 May 3641 ... 30555 14995 5295
.. ... ... ... ... ... ... ...
474 2017 Jul 5625 ... 48620 71153 12619
475 2017 Aug 4610 ... 40993 51866 10934
476 2017 Sep 5387 ... 39692 40270 9888
477 2017 Oct 4202 ... 61448 39013 11616
478 2017 Nov 5258 ... 39304 36964 11402
I use this to get me the sum of all countries within the total years to display top 3
top3_country = new_df.iloc[0:, 2:9].sum(axis=0).sort_values(ascending=False).nlargest(3)
though my output is this
Indonesia 27572424
Malaysia 11337420
Philippines 6548622
I want to add columns and index into the sum value as if it was a new dataframe like this
Countries Visitors
0 Indonesia 27572424
1 Malaysia 11337420
2 Philippines 6548622
Sorry I am just starting to learn learn Panda any help will be gladly appreciated
Use Series.reset_index for 2 columns DataFrame and then set new columns names from list:
top3_country = top3_country.reset_index()
top3_country.columns = ['Countries', 'Visitors']
Or use Series.rename_axis with Series.reset_index:
top3_country = top3_country.rename_axis('Countries').reset_index(name='Visitors')
You can return back pd.DataFrame, use reset_index, and rename. Change your code to:
import pandas as pd
top3_country = pd.DataFrame(df.iloc[0:, 2:9].sum(axis=0).sort_values(ascending=False).nlargest(3)
).reset_index(
).rename(columns={'index':'Countries',0:'visitors'})
top3_country
Countries visitors
0 Indonesia 27572424
1 Malaysia 11337420
2 Philippines 6548622

using numpy to calculate mean

I am trying to calculate the mean of GNP for each country from 2006 to 2015. But when I apply the aggregation with mean function, it will not calculate the mean from 2006 to 2015. Instead, it just display the values for each year. Pls tell me what went wrong? I am able to sort by country but the mean just wont work on the data.
wb_indicator = 'NY.GNP.ATLS.CD'
start_year = 2006
end_year = 2015
df_ex = wb.download(indicator = wb_indicator,
country = ['all'],
start = start_year,
end = end_year)
df_ex1 = df_ex.reset_index()
df_ex1.groupby(['country']).agg({'NY.GNP.ATLS.CD': [np.mean]})
df_ex1.head(20)
Output:
country year NY.GNP.ATLS.CD 0 Arab World 2015 2.767920e+12 1 Arab
World 2014 2.897113e+12 2 Arab World 2013 2.832769e+12 3 Arab
World 2012 2.590610e+12 4 Arab World 2011 2.190786e+12 5 Arab
World 2010 2.055967e+12 6 Arab World 2009 1.932056e+12 7 Arab
World 2008 1.858270e+12 8 Arab World 2007 1.547924e+12 9 Arab
World 2006 1.312967e+12 10 Caribbean small states 2015 6.680302e+10
11 Caribbean small states 2014 6.664219e+10
This should work
import pandas as pd
import wbdata as wb
import datetime
wb_indicator = 'NY.GNP.ATLS.CD'
data_date = (datetime.datetime(2006, 1, 1), datetime.datetime(2015, 1, 1))
data = wb.get_data(wb_indicator, data_date=data_date, pandas=True)
gnp_means = data.reset_index().groupby('country').mean()

iterate through df column and return value in dataframe based on row index, column reference

My goal is to compare each value from the column "year" against the appropriate column year (i.e. 1999, 2000). I then want to return the corresponding value from the corresponding column. For example, for Afghanistan (first row), year 2004, I want to find the column named "2004" and return the value from the row that contains afghanistan.
Here is the table. For reference this table is the result of a sql join between educational attainment in a single defined year and a table for gdp per country for years 1999 - 2010. My ultimate goal is to return the gdp from the year that the educational data is from.
country year men_ed_yrs women_ed_yrs total_ed_yrs 1999 2000 2001 2002 2003 2004 2005 2006 2007 2008 2009 2010
0 Afghanistan 2004 11 5 8 NaN NaN 2461666315 4128818042 4583648922 5285461999 6.275076e+09 7.057598e+09 9.843842e+09 1.019053e+10 1.248694e+10 1.593680e+10
1 Albania 2004 11 11 11 3414760915 3632043908 4060758804 4435078648 5746945913 7314865176 8.158549e+09 8.992642e+09 1.070101e+10 1.288135e+10 1.204421e+10 1.192695e+10
2 Algeria 2005 13 13 13 48640611686 54790060513 54744714110 56760288396 67863829705 85324998959 1.030000e+11 1.170000e+11 1.350000e+11 1.710000e+11 1.370000e+11 1.610000e+11
3 Andorra 2008 11 12 11 1239840270 1401694156 1484004617 1717563533 2373836214 2916913449 3.248135e+09 3.536452e+09 4.010785e+09 4.001349e+09 3.649863e+09 3.346317e+09
4 Anguilla 2008 11 11 11 NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN
gdp_ed_list = []
for value in df_combined_column_named['year']: #loops through each year in year column
if value in df_combined_column_named.columns: #compares year to column names
idx = df_combined_column_named[df_combined_column_named['year'][value]].index.tolist() #supposed to get the index associated with value
gdp_ed = df_combined_column_named.get_value(idx, value) #get the value of the cell found at idx, value
gdp_ed_list.append(gdp_ed) #append to a list
Currently, my code is getting stuck at the index.list() section. It is returning the error:
---------------------------------------------------------------------------
KeyError Traceback (most recent call last)
<ipython-input-85-361acb97edd4> in <module>()
2 for value in df_combined_column_named['year']: #loops through each year in year column
3 if value in df_combined_column_named.columns: #compares year to column names
----> 4 idx = df_combined_column_named[df_combined_column_named['year'][value]].index.tolist()
5 gdp_ed = df_combined_column_named.get_value(idx, value)
6 gdp_ed_list.append(gdp_ed)
KeyError: u'2004'
Any thoughts?
It looks like you are trying to match the value in the year column to column labels and then extract the value in the corresponding cells. You could do that by looping through the rows (see below) but I think it would be not be the fastest way.
Instead, you could use pd.melt to coalesce the columns with year-like labels into a single column, say, year_col:
In [38]: melted = pd.melt(df, id_vars=['country', 'year', 'men_ed_yrs', 'women_ed_yrs', 'total_ed_yrs'], var_name='year_col')
In [39]: melted
Out[39]:
country year men_ed_yrs women_ed_yrs total_ed_yrs year_col value
0 Afghanistan 2004 11 5 8 1999 NaN
1 Albania 2004 11 11 11 1999 3.414761e+09
2 Algeria 2005 13 13 13 1999 4.864061e+10
3 Andorra 2008 11 12 11 1999 1.239840e+09
4 Anguilla 2008 11 11 11 1999 NaN
5 Afghanistan 2004 11 5 8 2000 NaN
...
The benefit of "melting" the DataFrame in this way is that
now you would have both year and year_col columns. The values you are looking for are in the rows where year equals year_col. And that is easy to obtain by using .loc:
In [41]: melted.loc[melted['year'] == melted['year_col']]
Out[41]:
country year men_ed_yrs women_ed_yrs total_ed_yrs year_col \
25 Afghanistan 2004 11 5 8 2004
26 Albania 2004 11 11 11 2004
32 Algeria 2005 13 13 13 2005
48 Andorra 2008 11 12 11 2008
49 Anguilla 2008 11 11 11 2008
value
25 5.285462e+09
26 7.314865e+09
32 1.030000e+11
48 4.001349e+09
49 NaN
Thus, you could use
import numpy as np
import pandas as pd
nan = np.nan
df = pd.DataFrame({'1999': [nan, 3414760915.0, 48640611686.0, 1239840270.0, nan],
'2000': [nan, 3632043908.0, 54790060513.0, 1401694156.0, nan],
'2001': [2461666315.0, 4060758804.0, 54744714110.0, 1484004617.0, nan],
'2002': [4128818042.0, 4435078648.0, 56760288396.0, 1717563533.0, nan],
'2003': [4583648922.0, 5746945913.0, 67863829705.0, 2373836214.0, nan],
'2004': [5285461999.0, 7314865176.0, 85324998959.0, 2916913449.0, nan],
'2005': [6275076000.0, 8158549000.0, 103000000000.0, 3248135000.0, nan],
'2006': [7057598000.0, 8992642000.0, 117000000000.0, 3536452000.0, nan],
'2007': [9843842000.0, 10701010000.0, 135000000000.0, 4010785000.0, nan],
'2008': [10190530000.0, 12881350000.0, 171000000000.0, 4001349000.0, nan],
'2009': [12486940000.0, 12044210000.0, 137000000000.0, 3649863000.0, nan],
'2010': [15936800000.0, 11926950000.0, 161000000000.0, 3346317000.0, nan],
'country': ['Afghanistan', 'Albania', 'Algeria', 'Andorra', 'Anguilla'],
'men_ed_yrs': [11, 11, 13, 11, 11],
'total_ed_yrs': [8, 11, 13, 11, 11],
'women_ed_yrs': [5, 11, 13, 12, 11],
'year': ['2004', '2004', '2005', '2008', '2008']})
melted = pd.melt(df, id_vars=['country', 'year', 'men_ed_yrs', 'women_ed_yrs',
'total_ed_yrs'], var_name='year_col')
result = melted.loc[melted['year'] == melted['year_col']]
print(result)
Why was a KeyError raised:
The KeyError is being raised by df_combined_column_named['year'][value]. Suppose value is '2004'. Then df_combined_column_named['year'] is a Series containing string representations of years and indexed by integers (like 0, 1, 2, ...). df_combined_column_named['year'][value] fails because it attempts to index this Series with the string '2004' which is not in the integer index.
Alternatively, here is another way to achieve the goal by looping through the rows using iterrows. This is perhaps simpler to understand, but in general using iterrows is slow compared to other column-based Pandas-centric methods:
data = []
for idx, row in df.iterrows():
data.append((row['country'], row['year'], row[row['year']]))
result = pd.DataFrame(data, columns=['country', 'year', 'value'])
print(result)
prints
country year value
0 Afghanistan 2004 5.285462e+09
1 Albania 2004 7.314865e+09
2 Algeria 2005 1.030000e+11
3 Andorra 2008 4.001349e+09
4 Anguilla 2008 NaN

Pandas Split Column String and Plot unique values

I have a dataframe Df that looks like this:
Country Year
0 Australia, USA 2015
1 USA, Hong Kong, UK 1982
2 USA 2012
3 USA 1994
4 USA, France 2013
5 Japan 1988
6 Japan 1997
7 USA 2013
8 Mexico 2000
9 USA, UK 2005
10 USA 2012
11 USA, UK 2014
12 USA 1980
13 USA 1992
14 USA 1997
15 USA 2003
16 USA 2004
17 USA 2007
18 USA, Germany 2009
19 Japan 2006
20 Japan 1995
I want to make a bar chart for the Country column, if i try this
Df.Country.value_counts().plot(kind='bar')
I get this plot
which is incorrect because it doesn't separate the countries. My goal is to obtain a bar chart that plots the count of each country in the column, but to achieve that, first i have to somehow split the string in each row (if needed) and then plot the data. I know i can use Df.Country.str.split(', ') to split the strings, but if i do this i can't plot the data.
Anyone has an idea how to solve this problem?
You could use the vectorized Series.str.split method to split the Countrys:
In [163]: df['Country'].str.split(r',\s+', expand=True)
Out[163]:
0 1 2
0 Australia USA None
1 USA Hong Kong UK
2 USA None None
3 USA None None
4 USA France None
...
If you stack this DataFrame to move all the values into a single column, then you can apply value_counts and plot as before:
import pandas as pd
import matplotlib.pyplot as plt
df = pd.DataFrame(
{'Country': ['Australia, USA', 'USA, Hong Kong, UK', 'USA', 'USA', 'USA, France', 'Japan', 'Japan', 'USA', 'Mexico', 'USA, UK', 'USA', 'USA, UK', 'USA', 'USA', 'USA', 'USA', 'USA', 'USA', 'USA, Germany', 'Japan', 'Japan'],
'Year': [2015, 1982, 2012, 1994, 2013, 1988, 1997, 2013, 2000, 2005, 2012, 2014, 1980, 1992, 1997, 2003, 2004, 2007, 2009, 2006, 1995]})
counts = df['Country'].str.split(r',\s+', expand=True).stack().value_counts()
counts.plot(kind='bar')
plt.show()
from collections import Counter
c = pd.Series(Counter(df.Country.str.split(',').sum()))
>>> c.plot(kind='bar', title='Country Count')
new_df = pd.concat([Series(row['Year'], row['Country'].split(',')) for _, row in DF.iterrows()]).reset_index()
(DF is your old DF).
this will give you one data point for each country name.
Hope this helps.
Cheers!

Categories