Pandas Split Column String and Plot unique values - python

I have a dataframe Df that looks like this:
Country Year
0 Australia, USA 2015
1 USA, Hong Kong, UK 1982
2 USA 2012
3 USA 1994
4 USA, France 2013
5 Japan 1988
6 Japan 1997
7 USA 2013
8 Mexico 2000
9 USA, UK 2005
10 USA 2012
11 USA, UK 2014
12 USA 1980
13 USA 1992
14 USA 1997
15 USA 2003
16 USA 2004
17 USA 2007
18 USA, Germany 2009
19 Japan 2006
20 Japan 1995
I want to make a bar chart for the Country column, if i try this
Df.Country.value_counts().plot(kind='bar')
I get this plot
which is incorrect because it doesn't separate the countries. My goal is to obtain a bar chart that plots the count of each country in the column, but to achieve that, first i have to somehow split the string in each row (if needed) and then plot the data. I know i can use Df.Country.str.split(', ') to split the strings, but if i do this i can't plot the data.
Anyone has an idea how to solve this problem?

You could use the vectorized Series.str.split method to split the Countrys:
In [163]: df['Country'].str.split(r',\s+', expand=True)
Out[163]:
0 1 2
0 Australia USA None
1 USA Hong Kong UK
2 USA None None
3 USA None None
4 USA France None
...
If you stack this DataFrame to move all the values into a single column, then you can apply value_counts and plot as before:
import pandas as pd
import matplotlib.pyplot as plt
df = pd.DataFrame(
{'Country': ['Australia, USA', 'USA, Hong Kong, UK', 'USA', 'USA', 'USA, France', 'Japan', 'Japan', 'USA', 'Mexico', 'USA, UK', 'USA', 'USA, UK', 'USA', 'USA', 'USA', 'USA', 'USA', 'USA', 'USA, Germany', 'Japan', 'Japan'],
'Year': [2015, 1982, 2012, 1994, 2013, 1988, 1997, 2013, 2000, 2005, 2012, 2014, 1980, 1992, 1997, 2003, 2004, 2007, 2009, 2006, 1995]})
counts = df['Country'].str.split(r',\s+', expand=True).stack().value_counts()
counts.plot(kind='bar')
plt.show()

from collections import Counter
c = pd.Series(Counter(df.Country.str.split(',').sum()))
>>> c.plot(kind='bar', title='Country Count')

new_df = pd.concat([Series(row['Year'], row['Country'].split(',')) for _, row in DF.iterrows()]).reset_index()
(DF is your old DF).
this will give you one data point for each country name.
Hope this helps.
Cheers!

Related

How to pull value from a column when several columns match in two data frames?

I am trying to write a script which will search a database similar to that in Table 1 based on a product/region/year specification outlined in table 2. The plan is to search for a match in Table 1 to a specification outlined in Table 2 and then pull the observation value, as seen in Table 2 - with results.
I need this code to run several loops, where the year criteria is relaxed. For example, loop 1 would search for a match in Product_L1, Geography_L1 and Year and loop 2 would search for a match in Product_L1, Geography_L1 and Year-1 and so on.
Table 1
Product level 1
Product level 2
Region level 1
Region level 2
Year
Obs. value
Portland cement
Cement
Peru
South America
2021
1
Portland cement
Cement
Switzerland
Europe
2021
2
Portland cement
Cement
USA
North America
2021
3
Portland cement
Cement
Brazil
South America
2021
4
Portland cement
Cement
South Africa
Africa
2021
5
Portland cement
Cement
India
Asia
2021
6
Portland cement
Cement
Brazil
South America
2020
7
Table 2
Product level 1
Product level 2
Region level 1
Region level 2
Year
Portland cement
Cement
Brazil
South America
2021
Portland cement
Cement
Switzerland
Europe
2021
Table 2 - with results
Product level 1
Product level 2
Region level 1
Region level 2
Year
Loop 1
Loop 2
x
Portland cement
Cement
Brazil
South America
2021
4
7
I have tried using the following code, but it comes up with the error 'Can only compare identically-labeled Series objects'. Does anyone have any suggestions on how to prevent this error?
Table_2['Loop_1'] = np.where((Table_1.Product_L1 == Table_2.Product_L1)
& (Table_1.Geography_L1 == Table_2.Geography_L1)
& (Table_1.Year == Table_2.Year),
Table_1(['obs_value'], ''))
You can perform a merge operation and provide a list of columns that you want from Table_1.
import pandas as pd
Table_1 = pd.DataFrame({
"Product_L1":["Portland cement", "Portland cement", "Portland cement", "Portland cement", "Portland cement", "Portland cement", "Portland cement"],
"Product_L2":["Cement", "Cement", "Cement", "Cement", "Cement", "Cement", "Cement"],
"Geography_L1": ["Peru", "Switzerland", "USA", "Brazil", "South Africa", "India", "Brazil"],
"Geography_L2": ["South America", "Europe", "North America", "South America", "Africa", "Asia", "South America"],
"Year": [2021, 2021, 2021, 2021, 2021, 2021, 2020],
"obs_value": [1, 2, 3, 4, 5, 6, 7]
})
Table_2 = pd.DataFrame({
"Product_L1":["Portland cement", "Portland cement"],
"Product_L2":["Cement", "Cement"],
"Geography_L1": ["Brazil", "Switzerland"],
"Geography_L2": ["South America", "Europe"],
"Year": [2021, 2021]
})
columns_list = ['Product_L1','Product_L2','Geography_L1','Geography_L2','Year','obs_value']
result = pd.merge(Table_2, Table_1[columns_list], how='left')
result is a new dataframe:
Product_L1 Product_L2 Geography_L1 Geography_L2 Year obs_value
0 Portland cement Cement Brazil South America 2021 4
1 Portland cement Cement Switzerland Europe 2021 2
EDIT: Based upon the update to the question, I think what you are trying to do is achievable using set_index and unstack. This will create a new dataframe with the observed values listed in columns 'Year_2020', 'Year_2021' etc.
index_columns = ['Product_L1','Product_L2','Geography_L1','Geography_L2', 'Year']
edit_df = Table_1.set_index(index_columns)['obs_value'].unstack().add_prefix('Year_').reset_index()

Extracting country information from description using geograpy

PROBLEM: I want to extract country information from a user description. So far, I'm giving a try with the geograpy package. I like the behavior when the input is not very clear for example in Evesham or Rochdale, however, the package interprets some strings like Zaragoza, Spain as two mentions while the user is clearing saying that its location is in Spain. Still, I don't know why amsterdam is not giving as output Holland... How can I improve the outputs? Am I missing anything important? Is there a better package to achieve this?
DATA: My data example is:
user_location
2 Socialist Republic of Alachua
3 Hérault, France
4 Gwalior, India
5 Zaragoza,España
7 amsterdam
8 Evesham
9 Rochdale
I want to get something like this:
user_location country
2 Socialist Republic of Alachua ['USSR', 'United States']
3 Hérault, France ['France']
4 Gwalior, India ['India']
5 Zaragoza,España ['Spain']
7 amsterdam ['Holland']
8 Evesham ['United Kingdom']
9 Rochdale ['United Kingdom', 'United States']
REPREX:
import pandas as pd
import geograpy3
df = pd.DataFrame.from_dict({'user_location': {2: 'Socialist Republic of Alachua', 3: 'Hérault, France', 4: 'Gwalior, India', 5: 'Zaragoza,España', 7: 'amsterdam ', 8: 'Evesham', 9: 'Rochdale'}})
df['country'] = df['user_location'].apply(lambda x: geograpy.get_place_context(text=x).countries if pd.notnull(x) else x)
print(df)
#> user_location country
#> 2 Socialist Republic of Alachua [USSR, Union of Soviet Socialist Republics, Al...
#> 3 Hérault, France [France, Hérault]
#> 4 Gwalior, India [British Indian Ocean Territory, Gwalior, India]
#> 5 Zaragoza,España [Zaragoza, España, Spain, El Salvador]
#> 7 amsterdam []
#> 8 Evesham [Evesham, United Kingdom]
#> 9 Rochdale [Rochdale, United Kingdom, United States]
Created on 2020-06-02 by the reprexpy package
geograpy3 was not behaving correctly anymore regarding country lookup since it didn't check if None was returned by pycountry. As a committer i just fixed this.
I have added your slightly modified example (to avoid the pandas import) as a unit test case:
def testStackoverflow62152428(self):
'''
see https://stackoverflow.com/questions/62152428/extracting-country-information-from-description-using-geograpy?noredirect=1#comment112899776_62152428
'''
examples={2: 'Socialist Republic of Alachua', 3: 'Hérault, France', 4: 'Gwalior, India', 5: 'Zaragoza,España', 7: 'amsterdam ', 8: 'Evesham', 9: 'Rochdale'}
for index,text in examples.items():
places=geograpy.get_geoPlace_context(text=text)
print("example %d: %s" % (index,places.countries))
and the result is now:
example 2: ['United States']
example 3: ['France']
example 4: ['British Indian Ocean Territory', 'India']
example 5: ['Spain', 'El Salvador']
example 7: []
example 8: ['United Kingdom']
example 9: ['United Kingdom', 'United States']
indeed there is room for improvement for example 5. I have added an issue https://github.com/somnathrakshit/geograpy3/issues/7 - please stay tuned ...

Python - fuzzy string matching - TypeError: expected string or bytes-like object

I am trying to fuzzy merge two dataframes in Python using the code below:
import pandas as pd
from fuzzywuzzy import fuzz
from fuzzywuzzy import process
prospectus_data_file = 'file1.xlsx'
filings_data_file = 'file2.xlsx'
prospectus = pd.read_excel(prospectus_data_file)
filings = pd.read_excel(filings_data_file)
#all_data_st = pd.merge(prospectus, filings, on='NamePeriod')
filings['key']=filings.NamePeriod.apply(lambda x : [process.extract(x, prospectus.NamePeriod, limit=1)][0][0][0])
all_data_st = filings.merge(prospectus,left_on='key',right_on='NamePeriod')
all_data_st.to_excel('merged_file_fuzzy.xlsx')
The idea is to fuzzy merge based on two columns of each dataframe, Name and Year. I tried to combine these two in one field (NamePeriod) and then merge on that, but I am getting the following error:
TypeError: expected string or bytes-like object
Any idea how to perform this fuzzy merge? Here is how these columns look in the dataframes:
print(filings[['Name', 'Period','NamePeriod']])
print(prospectus[['prospectus_issuer_name', 'fyear','NamePeriod']])
print(filings[['Name', 'Period','NamePeriod']])
print(prospectus[['prospectus_issuer_name', 'fyear','NamePeriod']])
Name ... NamePeriod
0 NaN ... NaN
1 NAM TAI PROPERTY INC. ... NAM TAI PROPERTY INC. 2019
2 NAM TAI PROPERTY INC. ... NAM TAI PROPERTY INC. 2018
3 NAM TAI PROPERTY INC. ... NAM TAI PROPERTY INC. 2017
4 NAM TAI PROPERTY INC. ... NAM TAI PROPERTY INC. 2016
... ... ...
15922 Huitao Technology Co., Ltd. ... NaN
15923 Leaping Group Co., Ltd. ... NaN
15924 PUYI, INC. ... NaN
15925 Puhui Wealth Investment Management Co., Ltd. ... NaN
15926 Tidal Royalty Corp. ... NaN
[15927 rows x 3 columns]
prospectus_issuer_name fyear NamePeriod
0 ALCAN ALUM LTD 1990 ALCAN ALUM LTD 1990
1 ALCAN ALUM LTD 1991 ALCAN ALUM LTD 1991
2 ALCAN ALUM LTD 1992 ALCAN ALUM LTD 1992
3 AMOCO CDA PETE CO 1992 AMOCO CDA PETE CO 1992
4 AMOCO CDA PETE CO 1992 AMOCO CDA PETE CO 1992
... ... ...
1798 KOREA GAS CORP 2016 KOREA GAS CORP 2016
1799 KOREA GAS CORP 2016 KOREA GAS CORP 2016
1800 PETROLEOS MEXICANOS 2016 PETROLEOS MEXICANOS 2016
1801 PETROLEOS MEXICANOS 2016 PETROLEOS MEXICANOS 2016
1802 BOC AVIATION PTE LTD GLOBAL 2016 BOC AVIATION PTE LTD GLOBAL 2016
[1803 rows x 3 columns]
Here is the full code I try to run:
import pandas as pd
from rapidfuzz import process, utils
prospectus_data_file = 'file1.xlsx'
filings_data_file = 'file2.xlsx'
prospectus = pd.read_excel(prospectus_data_file)
filings = pd.read_excel(filings_data_file)
filings.rename(columns={'Name': 'name', 'Period': 'year'}, inplace=True)
prospectus.rename(columns={'prospectus_issuer_name': 'name', 'fyear': 'year'}, inplace=True)
df3 = pd.concat([filings, prospectus], ignore_index=True)
from rapidfuzz import fuzz, utils
df3.dropna(subset = ["name"], inplace=True)
names = [utils.default_process(x) for x in df3['name']]
for i1, row1 in df3.iterrows():
for i2 in df3.loc[(df3['year'] == row1['year']) & (df3.index > i1)].index:
if fuzz.WRatio(names[i1], names[i2], processor=None, score_cutoff=90):
df3.drop(i2, inplace=True)
df3.reset_index(inplace=True)
gives me an error IndexError: list index out of range
To summarize the problem:
there are two DataFrames, that both have a key for the name and the year
you would like to merge the two DataFrames and remove all duplicate elements, with duplicate elements being elements, that have the same year and a very similar name
I am working with the following two example DataFrames:
import pandas as pd
df1 = pd.DataFrame({
'Name': ['NAM PROPERTY INC.', 'NAM PROPERTY INC.', 'ALCAN ALUM LTD'],
'Period': [2019, 2019, 2018]})
df2 = pd.DataFrame({
'prospectus_issuer_name': ['NAM TAI PROPERTY INC.', 'ALCAN ALUM LTD', 'AMOCO CDA PETE CO'],
'fyear': [2019, 2019, 1992]})
My approach towards this problem would be to start by concating the two data frames
df1.rename(columns={'Name': 'name', 'Period': 'year'}, inplace=True)
df2.rename(columns={'prospectus_issuer_name': 'name', 'fyear': 'year'}, inplace=True)
df3 = pd.concat([df1, df2], ignore_index=True)
Afterwards it is possible to iterate over this new DataFrame an remove all duplicate rows. I am using RapidFuzz here, since it is faster than FuzzyWuzzy (I am the author). The following code is creating a list of preprocessed names ahead of time, since the entries might be used multiple times and the preprocessing is taking a big time of the runtime. Afterwards it is iterating over the rows and always compares it with all rows, that have a higher index (rows with a lower index are already compared, since ratio(a,b) == ratio(b,a)) and that have the correct year. Filtering on the correct year allows to run the slow string matching algorithm a lot less oftern. For all rows that have a similar year and a very similar name the first row is kept and the others are deleted. You might have to play around with the score_cutoff and the matching algorithm to see which one fits your needs the best.
from rapidfuzz import fuzz, utils
names = [utils.default_process(x) for x in df3['name']]
for i1, row1 in df3.iterrows():
for i2 in df3.loc[(df3['year'] == row1['year']) & (df3.index > i1)].index:
if fuzz.WRatio(names[i1], names[i2], processor=None, score_cutoff=90):
df3.drop(i2, inplace=True)
df3.reset_index(inplace=True)

I have country, start and end year for all baseball players. I need to know how many players per country played each year

I have a dataset with 20,000 players. Columns are birthCountry, debut_year and final_year.
birthCountry debut_year final_year
0 USA 2004 2015
1 USA 1954 1976
2 USA 1962 1971
3 USA 1977 1990
4 USA 2001 2006
I need to get a table as follows:
1980 1981 1982
USA 50 49 48
CANADA XX XX XX
MEXICO XX XX XX
...
Where each cell represents the number of players that were born in a particular country, that played during that year.
I created a nested list, containing all years that each player played. The length of this list is the same as the length of the df. In the df, I created one additional column per year and I tried to add 1 for each player/year combination.
The idea was to use this to create a groupby or pivot_table
# create a list of years
years = list(range(min(df['debut_year'].values),max(df['final_year'].values)+1))
# create a list of countries
countries = df.birthCountry.unique()
# add columns for years
for n in range(1841,2019): #years are from 1841 to 2018
df[n] = ''
# now I have one additional column for every year. A lot of new empty columns
# temporary lists
templist = list(range(0,len(df)))
# every element of the following list contains all the years each player played
templist2 = []
for i in templist:
templist2.append(list(range(int(df.iloc[i,1]),int(df.iloc[i,2]))))
# add 1 if the player played that year
for i in range(len(df)):
for j in templist2[i]:
df.iloc[i][j] = 1
I run for some time and then nothing changed in the original dataframe.
Probably you can find a better more elegant solution.
To limit the size of the example, I created the following source DataFrame:
df = pd.DataFrame(data=[[ 1, 'USA', 1974, 1978 ], [ 2, 'USA', 1976, 1981 ],
[ 3, 'USA', 1975, 1979 ], [ 4, 'USA', 1977, 1980 ],
[ 5, 'Mex', 1976, 1979 ], [ 6, 'Mex', 1978, 1980 ]],
columns=['Id', 'birthCountry', 'debut_year', 'final_year'])
The fists step of actual computation is to create a Series containing
years in which each player was active:
years = df.apply(lambda row: pd.Series(range(row.debut_year,
row.final_year + 1)), axis=1).stack().astype(int).rename('year')
The second step is to create an auxiliary DataFrame - a join of
df.birthCountry and years:
df2 = df[['birthCountry']].join(years.reset_index(level=1, drop=True))
And the last step is to produce the actual result:
df2.groupby(['birthCountry', 'year']).size().rename('Count')\
.unstack().fillna(0, downcast='infer')
For the above test data, the result is:
year 1974 1975 1976 1977 1978 1979 1980 1981
birthCountry
Mex 0 0 1 1 2 2 1 0
USA 1 2 3 4 4 3 2 1
I think, my solution is more "pandasonic" than the other, proposed earlier
by Remy.
I was able to come up with the following solution if I understand the structure of your df variable correctly. I made a dictionary list (using a smaller range of years) with the same structure for my example:
df = [{'birthCountry': 'USA', 'debut_year': 2012, 'final_year': 2016},
{'birthCountry': 'CANADA', 'debut_year': 2010, 'final_year': 2016},
{'birthCountry': 'USA', 'debut_year': 2012, 'final_year': 2017},
{'birthCountry': 'CANADA', 'debut_year': 2012, 'final_year': 2017},
{'birthCountry': 'MEXICO', 'debut_year': 2012, 'final_year': 2016}]
countries = {}
for field in df:
if field['birthCountry'] not in countries.keys():
countries[field['birthCountry']] = {year: 0 for year in range(2010, 2019)}
for year in range(field['debut_year'], field['final_year']):
countries[field['birthCountry']][year] += 1

Find key from value for Pandas Series

I have a dictionary whose values are in a pandas series. I want to make a new series that will look up a value in a series and return a new series with associated key. Example:
import pandas as pd
df = pd.DataFrame({'season' : ['Nor 2014', 'Nor 2013', 'Nor 2013', 'Norv 2013',
'Swe 2014', 'Swe 2014', 'Swe 2013',
'Swe 2013', 'Sven 2013', 'Sven 2013', 'Norv 2014']})
nmdict = {'Norway' : [s for s in list(set(df.season)) if 'No' in s],
'Sweden' : [s for s in list(set(df.season)) if 'S' in s]}
Desired result with df['country'] as the new column name:
season country
0 Nor 2014 Norway
1 Nor 2013 Norway
2 Nor 2013 Norway
3 Norv 2013 Norway
4 Swe 2014 Sweden
5 Swe 2014 Sweden
6 Swe 2013 Sweden
7 Swe 2013 Sweden
8 Sven 2013 Sweden
9 Sven 2013 Sweden
10 Norv 2014 Norway
Due to nature of my data I must manually make the nmdict as shown. I've tried this but couldn't reverse my nmdict as arrays are not same length.
More importantly, I think my approach may be wrong. I'm coming from Excel and thinking of a vlookup solution, but according to this answer, I shouldn't be using the dictionary in this way.
Any answers appreciated.
I've done it in a verbose manner to allow you to follow through.
First, let's define a function that determines the value 'country'
In [4]: def get_country(s):
...: if 'Nor' in s:
...: return 'Norway'
...: if 'S' in s:
...: return 'Sweden'
...: # return 'Default Country' # if you get unmatched values
In [5]: get_country('Sven')
Out[5]: 'Sweden'
In [6]: get_country('Norv')
Out[6]: 'Norway'
We can use map to run get_country on every row. Pandas DataFrames also have a apply() which works similarly*.
In [7]: map(get_country, df['season'])
Out[7]:
['Norway',
'Norway',
'Norway',
'Norway',
'Sweden',
'Sweden',
'Sweden',
'Sweden',
'Sweden',
'Sweden',
'Norway']
Now we assign that result to the column called 'country'
In [8]: df['country'] = map(get_country, df['season'])
Let's view the final result:
In [9]: df
Out[9]:
season country
0 Nor 2014 Norway
1 Nor 2013 Norway
2 Nor 2013 Norway
3 Norv 2013 Norway
4 Swe 2014 Sweden
5 Swe 2014 Sweden
6 Swe 2013 Sweden
7 Swe 2013 Sweden
8 Sven 2013 Sweden
9 Sven 2013 Sweden
10 Norv 2014 Norway
*With apply() here's how it would look:
In [16]: df['country'] = df['season'].apply(get_country)
In [17]: df
Out[17]:
season country
0 Nor 2014 Norway
1 Nor 2013 Norway
2 Nor 2013 Norway
3 Norv 2013 Norway
4 Swe 2014 Sweden
5 Swe 2014 Sweden
6 Swe 2013 Sweden
7 Swe 2013 Sweden
8 Sven 2013 Sweden
9 Sven 2013 Sweden
10 Norv 2014 Norway
A more scalable country matcher
pseudo-code only :)
# Modify this as needed
country_matchers = {
'Norway': ['Nor', 'Norv'],
'Sweden': ['S', 'Swed'],
}
def get_country(s):
"""
Run the passed string s against "matchers" for each country
Return the first matched country
"""
for country, matchers in country_matchers.items():
for matcher in matchers:
if matcher in s:
return country
IIUC, I would do the following:
df['country'] = df['season'].apply(lambda x: 'Norway' if 'No' in x else 'Sweden' if 'S' in x else x)
You could create the country dictionary using a dictionary comprehension:
country_id = df.season.str.split().str.get(0).drop_duplicates()
country_dict = {c: ('Norway' if c.startswith('N') else 'Sweden') for c in country_id.values}
to get:
{'Nor': 'Norway', 'Swe': 'Sweden', 'Sven': 'Sweden', 'Norv': 'Norway'}
This works fine for two countries, otherwise you can apply a self-defined function in similar way:
def country_dict(country_id):
if country_id.startswith('S'):
return 'Sweden'
elif country_id.startswith('N'):
return 'Norway'
elif country_id.startswith('XX'):
return ...
else:
return 'default'
Either way, map the dictionary to the country_id part of the season column, extracted using pandas string methods:
df['country'] = df.season.str.split().str.get(0).map(country_dict)
season country
0 Nor 2014 Norway
1 Nor 2013 Norway
2 Nor 2013 Norway
3 Norv 2013 Norway
4 Swe 2014 Sweden
5 Swe 2014 Sweden
6 Swe 2013 Sweden
7 Swe 2013 Sweden
8 Sven 2013 Sweden
9 Sven 2013 Sweden
10 Norv 2014 Norway

Categories