Extracting country information from description using geograpy - python

PROBLEM: I want to extract country information from a user description. So far, I'm giving a try with the geograpy package. I like the behavior when the input is not very clear for example in Evesham or Rochdale, however, the package interprets some strings like Zaragoza, Spain as two mentions while the user is clearing saying that its location is in Spain. Still, I don't know why amsterdam is not giving as output Holland... How can I improve the outputs? Am I missing anything important? Is there a better package to achieve this?
DATA: My data example is:
user_location
2 Socialist Republic of Alachua
3 Hérault, France
4 Gwalior, India
5 Zaragoza,España
7 amsterdam
8 Evesham
9 Rochdale
I want to get something like this:
user_location country
2 Socialist Republic of Alachua ['USSR', 'United States']
3 Hérault, France ['France']
4 Gwalior, India ['India']
5 Zaragoza,España ['Spain']
7 amsterdam ['Holland']
8 Evesham ['United Kingdom']
9 Rochdale ['United Kingdom', 'United States']
REPREX:
import pandas as pd
import geograpy3
df = pd.DataFrame.from_dict({'user_location': {2: 'Socialist Republic of Alachua', 3: 'Hérault, France', 4: 'Gwalior, India', 5: 'Zaragoza,España', 7: 'amsterdam ', 8: 'Evesham', 9: 'Rochdale'}})
df['country'] = df['user_location'].apply(lambda x: geograpy.get_place_context(text=x).countries if pd.notnull(x) else x)
print(df)
#> user_location country
#> 2 Socialist Republic of Alachua [USSR, Union of Soviet Socialist Republics, Al...
#> 3 Hérault, France [France, Hérault]
#> 4 Gwalior, India [British Indian Ocean Territory, Gwalior, India]
#> 5 Zaragoza,España [Zaragoza, España, Spain, El Salvador]
#> 7 amsterdam []
#> 8 Evesham [Evesham, United Kingdom]
#> 9 Rochdale [Rochdale, United Kingdom, United States]
Created on 2020-06-02 by the reprexpy package

geograpy3 was not behaving correctly anymore regarding country lookup since it didn't check if None was returned by pycountry. As a committer i just fixed this.
I have added your slightly modified example (to avoid the pandas import) as a unit test case:
def testStackoverflow62152428(self):
'''
see https://stackoverflow.com/questions/62152428/extracting-country-information-from-description-using-geograpy?noredirect=1#comment112899776_62152428
'''
examples={2: 'Socialist Republic of Alachua', 3: 'Hérault, France', 4: 'Gwalior, India', 5: 'Zaragoza,España', 7: 'amsterdam ', 8: 'Evesham', 9: 'Rochdale'}
for index,text in examples.items():
places=geograpy.get_geoPlace_context(text=text)
print("example %d: %s" % (index,places.countries))
and the result is now:
example 2: ['United States']
example 3: ['France']
example 4: ['British Indian Ocean Territory', 'India']
example 5: ['Spain', 'El Salvador']
example 7: []
example 8: ['United Kingdom']
example 9: ['United Kingdom', 'United States']
indeed there is room for improvement for example 5. I have added an issue https://github.com/somnathrakshit/geograpy3/issues/7 - please stay tuned ...

Related

How to loop to consecutively go through a list of strings, assign value to each string and return it to a new list

Say instead of a dictionary I have these lists:
cities = ('New York', 'Vancouver', 'London', 'Berlin', 'Tokyo', 'Bangkok')
Europe = ('London', 'Berlin')
America = ('New York', 'Vancouver')
Asia = ('Tokyo', 'Bangkok')
I want to create a pd.DataFrame from this such as:
City
Continent
New York
America
Vancouver
America
London
Europe
Berlin
Europe
Tokyo
Asia
Bangkok
Asia
Note: this is the minimum reproductible example to keep it simple, but the real dataset is more like city -> country -> continent
I understand with such a small sample it would be possible to manually create a dictionary, but in the real example there are many more data-points. So I need to automate it.
I've tried a for loop and a while loop with arguments such as "if Europe in cities" but that doesn't do anything and I think that's because it's "false" since it compares the whole list "Europe" against the whole list "cities".
Either way, my idea was that the loops would go through every city in the cities list and return (city + continent) for each. I just don't know how to um... actually make that work.
I am very new and I wasn't able to figure anything out from looking at similar questions.
Thank you for any direction!
Problem in your Code:
First of all, let's take a look at a Code Snippet used by you: if Europe in cities: was returned nothing Correct!
It is because you are comparing the whole list [Europe] instead of individual list element ['London', 'Berlin']
Solution:
Initially, I have imported all the important modules and regenerated a List of Sample Data provided by you.
# Import all the Important Modules
import pandas as pd
# Read Data
cities = ['New York', 'Vancouver', 'London', 'Berlin', 'Tokyo', 'Bangkok']
Europe = ['London', 'Berlin']
America = ['New York', 'Vancouver']
Asia = ['Tokyo', 'Bangkok']
Now, As you can see in your Expected Output we have 2 Columns mentioned below:
City [Which is already available in the form of cities (List)]
Continent [Which we have to generate based on other Lists. In our case: Europe, America, Asia]
For Generating a proper Continent List follow the Code mentioned below:
# Make Continent list
continent = []
# Compare the list of Europe, America and Asia with cities
for city in cities:
if city in Europe:
continent.append('Europe')
elif city in America:
continent.append('America')
elif city in Asia:
continent.append('Asia')
else:
pass
# Print the continent list
continent
# Output of Above Code:
['America', 'America', 'Europe', 'Europe', 'Asia', 'Asia']
As you can see we have received the expected Continent List. Now let's generate the pd.DataFrame() from the same:
# Make dataframe from 'City' and 'Continent List`
data_df = pd.DataFrame({'City': cities, 'Continent': continent})
# Print Results
data_df
# Output of the above Code:
City Continent
0 New York America
1 Vancouver America
2 London Europe
3 Berlin Europe
4 Tokyo Asia
5 Bangkok Asia
Hope this Solution helps you. But if you are still facing Errors then feel free to start a thread below.
1 : Counting elements
You just count the number of cities in each continent and create a list with it :
cities = ('New York', 'Vancouver', 'London', 'Berlin', 'Tokyo', 'Bangkok')
Europe = ('London', 'Berlin')
America = ('New York', 'Vancouver')
continent = []
cities = []
for name, cont in zip(['Europe', 'America', 'Asia'], [Europe, America, Asia]):
continent += [name for _ in range(len(cont))]
cities += [city for city in cont]
df = pd.DataFrame({'City': cities, 'Continent': continent}
print(df)
And this gives you the following result :
City Continent
0 London Europe
1 Berlin Europe
2 New York America
3 Vancouver America
4 Tokyo Asia
5 Bangkok Asia
This is I think the best solution.
2: With dictionnary
You can create an intermediate dictionnary.
Starting from your code
cities = ('New York', 'Vancouver', 'London', 'Berlin', 'Tokyo', 'Bangkok')
Europe = ('London', 'Berlin')
America = ('New York', 'Vancouver')
Asia = ('Tokyo', 'Bangkok')
You would do this :
continent = dict()
for cont_name, cont_cities in zip(['Europe', 'America', 'Asia'], [Europe, America, Asia]):
for city in cont_cities:
continent[city] = cont_name
This give you the following result :
{
'London': 'Europe', 'Berlin': 'Europe',
'New York': 'America', 'Vancouver': 'America',
'Tokyo': 'Asia', 'Bangkok': 'Asia'
}
Then, you can create your DataFrame :
df = pd.DataFrame(continent.items())
print(df)
0 1
0 London Europe
1 Berlin Europe
2 New York America
3 Vancouver America
4 Tokyo Asia
5 Bangkok Asia
This solution allows you not to override your cities tuple
I think on the long run you might want to elimninate loops for large datasets. Also, you might need to include more continent depending on the content of your data.
import pandas as pd
continent = {
'0': 'Europe',
'1': 'America',
'2': 'Asia'
}
df= pd.DataFrame([Europe, America, Asia]).stack().reset_index()
df['continent']= df['level_0'].astype(str).map(continent)
df.drop(['level_0','level_1'], inplace=True, axis=1)
You should get this output
0 continent
0 London Europe
1 Berlin Europe
2 New York America
3 Vancouver America
4 Tokyo Asia
5 Bangkok Asia
Feel free to adjust to suit your use case

Assign values from a dictionary to a new column based on condition

This my data frame
City
sales
San Diego
500
Texas
400
Nebraska
300
Macau
200
Rome
100
London
50
Manchester
70
I want to add the country at the end which will look like this
City
sales
Country
San Diego
500
US
Texas
400
US
Nebraska
300
US
Macau
200
Hong Kong
Rome
100
Italy
London
50
England
Manchester
200
England
The countries are stored in below dictionary
country={'US':['San Diego','Texas','Nebraska'], 'Hong Kong':'Macau', 'England':['London','Manchester'],'Italy':'Rome'}
It's a little complicated because you have lists and strings as the values and strings are technically iterable, so distinguishing is more annoying. But here's a function that can flatten your dict:
def flatten_dict(d):
nd = {}
for k,v in d.items():
# Check if it's a list, if so then iterate through
if ((hasattr(v, '__iter__') and not isinstance(v, str))):
for item in v:
nd[item] = k
else:
nd[v] = k
return nd
d = flatten_dict(country)
#{'San Diego': 'US',
# 'Texas': 'US',
# 'Nebraska': 'US',
# 'Macau': 'Hong Kong',
# 'London': 'England',
# 'Manchester': 'England',
# 'Rome': 'Italy'}
df['Country'] = df['City'].map(d)
You can implement this using geopy
You can install geopy by pip install geopy
Here is the documentation : https://pypi.org/project/geopy/
# import libraries
from geopy.geocoders import Nominatim
# you need to mention a name for the app
geolocator = Nominatim(user_agent="some_random_app_name")
# get country name
df['Country'] = df['City'].apply(lambda x : geolocator.geocode(x).address.split(', ')[-1])
print(df)
City sales Country
0 San Diego 500 United States
1 Texas 400 United States
2 Nebraska 300 United States
3 Macau 200 中国
4 Rome 100 Italia
5 London 50 United Kingdom
6 Manchester 70 United Kingdom
# to get country name in english
df['Country'] = df['City'].apply(lambda x : geolocator.reverse(geolocator.geocode(x).point, language='en').address.split(', ')[-1])
print(df)
City sales Country
0 San Diego 500 United States
1 Texas 400 United States
2 Nebraska 300 United States
3 Macau 200 China
4 Rome 100 Italy
5 London 50 United Kingdom
6 Manchester 70 United Kingdom

How to calculate a win streak in Python/Pandas

I'm trying to calculate the win-streak or losing-streak going into a game. My goal is to generate a betting decision based on these streak factors or a recent record. I am new to Python and Pandas (and programming in general), so any detailed explanation of what code does would be welcome.
Here's my data
Season Game Date Game Index Away Team Away Score Home Team Home Score Winner Loser
0 2014 Regular Season Saturday, March 22, 2014 2014032201 Los Angeles Dodgers 3 Arizona D'Backs 1 Los Angeles Dodgers Arizona D'Backs
1 2014 Regular Season Sunday, March 23, 2014 2014032301 Los Angeles Dodgers 7 Arizona D'Backs 5 Los Angeles Dodgers Arizona D'Backs
2 2014 Regular Season Sunday, March 30, 2014 2014033001 Los Angeles Dodgers 1 San Diego Padres 3 San Diego Padres Los Angeles Dodgers
3 2014 Regular Season Monday, March 31, 2014 2014033101 Seattle Mariners 10 Los Angeles Angels 3 Seattle Mariners Los Angeles Angels
4 2014 Regular Season Monday, March 31, 2014 2014033102 San Francisco Giants 9 Arizona D'Backs 8 San Francisco Giants Arizona D'Backs
5 2014 Regular Season Monday, March 31, 2014 2014033103 Boston Red Sox 1 Baltimore Orioles 2 Baltimore Orioles Boston Red Sox
6 2014 Regular Season Monday, March 31, 2014 2014033104 Minnesota Twins 3 Chicago White Sox 5 Chicago White Sox Minnesota Twins
7 2014 Regular Season Monday, March 31, 2014 2014033105 St. Louis Cardinals 1 Cincinnati Reds 0 St. Louis Cardinals Cincinnati Reds
8 2014 Regular Season Monday, March 31, 2014 2014033106 Kansas City Royals 3 Detroit Tigers 4 Detroit Tigers Kansas City Royals
9 2014 Regular Season Monday, March 31, 2014 2014033107 Colorado Rockies 1 Miami Marlins 10 Miami Marlins Colorado Rockies
Dictionary below:
{'Away Score': {0: 3, 1: 7, 2: 1, 3: 10, 4: 9},
'Away Team': {0: 'Los Angeles Dodgers',
1: 'Los Angeles Dodgers',
2: 'Los Angeles Dodgers',
3: 'Seattle Mariners',
4: 'San Francisco Giants'},
'Game Date': {0: 'Saturday, March 22, 2014',
1: 'Sunday, March 23, 2014',
2: 'Sunday, March 30, 2014',
3: 'Monday, March 31, 2014',
4: 'Monday, March 31, 2014'},
'Game Index': {0: 2014032201,
1: 2014032301,
2: 2014033001,
3: 2014033101,
4: 2014033102},
'Home Score': {0: 1, 1: 5, 2: 3, 3: 3, 4: 8},
'Home Team': {0: "Arizona D'Backs",
1: "Arizona D'Backs",
2: 'San Diego Padres',
3: 'Los Angeles Angels',
4: "Arizona D'Backs"},
'Loser': {0: "Arizona D'Backs",
1: "Arizona D'Backs",
2: 'Los Angeles Dodgers',
3: 'Los Angeles Angels',
4: "Arizona D'Backs"},
'Season': {0: '2014 Regular Season',
1: '2014 Regular Season',
2: '2014 Regular Season',
3: '2014 Regular Season',
4: '2014 Regular Season'},
'Winner': {0: 'Los Angeles Dodgers',
1: 'Los Angeles Dodgers',
2: 'San Diego Padres',
3: 'Seattle Mariners',
4: 'San Francisco Giants'}}
I've tried looping through the season and the team, and then creating a streak count based on [this]: https://github.com/nhcamp/EPL-Betting/blob/master/EPL%20Match%20Results%20DF.ipynb github project.
I run into key errors early in building my loops, and I have trouble identifying data
game_table = pd.read_csv('MLB_Scores_2014_2018.csv')
# Get Team List
team_list = game_table['Away Team'].unique()
# Get Season List
season_list = game_table['Season'].unique()
#Defining "chunks" to append gamedata to the total dataframe
chunks = []
for season in season_list:
# Looping through seasons. Streaks reset for each season
season_games = game_table[game_table['Season'] == season]
for team in team_list:
# Looping through teams
season_team_games = season_games[(season_games['Away Team'] == team | season_games['Home Team'] == team)]
#Setting streak list and streak counter values
streak_list = []
streak = 0
# Looping through each game
for game in season_team_games.iterrow():
# Check if team is a winner, and up the streak
if game_table['Winner'] == team:
streak_list.append(streak)
streak += 1
# If not the winner, append streak and set to zero
elif game_table['Winner'] != team:
streak_list.append(streak)
streak = 0
# Just in case something wierd happens with the scores
else:
streak_list.append(streak)
game_table['Streak'] = streak_list
chunk_list.append(game_table)
And that's kind of where I lose it. How do I append separately if each team is the home team or the away team? Is there a better way to display this data?
As a general matter, I want to add a win-streak and/or losing-streak for each team in each game. Headers would look like this:
| Season | Game Date | Game Index | Away Team | Away Score | Home Team | Home Score | Winner | Loser | Away Win Streak | Away Lose Streak | Home Win Streak | Home Lose Streak |
Edit: this error message has been resolved
I also get an error creating the dataframe 'season_team_games."
TypeError: cannot compare a dtyped [object] array with a scalar of type [bool]
The error you are seeing come from the statement
season_team_games = season_games[(season_games['Away Team'] == team | season_games['Home Team'] == team)]
When you're adding two boolean conditions, you need to separate them out with parentheses. This is because the | operator takes precedence over the == operator. So this should become:
season_team_games = season_games[(season_games['Away Team'] == team) | (season_games['Home Team'] == team)]
I know there is more to the question than this error, but as mentioned in the comment, once you provide some text based data, it might be easier to help

Pandas Split Column String and Plot unique values

I have a dataframe Df that looks like this:
Country Year
0 Australia, USA 2015
1 USA, Hong Kong, UK 1982
2 USA 2012
3 USA 1994
4 USA, France 2013
5 Japan 1988
6 Japan 1997
7 USA 2013
8 Mexico 2000
9 USA, UK 2005
10 USA 2012
11 USA, UK 2014
12 USA 1980
13 USA 1992
14 USA 1997
15 USA 2003
16 USA 2004
17 USA 2007
18 USA, Germany 2009
19 Japan 2006
20 Japan 1995
I want to make a bar chart for the Country column, if i try this
Df.Country.value_counts().plot(kind='bar')
I get this plot
which is incorrect because it doesn't separate the countries. My goal is to obtain a bar chart that plots the count of each country in the column, but to achieve that, first i have to somehow split the string in each row (if needed) and then plot the data. I know i can use Df.Country.str.split(', ') to split the strings, but if i do this i can't plot the data.
Anyone has an idea how to solve this problem?
You could use the vectorized Series.str.split method to split the Countrys:
In [163]: df['Country'].str.split(r',\s+', expand=True)
Out[163]:
0 1 2
0 Australia USA None
1 USA Hong Kong UK
2 USA None None
3 USA None None
4 USA France None
...
If you stack this DataFrame to move all the values into a single column, then you can apply value_counts and plot as before:
import pandas as pd
import matplotlib.pyplot as plt
df = pd.DataFrame(
{'Country': ['Australia, USA', 'USA, Hong Kong, UK', 'USA', 'USA', 'USA, France', 'Japan', 'Japan', 'USA', 'Mexico', 'USA, UK', 'USA', 'USA, UK', 'USA', 'USA', 'USA', 'USA', 'USA', 'USA', 'USA, Germany', 'Japan', 'Japan'],
'Year': [2015, 1982, 2012, 1994, 2013, 1988, 1997, 2013, 2000, 2005, 2012, 2014, 1980, 1992, 1997, 2003, 2004, 2007, 2009, 2006, 1995]})
counts = df['Country'].str.split(r',\s+', expand=True).stack().value_counts()
counts.plot(kind='bar')
plt.show()
from collections import Counter
c = pd.Series(Counter(df.Country.str.split(',').sum()))
>>> c.plot(kind='bar', title='Country Count')
new_df = pd.concat([Series(row['Year'], row['Country'].split(',')) for _, row in DF.iterrows()]).reset_index()
(DF is your old DF).
this will give you one data point for each country name.
Hope this helps.
Cheers!

How can I improve performance on my apply() with fuzzy matching statement

I've written a function called muzz that leverages the fuzzywuzzy module to 'merge' two pandas dataframes. Works great, but the performance is pretty bad on larger frames. Please take a look at my apply() that does the extracting/scoring and let me know if you have any ideas that could speed it up.
import pandas as pd
import numpy as np
import fuzzywuzzy as fw
Create a frame of raw data
dfRaw = pd.DataFrame({'City': {0: u'St Louis',
1: 'Omaha',
2: 'Chicogo',
3: 'Kansas city',
4: 'Des Moine'},
'State' : {0: 'MO', 1: 'NE', 2 : 'IL', 3 : 'MO', 4 : 'IA'}})
Which yields
City State
0 St Louis MO
1 Omaha NE
2 Chicogo IL
3 Kansas city MO
4 Des Moine IA
Then a frame that represents the good data that we want to look up
dfLocations = pd.DataFrame({'City': {0: 'Saint Louis',
1: u'Omaha',
2: u'Chicago',
3: u'Kansas City',
4: u'Des Moines'},
'State' : {0: 'MO', 1: 'NE', 2 : 'IL',
3 : 'KS', 4 : 'IA'},
u'Zip': {0: '63201', 1: '68104', 2: '60290',
3: '68101', 4: '50301'}})
Which yields
City State Zip
0 Saint Louis MO 63201
1 Omaha NE 68104
2 Chicago IL 60290
3 Kansas City KS 68101
4 Des Moines IA 50301
and now the muzz function. EDIT: Added choices= right[match_col_name] line and used choices in the apply per Brenbarn suggestion. I also, per Brenbarn suggestion, ran some tests with the extractOne() without the apply and it it appears to be the bottleneck. Maybe there's a faster way to do the fuzzy matching?
def muzz(left, right, on, match_col_name='match_on',score_col_name='score_match',
right_suffix='_match', score_cutoff=80):
right[match_col_name] = np.sum(right[on],axis=1)
choices= right[match_col_name]
###The offending statement###
left[[match_col_name,score_col_name]] =
pd.Series(np.sum(left[on],axis=1)).apply(lambda x : pd.Series(
fw.process.extractOne(x,choices,score_cutoff=score_cutoff)))
dfTemp = pd.merge(left,right,how='left',on=match_col_name,suffixes=('',right_suffix))
return dfTemp.drop(match_col_name, axis=1)
Calling muzz
muzz(dfRaw.copy(),dfLocations,on=['City','State'], score_cutoff=85)
Which yields
City State score_match City_match State_match Zip
0 St Louis MO 87 Saint Louis MO 63201
1 Omaha NE 100 Omaha NE 68104
2 Chicogo IL 89 Chicago IL 60290
3 Kansas city MO NaN NaN NaN NaN
4 Des Moine IA 96 Des Moines IA 50301

Categories