How can I combine the Index in the python dataframe? - python

in the following, can I make a single index for all the entries with common index.
cric = pd.Series(['India', 'Pakistan', 'South Africa', 'England', 'New Zealand'],
index = ['Cricket', 'Cricket', 'Cricket', 'Cricket', 'Cricket'])
ftbl = pd.Series(['England', 'South Africa', 'Australia', 'Netherlands', 'New Zealand'],
index = ['Football', 'Football', 'Football', 'Football' , 'Football'])
hock = pd.Series(['India', 'Pakistan', 'South Korea', 'England', 'India', 'New Zealand'],
index = ['Hockey', 'Hockey', 'Hockey', 'Hockey', 'Hockey', 'Hockey'])
all_countries_1 = cric.append(ftbl)
all_countries_1 = all_countries_1.append(ftbl)
all_countries_1 = all_countries_1.append(hock)
all_countries_1 = all_countries_1.to_frame()
all_countries_1.columns = ['Countries']
all_countries_1
I want the following as my out

Is this what you are looking for?
# zip the first three chars of the index and the index together
z = list(zip(all_countries_1.index.str[:3], all_countries_1.index))
# create multi index
idx = pd.MultiIndex.from_tuples(z)
# assign index
all_countries_1.index = idx
Countries
Cri Cricket India
Cricket Pakistan
Cricket South Africa
Cricket England
Cricket New Zealand
Foo Football England
Football South Africa
Football Australia
Football Netherlands
Football New Zealand
Football England
Football South Africa
Football Australia
Football Netherlands
Football New Zealand
Hoc Hockey India
Hockey Pakistan
Hockey South Korea
Hockey England
Hockey India
Hockey New Zealand

If, by single index, you mean an index made of autoincrementing numbers, there is nothing special you have to do. That is the default index for a DataFrame, so using the reset_index() method will get what you want. The next step will probably be to rename your index column. You can chain that method with reset_index and take care of it one line.
all_countries_1 = all_countries_1.reset_index().rename(columns={"index":"Sports"})

Related

Converting dataframe to dictionary with country by continent

I have a .csv and dataframe which has 2 columns (country, continent). I want to create a dictionary, carrying the continent as key and a list of all countries as values.
The .csv has the following format:
country
continent
Algeria
Africa
Angola
Africa
and so on.
I tried using:
continentsDict = dict([(con, cou) for con, cou in zip(continents.continent, continents.country)])
But this gave me the following output:
{'Africa': 'Zimbabwe', 'Asia': 'Yemen', 'Europe': 'Vatican City', 'North America': 'United States Virgin Islands', 'Oceania': 'Wallis and Futuna', 'South America': 'Venezuela'}
Which is the right format but only added the last value it found for the respective continent.
Anyone an idea?
Thank you!
Assuming continents is the instance of your pandas df, you could do:
continentsDict = continents.groupby("continent")["country"].apply(list).to_dict()
Given:
country continent
0 Algeria Africa
1 Angola Africa
Doing:
out = df.groupby('continent')['country'].agg(list).to_dict()
print(out)
Output:
{'Africa': ['Algeria', 'Angola']}

How to loop to consecutively go through a list of strings, assign value to each string and return it to a new list

Say instead of a dictionary I have these lists:
cities = ('New York', 'Vancouver', 'London', 'Berlin', 'Tokyo', 'Bangkok')
Europe = ('London', 'Berlin')
America = ('New York', 'Vancouver')
Asia = ('Tokyo', 'Bangkok')
I want to create a pd.DataFrame from this such as:
City
Continent
New York
America
Vancouver
America
London
Europe
Berlin
Europe
Tokyo
Asia
Bangkok
Asia
Note: this is the minimum reproductible example to keep it simple, but the real dataset is more like city -> country -> continent
I understand with such a small sample it would be possible to manually create a dictionary, but in the real example there are many more data-points. So I need to automate it.
I've tried a for loop and a while loop with arguments such as "if Europe in cities" but that doesn't do anything and I think that's because it's "false" since it compares the whole list "Europe" against the whole list "cities".
Either way, my idea was that the loops would go through every city in the cities list and return (city + continent) for each. I just don't know how to um... actually make that work.
I am very new and I wasn't able to figure anything out from looking at similar questions.
Thank you for any direction!
Problem in your Code:
First of all, let's take a look at a Code Snippet used by you: if Europe in cities: was returned nothing Correct!
It is because you are comparing the whole list [Europe] instead of individual list element ['London', 'Berlin']
Solution:
Initially, I have imported all the important modules and regenerated a List of Sample Data provided by you.
# Import all the Important Modules
import pandas as pd
# Read Data
cities = ['New York', 'Vancouver', 'London', 'Berlin', 'Tokyo', 'Bangkok']
Europe = ['London', 'Berlin']
America = ['New York', 'Vancouver']
Asia = ['Tokyo', 'Bangkok']
Now, As you can see in your Expected Output we have 2 Columns mentioned below:
City [Which is already available in the form of cities (List)]
Continent [Which we have to generate based on other Lists. In our case: Europe, America, Asia]
For Generating a proper Continent List follow the Code mentioned below:
# Make Continent list
continent = []
# Compare the list of Europe, America and Asia with cities
for city in cities:
if city in Europe:
continent.append('Europe')
elif city in America:
continent.append('America')
elif city in Asia:
continent.append('Asia')
else:
pass
# Print the continent list
continent
# Output of Above Code:
['America', 'America', 'Europe', 'Europe', 'Asia', 'Asia']
As you can see we have received the expected Continent List. Now let's generate the pd.DataFrame() from the same:
# Make dataframe from 'City' and 'Continent List`
data_df = pd.DataFrame({'City': cities, 'Continent': continent})
# Print Results
data_df
# Output of the above Code:
City Continent
0 New York America
1 Vancouver America
2 London Europe
3 Berlin Europe
4 Tokyo Asia
5 Bangkok Asia
Hope this Solution helps you. But if you are still facing Errors then feel free to start a thread below.
1 : Counting elements
You just count the number of cities in each continent and create a list with it :
cities = ('New York', 'Vancouver', 'London', 'Berlin', 'Tokyo', 'Bangkok')
Europe = ('London', 'Berlin')
America = ('New York', 'Vancouver')
continent = []
cities = []
for name, cont in zip(['Europe', 'America', 'Asia'], [Europe, America, Asia]):
continent += [name for _ in range(len(cont))]
cities += [city for city in cont]
df = pd.DataFrame({'City': cities, 'Continent': continent}
print(df)
And this gives you the following result :
City Continent
0 London Europe
1 Berlin Europe
2 New York America
3 Vancouver America
4 Tokyo Asia
5 Bangkok Asia
This is I think the best solution.
2: With dictionnary
You can create an intermediate dictionnary.
Starting from your code
cities = ('New York', 'Vancouver', 'London', 'Berlin', 'Tokyo', 'Bangkok')
Europe = ('London', 'Berlin')
America = ('New York', 'Vancouver')
Asia = ('Tokyo', 'Bangkok')
You would do this :
continent = dict()
for cont_name, cont_cities in zip(['Europe', 'America', 'Asia'], [Europe, America, Asia]):
for city in cont_cities:
continent[city] = cont_name
This give you the following result :
{
'London': 'Europe', 'Berlin': 'Europe',
'New York': 'America', 'Vancouver': 'America',
'Tokyo': 'Asia', 'Bangkok': 'Asia'
}
Then, you can create your DataFrame :
df = pd.DataFrame(continent.items())
print(df)
0 1
0 London Europe
1 Berlin Europe
2 New York America
3 Vancouver America
4 Tokyo Asia
5 Bangkok Asia
This solution allows you not to override your cities tuple
I think on the long run you might want to elimninate loops for large datasets. Also, you might need to include more continent depending on the content of your data.
import pandas as pd
continent = {
'0': 'Europe',
'1': 'America',
'2': 'Asia'
}
df= pd.DataFrame([Europe, America, Asia]).stack().reset_index()
df['continent']= df['level_0'].astype(str).map(continent)
df.drop(['level_0','level_1'], inplace=True, axis=1)
You should get this output
0 continent
0 London Europe
1 Berlin Europe
2 New York America
3 Vancouver America
4 Tokyo Asia
5 Bangkok Asia
Feel free to adjust to suit your use case

create new column of dataframe base on value of another dataframe run fast?

i want to create a new columns for my df_cau2['continent']. first there r 2 df of mine:
country_continent
Continent
Country
Afghanistan Asia
Albania Europe
Algeria Africa
American Samoa Oceania
and
df_cau2
date home_team away_team home_score away_score tournament city country neutral
0 1872-11-30 Scotland England 0 0 Friendly Glasgow Scotland False
1 1873-03-08 England Scotland 4 2 Friendly London England False
2 1874-03-07 Scotland England 2 1 Friendly Glasgow Scotland False
to create new column continent i use apply for df_cau2 like this:
def same_continent(home,away):
if country_continent.loc[home].Continent == country_continent.loc[away].Continent:
return country_continent.loc[home].Continent
return 'None'
df_cau2['continent']=df_cau2.apply(lambda x: same_continent(x['home_team'],x['away_team']),axis=1)
df_cau2.head()
with 39480 rows of df_cau2, this code run too slow, how can i change my code to run it's faster? i am thinking about using np.select but i don't know how to use it's in this case.
This is result that i want:
date home_team away_team home_score away_score tournament city country neutral continent
7611 1970-09-11 Iran Turkey 1 1 Friendly Teheran Iran False None
31221 2009-03-11 Nepal Pakistan 1 0 Friendly Kathmandu Nepal False Asia
32716 2010-11-17 Colombia Peru 1 1 Friendly Bogotá Colombia False South America
Thanks
IIUC, you want to set continent column only if home_team and away_team columns are in the same continent:
home_continent = df1['home_team'].map(df2.squeeze())
away_continent = df1['away_team'].map(df2.squeeze())
m = home_continent == away_continent
df1.loc[m, 'continent'] = home_continent.loc[m]
print(df1)
# Output
home_team away_team continent
0 Canada England NaN
1 France Spain Europe
2 China Japan Asia
Setup a MRE
df1 = pd.DataFrame({'home_team': ['Canada', 'France', 'China'],
'away_team': ['England', 'Spain', 'Japan']})
print(df1)
df2 = pd.DataFrame({'Country': ['Canada', 'China', 'England',
'France', 'Japan', 'Spain'],
'Continent': ['North America', 'Asia', 'Europe',
'Europe', 'Asia', 'Europe']}).set_index('Country')
print(df2)
# Output df1
home_team away_team
0 Canada England
1 France Spain
2 China Japan
# Output df2
Continent
Country
Canada North America
China Asia
England Europe
France Europe
Japan Asia
Spain Europe
Consider merge of the continent lookup data frame twice to create home and away continent columns. And since you will have both continents, assign new shared continent column conditionally with numpy.where:
df_cau2 = (
df.cau2.merge(
country_continent.reset_index(),
left_on = "home_team",
right_on = "Country",
how = "left"
).merge(
country_continent.reset_index(),
left_on = "away_team",
right_on = "Country",
how = "left",
suffixes = ["_home", "_away"]
)
)
df_cau2["shared_continent"] = np.where(
df_cau2["Continent_home"].eq(df_cau2["Continent_away"]),
df_cau2["Continent_home"],
np.nan
)

How can I use Python to turn country code into full name and infer the country name based on the city name on an Excel file? [closed]

Closed. This question needs to be more focused. It is not currently accepting answers.
Want to improve this question? Update the question so it focuses on one problem only by editing this post.
Closed 2 years ago.
Improve this question
I'm a beginner in Python.
Now I have 2 columns on my Excel file. One is country column and the other one is city column.
For the country column, most of the values are shown in country code and some of them are shown in country full name, while some values are U.S.A states code and less than 1% of them are blank.
For the city column, it clearly shows the full city name (not city code), while nearly 20% of them are blank.
How can I use Python to create a new column to show the full country name based on the country code and remain the same name if it shows the full country name in the country column, and show the U.S.A states code as the United States in the new column?
The tricky part is, in the country column, take CO as the example, Co can stand for Columbia and Colorado, I cannot be sure whether it's a country or a state at the beginning, but when I check the corresponding city name I can know it's a country or a state (ex: Longmont for Colorado, Bogota for Columbia). How can I avoid this issue in the new column and infer the full country name in the new column based on the corresponding city name?
I appreciate your help!
Explanation
Coded the task using following logic.
Process simple abbreviations such as U.S.
A country length greater than 3
Have Country and City
Find closest Country City Pair in Cities
Country Only
Find closest country match in list of countries in two letter country codes
Country length equals 3
Find country with 3 letter country codes
Country length equals 2 (could be country or state code)
Code does not exist in list of states
Must be country code, so look up country in two letter country codes
Code does not exist in list of countries
Must be state code for USA, so country is United States
Could be country or state code
Check if city with this as a state code
Check if city with this as a country code
Must be best match of these two possibilities
Note: String matching uses fuzzy matching to allow for flexibility in spelling of names
rapidfuzz library was used over fuzzywuzzy since its an order of magnitude faster
Code
import pandas as pd
from rapidfuzz import fuzz
def find_closest_country(country):
' Country with the closest name in list of countries in country code '
ratios = [fuzz.partial_ratio(country, x) for x in alpha2.values()]
rated_countries = [(info, r) for info, r in zip(alpha2.values(), ratios)]
# Best match with shortest name
return sorted(rated_countries, key = lambda x: (x[1], -len(x[0])), reverse = True)[0]
def check_city_country(city, country):
' City, Country pair closest in list of cities '
ratios = [fuzz.partial_ratio(city, x['name']) * fuzz.partial_ratio(country, x['country']) for x in cities]
rated_cities = [(info, r) for info, r in zip(cities, ratios)]
# Best match with shortest name
return sorted(rated_cities, key = lambda x: (x[1], -len(x[0])), reverse = True)[0]
def check_city_subregion(city, subregion):
' City, subresion pair closest in list of cities '
ratios = [fuzz.partial_ratio(city, x['name']) * fuzz.partial_ratio(subregion, x['subcountry']) for x in cities]
rated_cities = [(info, r) for info, r in zip(cities, ratios)]
# Best match with shortest name
return sorted(rated_cities, key = lambda x: (x[1], -len(x[0])), reverse = True)[0]
def lookup(country, city):
'''
Finds country based upon country and city
country - country name or country code
city - name of city
'''
if country.lower() == 'u.s.':
# Picks up common US acronym
country = "US"
if len(country) > 3:
# Must be country since too long for abbreviation
if city:
# Find closest city country pair in list of cities
city_info = check_city_country(city, country)
if city_info:
return city_info[0]['country']
# No city, so find closest country in list of countries (2 code abbreviations reverse lookup)
countries = find_closest_country(country)
if countries:
return countries[0]
return None
elif len(country) == 3:
# 3 letter abbreviation
country = country.upper()
return alpha3.get(country, None)
elif len(country) == 2:
# Two letter country abbreviation
country = country.upper()
if not country in states:
# Not a state code, so lookup contry from code
return alpha2.get(country, None)
if not country in alpha2:
# Not a country code, so must be state code for US
return "United States of America"
# Could be country of state code
if city:
# Have 2 digit code (could be country or state)
pos_country = alpha2[country] # possible country
pos_state = states[country] # possible state
# check closest country with this city
pos_countries = check_city_country(city, pos_country)
# If state code, country would be United States
pos_us = check_city_country(city, "United States")
if pos_countries[1] > pos_us[1]:
# Provided better match as country code
return pos_countries[0]['country']
else:
# Provided better match as state code (i.e. "United States")
return pos_us[0]['country']
else:
return alpha2[country]
else:
return None
Data
# State Codes
# https://gist.github.com/rugbyprof/76575b470b6772ce8fa0c49e23931d97
states = {"AL":"Alabama","AK":"Alaska","AZ":"Arizona","AR":"Arkansas","CA":"California","CO":"Colorado","CT":"Connecticut","DE":"Delaware","FL":"Florida","GA":"Georgia","HI":"Hawaii","ID":"Idaho","IL":"Illinois","IN":"Indiana","IA":"Iowa","KS":"Kansas","KY":"Kentucky","LA":"Louisiana","ME":"Maine","MD":"Maryland","MA":"Massachusetts","MI":"Michigan","MN":"Minnesota","MS":"Mississippi","MO":"Missouri","MT":"Montana","NE":"Nebraska","NV":"Nevada","NH":"New Hampshire","NJ":"New Jersey","NM":"New Mexico","NY":"New York","NC":"North Carolina","ND":"North Dakota","OH":"Ohio","OK":"Oklahoma","OR":"Oregon","PA":"Pennsylvania","RI":"Rhode Island","SC":"South Carolina","SD":"South Dakota","TN":"Tennessee","TX":"Texas","UT":"Utah","VT":"Vermont","VA":"Virginia","WA":"Washington","WV":"West Virginia","WI":"Wisconsin","WY":"Wyoming"}
# two letter country codes
# https://gist.github.com/carlopires/1261951/d13ca7320a6abcd4b0aa800d351a31b54cefdff4
alpha2 = {
'AD': 'Andorra',
'AE': 'United Arab Emirates',
'AF': 'Afghanistan',
'AG': 'Antigua & Barbuda',
'AI': 'Anguilla',
'AL': 'Albania',
'AM': 'Armenia',
'AN': 'Netherlands Antilles',
'AO': 'Angola',
'AQ': 'Antarctica',
'AR': 'Argentina',
'AS': 'American Samoa',
'AT': 'Austria',
'AU': 'Australia',
'AW': 'Aruba',
'AZ': 'Azerbaijan',
'BA': 'Bosnia and Herzegovina',
'BB': 'Barbados',
'BD': 'Bangladesh',
'BE': 'Belgium',
'BF': 'Burkina Faso',
'BG': 'Bulgaria',
'BH': 'Bahrain',
'BI': 'Burundi',
'BJ': 'Benin',
'BM': 'Bermuda',
'BN': 'Brunei Darussalam',
'BO': 'Bolivia',
'BR': 'Brazil',
'BS': 'Bahama',
'BT': 'Bhutan',
'BU': 'Burma (no longer exists)',
'BV': 'Bouvet Island',
'BW': 'Botswana',
'BY': 'Belarus',
'BZ': 'Belize',
'CA': 'Canada',
'CC': 'Cocos (Keeling) Islands',
'CF': 'Central African Republic',
'CG': 'Congo',
'CH': 'Switzerland',
'CI': 'Côte D\'ivoire (Ivory Coast)',
'CK': 'Cook Iislands',
'CL': 'Chile',
'CM': 'Cameroon',
'CN': 'China',
'CO': 'Colombia',
'CR': 'Costa Rica',
'CS': 'Czechoslovakia (no longer exists)',
'CU': 'Cuba',
'CV': 'Cape Verde',
'CX': 'Christmas Island',
'CY': 'Cyprus',
'CZ': 'Czech Republic',
'DD': 'German Democratic Republic (no longer exists)',
'DE': 'Germany',
'DJ': 'Djibouti',
'DK': 'Denmark',
'DM': 'Dominica',
'DO': 'Dominican Republic',
'DZ': 'Algeria',
'EC': 'Ecuador',
'EE': 'Estonia',
'EG': 'Egypt',
'EH': 'Western Sahara',
'ER': 'Eritrea',
'ES': 'Spain',
'ET': 'Ethiopia',
'FI': 'Finland',
'FJ': 'Fiji',
'FK': 'Falkland Islands (Malvinas)',
'FM': 'Micronesia',
'FO': 'Faroe Islands',
'FR': 'France',
'FX': 'France, Metropolitan',
'GA': 'Gabon',
'GB': 'United Kingdom (Great Britain)',
'GD': 'Grenada',
'GE': 'Georgia',
'GF': 'French Guiana',
'GH': 'Ghana',
'GI': 'Gibraltar',
'GL': 'Greenland',
'GM': 'Gambia',
'GN': 'Guinea',
'GP': 'Guadeloupe',
'GQ': 'Equatorial Guinea',
'GR': 'Greece',
'GS': 'South Georgia and the South Sandwich Islands',
'GT': 'Guatemala',
'GU': 'Guam',
'GW': 'Guinea-Bissau',
'GY': 'Guyana',
'HK': 'Hong Kong',
'HM': 'Heard & McDonald Islands',
'HN': 'Honduras',
'HR': 'Croatia',
'HT': 'Haiti',
'HU': 'Hungary',
'ID': 'Indonesia',
'IE': 'Ireland',
'IL': 'Israel',
'IN': 'India',
'IO': 'British Indian Ocean Territory',
'IQ': 'Iraq',
'IR': 'Islamic Republic of Iran',
'IS': 'Iceland',
'IT': 'Italy',
'JM': 'Jamaica',
'JO': 'Jordan',
'JP': 'Japan',
'KE': 'Kenya',
'KG': 'Kyrgyzstan',
'KH': 'Cambodia',
'KI': 'Kiribati',
'KM': 'Comoros',
'KN': 'St. Kitts and Nevis',
'KP': 'Korea, Democratic People\'s Republic of',
'KR': 'Korea, Republic of',
'KW': 'Kuwait',
'KY': 'Cayman Islands',
'KZ': 'Kazakhstan',
'LA': 'Lao People\'s Democratic Republic',
'LB': 'Lebanon',
'LC': 'Saint Lucia',
'LI': 'Liechtenstein',
'LK': 'Sri Lanka',
'LR': 'Liberia',
'LS': 'Lesotho',
'LT': 'Lithuania',
'LU': 'Luxembourg',
'LV': 'Latvia',
'LY': 'Libyan Arab Jamahiriya',
'MA': 'Morocco',
'MC': 'Monaco',
'MD': 'Moldova, Republic of',
'MG': 'Madagascar',
'MH': 'Marshall Islands',
'ML': 'Mali',
'MN': 'Mongolia',
'MM': 'Myanmar',
'MO': 'Macau',
'MP': 'Northern Mariana Islands',
'MQ': 'Martinique',
'MR': 'Mauritania',
'MS': 'Monserrat',
'MT': 'Malta',
'MU': 'Mauritius',
'MV': 'Maldives',
'MW': 'Malawi',
'MX': 'Mexico',
'MY': 'Malaysia',
'MZ': 'Mozambique',
'NA': 'Namibia',
'NC': 'New Caledonia',
'NE': 'Niger',
'NF': 'Norfolk Island',
'NG': 'Nigeria',
'NI': 'Nicaragua',
'NL': 'Netherlands',
'NO': 'Norway',
'NP': 'Nepal',
'NR': 'Nauru',
'NT': 'Neutral Zone (no longer exists)',
'NU': 'Niue',
'NZ': 'New Zealand',
'OM': 'Oman',
'PA': 'Panama',
'PE': 'Peru',
'PF': 'French Polynesia',
'PG': 'Papua New Guinea',
'PH': 'Philippines',
'PK': 'Pakistan',
'PL': 'Poland',
'PM': 'St. Pierre & Miquelon',
'PN': 'Pitcairn',
'PR': 'Puerto Rico',
'PT': 'Portugal',
'PW': 'Palau',
'PY': 'Paraguay',
'QA': 'Qatar',
'RE': 'Réunion',
'RO': 'Romania',
'RU': 'Russian Federation',
'RW': 'Rwanda',
'SA': 'Saudi Arabia',
'SB': 'Solomon Islands',
'SC': 'Seychelles',
'SD': 'Sudan',
'SE': 'Sweden',
'SG': 'Singapore',
'SH': 'St. Helena',
'SI': 'Slovenia',
'SJ': 'Svalbard & Jan Mayen Islands',
'SK': 'Slovakia',
'SL': 'Sierra Leone',
'SM': 'San Marino',
'SN': 'Senegal',
'SO': 'Somalia',
'SR': 'Suriname',
'ST': 'Sao Tome & Principe',
'SU': 'Union of Soviet Socialist Republics (no longer exists)',
'SV': 'El Salvador',
'SY': 'Syrian Arab Republic',
'SZ': 'Swaziland',
'TC': 'Turks & Caicos Islands',
'TD': 'Chad',
'TF': 'French Southern Territories',
'TG': 'Togo',
'TH': 'Thailand',
'TJ': 'Tajikistan',
'TK': 'Tokelau',
'TM': 'Turkmenistan',
'TN': 'Tunisia',
'TO': 'Tonga',
'TP': 'East Timor',
'TR': 'Turkey',
'TT': 'Trinidad & Tobago',
'TV': 'Tuvalu',
'TW': 'Taiwan, Province of China',
'TZ': 'Tanzania, United Republic of',
'UA': 'Ukraine',
'UG': 'Uganda',
'UM': 'United States Minor Outlying Islands',
'US': 'United States of America',
'UY': 'Uruguay',
'UZ': 'Uzbekistan',
'VA': 'Vatican City State (Holy See)',
'VC': 'St. Vincent & the Grenadines',
'VE': 'Venezuela',
'VG': 'British Virgin Islands',
'VI': 'United States Virgin Islands',
'VN': 'Viet Nam',
'VU': 'Vanuatu',
'WF': 'Wallis & Futuna Islands',
'WS': 'Samoa',
'YD': 'Democratic Yemen (no longer exists)',
'YE': 'Yemen',
'YT': 'Mayotte',
'YU': 'Yugoslavia',
'ZA': 'South Africa',
'ZM': 'Zambia',
'ZR': 'Zaire',
'ZW': 'Zimbabwe',
'ZZ': 'Unknown or unspecified country',
}
# Three letter codes
#https://en.wikipedia.org/wiki/ISO_3166-1_alpha-3#Uses_and_applications
alpha3 = """ABW Aruba
AFG Afghanistan
AGO Angola
AIA Anguilla
ALA Åland Islands
ALB Albania
AND Andorra
ARE United Arab Emirates
ARG Argentina
ARM Armenia
ASM American Samoa
ATA Antarctica
ATF French Southern Territories
ATG Antigua and Barbuda
AUS Australia
AUT Austria
AZE Azerbaijan
BDI Burundi
BEL Belgium
BEN Benin
BES Bonaire, Sint Eustatius and Saba
BFA Burkina Faso
BGD Bangladesh
BGR Bulgaria
BHR Bahrain
BHS Bahamas
BIH Bosnia and Herzegovina
BLM Saint Barthélemy
BLR Belarus
BLZ Belize
BMU Bermuda
BOL Bolivia (Plurinational State of)
BRA Brazil
BRB Barbados
BRN Brunei Darussalam
BTN Bhutan
BVT Bouvet Island
BWA Botswana
CAF Central African Republic
CAN Canada
CCK Cocos (Keeling) Islands
CHE Switzerland
CHL Chile
CHN China
CIV Côte d'Ivoire
CMR Cameroon
COD Congo, Democratic Republic of the
COG Congo
COK Cook Islands
COL Colombia
COM Comoros
CPV Cabo Verde
CRI Costa Rica
CUB Cuba
CUW Curaçao
CXR Christmas Island
CYM Cayman Islands
CYP Cyprus
CZE Czechia
DEU Germany
DJI Djibouti
DMA Dominica
DNK Denmark
DOM Dominican Republic
DZA Algeria
ECU Ecuador
EGY Egypt
ERI Eritrea
ESH Western Sahara
ESP Spain
EST Estonia
ETH Ethiopia
FIN Finland
FJI Fiji
FLK Falkland Islands (Malvinas)
FRA France
FRO Faroe Islands
FSM Micronesia (Federated States of)
GAB Gabon
GBR United Kingdom of Great Britain and Northern Ireland
GEO Georgia
GGY Guernsey
GHA Ghana
GIB Gibraltar
GIN Guinea
GLP Guadeloupe
GMB Gambia
GNB Guinea-Bissau
GNQ Equatorial Guinea
GRC Greece
GRD Grenada
GRL Greenland
GTM Guatemala
GUF French Guiana
GUM Guam
GUY Guyana
HKG Hong Kong
HMD Heard Island and McDonald Islands
HND Honduras
HRV Croatia
HTI Haiti
HUN Hungary
IDN Indonesia
IMN Isle of Man
IND India
IOT British Indian Ocean Territory
IRL Ireland
IRN Iran (Islamic Republic of)
IRQ Iraq
ISL Iceland
ISR Israel
ITA Italy
JAM Jamaica
JEY Jersey
JOR Jordan
JPN Japan
KAZ Kazakhstan
KEN Kenya
KGZ Kyrgyzstan
KHM Cambodia
KIR Kiribati
KNA Saint Kitts and Nevis
KOR Korea, Republic of
KWT Kuwait
LAO Lao People's Democratic Republic
LBN Lebanon
LBR Liberia
LBY Libya
LCA Saint Lucia
LIE Liechtenstein
LKA Sri Lanka
LSO Lesotho
LTU Lithuania
LUX Luxembourg
LVA Latvia
MAC Macao
MAF Saint Martin (French part)
MAR Morocco
MCO Monaco
MDA Moldova, Republic of
MDG Madagascar
MDV Maldives
MEX Mexico
MHL Marshall Islands
MKD North Macedonia
MLI Mali
MLT Malta
MMR Myanmar
MNE Montenegro
MNG Mongolia
MNP Northern Mariana Islands
MOZ Mozambique
MRT Mauritania
MSR Montserrat
MTQ Martinique
MUS Mauritius
MWI Malawi
MYS Malaysia
MYT Mayotte
NAM Namibia
NCL New Caledonia
NER Niger
NFK Norfolk Island
NGA Nigeria
NIC Nicaragua
NIU Niue
NLD Netherlands
NOR Norway
NPL Nepal
NRU Nauru
NZL New Zealand
OMN Oman
PAK Pakistan
PAN Panama
PCN Pitcairn
PER Peru
PHL Philippines
PLW Palau
PNG Papua New Guinea
POL Poland
PRI Puerto Rico
PRK Korea (Democratic People's Republic of)
PRT Portugal
PRY Paraguay
PSE Palestine, State of
PYF French Polynesia
QAT Qatar
REU Réunion
ROU Romania
RUS Russian Federation
RWA Rwanda
SAU Saudi Arabia
SDN Sudan
SEN Senegal
SGP Singapore
SGS South Georgia and the South Sandwich Islands
SHN Saint Helena, Ascension and Tristan da Cunha
SJM Svalbard and Jan Mayen
SLB Solomon Islands
SLE Sierra Leone
SLV El Salvador
SMR San Marino
SOM Somalia
SPM Saint Pierre and Miquelon
SRB Serbia
SSD South Sudan
STP Sao Tome and Principe
SUR Suriname
SVK Slovakia
SVN Slovenia
SWE Sweden
SWZ Eswatini
SXM Sint Maarten (Dutch part)
SYC Seychelles
SYR Syrian Arab Republic
TCA Turks and Caicos Islands
TCD Chad
TGO Togo
THA Thailand
TJK Tajikistan
TKL Tokelau
TKM Turkmenistan
TLS Timor-Leste
TON Tonga
TTO Trinidad and Tobago
TUN Tunisia
TUR Turkey
TUV Tuvalu
TWN Taiwan, Province of China
TZA Tanzania, United Republic of
UGA Uganda
UKR Ukraine
UMI United States Minor Outlying Islands
URY Uruguay
USA United States of America
UZB Uzbekistan
VAT Holy See
VCT Saint Vincent and the Grenadines
VEN Venezuela (Bolivarian Republic of)
VGB Virgin Islands (British)
VIR Virgin Islands (U.S.)
VNM Viet Nam
VUT Vanuatu
WLF Wallis and Futuna
WSM Samoa
YEM Yemen
ZAF South Africa
ZMB Zambia
ZWE Zimbabwe"""
# Convert to dictionary
alpha3 = dict(tuple(re.split(r" {2,}", s)) for s in alpha3.split('\n'))
# List of World Cities & Country
# cities https://pkgstore.datahub.io/core/world-cities/world-cities_csv/data/6cc66692f0e82b18216a48443b6b95da/world-cities_csv.csv
# Online CSV File
import csv
import urllib.request
import io
def csv_import(url):
url_open = urllib.request.urlopen(url)
csvfile = csv.DictReader(io.StringIO(url_open.read().decode('utf-8')), delimiter=',')
return csvfile
url = 'https://pkgstore.datahub.io/core/world-cities/world-cities_csv/data/6cc66692f0e82b18216a48443b6b95da/world-cities_csv.csv'
cities = csv_import(url)
Test
Excel File (Input)
country city
u.s.
DZ
AS
co Longmont
co Bogota
AL
AL Huntsville
usa
AFG
BLR Minsk
AUS
united states
Korea seoul
Korea Pyongyang
Test Code
df = pd.read_excel('country_test.xlsx') # Load Excel File
df.fillna('', inplace=True)
# Get name of country based upon country and city
df['country_'] = df.apply(lambda row: lookup(row['country'], row['city']), axis = 1)
Resulting Dataframe
country city country_
0 u.s. United States of America
1 DZ Algeria
2 AS American Samoa
3 co Longmont United States
4 co Bogota Colombia
5 AL Albania
6 AL Huntsville United States
7 usa United States of America
8 AFG Afghanistan
9 BLR Minsk Belarus
10 AUS Australia
11 united states United States of America
12 Korea seoul South Korea
13 Korea Pyongyang North Korea
Well, You can have a {key (state) : Values (cities belonging to states)} json and use python to read the file and arrange the list to the corresponding city, state.
An advice for this approach is to create dictionaries(i.e. dic = {'CO':'Colombia',...} and dic_state = {'CO':'Colorado',...}). Then, probably have an if statement to check if the country is USA. If USA, then use dic_state. Finally, you can create a new column by using the appropriate command (this depends on the package/module that you are using)
Good luck!

NaN error from .map on a column in a dataframe

I have a dataframe that I'm working with that contains a column that has state names spelled out and Im' trying to convert that into the two letter abbreviation form. I found a separate cvs file with all the state names and converted it into a dictionary. I then tried to use that dictionary to map the column but got NaN errors for my output columns.
The original dataframe I had contains a column with city and state grouped together. I've split them into two separate columns and the state is the one that I'm playing around with.
Here's what my dataframe looks like after I've split them:
print(newtop50.head())
city_state 2018 city state
11698 New York, New York 8398748 New York New York
1443 Los Angeles, California 3990456 Los Angeles California
3415 Chicago, Illinois 2705994 Chicago Illinois
17040 Houston, Texas 2325502 Houston Texas
665 Phoenix, Arizona 1660272 Phoenix Arizona
This is what a few rows of my dictionary looks like:
print(states_dic)
{'Alabama': 'AL', 'Alaska': 'AK', 'Arizona': 'AZ', 'Arkansas': 'AR', 'California': 'CA', 'Colorado': 'CO', 'Connecticut': 'CT', 'Delaware': 'DE', 'District of Columbia': 'DC', 'Florida': 'FL', 'Georgia': 'GA', 'Hawaii': 'HI', 'Idaho': 'ID'
Here's what I've tried:
newtop50['state'] = newtop50['state'].map(states_dic)
print(newtop50.head())
city_state 2018 city state
11698 New York, New York 8398748 New York NaN
1443 Los Angeles, California 3990456 Los Angeles NaN
3415 Chicago, Illinois 2705994 Chicago NaN
17040 Houston, Texas 2325502 Houston NaN
665 Phoenix, Arizona 1660272 Phoenix NaN
Not quite sure what I'm missing here?
You have explained that you have split the city_state column into city and state. For map to work, the value must be an exact match. What I speculate is that you have spaces on either side of the state series.
Try doing
newtop50['state'].str.strip().map(states_dic)
Incase you dont want to manually create the mapping(as the example has missing values) , you can use this module:
import us
states_dic=us.states.mapping('name', 'abbr')
df.state.map(states_dic)
11698 NY
1443 CA
3415 IL
17040 TX
665 AZ

Categories