Python get a list of cities, states, region - python

I have a dataframe that contains a column of cities. I am looking to match the city with its region. For example, San Francisco would be West.
Here is my original dataframe:
data = {'city': ['San Francisco', 'New York', 'Chicago', 'Philadelphia', 'Boston'],
'year': [2012, 2012, 2013, 2014, 2014],
'reports': [4, 24, 31, 2, 3]}
df = pd.DataFrame(data, index = ['Cochice', 'Pima', 'Santa Cruz', 'Maricopa', 'Yuma'])
df
city year reports
San Francisco 2012 Cochice
New York 2012 Pima
Chicago 2013 Santa Cruz
Philadelphia 2014 Maricopa
Boston 2014 Yuma
Here I pull data that contains region by state. However, it does not contain city.
pd.read_csv('https://raw.githubusercontent.com/cphalpert/census-regions/master/us%20census%20bureau%20regions%20and%20divisions.csv')
How do I get the state per city? That way I can then join the original dataframe including state with the second dataframe that has region.

On this Github project there is a CSV that the creator claims to contain all American cities and states.
The following data is presented:
City|State short name|State full name|County|City Alias Mixed Case
Example:
San Francisco|CA|California|SAN FRANCISCO|San Francisco
San Francisco|CA|California|SAN MATEO|San Francisco Intnl Airport
San Francisco|CA|California|SAN MATEO|San Francisco
San Francisco|CA|California|SAN FRANCISCO|Presidio
San Francisco|CA|California|SAN FRANCISCO|Bank Of America
San Francisco|CA|California|SAN FRANCISCO|Wells Fargo Bank
San Francisco|CA|California|SAN FRANCISCO|First Interstate Bank
San Francisco|CA|California|SAN FRANCISCO|Uc San Francisco
San Francisco|CA|California|SAN FRANCISCO|Union Bank Of California
San Francisco|CA|California|SAN FRANCISCO|Irs Service Center
San Francisco|CA|California|SAN FRANCISCO|At & T
San Francisco|CA|California|SAN FRANCISCO|Pacific Gas And Electric
Sacramento|CA|California|SACRAMENTO|Sacramento
Sacramento|CA|California|SACRAMENTO|Ca Franchise Tx Brd Brm
Sacramento|CA|California|SACRAMENTO|Ca State Govt Brm
I suggest you parse the above file to extract the info you need (on this case, the state given a specific city) then you correlate with the region on the other csv you have.
Better still would be for you to create your own table using all the csvs you access to contain only the info you really need.

Related

Draw a Map of cities in python

I have a ranking of countries across the world in a variable called rank_2000 that looks like this:
Seoul
Tokyo
Paris
New_York_Greater
Shizuoka
Chicago
Minneapolis
Boston
Austin
Munich
Salt_Lake
Greater_Sydney
Houston
Dallas
London
San_Francisco_Greater
Berlin
Seattle
Toronto
Stockholm
Atlanta
Indianapolis
Fukuoka
San_Diego
Phoenix
Frankfurt_am_Main
Stuttgart
Grenoble
Albany
Singapore
Washington_Greater
Helsinki
Nuremberg
Detroit_Greater
TelAviv
Zurich
Hamburg
Pittsburgh
Philadelphia_Greater
Taipei
Los_Angeles_Greater
Miami_Greater
MannheimLudwigshafen
Brussels
Milan
Montreal
Dublin
Sacramento
Ottawa
Vancouver
Malmo
Karlsruhe
Columbus
Dusseldorf
Shenzen
Copenhagen
Milwaukee
Marseille
Greater_Melbourne
Toulouse
Beijing
Dresden
Manchester
Lyon
Vienna
Shanghai
Guangzhou
San_Antonio
Utrecht
New_Delhi
Basel
Oslo
Rome
Barcelona
Madrid
Geneva
Hong_Kong
Valencia
Edinburgh
Amsterdam
Taichung
The_Hague
Bucharest
Muenster
Greater_Adelaide
Chengdu
Greater_Brisbane
Budapest
Manila
Bologna
Quebec
Dubai
Monterrey
Wellington
Shenyang
Tunis
Johannesburg
Auckland
Hangzhou
Athens
Wuhan
Bangalore
Chennai
Istanbul
Cape_Town
Lima
Xian
Bangkok
Penang
Luxembourg
Buenos_Aires
Warsaw
Greater_Perth
Kuala_Lumpur
Santiago
Lisbon
Dalian
Zhengzhou
Prague
Changsha
Chongqing
Ankara
Fuzhou
Jinan
Xiamen
Sao_Paulo
Kunming
Jakarta
Cairo
Curitiba
Riyadh
Rio_de_Janeiro
Mexico_City
Hefei
Almaty
Beirut
Belgrade
Belo_Horizonte
Bogota_DC
Bratislava
Dhaka
Durban
Hanoi
Ho_Chi_Minh_City
Kampala
Karachi
Kuwait_City
Manama
Montevideo
Panama_City
Quito
San_Juan
What I would like to do is a map of the world where those cities are colored according to their position on the ranking above. I am opened to further solutions for the representation (such as bubbles of increasing dimension according to the position of the cities in the rank or, if necessary, representing only a sample of countries taken from the top rank, the middle and the bottom).
Thank you,
Federico
Your question has two parts; finding the location of each city and then drawing them on the map. Assuming you have the latitude and longitude of each city, here's how you'd tackle the latter part.
I like Folium (https://pypi.org/project/folium/) for drawing maps. Here's an example of how you might draw a circle for each city, with it's position in the list is used to determine the size of that circle.
import folium
cities = [
{'name':'Seoul', 'coodrs':[37.5639715, 126.9040468]},
{'name':'Tokyo', 'coodrs':[35.5090627, 139.2094007]},
{'name':'Paris', 'coodrs':[48.8588787,2.2035149]},
{'name':'New York', 'coodrs':[40.6976637,-74.1197631]},
# etc. etc.
]
m = folium.Map(zoom_start=15)
for counter, city in enumerate(cities):
circle_size = 5 + counter
folium.CircleMarker(
location=city['coodrs'],
radius=circle_size,
popup=city['name'],
color="crimson",
fill=True,
fill_color="crimson",
).add_to(m)
m.save('map.html')
Output:
You may need to adjust the circle_size calculation a little to work with the number of cities you want to include.

Split a row into more rows based on a string (regex)

I have this df and I want to split it:
cities3 = {'Metropolitan': ['New York', 'Los Angeles', 'San Francisco'],
'NHL': ['RangersIslandersDevils', 'KingsDucks', 'Sharks']}
cities4 = pd.DataFrame(cities3)
to get a new df like this one: (please click on the images)
What code can I use?
You can split your column based on an upper-case letter preceded by a lower-case one using this regex:
(?<=[a-z])(?=[A-Z])
and then you can use the technique described in this answer to replace the column with its exploded version:
cities4 = cities4.assign(NHL=cities4['NHL'].str.split(r'(?<=[a-z])(?=[A-Z])')).explode('NHL')
Output:
Metropolitan NHL
0 New York Rangers
0 New York Islanders
0 New York Devils
1 Los Angeles Kings
1 Los Angeles Ducks
2 San Francisco Sharks
If you want to reset the index (to 0..5) you can do this (either after the above command or as a part of it)
cities4.reset_index().reindex(cities4.columns, axis=1)
Output:
Metropolitan NHL
0 New York Rangers
1 New York Islanders
2 New York Devils
3 Los Angeles Kings
4 Los Angeles Ducks
5 San Francisco Sharks

NaN error from .map on a column in a dataframe

I have a dataframe that I'm working with that contains a column that has state names spelled out and Im' trying to convert that into the two letter abbreviation form. I found a separate cvs file with all the state names and converted it into a dictionary. I then tried to use that dictionary to map the column but got NaN errors for my output columns.
The original dataframe I had contains a column with city and state grouped together. I've split them into two separate columns and the state is the one that I'm playing around with.
Here's what my dataframe looks like after I've split them:
print(newtop50.head())
city_state 2018 city state
11698 New York, New York 8398748 New York New York
1443 Los Angeles, California 3990456 Los Angeles California
3415 Chicago, Illinois 2705994 Chicago Illinois
17040 Houston, Texas 2325502 Houston Texas
665 Phoenix, Arizona 1660272 Phoenix Arizona
This is what a few rows of my dictionary looks like:
print(states_dic)
{'Alabama': 'AL', 'Alaska': 'AK', 'Arizona': 'AZ', 'Arkansas': 'AR', 'California': 'CA', 'Colorado': 'CO', 'Connecticut': 'CT', 'Delaware': 'DE', 'District of Columbia': 'DC', 'Florida': 'FL', 'Georgia': 'GA', 'Hawaii': 'HI', 'Idaho': 'ID'
Here's what I've tried:
newtop50['state'] = newtop50['state'].map(states_dic)
print(newtop50.head())
city_state 2018 city state
11698 New York, New York 8398748 New York NaN
1443 Los Angeles, California 3990456 Los Angeles NaN
3415 Chicago, Illinois 2705994 Chicago NaN
17040 Houston, Texas 2325502 Houston NaN
665 Phoenix, Arizona 1660272 Phoenix NaN
Not quite sure what I'm missing here?
You have explained that you have split the city_state column into city and state. For map to work, the value must be an exact match. What I speculate is that you have spaces on either side of the state series.
Try doing
newtop50['state'].str.strip().map(states_dic)
Incase you dont want to manually create the mapping(as the example has missing values) , you can use this module:
import us
states_dic=us.states.mapping('name', 'abbr')
df.state.map(states_dic)
11698 NY
1443 CA
3415 IL
17040 TX
665 AZ

dColumns returning Nan with dictionary mapping

I have a data frame that looks like below:
City State Country
Chicago IL United States
Boston
San Diego CA United States
Los Angeles CA United States
San Francisco
Sacramento
Vancouver BC Canada
Toronto
I have 3 lists that contain all the missing values:
city_list = ['Boston', 'San Francisco', 'Sacramento', 'Toronto']
state_list = ['MA', 'CA', 'CA', 'ON']
country_list = ['United States', 'United States', 'United States', 'Canada']
And here's my ideal result:
City State Country
Chicago IL United States
Boston MA United States
San Diego CA United States
Los Angeles CA United States
San Francisco CA United States
Sacramento CA United States
Vancouver BC Canada
Toronto ON Canada
I used a potential method that's suggested by a helpful person, but I've been scratching my head and couldn't figure out what went wrong. And here's the code:
state_dict = dict(zip(city_list, state_list))
country_dict = dict(zip(city_list, country_list))
df = df.set_index('City')
df['State'] = df['State'].map(state_dict)
df['Country'] = df['Country'].map(country_dict)
df.reset_index()
print(df.City, df.State, df.Country)
But every cell of the State and Country columns return NaN.
City State Country
Chicago NaN NaN
Boston NaN NaN
San Diego NaN NaN
Los Angeles NaN NaN
San Francisco NaN NaN
Sacramento NaN NaN
Vancouver NaN NaN
Toronto NaN NaN
What went wrong here? And how would you change the code? Thanks.
I think that map should be called on the 'City' rather than 'State' field, like so:
df['State'] = df['City'].map(state_dict)
However, this has the problem that it overwrites any original 'State' values for cities which are not in your dictionary - e.g. 'Chicago'. One solution that gets around this is the following syntactically clumsier (but I believe correct) code:
df['State'] = df.apply(lambda x: state_dict[x['City']] if x['City'] in state_dict else x['State'], axis=1)
And it'll be the same idea for the country field.
I should add that this only works if you do not first set 'City' as index as you have in your example.

Apply fuzzy matching across a dataframe column and save results in a new column

I have two data frames with each having a different number of rows. Below is a couple rows from each data set
df1 =
Company City State ZIP
FREDDIE LEES AMERICAN GOURMET SAUCE St. Louis MO 63101
CITYARCHRIVER 2015 FOUNDATION St. Louis MO 63102
GLAXOSMITHKLINE CONSUMER HEALTHCARE St. Louis MO 63102
LACKEY SHEET METAL St. Louis MO 63102
and
df2 =
FDA Company FDA City FDA State FDA ZIP
LACKEY SHEET METAL St. Louis MO 63102
PRIMUS STERILIZER COMPANY LLC Great Bend KS 67530
HELGET GAS PRODUCTS INC Omaha NE 68127
ORTHOQUEST LLC La Vista NE 68128
I joined them side by side using combined_data = pandas.concat([df1, df2], axis = 1). My next goal is to compare each string under df1['Company'] to each string under in df2['FDA Company'] using several different matching commands from the fuzzy wuzzy module and return the value of the best match and its name. I want to store that in a new column. For example if I did the fuzz.ratio and fuzz.token_sort_ratio on LACKY SHEET METAL in df1['Company'] to df2['FDA Company'] it would return that the best match was LACKY SHEET METAL with a score of 100 and this would then be saved under a new column in combined data. It results would look like
combined_data =
Company City State ZIP FDA Company FDA City FDA State FDA ZIP fuzzy.token_sort_ratio match fuzzy.ratio match
FREDDIE LEES AMERICAN GOURMET SAUCE St. Louis MO 63101 LACKEY SHEET METAL St. Louis MO 63102 LACKEY SHEET METAL 100 LACKEY SHEET METAL 100
CITYARCHRIVER 2015 FOUNDATION St. Louis MO 63102 PRIMUS STERILIZER COMPANY LLC Great Bend KS 67530
GLAXOSMITHKLINE CONSUMER HEALTHCARE St. Louis MO 63102 HELGET GAS PRODUCTS INC Omaha NE 68127
LACKEY SHEET METAL St. Louis MO 63102 ORTHOQUEST LLC La Vista NE 68128
I tried doing
combined_data['name_ratio'] = combined_data.apply(lambda x: fuzz.ratio(x['Company'], x['FDA Company']), axis = 1)
But got an error because the lengths of the columns are different.
I am stumped. How I can accomplish this?
I couldn't tell what you were doing. This is how I would do it.
from fuzzywuzzy import fuzz
from fuzzywuzzy import process
Create a series of tuples to compare:
compare = pd.MultiIndex.from_product([df1['Company'],
df2['FDA Company']]).to_series()
Create a special function to calculate fuzzy metrics and return a series.
def metrics(tup):
return pd.Series([fuzz.ratio(*tup),
fuzz.token_sort_ratio(*tup)],
['ratio', 'token'])
Apply metrics to the compare series
compare.apply(metrics)
There are bunch of ways to do this next part:
Get closest matches to each row of df1
compare.apply(metrics).unstack().idxmax().unstack(0)
Get closest matches to each row of df2
compare.apply(metrics).unstack(0).idxmax().unstack(0)

Categories