I have a dataframe that I'm working with that contains a column that has state names spelled out and Im' trying to convert that into the two letter abbreviation form. I found a separate cvs file with all the state names and converted it into a dictionary. I then tried to use that dictionary to map the column but got NaN errors for my output columns.
The original dataframe I had contains a column with city and state grouped together. I've split them into two separate columns and the state is the one that I'm playing around with.
Here's what my dataframe looks like after I've split them:
print(newtop50.head())
city_state 2018 city state
11698 New York, New York 8398748 New York New York
1443 Los Angeles, California 3990456 Los Angeles California
3415 Chicago, Illinois 2705994 Chicago Illinois
17040 Houston, Texas 2325502 Houston Texas
665 Phoenix, Arizona 1660272 Phoenix Arizona
This is what a few rows of my dictionary looks like:
print(states_dic)
{'Alabama': 'AL', 'Alaska': 'AK', 'Arizona': 'AZ', 'Arkansas': 'AR', 'California': 'CA', 'Colorado': 'CO', 'Connecticut': 'CT', 'Delaware': 'DE', 'District of Columbia': 'DC', 'Florida': 'FL', 'Georgia': 'GA', 'Hawaii': 'HI', 'Idaho': 'ID'
Here's what I've tried:
newtop50['state'] = newtop50['state'].map(states_dic)
print(newtop50.head())
city_state 2018 city state
11698 New York, New York 8398748 New York NaN
1443 Los Angeles, California 3990456 Los Angeles NaN
3415 Chicago, Illinois 2705994 Chicago NaN
17040 Houston, Texas 2325502 Houston NaN
665 Phoenix, Arizona 1660272 Phoenix NaN
Not quite sure what I'm missing here?
You have explained that you have split the city_state column into city and state. For map to work, the value must be an exact match. What I speculate is that you have spaces on either side of the state series.
Try doing
newtop50['state'].str.strip().map(states_dic)
Incase you dont want to manually create the mapping(as the example has missing values) , you can use this module:
import us
states_dic=us.states.mapping('name', 'abbr')
df.state.map(states_dic)
11698 NY
1443 CA
3415 IL
17040 TX
665 AZ
Related
I have 2 CSV files in file1 I have list of research groups names. in file2 I have list of the Research full name with location as wall. I want to join these 2 csv file if the have the words matches in them.
Pandas ValueError: "Columns must be same length as key" I am using Jupyter Labs for this.
"df1[["f2_research_groups_names", "f2_location"]] = df1.apply(fn, axis=1)"
cvs row size for file1.csv 5000 data, and for file2.csv I have about 15,000
file1.csv
research_groups_names_f1
Chinese Academy of Sciences (CAS)
CAS
U-M
UQ
University of California, Los Angeles
Harvard University
file2.csv
research_groups_names_f2
Locatio_f2
Chinese Academy of Sciences (CAS)
China
University of Michigan (U-M)
USA
The University of Queensland (UQ)
USA
University of California
USA
file_output.csv
research_groups_names_f1
research_groups_names_f2
Locatio_f2
Chinese Academy of Sciences
Chinese Academy of Sciences(CAS)
China
CAS
Chinese Academy of Sciences (CAS)
China
U-M
University of Michigan (U-M)
USA
UQ
The University of Queensland (UQ)
Australia
Harvard University
Not found
USA
University of California, Los Angeles
University of California
USA
import pandas as pd
df1 = pd.read_csv('file1.csv')
df2 = pd.read_csv('file1.csv')
df1 = df1.add_prefix('f1_')
df2 = df2.add_prefix('f2_')
def fn(row):
for _, n in df2.iterrows():
if (
n["research_groups_names_f1"] == row["research_groups_names_f2"]
or row["research_groups_names_f1"] in n["research_groups_names_f2"]
):
return n
df1[["f2_research_groups_names", "f2_location"]] = df1.apply(fn, axis=1)
df1 = df1.rename(columns={"research_groups_names": "f1_research_groups_names"})
print(df1)
The issue here is that you're trying to merge on some very different values. Fuzzy matching may not help because the distance between CAS and Chinese Academy of Sciences (CAS) is quite large. The two have very little in common. You'll have to develop some custom approach based on your understanding of what the possible group names could be. Here is on approach that gets you most of the way there.
The idea here is to match on the university name OR the abbreviation. So in df2 we can split off the abbreviation and explode into a new row, remove the parenthesis, and in df remove any abbreviation surrounded by parentehsis.
The only leftover value is UCLA, which is the only sample that doesn't follow the same structure as the others. In this case fuzzy matching like I mentioned in my first comment probably would help.
import pandas as pd
df = pd.DataFrame({'research_groups_names_f1':[
'Chinese Academy of Sciences (CAS)',
'CAS',
'U-M',
'UQ',
'University of California, Los Angeles',
'Harvard University']})
df2 = pd.DataFrame({'research_groups_names_f2': ['Chinese Academy of Sciences (CAS)',
'University of Michigan (U-M)',
'The University of Queensland (UQ)',
'University of California'],
'Locatio_f2': ['China', 'USA', 'USA', 'USA']})
df2['key'] = df2['research_groups_names_f2'].str.split('\(')
df2 = df2.explode('key')
df2['key'] = df2['key'].str.replace('\(|\)','', regex=True)
df['key'] = df['research_groups_names_f1'].str.replace('\(.*\)','',regex=True)
df.merge(df2, on='key', how='left').drop(columns='key')
Output
research_groups_names_f1 research_groups_names_f2 Locatio_f2
0 Chinese Academy of Sciences (CAS) Chinese Academy of Sciences (CAS) China
1 CAS Chinese Academy of Sciences (CAS) China
2 U-M University of Michigan (U-M) USA
3 UQ The University of Queensland (UQ) USA
4 University of California, Los Angeles NaN NaN
5 Harvard University NaN NaN
I have this df and I want to split it:
cities3 = {'Metropolitan': ['New York', 'Los Angeles', 'San Francisco'],
'NHL': ['RangersIslandersDevils', 'KingsDucks', 'Sharks']}
cities4 = pd.DataFrame(cities3)
to get a new df like this one: (please click on the images)
What code can I use?
You can split your column based on an upper-case letter preceded by a lower-case one using this regex:
(?<=[a-z])(?=[A-Z])
and then you can use the technique described in this answer to replace the column with its exploded version:
cities4 = cities4.assign(NHL=cities4['NHL'].str.split(r'(?<=[a-z])(?=[A-Z])')).explode('NHL')
Output:
Metropolitan NHL
0 New York Rangers
0 New York Islanders
0 New York Devils
1 Los Angeles Kings
1 Los Angeles Ducks
2 San Francisco Sharks
If you want to reset the index (to 0..5) you can do this (either after the above command or as a part of it)
cities4.reset_index().reindex(cities4.columns, axis=1)
Output:
Metropolitan NHL
0 New York Rangers
1 New York Islanders
2 New York Devils
3 Los Angeles Kings
4 Los Angeles Ducks
5 San Francisco Sharks
in the following, can I make a single index for all the entries with common index.
cric = pd.Series(['India', 'Pakistan', 'South Africa', 'England', 'New Zealand'],
index = ['Cricket', 'Cricket', 'Cricket', 'Cricket', 'Cricket'])
ftbl = pd.Series(['England', 'South Africa', 'Australia', 'Netherlands', 'New Zealand'],
index = ['Football', 'Football', 'Football', 'Football' , 'Football'])
hock = pd.Series(['India', 'Pakistan', 'South Korea', 'England', 'India', 'New Zealand'],
index = ['Hockey', 'Hockey', 'Hockey', 'Hockey', 'Hockey', 'Hockey'])
all_countries_1 = cric.append(ftbl)
all_countries_1 = all_countries_1.append(ftbl)
all_countries_1 = all_countries_1.append(hock)
all_countries_1 = all_countries_1.to_frame()
all_countries_1.columns = ['Countries']
all_countries_1
I want the following as my out
Is this what you are looking for?
# zip the first three chars of the index and the index together
z = list(zip(all_countries_1.index.str[:3], all_countries_1.index))
# create multi index
idx = pd.MultiIndex.from_tuples(z)
# assign index
all_countries_1.index = idx
Countries
Cri Cricket India
Cricket Pakistan
Cricket South Africa
Cricket England
Cricket New Zealand
Foo Football England
Football South Africa
Football Australia
Football Netherlands
Football New Zealand
Football England
Football South Africa
Football Australia
Football Netherlands
Football New Zealand
Hoc Hockey India
Hockey Pakistan
Hockey South Korea
Hockey England
Hockey India
Hockey New Zealand
If, by single index, you mean an index made of autoincrementing numbers, there is nothing special you have to do. That is the default index for a DataFrame, so using the reset_index() method will get what you want. The next step will probably be to rename your index column. You can chain that method with reset_index and take care of it one line.
all_countries_1 = all_countries_1.reset_index().rename(columns={"index":"Sports"})
I have a data frame that looks like below:
City State Country
Chicago IL United States
Boston
San Diego CA United States
Los Angeles CA United States
San Francisco
Sacramento
Vancouver BC Canada
Toronto
I have 3 lists that contain all the missing values:
city_list = ['Boston', 'San Francisco', 'Sacramento', 'Toronto']
state_list = ['MA', 'CA', 'CA', 'ON']
country_list = ['United States', 'United States', 'United States', 'Canada']
And here's my ideal result:
City State Country
Chicago IL United States
Boston MA United States
San Diego CA United States
Los Angeles CA United States
San Francisco CA United States
Sacramento CA United States
Vancouver BC Canada
Toronto ON Canada
I used a potential method that's suggested by a helpful person, but I've been scratching my head and couldn't figure out what went wrong. And here's the code:
state_dict = dict(zip(city_list, state_list))
country_dict = dict(zip(city_list, country_list))
df = df.set_index('City')
df['State'] = df['State'].map(state_dict)
df['Country'] = df['Country'].map(country_dict)
df.reset_index()
print(df.City, df.State, df.Country)
But every cell of the State and Country columns return NaN.
City State Country
Chicago NaN NaN
Boston NaN NaN
San Diego NaN NaN
Los Angeles NaN NaN
San Francisco NaN NaN
Sacramento NaN NaN
Vancouver NaN NaN
Toronto NaN NaN
What went wrong here? And how would you change the code? Thanks.
I think that map should be called on the 'City' rather than 'State' field, like so:
df['State'] = df['City'].map(state_dict)
However, this has the problem that it overwrites any original 'State' values for cities which are not in your dictionary - e.g. 'Chicago'. One solution that gets around this is the following syntactically clumsier (but I believe correct) code:
df['State'] = df.apply(lambda x: state_dict[x['City']] if x['City'] in state_dict else x['State'], axis=1)
And it'll be the same idea for the country field.
I should add that this only works if you do not first set 'City' as index as you have in your example.
I have a dataframe that contains a column of cities. I am looking to match the city with its region. For example, San Francisco would be West.
Here is my original dataframe:
data = {'city': ['San Francisco', 'New York', 'Chicago', 'Philadelphia', 'Boston'],
'year': [2012, 2012, 2013, 2014, 2014],
'reports': [4, 24, 31, 2, 3]}
df = pd.DataFrame(data, index = ['Cochice', 'Pima', 'Santa Cruz', 'Maricopa', 'Yuma'])
df
city year reports
San Francisco 2012 Cochice
New York 2012 Pima
Chicago 2013 Santa Cruz
Philadelphia 2014 Maricopa
Boston 2014 Yuma
Here I pull data that contains region by state. However, it does not contain city.
pd.read_csv('https://raw.githubusercontent.com/cphalpert/census-regions/master/us%20census%20bureau%20regions%20and%20divisions.csv')
How do I get the state per city? That way I can then join the original dataframe including state with the second dataframe that has region.
On this Github project there is a CSV that the creator claims to contain all American cities and states.
The following data is presented:
City|State short name|State full name|County|City Alias Mixed Case
Example:
San Francisco|CA|California|SAN FRANCISCO|San Francisco
San Francisco|CA|California|SAN MATEO|San Francisco Intnl Airport
San Francisco|CA|California|SAN MATEO|San Francisco
San Francisco|CA|California|SAN FRANCISCO|Presidio
San Francisco|CA|California|SAN FRANCISCO|Bank Of America
San Francisco|CA|California|SAN FRANCISCO|Wells Fargo Bank
San Francisco|CA|California|SAN FRANCISCO|First Interstate Bank
San Francisco|CA|California|SAN FRANCISCO|Uc San Francisco
San Francisco|CA|California|SAN FRANCISCO|Union Bank Of California
San Francisco|CA|California|SAN FRANCISCO|Irs Service Center
San Francisco|CA|California|SAN FRANCISCO|At & T
San Francisco|CA|California|SAN FRANCISCO|Pacific Gas And Electric
Sacramento|CA|California|SACRAMENTO|Sacramento
Sacramento|CA|California|SACRAMENTO|Ca Franchise Tx Brd Brm
Sacramento|CA|California|SACRAMENTO|Ca State Govt Brm
I suggest you parse the above file to extract the info you need (on this case, the state given a specific city) then you correlate with the region on the other csv you have.
Better still would be for you to create your own table using all the csvs you access to contain only the info you really need.