I have a dataframe and there are 2 columns ["country"] and ["city"] which basically informs of the country and their cities.
I need to create a dict using dict comprehensions, to get as a key, the country and as values, a list of the city/cities (some of them only have one city, others many).
I'm able to define the keys and create a list but all the cities existing appears a values, I am not able to create the condition that the country of the value should be the key:
Dic = {k: list(megacities["city"]) for k,f in megacities.groupby('country')}
for k in Dic:
print("{}:{}\n".format(k, Dic[k]))
Part of the output that I receive is:
Argentina:['Tokyo', 'Jakarta', 'Delhi', 'Manila', 'São Paulo', 'Seoul', 'Mumbai', 'Shanghai', 'Mexico City', 'Guangzhou', 'Cairo', 'Beijing', 'New York', 'Kolkāta', 'Moscow', 'Bangkok', 'Dhaka', 'Buenos Aires', 'Ōsaka', 'Lagos', 'Istanbul', 'Karachi', 'Kinshasa', 'Shenzhen', 'Bangalore', 'Ho Chi Minh City', 'Tehran', 'Los Angeles', 'Rio de Janeiro', 'Chengdu', 'Baoding', 'Chennai', 'Lahore', 'London', 'Paris', 'Tianjin', 'Linyi', 'Shijiazhuang', 'Zhengzhou', 'Nanyang']
Bangladesh:['Tokyo', 'Jakarta', 'Delhi', 'Manila', 'São Paulo', 'Seoul', 'Mumbai', 'Shanghai', 'Mexico City', 'Guangzhou', 'Cairo', 'Beijing', 'New York', 'Kolkāta', 'Moscow', 'Bangkok', 'Dhaka', 'Buenos Aires', 'Ōsaka', 'Lagos', 'Istanbul', 'Karachi', 'Kinshasa', 'Shenzhen', 'Bangalore', 'Ho Chi Minh City', 'Tehran', 'Los Angeles', 'Rio de Janeiro', 'Chengdu', 'Baoding', 'Chennai', 'Lahore', 'London', 'Paris', 'Tianjin', 'Linyi', 'Shijiazhuang', 'Zhengzhou', 'Nanyang']
Brazil:['Tokyo', 'Jakarta', 'Delhi', 'Manila', 'São Paulo', 'Seoul', 'Mumbai', 'Shanghai', 'Mexico City', 'Guangzhou', 'Cairo', 'Beijing', 'New York', 'Kolkāta', 'Moscow', 'Bangkok', 'Dhaka', 'Buenos Aires', 'Ōsaka', 'Lagos', 'Istanbul', 'Karachi', 'Kinshasa', 'Shenzhen', 'Bangalore', 'Ho Chi Minh City', 'Tehran', 'Los Angeles', 'Rio de Janeiro', 'Chengdu', 'Baoding', 'Chennai', 'Lahore', 'London', 'Paris', 'Tianjin', 'Linyi', 'Shijiazhuang', 'Zhengzhou', 'Nanyang']
So basically the expect output would be:
Argentina:['Buenos Aires']
Bangladesh:['Dhaka']
Brazil:['São Paulo', 'Rio de Janeiro']
How can I should proceed in terms of syntaxis to stablish that condition for the value in the dict comprehension?
Lastly, the dataframe:
city city_ascii lat lng country iso2 iso3 admin_name capital population id
0 Tokyo Tokyo 35.6839 139.7744 Japan JP JPN Tōkyō primary 39105000 1392685764
1 Jakarta Jakarta -6.2146 106.8451 Indonesia ID IDN Jakarta primary 35362000 1360771077
2 Delhi Delhi 28.6667 77.2167 India IN IND Delhi admin 31870000 1356872604
3 Manila Manila 14.6000 120.9833 Philippines PH PHL Manila primary 23971000 1608618140
4 São Paulo Sao Paulo -23.5504 -46.6339 Brazil BR BRA São Paulo admin 22495000 1076532519
5 Seoul Seoul 37.5600 126.9900 South Korea KR KOR Seoul primary 22394000 1410836482
6 Mumbai Mumbai 19.0758 72.8775 India IN IND Mahārāshtra admin 22186000 1356226629
7 Shanghai Shanghai 31.1667 121.4667 China CN CHN Shanghai admin 22118000 1156073548
8 Mexico City Mexico City 19.4333 -99.1333 Mexico MX MEX Ciudad de México primary 21505000 1484247881
9 Guangzhou Guangzhou 23.1288 113.2590 China CN CHN Guangdong admin 21489000 1156237133
10 Cairo Cairo 30.0444 31.2358 Egypt EG EGY Al Qāhirah primary 19787000 1818253931
11 Beijing Beijing 39.9040 116.4075 China CN CHN Beijing primary 19437000 1156228865
12 New York New York 40.6943 -73.9249 United States US USA New York NaN 18713220 1840034016
13 Kolkāta Kolkata 22.5727 88.3639 India IN IND West Bengal admin 18698000 1356060520
14 Moscow Moscow 55.7558 37.6178 Russia RU RUS Moskva primary 17693000 1643318494
15 Bangkok Bangkok 13.7500 100.5167 Thailand TH THA Krung Thep Maha Nakhon primary 17573000 1764068610
16 Dhaka Dhaka 23.7289 90.3944 Bangladesh BD BGD Dhaka primary 16839000 1050529279
17 Buenos Aires Buenos Aires -34.5997 -58.3819 Argentina AR ARG Buenos Aires, Ciudad Autónoma de primary 16216000 1032717330
18 Ōsaka Osaka 34.7520 135.4582 Japan JP JPN Ōsaka admin 15490000 1392419823
19 Lagos Lagos 6.4500 3.4000 Nigeria NG NGA Lagos minor 15487000 1566593751
20 Istanbul Istanbul 41.0100 28.9603 Turkey TR TUR İstanbul admin 15311000 1792756324
21 Karachi Karachi 24.8600 67.0100 Pakistan PK PAK Sindh admin 15292000 1586129469
22 Kinshasa Kinshasa -4.3317 15.3139 Congo (Kinshasa) CD COD Kinshasa primary 15056000 1180000363
23 Shenzhen Shenzhen 22.5350 114.0540 China CN CHN Guangdong minor 14678000 1156158707
24 Bangalore Bangalore 12.9791 77.5913 India IN IND Karnātaka admin 13999000 1356410365
25 Ho Chi Minh City Ho Chi Minh City 10.8167 106.6333 Vietnam VN VNM Hồ Chí Minh admin 13954000 1704774326
26 Tehran Tehran 35.7000 51.4167 Iran IR IRN Tehrān primary 13819000 1364305026
27 Los Angeles Los Angeles 34.1139 -118.4068 United States US USA California NaN 12750807 1840020491
28 Rio de Janeiro Rio de Janeiro -22.9083 -43.1964 Brazil BR BRA Rio de Janeiro admin 12486000 1076887657
29 Chengdu Chengdu 30.6600 104.0633 China CN CHN Sichuan admin 11920000 1156421555
30 Baoding Baoding 38.8671 115.4845 China CN CHN Hebei NaN 11860000 1156256829
31 Chennai Chennai 13.0825 80.2750 India IN IND Tamil Nādu admin 11564000 1356374944
32 Lahore Lahore 31.5497 74.3436 Pakistan PK PAK Punjab admin 11148000 1586801463
33 London London 51.5072 -0.1275 United Kingdom GB GBR London, City of primary 11120000 1826645935
34 Paris Paris 48.8566 2.3522 France FR FRA Île-de-France primary 11027000 1250015082
35 Tianjin Tianjin 39.1467 117.2056 China CN CHN Tianjin admin 10932000 1156174046
36 Linyi Linyi 35.0606 118.3425 China CN CHN Shandong NaN 10820000 1156086320
37 Shijiazhuang Shijiazhuang 38.0422 114.5086 China CN CHN Hebei admin 10784600 1156217541
38 Zhengzhou Zhengzhou 34.7492 113.6605 China CN CHN Henan admin 10136000 1156183137
39 Nanyang Nanyang 32.9987 112.5292 China CN CHN Henan NaN 10013600 1156192287
Many thanks!
Try:
d = {i: g["city"].to_list() for i, g in df.groupby("country")}
print(d)
Prints:
{
"Argentina": ["Buenos Aires"],
"Bangladesh": ["Dhaka"],
"Brazil": ["São Paulo", "Rio de Janeiro"],
"China": [
"Shanghai",
"Guangzhou",
"Beijing",
"Shenzhen",
"Chengdu",
"Baoding",
"Tianjin",
"Linyi",
"Shijiazhuang",
"Zhengzhou",
"Nanyang",
],
"Congo (Kinshasa)": ["Kinshasa"],
"Egypt": ["Cairo"],
"France": ["Paris"],
"India": ["Delhi", "Mumbai", "Kolkāta", "Bangalore", "Chennai"],
"Indonesia": ["Jakarta"],
"Iran": ["Tehran"],
"Japan": ["Tokyo", "Ōsaka"],
"Mexico": ["Mexico City"],
"Nigeria": ["Lagos"],
"Pakistan": ["Karachi", "Lahore"],
"Philippines": ["Manila"],
"Russia": ["Moscow"],
"South Korea": ["Seoul"],
"Thailand": ["Bangkok"],
"Turkey": ["Istanbul"],
"United Kingdom": ["London"],
"United States": ["New York", "Los Angeles"],
"Vietnam": ["Ho Chi Minh City"],
}
Since you are doing the groupby, You need to fetch city from the group
Dic = {k: f['city'].unique() for k,f in megacities.groupby('country')}
Suppose I have two dataframes
df_1
city state salary
New York NY 85000
Chicago IL 65000
Miami FL 75000
Dallas TX 78000
Seattle WA 96000
df_2
city state taxes
New York NY 15000
Chicago IL 5000
Miami FL 6500
Next, I join the two dataframes
joined_df = df_1.merge(df_2, how='inner', left_on=['city'], right_on = ['city'])
The Result:
joined_df
city state salary city state taxes
New York NY 85000 New York NY 15000
Chicago IL 65000 Chicago IL 5000
Miami FL 75000 Miami FL 6500
Is there anyway I can stack the two dataframes on top of each other joining on the city instead of extending the line horizontally, like below:
Requested:
joined_df
city state salary taxes
New York NY 85000
New York NY 15000
Chicago IL 65000
Chicago IL 5000
Miami FL 75000
Miami FL 6500
How can I do this in Pandas!
In this case we might need to use merge to restrict to the relevant rows before concat if we need to consider both city and state.
rel_df_1 = df_1.merge(df_2)[df_1.columns]
rel_df_2 = df_2.merge(df_1)[df_2.columns]
df = pd.concat([rel_df_1, rel_df_2]).sort_values(['city', 'state'])
You can use append (a shortcut for concat) to achieve that:
result = df1.append(df2, sort=False)
If your dataframes have overlapping indexes, you can use:
df1.append(df2, ignore_index=True, sort=False)
Also, you can look for more information here
UPDATE: After appending your dataframes, you can filter your result to get only the rows that contains the city in both dataframes:
result = result.loc[result['city'].isin(df1['city'])
& result['city'].isin(df2['city'])]
Try with stack():
stacked = df_1.merge(df_2, on=["city", "state"]).set_index(["city", "state"]).stack()
output = pd.concat([stacked.where(stacked.index.get_level_values(-1)=="salary"),
stacked.where(stacked.index.get_level_values(-1)=="taxes")],
axis=1,
keys=["salary", "taxes"]) \
.droplevel(-1) \
.reset_index()
>>> output
city state salary taxes
0 New York NY 85000.0 NaN
1 New York NY NaN 15000.0
2 Chicago IL 65000.0 NaN
3 Chicago IL NaN 5000.0
4 Miami FL 75000.0 NaN
5 Miami FL NaN 6500.0
This my data frame
City
sales
San Diego
500
Texas
400
Nebraska
300
Macau
200
Rome
100
London
50
Manchester
70
I want to add the country at the end which will look like this
City
sales
Country
San Diego
500
US
Texas
400
US
Nebraska
300
US
Macau
200
Hong Kong
Rome
100
Italy
London
50
England
Manchester
200
England
The countries are stored in below dictionary
country={'US':['San Diego','Texas','Nebraska'], 'Hong Kong':'Macau', 'England':['London','Manchester'],'Italy':'Rome'}
It's a little complicated because you have lists and strings as the values and strings are technically iterable, so distinguishing is more annoying. But here's a function that can flatten your dict:
def flatten_dict(d):
nd = {}
for k,v in d.items():
# Check if it's a list, if so then iterate through
if ((hasattr(v, '__iter__') and not isinstance(v, str))):
for item in v:
nd[item] = k
else:
nd[v] = k
return nd
d = flatten_dict(country)
#{'San Diego': 'US',
# 'Texas': 'US',
# 'Nebraska': 'US',
# 'Macau': 'Hong Kong',
# 'London': 'England',
# 'Manchester': 'England',
# 'Rome': 'Italy'}
df['Country'] = df['City'].map(d)
You can implement this using geopy
You can install geopy by pip install geopy
Here is the documentation : https://pypi.org/project/geopy/
# import libraries
from geopy.geocoders import Nominatim
# you need to mention a name for the app
geolocator = Nominatim(user_agent="some_random_app_name")
# get country name
df['Country'] = df['City'].apply(lambda x : geolocator.geocode(x).address.split(', ')[-1])
print(df)
City sales Country
0 San Diego 500 United States
1 Texas 400 United States
2 Nebraska 300 United States
3 Macau 200 中国
4 Rome 100 Italia
5 London 50 United Kingdom
6 Manchester 70 United Kingdom
# to get country name in english
df['Country'] = df['City'].apply(lambda x : geolocator.reverse(geolocator.geocode(x).point, language='en').address.split(', ')[-1])
print(df)
City sales Country
0 San Diego 500 United States
1 Texas 400 United States
2 Nebraska 300 United States
3 Macau 200 China
4 Rome 100 Italy
5 London 50 United Kingdom
6 Manchester 70 United Kingdom
I have a dataframe that I'm working with that contains a column that has state names spelled out and Im' trying to convert that into the two letter abbreviation form. I found a separate cvs file with all the state names and converted it into a dictionary. I then tried to use that dictionary to map the column but got NaN errors for my output columns.
The original dataframe I had contains a column with city and state grouped together. I've split them into two separate columns and the state is the one that I'm playing around with.
Here's what my dataframe looks like after I've split them:
print(newtop50.head())
city_state 2018 city state
11698 New York, New York 8398748 New York New York
1443 Los Angeles, California 3990456 Los Angeles California
3415 Chicago, Illinois 2705994 Chicago Illinois
17040 Houston, Texas 2325502 Houston Texas
665 Phoenix, Arizona 1660272 Phoenix Arizona
This is what a few rows of my dictionary looks like:
print(states_dic)
{'Alabama': 'AL', 'Alaska': 'AK', 'Arizona': 'AZ', 'Arkansas': 'AR', 'California': 'CA', 'Colorado': 'CO', 'Connecticut': 'CT', 'Delaware': 'DE', 'District of Columbia': 'DC', 'Florida': 'FL', 'Georgia': 'GA', 'Hawaii': 'HI', 'Idaho': 'ID'
Here's what I've tried:
newtop50['state'] = newtop50['state'].map(states_dic)
print(newtop50.head())
city_state 2018 city state
11698 New York, New York 8398748 New York NaN
1443 Los Angeles, California 3990456 Los Angeles NaN
3415 Chicago, Illinois 2705994 Chicago NaN
17040 Houston, Texas 2325502 Houston NaN
665 Phoenix, Arizona 1660272 Phoenix NaN
Not quite sure what I'm missing here?
You have explained that you have split the city_state column into city and state. For map to work, the value must be an exact match. What I speculate is that you have spaces on either side of the state series.
Try doing
newtop50['state'].str.strip().map(states_dic)
Incase you dont want to manually create the mapping(as the example has missing values) , you can use this module:
import us
states_dic=us.states.mapping('name', 'abbr')
df.state.map(states_dic)
11698 NY
1443 CA
3415 IL
17040 TX
665 AZ