I have two dataframe, df1 and df2,, df1 contains correct data that will be used to match data in df2
I want to find latitudes and longitudes in df2 that don't match the City name in df1.
Also I want to find cities in df2 that are "located" in the wrong country
Here's df1 dataframe
id city lat lng country
1036323110 Katherine -14.4667 132.2667 Australia
1840015979 South Pasadena 27.7526 -82.7394 United States
1124755118 Beaconsfield 45.4333 -73.8667 Canada
1250921305 Ferney-Voltaire 46.2558 6.1081 France
1156346497 Jiangshan 28.7412 118.6225 China
1231393325 Dīla 6.4104 38.3100 Ethiopia
1192391794 Gibara 21.1072 -76.1367 Cuba
1840054954 Hampstead 42.8821 -71.1709 United States
1840005111 West Islip 40.7097 -73.2971 United States
1076327352 Paulínia -22.7611 -47.1542 Brazil
Here's df2 dataframe
id location city country
16620625-5686 45.5333, -73.2833 Saint-Basile-le-Grand Canada
16310427-5502 52.0000, 84.9833 Belokurikha Russia
16501010-4957 -14.4667, 136.2667 Katherine Australia
16110430-8679 40.5626, -74.5743 Finderne United States
16990624-4174 27.7526, -90.7394 South Pasadena China
16790311-9092 35.98157, -160.41182 Jiangshan United States
16650927-9151 44.7667, 39.8667 West Islip Russia
16530328-2221 -22.8858, -48.4450 Botucatu Brazil
16411229-7314 42.8821, -71.1709 Hampstead United States
16060229-4175 -7.7296, 38.9500 Kibiti Tanzania
Here' my code so far:
city_df = pd.merge(df1,df2,on ='city',how ='left')
First add lat and lng columns to df2
df2[['lat', 'lng']] = df2['location'].str.split(', ', expand=True)
df2[['lat', 'lng']] = df2[['lat', 'lng']].astype(float)
Then merge df1 to df2 based on cities
city_df = pd.merge(df1[['lat', 'lng', 'city', 'country']], df2, on='city', how ='right', suffixes=('_correct', ''))
Find cities in df2 that are "located" in the wrong country
m = ~((city_df['country_correct'] == city_df['country']) | city_df['country_correct'].isna())
print(city_df[m])
lat_correct lng_correct city country_correct id location country lat lng
4 27.7526 -82.7394 South Pasadena United States 16990624-4174 27.7526, -90.7394 China 27.75260 -90.73940
5 28.7412 118.6225 Jiangshan China 16790311-9092 35.98157, -160.41182 United States 35.98157 -160.41182
6 40.7097 -73.2971 West Islip United States 16650927-9151 44.7667, 39.8667 Russia 44.76670 39.86670
To compare the two data frames, it's easier to have in the first place the df1 and df2 in similar formats. For example, df1 would be like this :
lat lng country
city
Katherine -14.4667 132.2667 Australia
South Pasadena 27.7526 -82.7394 United States
Beaconsfield 45.4333 -73.8667 Canada
Ferney-Voltaire 46.2558 6.1081 France
Jiangshan 28.7412 118.6225 China
Dīla 6.4104 38.3100 Ethiopia
Gibara 21.1072 -76.1367 Cuba
Hampstead 42.8821 -71.1709 United States
West Islip 40.7097 -73.2971 United States
Paulínia -22.7611 -47.1542 Brazil
And df2 :
country2 lng2 lat2
city
Saint-Basile-le-Grand Canada -73.2833 45.5333
Belokurikha Russia 84.9833 52.0000
Katherine Australia 132.2667 -14.4667
Finderne United States -74.5743 40.5626
South Pasadena United States -82.7394 27.7526
West Islip United States -160.41182 35.98157
Belorechensk Russia 39.8667 44.7667
Botucatu Brazil -48.4450 -22.8858
Hampstead United States -71.1709 42.8821
Kibiti Tanzania 38.9500 -7.7296
Then you can use the pd.concat method on axis=1 as follow :
df3 = pd.concat([df1,df2],axis=1) in order to get the following df :
lat lng country country2 lng2 lat2
city
Katherine -14.4667 132.2667 Australia Australia 132.2667 -14.4667
South Pasadena 27.7526 -82.7394 United States United States -82.7394 27.7526
Beaconsfield 45.4333 -73.8667 Canada NaN NaN NaN
Ferney-Voltaire 46.2558 6.1081 France NaN NaN NaN
Jiangshan 28.7412 118.6225 China NaN NaN NaN
Dīla 6.4104 38.3100 Ethiopia NaN NaN NaN
Gibara 21.1072 -76.1367 Cuba NaN NaN NaN
Hampstead 42.8821 -71.1709 United States United States -71.1709 42.8821
West Islip 40.7097 -73.2971 United States United States -160.41182 35.98157
Paulínia -22.7611 -47.1542 Brazil NaN NaN NaN
Saint-Basile-le-Grand NaN NaN NaN Canada -73.2833 45.5333
Belokurikha NaN NaN NaN Russia 84.9833 52.0000
Finderne NaN NaN NaN United States -74.5743 40.5626
Belorechensk NaN NaN NaN Russia 39.8667 44.7667
Botucatu NaN NaN NaN Brazil -48.4450 -22.8858
Kibiti NaN NaN NaN Tanzania 38.9500 -7.7296
Finaly from the concatenated df3 you can get rows where latitudes and longitudes in df2 don't match the City name in df1 :
df3[(df3['lat']!=df3['lat2']) & (df3['lng']!=df3['lng2'])].dropna()
lat lng country country2 lng2 lat2
city
West Islip 40.7097 -73.2971 United States United States -160.41182 35.98157
To find cities in df2 that are "located" in the wrong country :
df3[df3['country']!=df3['country2']]
Related
I want to replace missing country with the country that corresponds to the city, i.e. find another data point with the same city and copy the country, if there is no other record with the same city then remove.
Here's the dataframe:
id city lat lng country
1036323110 Katherine -14.4667 132.2667 Australia
1840015979 South Pasadena 27.7526 -82.7394
1124755118 Beaconsfield 45.4333 -73.8667 Canada
1250921305 Ferney-Voltaire 46.2558 6.1081 France
1156346497 Jiangshan 28.7412 118.6225
1231393325 Dīla 6.4104 38.3100 Ethiopia
1840015979 South Pasadena 27.7526 -82.7394 United States
1192391794 Kigoma 21.1072 -76.1367
1840054954 Hampstead 42.8821 -71.1709 United States
1840005111 West Islip 40.7097 -73.2971 United States
1076327352 Paulínia -22.7611 -47.1542 Brazil
1250921305 Ferney-Voltaire 46.2558 6.1081
1250921305 Ferney-Voltaire 46.2558 6.1081 France
1156346497 Jiangshan 28.7412 118.6225 China
1231393325 Dīla 6.4104 38.3100 Ethiopia
1192391794 Gibara 21.1072 -76.1367 Cuba
1840054954 Dodoma 42.8821 -71.1709
1840005111 West Islip 40.7097 -73.2971 United States
Here's my code so far:
df[df.isin(['city'])].stack()
You can group by city, lat, lng three columns and filling missing values by first not nan value in each group.
df['country'] = df['country'].fillna(
df.groupby(['city', 'lat', 'lng'])['country'].transform(
lambda x: x.loc[x.first_valid_index()] if x.first_valid_index() else x
)
)
print(df)
id city lat lng country
0 1036323110 Katherine -14.4667 132.2667 Australia
1 1840015979 South Pasadena 27.7526 -82.7394 United States
2 1124755118 Beaconsfield 45.4333 -73.8667 Canada
3 1250921305 Ferney-Voltaire 46.2558 6.1081 France
4 1156346497 Jiangshan 28.7412 118.6225 China
5 1231393325 Dīla 6.4104 38.3100 Ethiopia
6 1840015979 South Pasadena 27.7526 -82.7394 United States
7 1192391794 Kigoma 21.1072 -76.1367 NaN
8 1840054954 Hampstead 42.8821 -71.1709 United States
9 1840005111 West Islip 40.7097 -73.2971 United States
10 1076327352 Paulínia -22.7611 -47.1542 Brazil
11 1250921305 Ferney-Voltaire 46.2558 6.1081 France
12 1250921305 Ferney-Voltaire 46.2558 6.1081 France
13 1156346497 Jiangshan 28.7412 118.6225 China
14 1231393325 Dīla 6.4104 38.3100 Ethiopia
15 1192391794 Gibara 21.1072 -76.1367 Cuba
16 1840054954 Dodoma 42.8821 -71.1709 NaN
17 1840005111 West Islip 40.7097 -73.2971 United States
I've solved such a Problem with the geopy package.
Then you can use the lat and long. Then filter the Geopy-Output for the Country. This Way you will avoid NaN's and always get an answer based on geo information.
pip3 install geopy
to pip install the geopy package
geo=Nominatim(user_agent="geoapiExercises")
for i in range(0,len(df)):
lat=str(df.iloc[i,2])
lon=str(df.iloc[i,3])
df.iloc[i,4]=geo.reverse(lat+','+lon).raw['address']['country']
Please inform yourself about the user_agent api. For Exercise Purposes this key should work
I have a pandas dataframe(4 of 8 columns):
df = pd.DataFrame( {"departure_country":["Mexico","Mexico","United States","United States","United States","United States","Japan","United States","United States","United States"],"departure_city":["Guadalajara","Guadalajara","New York","Chicago","Los Angeles","Michigan","Tokyo","New York","New York","Chicago"],"destination_country":["United States","United States","United States","United States","Mexico","United States","United States","Mexico","United States","Japan"],"destination_city":["Los Angeles","Los Angeles","Chicago","New York","Guadalajara","New York","Chicago","Guadalajara","Michigan","Tokyo"]})
df
departure_country departure_city destination_country destination_city
0 Mexico Guadalajara United States Los Angeles
1 Mexico Guadalajara United States Los Angeles
2 United States New York United States Chicago
3 United States Chicago United States New York
4 United States Los Angeles Mexico Guadalajara
5 United States Michigan United States New York
6 Japan Tokyo United States Chicago
7 United States New York Mexico Guadalajara
8 United States New York United States Michigan
9 United States Chicago Japan Tokyo
I want to analyze the data in each group so I would like to groupby "the same pair" of departure and destination first, something like:
departure_country departure_city destination_country destination_city
0 Mexico Guadalajara United States Los Angeles
1 Mexico Guadalajara United States Los Angeles
2 United States Los Angeles Mexico Guadalajara
3 United States New York United States Chicago
4 United States Chicago United States New York
5 United States Michigan United States New York
6 United States New York United States Michigan
7 Japan Tokyo United States Chicago
8 United States Chicago Japan Tokyo
9 United States New York Mexico Guadalajara
Is it possible to make it in a DataFrame? I have tried groupby and key-value, but I failed.
Really appreciate your help with this, thanks!
I'm sure someone could think of a better optimized solution, but one way is to create sorted tuples of your country/city pairs and sort by it:
print (df.assign(country=[tuple(sorted(i)) for i in df.filter(like="country").to_numpy()],
city=[tuple(sorted(i)) for i in df.filter(like="city").to_numpy()])
.sort_values(["country","city"], ascending=False).filter(like="_"))
departure_country departure_city destination_country destination_city
5 United States Michigan United States New York
8 United States New York United States Michigan
2 United States New York United States Chicago
3 United States Chicago United States New York
7 United States New York Mexico Guadalajara
0 Mexico Guadalajara United States Los Angeles
1 Mexico Guadalajara United States Los Angeles
4 United States Los Angeles Mexico Guadalajara
6 Japan Tokyo United States Chicago
9 United States Chicago Japan Tokyo
I have the following pandas series:
Reducedset['% Renewable']
Which gives me:
Asia China 19.7549
Japan 10.2328
India 14.9691
South Korea 2.27935
Iran 5.70772
North America United States 11.571
Canada 61.9454
Europe United Kingdom 10.6005
Russian Federation 17.2887
Germany 17.9015
France 17.0203
Italy 33.6672
Spain 37.9686
Australia Australia 11.8108
South America Brazil 69.648
Name: % Renewable, dtype: object
I then sorted this series into 5 bins:
binning = pd.cut(Top15['% Renewable'],5)
Which gives me:
Asia China (15.753, 29.227]
Japan (2.212, 15.753]
India (2.212, 15.753]
South Korea (2.212, 15.753]
Iran (2.212, 15.753]
North America United States (2.212, 15.753]
Canada (56.174, 69.648]
Europe United Kingdom (2.212, 15.753]
Russian Federation (15.753, 29.227]
Germany (15.753, 29.227]
France (15.753, 29.227]
Italy (29.227, 42.701]
Spain (29.227, 42.701]
Australia Australia (2.212, 15.753]
South America Brazil (56.174, 69.648]
Name: % Renewable, dtype: category
Categories (5, interval[float64]): [(2.212, 15.753] < (15.753, 29.227] < (29.227, 42.701] <
(42.701, 56.174] < (56.174, 69.648]]
I then grouped this binned data in order to calculate the number of countries in each bin:
Reduced = Reducedset.groupby(binning)['% Renewable'].agg(['count'])
Which gives me:
% Renewable
(2.212, 15.753] 7
(15.753, 29.227] 4
(29.227, 42.701] 2
(42.701, 56.174] 0
(56.174, 69.648] 2
Name: count, dtype: int64
However, the index has disappeared and I still want to keep the index for the 'continents' (the outer index).
Thus, on the very left of the (% Renewable) column it should say:
Asia
North America
Europe
Australia
South America
When I try doing that by:
print(Reducedset['% Renewable'].groupby([Reducedset['% Renewable'].index.get_level_values(0),pd.cut(Reducedset['% Renewable'],5)]).count())
It works!
Problem solved!
Let's assume the following data:
np.random.seed(1)
s = pd.Series(np.random.randint(0,10, 16),
index=pd.MultiIndex.from_arrays([list('aaaabbccdddddeee'),
list('abcdefghijklmnop')]))
What you are looking IIUC is then
print(s.groupby([s.index.get_level_values(0), #that is the continent for you
pd.cut(s, 5)]) #that is the binning you created
.count())
a (-0.009, 1.8] 0
(1.8, 3.6] 0
(3.6, 5.4] 2
(5.4, 7.2] 0
(7.2, 9.0] 2
b (-0.009, 1.8] 2
(1.8, 3.6] 0
(3.6, 5.4] 0
(5.4, 7.2] 0
(7.2, 9.0] 0
c (-0.009, 1.8] 1
(1.8, 3.6] 0
(3.6, 5.4] 0
(5.4, 7.2] 1
(7.2, 9.0] 0
d (-0.009, 1.8] 0
(1.8, 3.6] 1
(3.6, 5.4] 2
(5.4, 7.2] 1
(7.2, 9.0] 1
e (-0.009, 1.8] 0
(1.8, 3.6] 2
(3.6, 5.4] 1
(5.4, 7.2] 0
(7.2, 9.0] 0
dtype: int64
I have a column (of type Series) from a DataFrame (energy["Energy Supply"]) like this:
Country
China 127191
United States 90838
Japan 18984
United Kingdom 7920
Russian Federation 30709
Canada 10431
Germany 13261
India 33195
France 10597
South Korea 11007
Italy 6530
Spain 4923
Iran NaN
Australia 5386
Brazil 12149
Name: Energy Supply, dtype: object
Currently it is of type object.
The NaN value came was converted from this code:
peta = row["Energy Supply"]
if peta == "...":
# pd.to_numeric(row["Energy Supply"], errors='coerce')
row["Energy Supply"] = np.NaN
the commented line works in a similar way.
I don't understand that why this Series is now of type object.
I checked each numeric value, and they are all of type float.
I would want the whole Series to be of type float or float64.
I tried to convert the whole series again into numeric by doing:
energy["Energy Supply"] = pd.to_numeric(energy["Energy Supply"], errors='coerce')
but after that, this series was changed into:
Country
China NaN
United States NaN
Japan NaN
United Kingdom NaN
Russian Federation 30709.0
Canada 10431.0
Germany 13261.0
India 33195.0
France NaN
South Korea NaN
Italy NaN
Spain NaN
Iran NaN
Australia NaN
Brazil 12149.0
Name: Energy Supply, dtype: float64
I wonder why the values 127191, 90838 were converted into NaN, while 30709, 10431 remained as numbers?
New to python, can't seem to understand how to proceed.
After using bin and editing my data frame I was able to come up with this :
Continents % Renewable Country
0 Asia (15.753, 29.227] China
1 North America (2.212, 15.753] United States
2 Asia (2.212, 15.753] Japan
3 Europe (2.212, 15.753] United Kingdom
4 Europe (15.753, 29.227] Russian Federation
5 North America (56.174, 69.648] Canada
6 Europe (15.753, 29.227] Germany
7 Asia (2.212, 15.753] India
8 Europe (15.753, 29.227] France
9 Asia (2.212, 15.753] South Korea
10 Europe (29.227, 42.701] Italy
11 Europe (29.227, 42.701] Spain
12 Asia (2.212, 15.753] Iran
13 Australia (2.212, 15.753] Australia
14 South America (56.174, 69.648] Brazil
Now when I set the Continents and % Renewable as a miltiindex using :
Top15 = Top15.groupby(by=['Continents', '% Renewable']).sum()
to get the following:
Country
Continents % Renewable
Asia (15.753, 29.227] China
(2.212, 15.753] JapanIndiaSouth KoreaIran
Australia (2.212, 15.753] Australia
Europe (15.753, 29.227] Russian FederationGermanyFrance
(2.212, 15.753] United Kingdom
(29.227, 42.701] ItalySpain
North America (2.212, 15.753] United States
(56.174, 69.648] Canada
South America (56.174, 69.648] Brazil
Now I would like to have a column that would give me the number of countries in each index ie:
In the 1st Row - China =1 ,
and in the 2nd Row JapanIndiaSouth KoreaIran would be 4
So in the end I want something like this :
Asia (2.212, 15.753] 4
(15.753, 29.227] 1
I just don't know how to get there.
Also, the numbers need to be sorted in descending order, while still keeping the index grouping in place.
Top15.groupby(['Continents', '% Renewable']).Country.count()
Continents % Renewable
Asia (15.753, 29.227] 1
(2.212, 15.753] 4
Australia (2.212, 15.753] 1
Europe (15.753, 29.227] 3
(2.212, 15.753] 1
(29.227, 42.701] 2
North America (2.212, 15.753] 1
(56.174, 69.648] 1
South America (56.174, 69.648] 1
Name: Country, dtype: int64
To sort in the order you'd like
Top15_count = Top15.groupby(['Continents', '% Renewable']).Country.count()
Top15_count.reset_index() \
.sort_values(
['Continents', 'Country'],
ascending=[True, False]
).set_index(['Continents', '% Renewable']).Country
Continents % Renewable
Asia (2.212, 15.753] 4
(15.753, 29.227] 1
Australia (2.212, 15.753] 1
Europe (15.753, 29.227] 3
(29.227, 42.701] 2
(2.212, 15.753] 1
North America (2.212, 15.753] 1
(56.174, 69.648] 1
South America (56.174, 69.648] 1
Name: Country, dtype: int64
Solution with size:
What is the difference between size and count in pandas?
print (Top15.groupby(['Continents', '% Renewable']).size())
Name: Country, dtype: int64
Continents % Renewable
Asia (15.753, 29.227] 1
(2.212, 15.753] 4
Australia (2.212, 15.753] 1
Europe (15.753, 29.227] 3
(2.212, 15.753] 1
(29.227, 42.701] 2
North America (2.212, 15.753] 1
(56.174, 69.648] 1
South America (56.174, 69.648] 1
dtype: int64
Use sort_values if need change order and for dataframe add reset_index, last if need MultiIndex add set_index:
print (Top15.groupby(['Continents', '% Renewable']) \
.size() \
.reset_index(name='COUNT') \
.sort_values(['Continents', 'COUNT'], ascending=[True, False]) \
.set_index(['Continents','% Renewable']).COUNT)
Continents % Renewable
Asia (2.212, 15.753] 4
(15.753, 29.227] 1
Australia (2.212, 15.753] 1
Europe (15.753, 29.227] 3
(29.227, 42.701] 2
(2.212, 15.753] 1
North America (2.212, 15.753] 1
(56.174, 69.648] 1
South America (56.174, 69.648] 1
Name: COUNT, dtype: int64