Search pandas dataframe and edit values - python

I want to replace missing country with the country that corresponds to the city, i.e. find another data point with the same city and copy the country, if there is no other record with the same city then remove.
Here's the dataframe:
id city lat lng country
1036323110 Katherine -14.4667 132.2667 Australia
1840015979 South Pasadena 27.7526 -82.7394
1124755118 Beaconsfield 45.4333 -73.8667 Canada
1250921305 Ferney-Voltaire 46.2558 6.1081 France
1156346497 Jiangshan 28.7412 118.6225
1231393325 Dīla 6.4104 38.3100 Ethiopia
1840015979 South Pasadena 27.7526 -82.7394 United States
1192391794 Kigoma 21.1072 -76.1367
1840054954 Hampstead 42.8821 -71.1709 United States
1840005111 West Islip 40.7097 -73.2971 United States
1076327352 Paulínia -22.7611 -47.1542 Brazil
1250921305 Ferney-Voltaire 46.2558 6.1081
1250921305 Ferney-Voltaire 46.2558 6.1081 France
1156346497 Jiangshan 28.7412 118.6225 China
1231393325 Dīla 6.4104 38.3100 Ethiopia
1192391794 Gibara 21.1072 -76.1367 Cuba
1840054954 Dodoma 42.8821 -71.1709
1840005111 West Islip 40.7097 -73.2971 United States
Here's my code so far:
df[df.isin(['city'])].stack()

You can group by city, lat, lng three columns and filling missing values by first not nan value in each group.
df['country'] = df['country'].fillna(
df.groupby(['city', 'lat', 'lng'])['country'].transform(
lambda x: x.loc[x.first_valid_index()] if x.first_valid_index() else x
)
)
print(df)
id city lat lng country
0 1036323110 Katherine -14.4667 132.2667 Australia
1 1840015979 South Pasadena 27.7526 -82.7394 United States
2 1124755118 Beaconsfield 45.4333 -73.8667 Canada
3 1250921305 Ferney-Voltaire 46.2558 6.1081 France
4 1156346497 Jiangshan 28.7412 118.6225 China
5 1231393325 Dīla 6.4104 38.3100 Ethiopia
6 1840015979 South Pasadena 27.7526 -82.7394 United States
7 1192391794 Kigoma 21.1072 -76.1367 NaN
8 1840054954 Hampstead 42.8821 -71.1709 United States
9 1840005111 West Islip 40.7097 -73.2971 United States
10 1076327352 Paulínia -22.7611 -47.1542 Brazil
11 1250921305 Ferney-Voltaire 46.2558 6.1081 France
12 1250921305 Ferney-Voltaire 46.2558 6.1081 France
13 1156346497 Jiangshan 28.7412 118.6225 China
14 1231393325 Dīla 6.4104 38.3100 Ethiopia
15 1192391794 Gibara 21.1072 -76.1367 Cuba
16 1840054954 Dodoma 42.8821 -71.1709 NaN
17 1840005111 West Islip 40.7097 -73.2971 United States

I've solved such a Problem with the geopy package.
Then you can use the lat and long. Then filter the Geopy-Output for the Country. This Way you will avoid NaN's and always get an answer based on geo information.
pip3 install geopy
to pip install the geopy package
geo=Nominatim(user_agent="geoapiExercises")
for i in range(0,len(df)):
lat=str(df.iloc[i,2])
lon=str(df.iloc[i,3])
df.iloc[i,4]=geo.reverse(lat+','+lon).raw['address']['country']
Please inform yourself about the user_agent api. For Exercise Purposes this key should work

Related

Find Per group a specific value in another Column if does not exist in that group delete that group

I am having a trouble in deleting some records from a dataframe. If i groupby a certain column and check for each this column group in another column that for each group does it have any specific value in another column if that specific value does not exist delete that whole group from first column (The column upon which we have applied group by earlier). My data look like this:
search_value_per_group
Column_to_Be_Grouped
Pakistan
Ehsan
Saudi Arab
Irshad
Pakistan
Ayesha
India
Ehsan
Switzerland
Ehsan
Nigeria
Ehsan
Saudi Arabia
Ayesha
UK
Ayesha
Pakistan
Zohan
Afghanistan
Zohan
Iraq
Zohan
Iran
Zohan
USA
Zohan
Netherland
Irshad
Switzerland
Irshad
India
Irshad
I want to delete all that whole group which do not have Pakistan in that group. For example in my dataframe i will Delete All Irshad from Column_to_Be_Grouped because in all irshad i do not have Pakistan and my desired output will look like this as follows:
search_value_per_group
Column_to_Be_Grouped
Pakistan
Ehsan
Pakistan
Ayesha
India
Ehsan
Switzerland
Ehsan
Nigeria
Ehsan
Saudi Arabia
Ayesha
UK
Ayesha
Pakistan
Zohan
Afghanistan
Zohan
Iraq
Zohan
Iran
Zohan
USA
Zohan
Get all matched groups and filter them by boolean indexing with Series.isin:
groups = df.loc[df['search_value_per_group'].eq('Pakistan'),'Column_to_Be_Grouped']
df1 = df[df['Column_to_Be_Grouped'].isin(groups)]
print (df1)
search_value_per_group Column_to_Be_Grouped
0 Pakistan Ehsan
2 Pakistan Ayesha
3 India Ehsan
4 Switzerland Ehsan
5 Nigeria Ehsan
6 Saudi Arabia Ayesha
7 UK Ayesha
8 Pakistan Zohan
9 Afghanistan Zohan
10 Iraq Zohan
11 Iran Zohan
12 USA Zohan
You can use GroupBy.transform('any') to generate a boolean Series for boolean indexing:
out = df[df['search_value_per_group'].eq('Pakistan')
.groupby(df['Column_to_Be_Grouped']).transform('any')]
output:
search_value_per_group Column_to_Be_Grouped
0 Pakistan Ehsan
2 Pakistan Ayesha
3 India Ehsan
4 Switzerland Ehsan
5 Nigeria Ehsan
6 Saudi Arabia Ayesha
7 UK Ayesha
8 Pakistan Zohan
9 Afghanistan Zohan
10 Iraq Zohan
11 Iran Zohan
12 USA Zohan

Compare two dataframe in pandas and edit results

I have two dataframe, df1 and df2,, df1 contains correct data that will be used to match data in df2
I want to find latitudes and longitudes in df2 that don't match the City name in df1.
Also I want to find cities in df2 that are "located" in the wrong country
Here's df1 dataframe
id city lat lng country
1036323110 Katherine -14.4667 132.2667 Australia
1840015979 South Pasadena 27.7526 -82.7394 United States
1124755118 Beaconsfield 45.4333 -73.8667 Canada
1250921305 Ferney-Voltaire 46.2558 6.1081 France
1156346497 Jiangshan 28.7412 118.6225 China
1231393325 Dīla 6.4104 38.3100 Ethiopia
1192391794 Gibara 21.1072 -76.1367 Cuba
1840054954 Hampstead 42.8821 -71.1709 United States
1840005111 West Islip 40.7097 -73.2971 United States
1076327352 Paulínia -22.7611 -47.1542 Brazil
Here's df2 dataframe
id location city country
16620625-5686 45.5333, -73.2833 Saint-Basile-le-Grand Canada
16310427-5502 52.0000, 84.9833 Belokurikha Russia
16501010-4957 -14.4667, 136.2667 Katherine Australia
16110430-8679 40.5626, -74.5743 Finderne United States
16990624-4174 27.7526, -90.7394 South Pasadena China
16790311-9092 35.98157, -160.41182 Jiangshan United States
16650927-9151 44.7667, 39.8667 West Islip Russia
16530328-2221 -22.8858, -48.4450 Botucatu Brazil
16411229-7314 42.8821, -71.1709 Hampstead United States
16060229-4175 -7.7296, 38.9500 Kibiti Tanzania
Here' my code so far:
city_df = pd.merge(df1,df2,on ='city',how ='left')
First add lat and lng columns to df2
df2[['lat', 'lng']] = df2['location'].str.split(', ', expand=True)
df2[['lat', 'lng']] = df2[['lat', 'lng']].astype(float)
Then merge df1 to df2 based on cities
city_df = pd.merge(df1[['lat', 'lng', 'city', 'country']], df2, on='city', how ='right', suffixes=('_correct', ''))
Find cities in df2 that are "located" in the wrong country
m = ~((city_df['country_correct'] == city_df['country']) | city_df['country_correct'].isna())
print(city_df[m])
lat_correct lng_correct city country_correct id location country lat lng
4 27.7526 -82.7394 South Pasadena United States 16990624-4174 27.7526, -90.7394 China 27.75260 -90.73940
5 28.7412 118.6225 Jiangshan China 16790311-9092 35.98157, -160.41182 United States 35.98157 -160.41182
6 40.7097 -73.2971 West Islip United States 16650927-9151 44.7667, 39.8667 Russia 44.76670 39.86670
To compare the two data frames, it's easier to have in the first place the df1 and df2 in similar formats. For example, df1 would be like this :
lat lng country
city
Katherine -14.4667 132.2667 Australia
South Pasadena 27.7526 -82.7394 United States
Beaconsfield 45.4333 -73.8667 Canada
Ferney-Voltaire 46.2558 6.1081 France
Jiangshan 28.7412 118.6225 China
Dīla 6.4104 38.3100 Ethiopia
Gibara 21.1072 -76.1367 Cuba
Hampstead 42.8821 -71.1709 United States
West Islip 40.7097 -73.2971 United States
Paulínia -22.7611 -47.1542 Brazil
And df2 :
country2 lng2 lat2
city
Saint-Basile-le-Grand Canada -73.2833 45.5333
Belokurikha Russia 84.9833 52.0000
Katherine Australia 132.2667 -14.4667
Finderne United States -74.5743 40.5626
South Pasadena United States -82.7394 27.7526
West Islip United States -160.41182 35.98157
Belorechensk Russia 39.8667 44.7667
Botucatu Brazil -48.4450 -22.8858
Hampstead United States -71.1709 42.8821
Kibiti Tanzania 38.9500 -7.7296
Then you can use the pd.concat method on axis=1 as follow :
df3 = pd.concat([df1,df2],axis=1) in order to get the following df :
lat lng country country2 lng2 lat2
city
Katherine -14.4667 132.2667 Australia Australia 132.2667 -14.4667
South Pasadena 27.7526 -82.7394 United States United States -82.7394 27.7526
Beaconsfield 45.4333 -73.8667 Canada NaN NaN NaN
Ferney-Voltaire 46.2558 6.1081 France NaN NaN NaN
Jiangshan 28.7412 118.6225 China NaN NaN NaN
Dīla 6.4104 38.3100 Ethiopia NaN NaN NaN
Gibara 21.1072 -76.1367 Cuba NaN NaN NaN
Hampstead 42.8821 -71.1709 United States United States -71.1709 42.8821
West Islip 40.7097 -73.2971 United States United States -160.41182 35.98157
Paulínia -22.7611 -47.1542 Brazil NaN NaN NaN
Saint-Basile-le-Grand NaN NaN NaN Canada -73.2833 45.5333
Belokurikha NaN NaN NaN Russia 84.9833 52.0000
Finderne NaN NaN NaN United States -74.5743 40.5626
Belorechensk NaN NaN NaN Russia 39.8667 44.7667
Botucatu NaN NaN NaN Brazil -48.4450 -22.8858
Kibiti NaN NaN NaN Tanzania 38.9500 -7.7296
Finaly from the concatenated df3 you can get rows where latitudes and longitudes in df2 don't match the City name in df1 :
df3[(df3['lat']!=df3['lat2']) & (df3['lng']!=df3['lng2'])].dropna()
lat lng country country2 lng2 lat2
city
West Islip 40.7097 -73.2971 United States United States -160.41182 35.98157
To find cities in df2 that are "located" in the wrong country :
df3[df3['country']!=df3['country2']]

How to Groupby columns(ignore order) in Pandas DataFrame?

I have a pandas dataframe(4 of 8 columns):
df = pd.DataFrame( {"departure_country":["Mexico","Mexico","United States","United States","United States","United States","Japan","United States","United States","United States"],"departure_city":["Guadalajara","Guadalajara","New York","Chicago","Los Angeles","Michigan","Tokyo","New York","New York","Chicago"],"destination_country":["United States","United States","United States","United States","Mexico","United States","United States","Mexico","United States","Japan"],"destination_city":["Los Angeles","Los Angeles","Chicago","New York","Guadalajara","New York","Chicago","Guadalajara","Michigan","Tokyo"]})
df
departure_country departure_city destination_country destination_city
0 Mexico Guadalajara United States Los Angeles
1 Mexico Guadalajara United States Los Angeles
2 United States New York United States Chicago
3 United States Chicago United States New York
4 United States Los Angeles Mexico Guadalajara
5 United States Michigan United States New York
6 Japan Tokyo United States Chicago
7 United States New York Mexico Guadalajara
8 United States New York United States Michigan
9 United States Chicago Japan Tokyo
I want to analyze the data in each group so I would like to groupby "the same pair" of departure and destination first, something like:
departure_country departure_city destination_country destination_city
0 Mexico Guadalajara United States Los Angeles
1 Mexico Guadalajara United States Los Angeles
2 United States Los Angeles Mexico Guadalajara
3 United States New York United States Chicago
4 United States Chicago United States New York
5 United States Michigan United States New York
6 United States New York United States Michigan
7 Japan Tokyo United States Chicago
8 United States Chicago Japan Tokyo
9 United States New York Mexico Guadalajara
Is it possible to make it in a DataFrame? I have tried groupby and key-value, but I failed.
Really appreciate your help with this, thanks!
I'm sure someone could think of a better optimized solution, but one way is to create sorted tuples of your country/city pairs and sort by it:
print (df.assign(country=[tuple(sorted(i)) for i in df.filter(like="country").to_numpy()],
city=[tuple(sorted(i)) for i in df.filter(like="city").to_numpy()])
.sort_values(["country","city"], ascending=False).filter(like="_"))
departure_country departure_city destination_country destination_city
5 United States Michigan United States New York
8 United States New York United States Michigan
2 United States New York United States Chicago
3 United States Chicago United States New York
7 United States New York Mexico Guadalajara
0 Mexico Guadalajara United States Los Angeles
1 Mexico Guadalajara United States Los Angeles
4 United States Los Angeles Mexico Guadalajara
6 Japan Tokyo United States Chicago
9 United States Chicago Japan Tokyo

Issue with joining repeated values/rows

New to python, can't seem to understand how to proceed.
After using bin and editing my data frame I was able to come up with this :
Continents % Renewable Country
0 Asia (15.753, 29.227] China
1 North America (2.212, 15.753] United States
2 Asia (2.212, 15.753] Japan
3 Europe (2.212, 15.753] United Kingdom
4 Europe (15.753, 29.227] Russian Federation
5 North America (56.174, 69.648] Canada
6 Europe (15.753, 29.227] Germany
7 Asia (2.212, 15.753] India
8 Europe (15.753, 29.227] France
9 Asia (2.212, 15.753] South Korea
10 Europe (29.227, 42.701] Italy
11 Europe (29.227, 42.701] Spain
12 Asia (2.212, 15.753] Iran
13 Australia (2.212, 15.753] Australia
14 South America (56.174, 69.648] Brazil
Now when I set the Continents and % Renewable as a miltiindex using :
Top15 = Top15.groupby(by=['Continents', '% Renewable']).sum()
to get the following:
Country
Continents % Renewable
Asia (15.753, 29.227] China
(2.212, 15.753] JapanIndiaSouth KoreaIran
Australia (2.212, 15.753] Australia
Europe (15.753, 29.227] Russian FederationGermanyFrance
(2.212, 15.753] United Kingdom
(29.227, 42.701] ItalySpain
North America (2.212, 15.753] United States
(56.174, 69.648] Canada
South America (56.174, 69.648] Brazil
Now I would like to have a column that would give me the number of countries in each index ie:
In the 1st Row - China =1 ,
and in the 2nd Row JapanIndiaSouth KoreaIran would be 4
So in the end I want something like this :
Asia (2.212, 15.753] 4
(15.753, 29.227] 1
I just don't know how to get there.
Also, the numbers need to be sorted in descending order, while still keeping the index grouping in place.
Top15.groupby(['Continents', '% Renewable']).Country.count()
Continents % Renewable
Asia (15.753, 29.227] 1
(2.212, 15.753] 4
Australia (2.212, 15.753] 1
Europe (15.753, 29.227] 3
(2.212, 15.753] 1
(29.227, 42.701] 2
North America (2.212, 15.753] 1
(56.174, 69.648] 1
South America (56.174, 69.648] 1
Name: Country, dtype: int64
To sort in the order you'd like
Top15_count = Top15.groupby(['Continents', '% Renewable']).Country.count()
Top15_count.reset_index() \
.sort_values(
['Continents', 'Country'],
ascending=[True, False]
).set_index(['Continents', '% Renewable']).Country
Continents % Renewable
Asia (2.212, 15.753] 4
(15.753, 29.227] 1
Australia (2.212, 15.753] 1
Europe (15.753, 29.227] 3
(29.227, 42.701] 2
(2.212, 15.753] 1
North America (2.212, 15.753] 1
(56.174, 69.648] 1
South America (56.174, 69.648] 1
Name: Country, dtype: int64
Solution with size:
What is the difference between size and count in pandas?
print (Top15.groupby(['Continents', '% Renewable']).size())
Name: Country, dtype: int64
Continents % Renewable
Asia (15.753, 29.227] 1
(2.212, 15.753] 4
Australia (2.212, 15.753] 1
Europe (15.753, 29.227] 3
(2.212, 15.753] 1
(29.227, 42.701] 2
North America (2.212, 15.753] 1
(56.174, 69.648] 1
South America (56.174, 69.648] 1
dtype: int64
Use sort_values if need change order and for dataframe add reset_index, last if need MultiIndex add set_index:
print (Top15.groupby(['Continents', '% Renewable']) \
.size() \
.reset_index(name='COUNT') \
.sort_values(['Continents', 'COUNT'], ascending=[True, False]) \
.set_index(['Continents','% Renewable']).COUNT)
Continents % Renewable
Asia (2.212, 15.753] 4
(15.753, 29.227] 1
Australia (2.212, 15.753] 1
Europe (15.753, 29.227] 3
(29.227, 42.701] 2
(2.212, 15.753] 1
North America (2.212, 15.753] 1
(56.174, 69.648] 1
South America (56.174, 69.648] 1
Name: COUNT, dtype: int64

generate an indented multindex

Country %Renewable
China (15.753, 29.227]
United States (2.212, 15.753]
Japan (2.212, 15.753]
United Kingdom (2.212, 15.753]
Russian Federation (15.753, 29.227]
Canada (56.174, 69.648]
Germany (15.753, 29.227]
India (2.212, 15.753]
France (15.753, 29.227]
South Korea (2.212, 15.753]
Italy (29.227, 42.701]
Spain (29.227, 42.701]
Iran (2.212, 15.753]
Australia (2.212, 15.753]
Brazil (56.174, 69.648]
I have this DATAFRAME, i want a series with a multi-index 'Continent' --> '% renawable', I know i can use groupby, the problem is that i am not sure how to do the submultindex correctly, and how to deal with categoricals
example result series:
Continent % Renewable Country
Europe (2.212, 15.753] ['France', 'United Kingdom', 'Russian Federation']
(15.753, 29.227] ['Germany', 'France']
(29.227, 42.701] ['Italy', 'Spain']
Asia (2.212, 15.753] ['India', 'South Korea', 'Iran', 'Japan', 'Iran']
(15.753, 29.227] ['China']
Oceania (2.212, 15.753] ['Australia']
North America (2.212, 15.753] ['United States']
(56.174, 69.648] ['Canada']
South America (56.174, 69.648] ['Brazil']
ContinentDict = {'China':'Asia',
'United States':'North America',
'Japan':'Asia',
'United Kingdom':'Europe',
'Russian Federation':'Europe',
'Canada':'North America',
'Germany':'Europe',
'India':'Asia',
'France':'Europe',
'South Korea':'Asia',
'Italy':'Europe',
'Spain':'Europe',
'Iran':'Asia',
'Australia':'Australia',
'Brazil':'South America'}
this is the dict for the conversion country to continent
You can use replace to do the continent mapping, and then tolist to get the list of values for each group:
In [53]: df['Continent'] = df.Country.replace(ContinentDict)
In [55]: df.groupby(['Continent', '%Renewable']).apply(lambda x: x.Country.tolist())
Out[55]:
Continent %Renewable
Asia (15.753, 29.227] [China]
(2.212, 15.753] [Japan, India, South Korea, Iran]
Australia (2.212, 15.753] [Australia]
Europe (15.753, 29.227] [Russian Federation, Germany, France]
(2.212, 15.753] [United Kingdom]
(29.227, 42.701] [Italy, Spain]
North America (2.212, 15.753] [United States]
(56.174, 69.648] [Canada]
South America (56.174, 69.648] [Brazil]

Categories