Country %Renewable
China (15.753, 29.227]
United States (2.212, 15.753]
Japan (2.212, 15.753]
United Kingdom (2.212, 15.753]
Russian Federation (15.753, 29.227]
Canada (56.174, 69.648]
Germany (15.753, 29.227]
India (2.212, 15.753]
France (15.753, 29.227]
South Korea (2.212, 15.753]
Italy (29.227, 42.701]
Spain (29.227, 42.701]
Iran (2.212, 15.753]
Australia (2.212, 15.753]
Brazil (56.174, 69.648]
I have this DATAFRAME, i want a series with a multi-index 'Continent' --> '% renawable', I know i can use groupby, the problem is that i am not sure how to do the submultindex correctly, and how to deal with categoricals
example result series:
Continent % Renewable Country
Europe (2.212, 15.753] ['France', 'United Kingdom', 'Russian Federation']
(15.753, 29.227] ['Germany', 'France']
(29.227, 42.701] ['Italy', 'Spain']
Asia (2.212, 15.753] ['India', 'South Korea', 'Iran', 'Japan', 'Iran']
(15.753, 29.227] ['China']
Oceania (2.212, 15.753] ['Australia']
North America (2.212, 15.753] ['United States']
(56.174, 69.648] ['Canada']
South America (56.174, 69.648] ['Brazil']
ContinentDict = {'China':'Asia',
'United States':'North America',
'Japan':'Asia',
'United Kingdom':'Europe',
'Russian Federation':'Europe',
'Canada':'North America',
'Germany':'Europe',
'India':'Asia',
'France':'Europe',
'South Korea':'Asia',
'Italy':'Europe',
'Spain':'Europe',
'Iran':'Asia',
'Australia':'Australia',
'Brazil':'South America'}
this is the dict for the conversion country to continent
You can use replace to do the continent mapping, and then tolist to get the list of values for each group:
In [53]: df['Continent'] = df.Country.replace(ContinentDict)
In [55]: df.groupby(['Continent', '%Renewable']).apply(lambda x: x.Country.tolist())
Out[55]:
Continent %Renewable
Asia (15.753, 29.227] [China]
(2.212, 15.753] [Japan, India, South Korea, Iran]
Australia (2.212, 15.753] [Australia]
Europe (15.753, 29.227] [Russian Federation, Germany, France]
(2.212, 15.753] [United Kingdom]
(29.227, 42.701] [Italy, Spain]
North America (2.212, 15.753] [United States]
(56.174, 69.648] [Canada]
South America (56.174, 69.648] [Brazil]
Related
I have two dataframe, df1 and df2,, df1 contains correct data that will be used to match data in df2
I want to find latitudes and longitudes in df2 that don't match the City name in df1.
Also I want to find cities in df2 that are "located" in the wrong country
Here's df1 dataframe
id city lat lng country
1036323110 Katherine -14.4667 132.2667 Australia
1840015979 South Pasadena 27.7526 -82.7394 United States
1124755118 Beaconsfield 45.4333 -73.8667 Canada
1250921305 Ferney-Voltaire 46.2558 6.1081 France
1156346497 Jiangshan 28.7412 118.6225 China
1231393325 Dīla 6.4104 38.3100 Ethiopia
1192391794 Gibara 21.1072 -76.1367 Cuba
1840054954 Hampstead 42.8821 -71.1709 United States
1840005111 West Islip 40.7097 -73.2971 United States
1076327352 Paulínia -22.7611 -47.1542 Brazil
Here's df2 dataframe
id location city country
16620625-5686 45.5333, -73.2833 Saint-Basile-le-Grand Canada
16310427-5502 52.0000, 84.9833 Belokurikha Russia
16501010-4957 -14.4667, 136.2667 Katherine Australia
16110430-8679 40.5626, -74.5743 Finderne United States
16990624-4174 27.7526, -90.7394 South Pasadena China
16790311-9092 35.98157, -160.41182 Jiangshan United States
16650927-9151 44.7667, 39.8667 West Islip Russia
16530328-2221 -22.8858, -48.4450 Botucatu Brazil
16411229-7314 42.8821, -71.1709 Hampstead United States
16060229-4175 -7.7296, 38.9500 Kibiti Tanzania
Here' my code so far:
city_df = pd.merge(df1,df2,on ='city',how ='left')
First add lat and lng columns to df2
df2[['lat', 'lng']] = df2['location'].str.split(', ', expand=True)
df2[['lat', 'lng']] = df2[['lat', 'lng']].astype(float)
Then merge df1 to df2 based on cities
city_df = pd.merge(df1[['lat', 'lng', 'city', 'country']], df2, on='city', how ='right', suffixes=('_correct', ''))
Find cities in df2 that are "located" in the wrong country
m = ~((city_df['country_correct'] == city_df['country']) | city_df['country_correct'].isna())
print(city_df[m])
lat_correct lng_correct city country_correct id location country lat lng
4 27.7526 -82.7394 South Pasadena United States 16990624-4174 27.7526, -90.7394 China 27.75260 -90.73940
5 28.7412 118.6225 Jiangshan China 16790311-9092 35.98157, -160.41182 United States 35.98157 -160.41182
6 40.7097 -73.2971 West Islip United States 16650927-9151 44.7667, 39.8667 Russia 44.76670 39.86670
To compare the two data frames, it's easier to have in the first place the df1 and df2 in similar formats. For example, df1 would be like this :
lat lng country
city
Katherine -14.4667 132.2667 Australia
South Pasadena 27.7526 -82.7394 United States
Beaconsfield 45.4333 -73.8667 Canada
Ferney-Voltaire 46.2558 6.1081 France
Jiangshan 28.7412 118.6225 China
Dīla 6.4104 38.3100 Ethiopia
Gibara 21.1072 -76.1367 Cuba
Hampstead 42.8821 -71.1709 United States
West Islip 40.7097 -73.2971 United States
Paulínia -22.7611 -47.1542 Brazil
And df2 :
country2 lng2 lat2
city
Saint-Basile-le-Grand Canada -73.2833 45.5333
Belokurikha Russia 84.9833 52.0000
Katherine Australia 132.2667 -14.4667
Finderne United States -74.5743 40.5626
South Pasadena United States -82.7394 27.7526
West Islip United States -160.41182 35.98157
Belorechensk Russia 39.8667 44.7667
Botucatu Brazil -48.4450 -22.8858
Hampstead United States -71.1709 42.8821
Kibiti Tanzania 38.9500 -7.7296
Then you can use the pd.concat method on axis=1 as follow :
df3 = pd.concat([df1,df2],axis=1) in order to get the following df :
lat lng country country2 lng2 lat2
city
Katherine -14.4667 132.2667 Australia Australia 132.2667 -14.4667
South Pasadena 27.7526 -82.7394 United States United States -82.7394 27.7526
Beaconsfield 45.4333 -73.8667 Canada NaN NaN NaN
Ferney-Voltaire 46.2558 6.1081 France NaN NaN NaN
Jiangshan 28.7412 118.6225 China NaN NaN NaN
Dīla 6.4104 38.3100 Ethiopia NaN NaN NaN
Gibara 21.1072 -76.1367 Cuba NaN NaN NaN
Hampstead 42.8821 -71.1709 United States United States -71.1709 42.8821
West Islip 40.7097 -73.2971 United States United States -160.41182 35.98157
Paulínia -22.7611 -47.1542 Brazil NaN NaN NaN
Saint-Basile-le-Grand NaN NaN NaN Canada -73.2833 45.5333
Belokurikha NaN NaN NaN Russia 84.9833 52.0000
Finderne NaN NaN NaN United States -74.5743 40.5626
Belorechensk NaN NaN NaN Russia 39.8667 44.7667
Botucatu NaN NaN NaN Brazil -48.4450 -22.8858
Kibiti NaN NaN NaN Tanzania 38.9500 -7.7296
Finaly from the concatenated df3 you can get rows where latitudes and longitudes in df2 don't match the City name in df1 :
df3[(df3['lat']!=df3['lat2']) & (df3['lng']!=df3['lng2'])].dropna()
lat lng country country2 lng2 lat2
city
West Islip 40.7097 -73.2971 United States United States -160.41182 35.98157
To find cities in df2 that are "located" in the wrong country :
df3[df3['country']!=df3['country2']]
I want to replace missing country with the country that corresponds to the city, i.e. find another data point with the same city and copy the country, if there is no other record with the same city then remove.
Here's the dataframe:
id city lat lng country
1036323110 Katherine -14.4667 132.2667 Australia
1840015979 South Pasadena 27.7526 -82.7394
1124755118 Beaconsfield 45.4333 -73.8667 Canada
1250921305 Ferney-Voltaire 46.2558 6.1081 France
1156346497 Jiangshan 28.7412 118.6225
1231393325 Dīla 6.4104 38.3100 Ethiopia
1840015979 South Pasadena 27.7526 -82.7394 United States
1192391794 Kigoma 21.1072 -76.1367
1840054954 Hampstead 42.8821 -71.1709 United States
1840005111 West Islip 40.7097 -73.2971 United States
1076327352 Paulínia -22.7611 -47.1542 Brazil
1250921305 Ferney-Voltaire 46.2558 6.1081
1250921305 Ferney-Voltaire 46.2558 6.1081 France
1156346497 Jiangshan 28.7412 118.6225 China
1231393325 Dīla 6.4104 38.3100 Ethiopia
1192391794 Gibara 21.1072 -76.1367 Cuba
1840054954 Dodoma 42.8821 -71.1709
1840005111 West Islip 40.7097 -73.2971 United States
Here's my code so far:
df[df.isin(['city'])].stack()
You can group by city, lat, lng three columns and filling missing values by first not nan value in each group.
df['country'] = df['country'].fillna(
df.groupby(['city', 'lat', 'lng'])['country'].transform(
lambda x: x.loc[x.first_valid_index()] if x.first_valid_index() else x
)
)
print(df)
id city lat lng country
0 1036323110 Katherine -14.4667 132.2667 Australia
1 1840015979 South Pasadena 27.7526 -82.7394 United States
2 1124755118 Beaconsfield 45.4333 -73.8667 Canada
3 1250921305 Ferney-Voltaire 46.2558 6.1081 France
4 1156346497 Jiangshan 28.7412 118.6225 China
5 1231393325 Dīla 6.4104 38.3100 Ethiopia
6 1840015979 South Pasadena 27.7526 -82.7394 United States
7 1192391794 Kigoma 21.1072 -76.1367 NaN
8 1840054954 Hampstead 42.8821 -71.1709 United States
9 1840005111 West Islip 40.7097 -73.2971 United States
10 1076327352 Paulínia -22.7611 -47.1542 Brazil
11 1250921305 Ferney-Voltaire 46.2558 6.1081 France
12 1250921305 Ferney-Voltaire 46.2558 6.1081 France
13 1156346497 Jiangshan 28.7412 118.6225 China
14 1231393325 Dīla 6.4104 38.3100 Ethiopia
15 1192391794 Gibara 21.1072 -76.1367 Cuba
16 1840054954 Dodoma 42.8821 -71.1709 NaN
17 1840005111 West Islip 40.7097 -73.2971 United States
I've solved such a Problem with the geopy package.
Then you can use the lat and long. Then filter the Geopy-Output for the Country. This Way you will avoid NaN's and always get an answer based on geo information.
pip3 install geopy
to pip install the geopy package
geo=Nominatim(user_agent="geoapiExercises")
for i in range(0,len(df)):
lat=str(df.iloc[i,2])
lon=str(df.iloc[i,3])
df.iloc[i,4]=geo.reverse(lat+','+lon).raw['address']['country']
Please inform yourself about the user_agent api. For Exercise Purposes this key should work
I'm working on an assignment of Applied Data Science.
Question:
Cut % Renewable into 5 bins. Group Top15 by the Continent, as well as these new % Renewable bins. How many countries are in each of these groups?
This function should return a Series with a MultiIndex of Continent, then the bins for % Renewable. Do not include groups with no countries.
This is my code:
def answer_twelve():
Top15 = answer_one()
ContinentDict = {'China':'Asia',
'United States':'North America',
'Japan':'Asia',
'United Kingdom':'Europe',
'Russian Federation':'Europe',
'Canada':'North America',
'Germany':'Europe',
'India':'Asia',
'France':'Europe',
'South Korea':'Asia',
'Italy':'Europe',
'Spain':'Europe',
'Iran':'Asia',
'Australia':'Australia',
'Brazil':'South America'}
Top15['Continent'] = Top15.index.to_series().map(ContinentDict)
Top15['bins'] = pd.cut(Top15['% Renewable'],5)
return pd.Series(Top15.groupby(by = ['Continent', 'bins']).size())#,apply(lambda x:s if x['Rank']==0 continue))
answer_twelve()
This is my output for the above code
Continent bins
Asia (2.212, 15.753] 4
(15.753, 29.227] 1
(29.227, 42.701] 0
(42.701, 56.174] 0
(56.174, 69.648] 0
Australia (2.212, 15.753] 1
(15.753, 29.227] 0
(29.227, 42.701] 0
(42.701, 56.174] 0
(56.174, 69.648] 0
Europe (2.212, 15.753] 1
(15.753, 29.227] 3
(29.227, 42.701] 2
(42.701, 56.174] 0
(56.174, 69.648] 0
North America (2.212, 15.753] 1
(15.753, 29.227] 0
(29.227, 42.701] 0
(42.701, 56.174] 0
(56.174, 69.648] 1
South America (2.212, 15.753] 0
(15.753, 29.227] 0
(29.227, 42.701] 0
(42.701, 56.174] 0
(56.174, 69.648] 1
dtype: int64
Required output is
Continent bins
Asia (2.212, 15.753] 4
(15.753, 29.227] 1
Australia (2.212, 15.753] 1
Europe (2.212, 15.753] 1
(15.753, 29.227] 3
(29.227, 42.701] 2
North America (2.212, 15.753] 1
(56.174, 69.648] 1
South America (56.174, 69.648] 1
Name: Countries, dtype: int64
I want to get rid of the zeros, i tried using
pd.Series(Top15.groupby(by = ['Continent', 'bins']).size().apply(lambda x:s if x['Rank']==0 continue))
But I keep getting the following error
File "<ipython-input-317-14bc05bb2137>", line 20
return pd.Series(Top15.groupby(by = ['Continent', 'bins']).size().apply(lambda x:s if x['Rank']==0 continue))
^
SyntaxError: invalid syntax
I'm unable to figure out my mistake. Please help me!
Use pandas and just drop the rows when the column is zero
if column_name is your column:
df = df[df.column_name != 0]
lambda x:s if x['Rank']==0 continue
This doesn't make any sense, as continue is useful only within a loop.
Note that you need a value to print. Instead, make it a blank:
lambda x:"" if x['Rank']==0 else s
You can use 'for' loop to iter through the values, and then use replace() to replace 0's with NaN,
now you can just drop them, with dropna().
I tried using drop() or droplevel() instead of replacing them, but it didn't work.Here is my code:
for k,i in series_df.items():
if i == 0:
pd_series.replace(to_replace=i, value=np.nan, inplace=True)
pd_series.dropna(axis=0, inplace=True)
print(pd_series)
You might need to change dtype of the result.The output was:
Continent bins
Asia (2.212, 15.753] 4
(15.753, 29.227] 1
Australia (2.212, 15.753] 1
Europe (2.212, 15.753] 1
(15.753, 29.227] 3
(29.227, 42.701] 2
North America (2.212, 15.753] 1
(56.174, 69.648] 1
South America (56.174, 69.648] 1
dtype: int64
I have the following pandas series:
Reducedset['% Renewable']
Which gives me:
Asia China 19.7549
Japan 10.2328
India 14.9691
South Korea 2.27935
Iran 5.70772
North America United States 11.571
Canada 61.9454
Europe United Kingdom 10.6005
Russian Federation 17.2887
Germany 17.9015
France 17.0203
Italy 33.6672
Spain 37.9686
Australia Australia 11.8108
South America Brazil 69.648
Name: % Renewable, dtype: object
I then sorted this series into 5 bins:
binning = pd.cut(Top15['% Renewable'],5)
Which gives me:
Asia China (15.753, 29.227]
Japan (2.212, 15.753]
India (2.212, 15.753]
South Korea (2.212, 15.753]
Iran (2.212, 15.753]
North America United States (2.212, 15.753]
Canada (56.174, 69.648]
Europe United Kingdom (2.212, 15.753]
Russian Federation (15.753, 29.227]
Germany (15.753, 29.227]
France (15.753, 29.227]
Italy (29.227, 42.701]
Spain (29.227, 42.701]
Australia Australia (2.212, 15.753]
South America Brazil (56.174, 69.648]
Name: % Renewable, dtype: category
Categories (5, interval[float64]): [(2.212, 15.753] < (15.753, 29.227] < (29.227, 42.701] <
(42.701, 56.174] < (56.174, 69.648]]
I then grouped this binned data in order to calculate the number of countries in each bin:
Reduced = Reducedset.groupby(binning)['% Renewable'].agg(['count'])
Which gives me:
% Renewable
(2.212, 15.753] 7
(15.753, 29.227] 4
(29.227, 42.701] 2
(42.701, 56.174] 0
(56.174, 69.648] 2
Name: count, dtype: int64
However, the index has disappeared and I still want to keep the index for the 'continents' (the outer index).
Thus, on the very left of the (% Renewable) column it should say:
Asia
North America
Europe
Australia
South America
When I try doing that by:
print(Reducedset['% Renewable'].groupby([Reducedset['% Renewable'].index.get_level_values(0),pd.cut(Reducedset['% Renewable'],5)]).count())
It works!
Problem solved!
Let's assume the following data:
np.random.seed(1)
s = pd.Series(np.random.randint(0,10, 16),
index=pd.MultiIndex.from_arrays([list('aaaabbccdddddeee'),
list('abcdefghijklmnop')]))
What you are looking IIUC is then
print(s.groupby([s.index.get_level_values(0), #that is the continent for you
pd.cut(s, 5)]) #that is the binning you created
.count())
a (-0.009, 1.8] 0
(1.8, 3.6] 0
(3.6, 5.4] 2
(5.4, 7.2] 0
(7.2, 9.0] 2
b (-0.009, 1.8] 2
(1.8, 3.6] 0
(3.6, 5.4] 0
(5.4, 7.2] 0
(7.2, 9.0] 0
c (-0.009, 1.8] 1
(1.8, 3.6] 0
(3.6, 5.4] 0
(5.4, 7.2] 1
(7.2, 9.0] 0
d (-0.009, 1.8] 0
(1.8, 3.6] 1
(3.6, 5.4] 2
(5.4, 7.2] 1
(7.2, 9.0] 1
e (-0.009, 1.8] 0
(1.8, 3.6] 2
(3.6, 5.4] 1
(5.4, 7.2] 0
(7.2, 9.0] 0
dtype: int64
New to python, can't seem to understand how to proceed.
After using bin and editing my data frame I was able to come up with this :
Continents % Renewable Country
0 Asia (15.753, 29.227] China
1 North America (2.212, 15.753] United States
2 Asia (2.212, 15.753] Japan
3 Europe (2.212, 15.753] United Kingdom
4 Europe (15.753, 29.227] Russian Federation
5 North America (56.174, 69.648] Canada
6 Europe (15.753, 29.227] Germany
7 Asia (2.212, 15.753] India
8 Europe (15.753, 29.227] France
9 Asia (2.212, 15.753] South Korea
10 Europe (29.227, 42.701] Italy
11 Europe (29.227, 42.701] Spain
12 Asia (2.212, 15.753] Iran
13 Australia (2.212, 15.753] Australia
14 South America (56.174, 69.648] Brazil
Now when I set the Continents and % Renewable as a miltiindex using :
Top15 = Top15.groupby(by=['Continents', '% Renewable']).sum()
to get the following:
Country
Continents % Renewable
Asia (15.753, 29.227] China
(2.212, 15.753] JapanIndiaSouth KoreaIran
Australia (2.212, 15.753] Australia
Europe (15.753, 29.227] Russian FederationGermanyFrance
(2.212, 15.753] United Kingdom
(29.227, 42.701] ItalySpain
North America (2.212, 15.753] United States
(56.174, 69.648] Canada
South America (56.174, 69.648] Brazil
Now I would like to have a column that would give me the number of countries in each index ie:
In the 1st Row - China =1 ,
and in the 2nd Row JapanIndiaSouth KoreaIran would be 4
So in the end I want something like this :
Asia (2.212, 15.753] 4
(15.753, 29.227] 1
I just don't know how to get there.
Also, the numbers need to be sorted in descending order, while still keeping the index grouping in place.
Top15.groupby(['Continents', '% Renewable']).Country.count()
Continents % Renewable
Asia (15.753, 29.227] 1
(2.212, 15.753] 4
Australia (2.212, 15.753] 1
Europe (15.753, 29.227] 3
(2.212, 15.753] 1
(29.227, 42.701] 2
North America (2.212, 15.753] 1
(56.174, 69.648] 1
South America (56.174, 69.648] 1
Name: Country, dtype: int64
To sort in the order you'd like
Top15_count = Top15.groupby(['Continents', '% Renewable']).Country.count()
Top15_count.reset_index() \
.sort_values(
['Continents', 'Country'],
ascending=[True, False]
).set_index(['Continents', '% Renewable']).Country
Continents % Renewable
Asia (2.212, 15.753] 4
(15.753, 29.227] 1
Australia (2.212, 15.753] 1
Europe (15.753, 29.227] 3
(29.227, 42.701] 2
(2.212, 15.753] 1
North America (2.212, 15.753] 1
(56.174, 69.648] 1
South America (56.174, 69.648] 1
Name: Country, dtype: int64
Solution with size:
What is the difference between size and count in pandas?
print (Top15.groupby(['Continents', '% Renewable']).size())
Name: Country, dtype: int64
Continents % Renewable
Asia (15.753, 29.227] 1
(2.212, 15.753] 4
Australia (2.212, 15.753] 1
Europe (15.753, 29.227] 3
(2.212, 15.753] 1
(29.227, 42.701] 2
North America (2.212, 15.753] 1
(56.174, 69.648] 1
South America (56.174, 69.648] 1
dtype: int64
Use sort_values if need change order and for dataframe add reset_index, last if need MultiIndex add set_index:
print (Top15.groupby(['Continents', '% Renewable']) \
.size() \
.reset_index(name='COUNT') \
.sort_values(['Continents', 'COUNT'], ascending=[True, False]) \
.set_index(['Continents','% Renewable']).COUNT)
Continents % Renewable
Asia (2.212, 15.753] 4
(15.753, 29.227] 1
Australia (2.212, 15.753] 1
Europe (15.753, 29.227] 3
(29.227, 42.701] 2
(2.212, 15.753] 1
North America (2.212, 15.753] 1
(56.174, 69.648] 1
South America (56.174, 69.648] 1
Name: COUNT, dtype: int64