How to Groupby columns(ignore order) in Pandas DataFrame? - python

I have a pandas dataframe(4 of 8 columns):
df = pd.DataFrame( {"departure_country":["Mexico","Mexico","United States","United States","United States","United States","Japan","United States","United States","United States"],"departure_city":["Guadalajara","Guadalajara","New York","Chicago","Los Angeles","Michigan","Tokyo","New York","New York","Chicago"],"destination_country":["United States","United States","United States","United States","Mexico","United States","United States","Mexico","United States","Japan"],"destination_city":["Los Angeles","Los Angeles","Chicago","New York","Guadalajara","New York","Chicago","Guadalajara","Michigan","Tokyo"]})
df
departure_country departure_city destination_country destination_city
0 Mexico Guadalajara United States Los Angeles
1 Mexico Guadalajara United States Los Angeles
2 United States New York United States Chicago
3 United States Chicago United States New York
4 United States Los Angeles Mexico Guadalajara
5 United States Michigan United States New York
6 Japan Tokyo United States Chicago
7 United States New York Mexico Guadalajara
8 United States New York United States Michigan
9 United States Chicago Japan Tokyo
I want to analyze the data in each group so I would like to groupby "the same pair" of departure and destination first, something like:
departure_country departure_city destination_country destination_city
0 Mexico Guadalajara United States Los Angeles
1 Mexico Guadalajara United States Los Angeles
2 United States Los Angeles Mexico Guadalajara
3 United States New York United States Chicago
4 United States Chicago United States New York
5 United States Michigan United States New York
6 United States New York United States Michigan
7 Japan Tokyo United States Chicago
8 United States Chicago Japan Tokyo
9 United States New York Mexico Guadalajara
Is it possible to make it in a DataFrame? I have tried groupby and key-value, but I failed.
Really appreciate your help with this, thanks!

I'm sure someone could think of a better optimized solution, but one way is to create sorted tuples of your country/city pairs and sort by it:
print (df.assign(country=[tuple(sorted(i)) for i in df.filter(like="country").to_numpy()],
city=[tuple(sorted(i)) for i in df.filter(like="city").to_numpy()])
.sort_values(["country","city"], ascending=False).filter(like="_"))
departure_country departure_city destination_country destination_city
5 United States Michigan United States New York
8 United States New York United States Michigan
2 United States New York United States Chicago
3 United States Chicago United States New York
7 United States New York Mexico Guadalajara
0 Mexico Guadalajara United States Los Angeles
1 Mexico Guadalajara United States Los Angeles
4 United States Los Angeles Mexico Guadalajara
6 Japan Tokyo United States Chicago
9 United States Chicago Japan Tokyo

Related

Compare two dataframe in pandas and edit results

I have two dataframe, df1 and df2,, df1 contains correct data that will be used to match data in df2
I want to find latitudes and longitudes in df2 that don't match the City name in df1.
Also I want to find cities in df2 that are "located" in the wrong country
Here's df1 dataframe
id city lat lng country
1036323110 Katherine -14.4667 132.2667 Australia
1840015979 South Pasadena 27.7526 -82.7394 United States
1124755118 Beaconsfield 45.4333 -73.8667 Canada
1250921305 Ferney-Voltaire 46.2558 6.1081 France
1156346497 Jiangshan 28.7412 118.6225 China
1231393325 Dīla 6.4104 38.3100 Ethiopia
1192391794 Gibara 21.1072 -76.1367 Cuba
1840054954 Hampstead 42.8821 -71.1709 United States
1840005111 West Islip 40.7097 -73.2971 United States
1076327352 Paulínia -22.7611 -47.1542 Brazil
Here's df2 dataframe
id location city country
16620625-5686 45.5333, -73.2833 Saint-Basile-le-Grand Canada
16310427-5502 52.0000, 84.9833 Belokurikha Russia
16501010-4957 -14.4667, 136.2667 Katherine Australia
16110430-8679 40.5626, -74.5743 Finderne United States
16990624-4174 27.7526, -90.7394 South Pasadena China
16790311-9092 35.98157, -160.41182 Jiangshan United States
16650927-9151 44.7667, 39.8667 West Islip Russia
16530328-2221 -22.8858, -48.4450 Botucatu Brazil
16411229-7314 42.8821, -71.1709 Hampstead United States
16060229-4175 -7.7296, 38.9500 Kibiti Tanzania
Here' my code so far:
city_df = pd.merge(df1,df2,on ='city',how ='left')
First add lat and lng columns to df2
df2[['lat', 'lng']] = df2['location'].str.split(', ', expand=True)
df2[['lat', 'lng']] = df2[['lat', 'lng']].astype(float)
Then merge df1 to df2 based on cities
city_df = pd.merge(df1[['lat', 'lng', 'city', 'country']], df2, on='city', how ='right', suffixes=('_correct', ''))
Find cities in df2 that are "located" in the wrong country
m = ~((city_df['country_correct'] == city_df['country']) | city_df['country_correct'].isna())
print(city_df[m])
lat_correct lng_correct city country_correct id location country lat lng
4 27.7526 -82.7394 South Pasadena United States 16990624-4174 27.7526, -90.7394 China 27.75260 -90.73940
5 28.7412 118.6225 Jiangshan China 16790311-9092 35.98157, -160.41182 United States 35.98157 -160.41182
6 40.7097 -73.2971 West Islip United States 16650927-9151 44.7667, 39.8667 Russia 44.76670 39.86670
To compare the two data frames, it's easier to have in the first place the df1 and df2 in similar formats. For example, df1 would be like this :
lat lng country
city
Katherine -14.4667 132.2667 Australia
South Pasadena 27.7526 -82.7394 United States
Beaconsfield 45.4333 -73.8667 Canada
Ferney-Voltaire 46.2558 6.1081 France
Jiangshan 28.7412 118.6225 China
Dīla 6.4104 38.3100 Ethiopia
Gibara 21.1072 -76.1367 Cuba
Hampstead 42.8821 -71.1709 United States
West Islip 40.7097 -73.2971 United States
Paulínia -22.7611 -47.1542 Brazil
And df2 :
country2 lng2 lat2
city
Saint-Basile-le-Grand Canada -73.2833 45.5333
Belokurikha Russia 84.9833 52.0000
Katherine Australia 132.2667 -14.4667
Finderne United States -74.5743 40.5626
South Pasadena United States -82.7394 27.7526
West Islip United States -160.41182 35.98157
Belorechensk Russia 39.8667 44.7667
Botucatu Brazil -48.4450 -22.8858
Hampstead United States -71.1709 42.8821
Kibiti Tanzania 38.9500 -7.7296
Then you can use the pd.concat method on axis=1 as follow :
df3 = pd.concat([df1,df2],axis=1) in order to get the following df :
lat lng country country2 lng2 lat2
city
Katherine -14.4667 132.2667 Australia Australia 132.2667 -14.4667
South Pasadena 27.7526 -82.7394 United States United States -82.7394 27.7526
Beaconsfield 45.4333 -73.8667 Canada NaN NaN NaN
Ferney-Voltaire 46.2558 6.1081 France NaN NaN NaN
Jiangshan 28.7412 118.6225 China NaN NaN NaN
Dīla 6.4104 38.3100 Ethiopia NaN NaN NaN
Gibara 21.1072 -76.1367 Cuba NaN NaN NaN
Hampstead 42.8821 -71.1709 United States United States -71.1709 42.8821
West Islip 40.7097 -73.2971 United States United States -160.41182 35.98157
Paulínia -22.7611 -47.1542 Brazil NaN NaN NaN
Saint-Basile-le-Grand NaN NaN NaN Canada -73.2833 45.5333
Belokurikha NaN NaN NaN Russia 84.9833 52.0000
Finderne NaN NaN NaN United States -74.5743 40.5626
Belorechensk NaN NaN NaN Russia 39.8667 44.7667
Botucatu NaN NaN NaN Brazil -48.4450 -22.8858
Kibiti NaN NaN NaN Tanzania 38.9500 -7.7296
Finaly from the concatenated df3 you can get rows where latitudes and longitudes in df2 don't match the City name in df1 :
df3[(df3['lat']!=df3['lat2']) & (df3['lng']!=df3['lng2'])].dropna()
lat lng country country2 lng2 lat2
city
West Islip 40.7097 -73.2971 United States United States -160.41182 35.98157
To find cities in df2 that are "located" in the wrong country :
df3[df3['country']!=df3['country2']]

Search pandas dataframe and edit values

I want to replace missing country with the country that corresponds to the city, i.e. find another data point with the same city and copy the country, if there is no other record with the same city then remove.
Here's the dataframe:
id city lat lng country
1036323110 Katherine -14.4667 132.2667 Australia
1840015979 South Pasadena 27.7526 -82.7394
1124755118 Beaconsfield 45.4333 -73.8667 Canada
1250921305 Ferney-Voltaire 46.2558 6.1081 France
1156346497 Jiangshan 28.7412 118.6225
1231393325 Dīla 6.4104 38.3100 Ethiopia
1840015979 South Pasadena 27.7526 -82.7394 United States
1192391794 Kigoma 21.1072 -76.1367
1840054954 Hampstead 42.8821 -71.1709 United States
1840005111 West Islip 40.7097 -73.2971 United States
1076327352 Paulínia -22.7611 -47.1542 Brazil
1250921305 Ferney-Voltaire 46.2558 6.1081
1250921305 Ferney-Voltaire 46.2558 6.1081 France
1156346497 Jiangshan 28.7412 118.6225 China
1231393325 Dīla 6.4104 38.3100 Ethiopia
1192391794 Gibara 21.1072 -76.1367 Cuba
1840054954 Dodoma 42.8821 -71.1709
1840005111 West Islip 40.7097 -73.2971 United States
Here's my code so far:
df[df.isin(['city'])].stack()
You can group by city, lat, lng three columns and filling missing values by first not nan value in each group.
df['country'] = df['country'].fillna(
df.groupby(['city', 'lat', 'lng'])['country'].transform(
lambda x: x.loc[x.first_valid_index()] if x.first_valid_index() else x
)
)
print(df)
id city lat lng country
0 1036323110 Katherine -14.4667 132.2667 Australia
1 1840015979 South Pasadena 27.7526 -82.7394 United States
2 1124755118 Beaconsfield 45.4333 -73.8667 Canada
3 1250921305 Ferney-Voltaire 46.2558 6.1081 France
4 1156346497 Jiangshan 28.7412 118.6225 China
5 1231393325 Dīla 6.4104 38.3100 Ethiopia
6 1840015979 South Pasadena 27.7526 -82.7394 United States
7 1192391794 Kigoma 21.1072 -76.1367 NaN
8 1840054954 Hampstead 42.8821 -71.1709 United States
9 1840005111 West Islip 40.7097 -73.2971 United States
10 1076327352 Paulínia -22.7611 -47.1542 Brazil
11 1250921305 Ferney-Voltaire 46.2558 6.1081 France
12 1250921305 Ferney-Voltaire 46.2558 6.1081 France
13 1156346497 Jiangshan 28.7412 118.6225 China
14 1231393325 Dīla 6.4104 38.3100 Ethiopia
15 1192391794 Gibara 21.1072 -76.1367 Cuba
16 1840054954 Dodoma 42.8821 -71.1709 NaN
17 1840005111 West Islip 40.7097 -73.2971 United States
I've solved such a Problem with the geopy package.
Then you can use the lat and long. Then filter the Geopy-Output for the Country. This Way you will avoid NaN's and always get an answer based on geo information.
pip3 install geopy
to pip install the geopy package
geo=Nominatim(user_agent="geoapiExercises")
for i in range(0,len(df)):
lat=str(df.iloc[i,2])
lon=str(df.iloc[i,3])
df.iloc[i,4]=geo.reverse(lat+','+lon).raw['address']['country']
Please inform yourself about the user_agent api. For Exercise Purposes this key should work

Performing Split on Pandas dataframe and create a new frame

I have a pandas dataframe with one column like this:
Merged_Cities
New York, Wisconsin, Atlanta
Tokyo, Kyoto, Suzuki
Paris, Bordeaux, Lyon
Mumbai, Delhi, Bangalore
London, Manchester, Bermingham
And I want a new dataframe with the output like this:
Merged_Cities
Cities
New York, Wisconsin, Atlanta
New York
New York, Wisconsin, Atlanta
Wisconsin
New York, Wisconsin, Atlanta
Atlanta
Tokyo, Kyoto, Suzuki
Tokyo
Tokyo, Kyoto, Suzuki
Kyoto
Tokyo, Kyoto, Suzuki
Suzuki
Paris, Bordeaux, Lyon
Paris
Paris, Bordeaux, Lyon
Bordeaux
Paris, Bordeaux, Lyon
Lyon
Mumbai, Delhi, Bangalore
Mumbai
Mumbai, Delhi, Bangalore
Delhi
Mumbai, Delhi, Bangalore
Bangalore
London, Manchester, Bermingham
London
London, Manchester, Bermingham
Manchester
London, Manchester, Bermingham
Bermingham
In short I want to split all the cities into different rows while maintaining the 'Merged_Cities' column.
Here's a replicable version of df:
df = pd.DataFrame({'Merged_Cities':['New York, Wisconsin, Atlanta',
'Tokyo, Kyoto, Suzuki',
'Paris, Bordeaux, Lyon',
'Mumbai, Delhi, Bangalore',
'London, Manchester, Bermingham']})
Use .str.split() and .explode():
df = df.assign(Cities=df["Merged_Cities"].str.split(", ")).explode("Cities")
print(df)
Prints:
Merged_Cities Cities
0 New York, Wisconsin, Atlanta New York
0 New York, Wisconsin, Atlanta Wisconsin
0 New York, Wisconsin, Atlanta Atlanta
1 Tokyo, Kyoto, Suzuki Tokyo
1 Tokyo, Kyoto, Suzuki Kyoto
1 Tokyo, Kyoto, Suzuki Suzuki
2 Paris, Bordeaux, Lyon Paris
2 Paris, Bordeaux, Lyon Bordeaux
2 Paris, Bordeaux, Lyon Lyon
3 Mumbai, Delhi, Bangalore Mumbai
3 Mumbai, Delhi, Bangalore Delhi
3 Mumbai, Delhi, Bangalore Bangalore
4 London, Manchester, Bermingham London
4 London, Manchester, Bermingham Manchester
4 London, Manchester, Bermingham Bermingham
This is really similar to #AndrejKesely's answer, except it merges df and the cities on their index.
# Create pandas.Series from splitting the column on ', '
s = df['Merged_Cities'].str.split(', ').explode().rename('Cities')
# Merge df with s on their index
df = df.merge(s, left_index=True, right_index=True)
# Result
print(df)
Merged_Cities Cities
0 New York, Wisconsin, Atlanta New York
0 New York, Wisconsin, Atlanta Wisconsin
0 New York, Wisconsin, Atlanta Atlanta
1 Tokyo, Kyoto, Suzuki Tokyo
1 Tokyo, Kyoto, Suzuki Kyoto
Merged_Cities Cities
0 New York, Wisconsin, Atlanta New York
0 New York, Wisconsin, Atlanta Wisconsin
0 New York, Wisconsin, Atlanta Atlanta
1 Tokyo, Kyoto, Suzuki Tokyo
1 Tokyo, Kyoto, Suzuki Kyoto
1 Tokyo, Kyoto, Suzuki Suzuki

Parsing a table with no <table>/<td>/<tr> tags and data is nested in <div> tags - beautifulsoup, selenium and and webdriver_manager

I'm trying to get all the table in this url = "https://www.topuniversities.com/university-rankings/university-subject-rankings/2021/psychology".
The problem is that there's no table tag and neither <tr> and <td> tags. All the data in rows are in nested "div" tags.
The code I'm using is this:
from bs4 import BeautifulSoup
from selenium import webdriver
from webdriver_manager.firefox import GeckoDriverManager
import time
driver = webdriver.Firefox(executable_path=GeckoDriverManager().install())
driver.maximize_window()
driver.get(url)
time.sleep(5)
content = driver.page_source.encode('utf-8').strip()
soup = BeautifulSoup(content,"html.parser")
driver.quit()
print(soup)
Also, I'm only getting data from one column (column named "Overall Score") in the nested <div> tags.
Something else I realised is that there's only data from the 10 first rows in the soup output, but I'm trying to get all the 302 rows data.
Thanks a lot for any advise you coud give me.
EDIT
I managed to get what I expected following #KunduK's answer. This is the code I used at the end:
res = requests.get('https://www.topuniversities.com/sites/default/files/qs-rankings-data/en/3519089_indicators.txt?1614801117').json()
df = pd.DataFrame(res["data"])
df = df[["uni", "region", "location", "city", "overall",
"ind_69", "ind_70", "ind_76", "ind_77"]]
headers = {"uni":"University", "overall": "Overall Score", "ind_69": "H-index Citations",
"ind_70": "Citations per Paper", "ind_76": "Academic Reputation", "ind_77": "Employer Reputation"}
df.rename(columns=headers, inplace=True)
for column in headers.values():
df[column] = df[column].apply(lambda value: BeautifulSoup(value, 'html.parser').find('div').text)
df
The DataFrame is the following:
You don't need selenium if you go to network tab you will get below link which returns data as json. you need to loop through it and fetch the value.
https://www.topuniversities.com/sites/default/files/qs-rankings-data/en/3519089.txt?1615516693?v=1616064930668
Code:
import requests
import json
res=requests.get("https://www.topuniversities.com/sites/default/files/qs-rankings-data/en/3519089.txt?1615516693?v=1616064930668").json()
print("Total records :{}".format(len(res['data'])))
for item in res['data']:
print(item['country'])
print(item['city'])
print(item['score'])
print("============")
Output:
Total records :302
United States
Cambridge
98.6
============
United States
Stanford
96.4
============
United Kingdom
Oxford
95.5
============
United Kingdom
Cambridge
94.8
============
United States
Berkeley
92.3
============
United States
Los Angeles
91.4
============
United States
New Haven
90.9
============
United States
Ann Arbor
89.5
============
United States
Cambridge
89.3
============
United Kingdom
London
89.2
============
United States
Philadelphia
89.2
============
United States
New York City
89.1
============
United States
New York City
88.4
============
United States
Chicago
88.2
============
Netherlands
Amsterdam
87.7
============
Singapore
Singapore
87.2
============
Canada
Vancouver
87.2
============
United States
Princeton
87
============
Canada
Toronto
86.1
============
United Kingdom
London
85.7
============
Australia
Parkville
85.7
============
United States
Evanston
85.5
============
Belgium
Leuven
85.2
============
United Kingdom
London
85.1
============
Australia
Sydney
85.1
============
Australia
Brisbane
84.4
============
Singapore
Singapore
84.3
============
United States
Durham
83.6
============
Canada
Montreal
83.5
============
Australia
Sydney
83.4
============
Netherlands
Utrecht
82.9
============
United States
Champaign
82.7
============
United Kingdom
Edinburgh
82.5
============
United Kingdom
Manchester
81.7
============
Hong Kong SAR
Hong Kong
81.7
============
United States
Austin
81.6
============
United States
Pittsburgh
81.5
============
Australia
Canberra
81.3
============
Netherlands
Rotterdam
81.2
============
United States
East Lansing
81.1
============
Germany
Berlin
81
============
Australia
Perth
81
============
Germany
Berlin
80.9
============
Netherlands
Groningen
80.9
============
United States
Ithaca
80.7
============
Hong Kong SAR
Hong Kong
80.4
============
United States
Madison
80.4
============
United States
Columbus
80.3
============
Switzerland
Zürich
80.3
============
United States
San Diego
80.2
============
Australia
Melbourne
80.1
============
Netherlands
Leiden
79.8
============
United States
Seattle
79.8
============
Netherlands
Tilburg
79.6
============
United States
Minneapolis
79.5
============
China (Mainland)
Beijing
79.4
============
New Zealand
Auckland
79.3
============
Netherlands
Maastricht
79.1
============
United States
University Park
79.1
============
United States
Chapel Hill
79.1
============
Belgium
Louvain-la-Neuve
78.9
============
Netherlands
Nijmegen
78.5
============
United Kingdom
Coventry
78.5
============
United States
Nashville
78.5
============
Netherlands
Amsterdam
78.5
============
United States
Baltimore
78.4
============
United Kingdom
Exeter
78.3
============
United States
College Park
78.3
============
United Kingdom
Cardiff
78.2
============
Germany
Munich
78.2
============
Chile
Santiago
78.1
============
New Zealand
Kelburn, Wellington
78.1
============
United States
Providence
78
============
Australia
Sydney
77.8
============
Belgium
Ghent
77.8
============
United States
Boston
77.3
============
United States
Los Angeles
77.3
============
Japan
Tokyo
77.1
============
United Kingdom
Birmingham
77.1
============
United Kingdom
Bristol
77
============
New Zealand
Dunedin
77
============
China (Mainland)
Beijing
76.9
============
Italy
Rome
76.9
============
Italy
Padua
76.9
============
United States
Charlottesville
76.9
============
Sweden
Stockholm
76.8
============
Spain
Madrid
76.8
============
United Kingdom
York
76.8
============
United States
Phoenix
76.6
============
Denmark
Aarhus
76.5
============ so on..
Network Tab
I have inspected the URL that you have provided. It appears that data (that is received from XHR request # https://www.topuniversities.com/sites/default/files/qs-rankings-data/en/3519089.txt?1616049862?v=1616050007711) is split by pagination, that's why you are seeing only 10 entries of it.
You have two options to deal with this problem:
Emulate click on next page button
Read full data from XHR URL in JSON format

convert pandas series type from object to float64

I have a column (of type Series) from a DataFrame (energy["Energy Supply"]) like this:
Country
China 127191
United States 90838
Japan 18984
United Kingdom 7920
Russian Federation 30709
Canada 10431
Germany 13261
India 33195
France 10597
South Korea 11007
Italy 6530
Spain 4923
Iran NaN
Australia 5386
Brazil 12149
Name: Energy Supply, dtype: object
Currently it is of type object.
The NaN value came was converted from this code:
peta = row["Energy Supply"]
if peta == "...":
# pd.to_numeric(row["Energy Supply"], errors='coerce')
row["Energy Supply"] = np.NaN
the commented line works in a similar way.
I don't understand that why this Series is now of type object.
I checked each numeric value, and they are all of type float.
I would want the whole Series to be of type float or float64.
I tried to convert the whole series again into numeric by doing:
energy["Energy Supply"] = pd.to_numeric(energy["Energy Supply"], errors='coerce')
but after that, this series was changed into:
Country
China NaN
United States NaN
Japan NaN
United Kingdom NaN
Russian Federation 30709.0
Canada 10431.0
Germany 13261.0
India 33195.0
France NaN
South Korea NaN
Italy NaN
Spain NaN
Iran NaN
Australia NaN
Brazil 12149.0
Name: Energy Supply, dtype: float64
I wonder why the values 127191, 90838 were converted into NaN, while 30709, 10431 remained as numbers?

Categories