How to lookup multiple entries in table to calculate proportion? - python

I have a table of lots of sports teams. For each team I need to know the number of fans attending their game as a percentage of all the fans in that region with the same name suffix. The table below gives you an idea of what I'm working with:
Region Team Suffix Attending Fans
North West blue city city 181
North East Black and white united united 130
North West blue and white city city 101
North East Purple United united 12
North East red city city 73
North East red and white 112
North West Red city city 162
North East white shorts united united 93
North East orange and black city city 68
North West pink united united 4
North West red united united 192
North West orange united united 42
In the above example, the percentage of attending Red City fans as a proportion of the attending fans from all North West teams suffixed with 'city' is 36.48 %.
What I would like to know is
How to lookup the relevant elements so that I can perform the calculation?
How to automate this so it occurs for every team (including those that do not have a suffix)?

Here is one way. The idea is to perform a groupby.sum() and map this onto the dataframe as part of your calculation.
import pandas as pd, numpy as np
df = pd.DataFrame([['North West', 'blue city', 'city', 181],
['North East', 'Black and white united', 'united', 130],
['North West', 'blue and white city', 'city', 101],
['North East', 'Purple United', 'united', 12],
['North East', 'red city', 'city', 73],
['North East', 'red and white', '', 112],
['North West', 'Red city', 'city', 162],
['North East', 'white shorts united', 'united', 93],
['North East', 'orange and black city', 'city', 68],
['North West', 'pink united', 'united', 4],
['North West', 'red united', 'united', 192],
['North West', 'orange united', 'united', 42]],
columns=['Region', 'Team', 'Suffix', 'Attending Fans'])
g = df.groupby(['Region', 'Suffix'])['Attending Fans'].sum()
df['Pct'] = 100 * df['Attending Fans'] / np.fromiter(map(g.get,
map(tuple, df[['Region', 'Suffix']].values)), dtype=float)
# Region Team Suffix Attending Fans Pct
# 0 North West blue city city 181 40.765766
# 1 North East Black and white united united 130 55.319149
# 2 North West blue and white city city 101 22.747748
# 3 North East Purple United united 12 5.106383
# 4 North East red city city 73 51.773050
# 5 North East red and white 112 100.000000
# 6 North West Red city city 162 36.486486
# 7 North East white shorts united united 93 39.574468
# 8 North East orange and black city city 68 48.226950
# 9 North West pink united united 4 1.680672
# 10 North West red united united 192 80.672269
# 11 North West orange united united 42 17.647059

Related

Python Dict Comprehension retrieve value from 1 dataframe column if match another column value

I have a dataframe and there are 2 columns ["country"] and ["city"] which basically informs of the country and their cities.
I need to create a dict using dict comprehensions, to get as a key, the country and as values, a list of the city/cities (some of them only have one city, others many).
I'm able to define the keys and create a list but all the cities existing appears a values, I am not able to create the condition that the country of the value should be the key:
Dic = {k: list(megacities["city"]) for k,f in megacities.groupby('country')}
for k in Dic:
print("{}:{}\n".format(k, Dic[k]))
Part of the output that I receive is:
Argentina:['Tokyo', 'Jakarta', 'Delhi', 'Manila', 'São Paulo', 'Seoul', 'Mumbai', 'Shanghai', 'Mexico City', 'Guangzhou', 'Cairo', 'Beijing', 'New York', 'Kolkāta', 'Moscow', 'Bangkok', 'Dhaka', 'Buenos Aires', 'Ōsaka', 'Lagos', 'Istanbul', 'Karachi', 'Kinshasa', 'Shenzhen', 'Bangalore', 'Ho Chi Minh City', 'Tehran', 'Los Angeles', 'Rio de Janeiro', 'Chengdu', 'Baoding', 'Chennai', 'Lahore', 'London', 'Paris', 'Tianjin', 'Linyi', 'Shijiazhuang', 'Zhengzhou', 'Nanyang']
Bangladesh:['Tokyo', 'Jakarta', 'Delhi', 'Manila', 'São Paulo', 'Seoul', 'Mumbai', 'Shanghai', 'Mexico City', 'Guangzhou', 'Cairo', 'Beijing', 'New York', 'Kolkāta', 'Moscow', 'Bangkok', 'Dhaka', 'Buenos Aires', 'Ōsaka', 'Lagos', 'Istanbul', 'Karachi', 'Kinshasa', 'Shenzhen', 'Bangalore', 'Ho Chi Minh City', 'Tehran', 'Los Angeles', 'Rio de Janeiro', 'Chengdu', 'Baoding', 'Chennai', 'Lahore', 'London', 'Paris', 'Tianjin', 'Linyi', 'Shijiazhuang', 'Zhengzhou', 'Nanyang']
Brazil:['Tokyo', 'Jakarta', 'Delhi', 'Manila', 'São Paulo', 'Seoul', 'Mumbai', 'Shanghai', 'Mexico City', 'Guangzhou', 'Cairo', 'Beijing', 'New York', 'Kolkāta', 'Moscow', 'Bangkok', 'Dhaka', 'Buenos Aires', 'Ōsaka', 'Lagos', 'Istanbul', 'Karachi', 'Kinshasa', 'Shenzhen', 'Bangalore', 'Ho Chi Minh City', 'Tehran', 'Los Angeles', 'Rio de Janeiro', 'Chengdu', 'Baoding', 'Chennai', 'Lahore', 'London', 'Paris', 'Tianjin', 'Linyi', 'Shijiazhuang', 'Zhengzhou', 'Nanyang']
So basically the expect output would be:
Argentina:['Buenos Aires']
Bangladesh:['Dhaka']
Brazil:['São Paulo', 'Rio de Janeiro']
How can I should proceed in terms of syntaxis to stablish that condition for the value in the dict comprehension?
Lastly, the dataframe:
city city_ascii lat lng country iso2 iso3 admin_name capital population id
0 Tokyo Tokyo 35.6839 139.7744 Japan JP JPN Tōkyō primary 39105000 1392685764
1 Jakarta Jakarta -6.2146 106.8451 Indonesia ID IDN Jakarta primary 35362000 1360771077
2 Delhi Delhi 28.6667 77.2167 India IN IND Delhi admin 31870000 1356872604
3 Manila Manila 14.6000 120.9833 Philippines PH PHL Manila primary 23971000 1608618140
4 São Paulo Sao Paulo -23.5504 -46.6339 Brazil BR BRA São Paulo admin 22495000 1076532519
5 Seoul Seoul 37.5600 126.9900 South Korea KR KOR Seoul primary 22394000 1410836482
6 Mumbai Mumbai 19.0758 72.8775 India IN IND Mahārāshtra admin 22186000 1356226629
7 Shanghai Shanghai 31.1667 121.4667 China CN CHN Shanghai admin 22118000 1156073548
8 Mexico City Mexico City 19.4333 -99.1333 Mexico MX MEX Ciudad de México primary 21505000 1484247881
9 Guangzhou Guangzhou 23.1288 113.2590 China CN CHN Guangdong admin 21489000 1156237133
10 Cairo Cairo 30.0444 31.2358 Egypt EG EGY Al Qāhirah primary 19787000 1818253931
11 Beijing Beijing 39.9040 116.4075 China CN CHN Beijing primary 19437000 1156228865
12 New York New York 40.6943 -73.9249 United States US USA New York NaN 18713220 1840034016
13 Kolkāta Kolkata 22.5727 88.3639 India IN IND West Bengal admin 18698000 1356060520
14 Moscow Moscow 55.7558 37.6178 Russia RU RUS Moskva primary 17693000 1643318494
15 Bangkok Bangkok 13.7500 100.5167 Thailand TH THA Krung Thep Maha Nakhon primary 17573000 1764068610
16 Dhaka Dhaka 23.7289 90.3944 Bangladesh BD BGD Dhaka primary 16839000 1050529279
17 Buenos Aires Buenos Aires -34.5997 -58.3819 Argentina AR ARG Buenos Aires, Ciudad Autónoma de primary 16216000 1032717330
18 Ōsaka Osaka 34.7520 135.4582 Japan JP JPN Ōsaka admin 15490000 1392419823
19 Lagos Lagos 6.4500 3.4000 Nigeria NG NGA Lagos minor 15487000 1566593751
20 Istanbul Istanbul 41.0100 28.9603 Turkey TR TUR İstanbul admin 15311000 1792756324
21 Karachi Karachi 24.8600 67.0100 Pakistan PK PAK Sindh admin 15292000 1586129469
22 Kinshasa Kinshasa -4.3317 15.3139 Congo (Kinshasa) CD COD Kinshasa primary 15056000 1180000363
23 Shenzhen Shenzhen 22.5350 114.0540 China CN CHN Guangdong minor 14678000 1156158707
24 Bangalore Bangalore 12.9791 77.5913 India IN IND Karnātaka admin 13999000 1356410365
25 Ho Chi Minh City Ho Chi Minh City 10.8167 106.6333 Vietnam VN VNM Hồ Chí Minh admin 13954000 1704774326
26 Tehran Tehran 35.7000 51.4167 Iran IR IRN Tehrān primary 13819000 1364305026
27 Los Angeles Los Angeles 34.1139 -118.4068 United States US USA California NaN 12750807 1840020491
28 Rio de Janeiro Rio de Janeiro -22.9083 -43.1964 Brazil BR BRA Rio de Janeiro admin 12486000 1076887657
29 Chengdu Chengdu 30.6600 104.0633 China CN CHN Sichuan admin 11920000 1156421555
30 Baoding Baoding 38.8671 115.4845 China CN CHN Hebei NaN 11860000 1156256829
31 Chennai Chennai 13.0825 80.2750 India IN IND Tamil Nādu admin 11564000 1356374944
32 Lahore Lahore 31.5497 74.3436 Pakistan PK PAK Punjab admin 11148000 1586801463
33 London London 51.5072 -0.1275 United Kingdom GB GBR London, City of primary 11120000 1826645935
34 Paris Paris 48.8566 2.3522 France FR FRA Île-de-France primary 11027000 1250015082
35 Tianjin Tianjin 39.1467 117.2056 China CN CHN Tianjin admin 10932000 1156174046
36 Linyi Linyi 35.0606 118.3425 China CN CHN Shandong NaN 10820000 1156086320
37 Shijiazhuang Shijiazhuang 38.0422 114.5086 China CN CHN Hebei admin 10784600 1156217541
38 Zhengzhou Zhengzhou 34.7492 113.6605 China CN CHN Henan admin 10136000 1156183137
39 Nanyang Nanyang 32.9987 112.5292 China CN CHN Henan NaN 10013600 1156192287
Many thanks!
Try:
d = {i: g["city"].to_list() for i, g in df.groupby("country")}
print(d)
Prints:
{
"Argentina": ["Buenos Aires"],
"Bangladesh": ["Dhaka"],
"Brazil": ["São Paulo", "Rio de Janeiro"],
"China": [
"Shanghai",
"Guangzhou",
"Beijing",
"Shenzhen",
"Chengdu",
"Baoding",
"Tianjin",
"Linyi",
"Shijiazhuang",
"Zhengzhou",
"Nanyang",
],
"Congo (Kinshasa)": ["Kinshasa"],
"Egypt": ["Cairo"],
"France": ["Paris"],
"India": ["Delhi", "Mumbai", "Kolkāta", "Bangalore", "Chennai"],
"Indonesia": ["Jakarta"],
"Iran": ["Tehran"],
"Japan": ["Tokyo", "Ōsaka"],
"Mexico": ["Mexico City"],
"Nigeria": ["Lagos"],
"Pakistan": ["Karachi", "Lahore"],
"Philippines": ["Manila"],
"Russia": ["Moscow"],
"South Korea": ["Seoul"],
"Thailand": ["Bangkok"],
"Turkey": ["Istanbul"],
"United Kingdom": ["London"],
"United States": ["New York", "Los Angeles"],
"Vietnam": ["Ho Chi Minh City"],
}
Since you are doing the groupby, You need to fetch city from the group
Dic = {k: f['city'].unique() for k,f in megacities.groupby('country')}

Draw a Map of cities in python

I have a ranking of countries across the world in a variable called rank_2000 that looks like this:
Seoul
Tokyo
Paris
New_York_Greater
Shizuoka
Chicago
Minneapolis
Boston
Austin
Munich
Salt_Lake
Greater_Sydney
Houston
Dallas
London
San_Francisco_Greater
Berlin
Seattle
Toronto
Stockholm
Atlanta
Indianapolis
Fukuoka
San_Diego
Phoenix
Frankfurt_am_Main
Stuttgart
Grenoble
Albany
Singapore
Washington_Greater
Helsinki
Nuremberg
Detroit_Greater
TelAviv
Zurich
Hamburg
Pittsburgh
Philadelphia_Greater
Taipei
Los_Angeles_Greater
Miami_Greater
MannheimLudwigshafen
Brussels
Milan
Montreal
Dublin
Sacramento
Ottawa
Vancouver
Malmo
Karlsruhe
Columbus
Dusseldorf
Shenzen
Copenhagen
Milwaukee
Marseille
Greater_Melbourne
Toulouse
Beijing
Dresden
Manchester
Lyon
Vienna
Shanghai
Guangzhou
San_Antonio
Utrecht
New_Delhi
Basel
Oslo
Rome
Barcelona
Madrid
Geneva
Hong_Kong
Valencia
Edinburgh
Amsterdam
Taichung
The_Hague
Bucharest
Muenster
Greater_Adelaide
Chengdu
Greater_Brisbane
Budapest
Manila
Bologna
Quebec
Dubai
Monterrey
Wellington
Shenyang
Tunis
Johannesburg
Auckland
Hangzhou
Athens
Wuhan
Bangalore
Chennai
Istanbul
Cape_Town
Lima
Xian
Bangkok
Penang
Luxembourg
Buenos_Aires
Warsaw
Greater_Perth
Kuala_Lumpur
Santiago
Lisbon
Dalian
Zhengzhou
Prague
Changsha
Chongqing
Ankara
Fuzhou
Jinan
Xiamen
Sao_Paulo
Kunming
Jakarta
Cairo
Curitiba
Riyadh
Rio_de_Janeiro
Mexico_City
Hefei
Almaty
Beirut
Belgrade
Belo_Horizonte
Bogota_DC
Bratislava
Dhaka
Durban
Hanoi
Ho_Chi_Minh_City
Kampala
Karachi
Kuwait_City
Manama
Montevideo
Panama_City
Quito
San_Juan
What I would like to do is a map of the world where those cities are colored according to their position on the ranking above. I am opened to further solutions for the representation (such as bubbles of increasing dimension according to the position of the cities in the rank or, if necessary, representing only a sample of countries taken from the top rank, the middle and the bottom).
Thank you,
Federico
Your question has two parts; finding the location of each city and then drawing them on the map. Assuming you have the latitude and longitude of each city, here's how you'd tackle the latter part.
I like Folium (https://pypi.org/project/folium/) for drawing maps. Here's an example of how you might draw a circle for each city, with it's position in the list is used to determine the size of that circle.
import folium
cities = [
{'name':'Seoul', 'coodrs':[37.5639715, 126.9040468]},
{'name':'Tokyo', 'coodrs':[35.5090627, 139.2094007]},
{'name':'Paris', 'coodrs':[48.8588787,2.2035149]},
{'name':'New York', 'coodrs':[40.6976637,-74.1197631]},
# etc. etc.
]
m = folium.Map(zoom_start=15)
for counter, city in enumerate(cities):
circle_size = 5 + counter
folium.CircleMarker(
location=city['coodrs'],
radius=circle_size,
popup=city['name'],
color="crimson",
fill=True,
fill_color="crimson",
).add_to(m)
m.save('map.html')
Output:
You may need to adjust the circle_size calculation a little to work with the number of cities you want to include.

Delete the rows with repeated characters in the dataframe

I have a large dataset from csv file to clean with the patterns I've identified but I can't upload the file here so I've just hardcoded a small sample to give an overview of what I'm looking for. The identified patterns are the repeated characters in the values. However, if you look at the dataframe below, there are actually repeated 'single characters' like ssssss, fffff, aaaaa, etc and then the repeated 'double characters' like dgdg, bvbvbv, tutu, etc. There are also repeated 'triple characters' such as yutyut and fdgfdg.
Despite of this, would it be also possible to delete the rows with ANY repeated 'single/double/triple characters' so that I can apply them to the large dataset? For example, the dataframe here only shows the patterns I identified above, however, there could be repeated characters of ANY letters like 'uuuu', 'zzzz', 'eded, 'rsrsrs', 'xyzxyz', etc in the large dataset.
Address1 Address2 Address3 Address4
0 High Street Park Avenue St. John’s Road The Grove
1 wssssss The Crescent tyutyut Mill Road
2 qfdgfdgdg dddfffff qdffgfdgfggfbvbvbv sefsdfdyuytutu
3 Green Lane Highfield Road Springfield Road School Lane
4 Kingsway Stanley Road George Street Albert Road
5 Church Street New Street Queensway Broadway
6 qaaaaass mjkhjk chfghfghh fghfhfh
Here's the code:
import pandas as pd
import numpy as np
data = {'Address1': ['High Street', 'wssssss', 'qfdgfdgdg', 'Green Lane', 'Kingsway', 'Church Street', 'qaaaaass'],
'Address2': ['Park Avenue', 'The Crescent', 'dddfffff', 'Highfield Road', 'Stanley Road', 'New Street', 'mjkhjk'],
'Address3': ['St. John’s Road', 'tyutyut', 'qdffgfdgfggfbvbvbv', 'Springfield Road', 'George Street', 'Queensway', 'chfghfghh'],
'Address4': ['The Grove', 'Mill Road', 'sefsdfdyuytutu', 'School Lane', 'Albert Road', 'Broadway', 'fghfhfh']}
address_details = pd.DataFrame(data)
#Code to delete the data for the identified patterns
print(address_details)
The output I expect is:
Address1 Address2 Address3 Address4
0 High Street Park Avenue St. John’s Road The Grove
1 Green Lane Highfield Road Springfield Road School Lane
2 Kingsway Stanley Road George Street Albert Road
3 Church Street New Street Queensway Broadway
Please advise, thank you!
Try with str.contains and loc with agg:
print(address_details.loc[~address_details.agg(lambda x: x.str.contains(r"(.)\1+\b"), axis=1).any(1)])
Output:
Address1 Address2 Address3 Address4
0 High Street Park Avenue St. John’s Road The Grove
3 Green Lane Highfield Road Springfield Road School Lane
4 Kingsway Stanley Road George Street Albert Road
5 Church Street New Street Queensway Broadway
Or if you care about index:
print(address_details.loc[~address_details.agg(lambda x: x.str.contains(r"(.)\1+\b"), axis=1).any(1)].reset_index(drop=True))
Output:
Address1 Address2 Address3 Address4
0 High Street Park Avenue St. John’s Road The Grove
1 Green Lane Highfield Road Springfield Road School Lane
2 Kingsway Stanley Road George Street Albert Road
3 Church Street New Street Queensway Broadway
Edit:
For only lowercase letters, try:
print(address_details.loc[~address_details.agg(lambda x: x.str.contains(r"([a-z]+)\1{1,}\b"), axis=1).any(1)].reset_index(drop=True))

Assign values from a dictionary to a new column based on condition

This my data frame
City
sales
San Diego
500
Texas
400
Nebraska
300
Macau
200
Rome
100
London
50
Manchester
70
I want to add the country at the end which will look like this
City
sales
Country
San Diego
500
US
Texas
400
US
Nebraska
300
US
Macau
200
Hong Kong
Rome
100
Italy
London
50
England
Manchester
200
England
The countries are stored in below dictionary
country={'US':['San Diego','Texas','Nebraska'], 'Hong Kong':'Macau', 'England':['London','Manchester'],'Italy':'Rome'}
It's a little complicated because you have lists and strings as the values and strings are technically iterable, so distinguishing is more annoying. But here's a function that can flatten your dict:
def flatten_dict(d):
nd = {}
for k,v in d.items():
# Check if it's a list, if so then iterate through
if ((hasattr(v, '__iter__') and not isinstance(v, str))):
for item in v:
nd[item] = k
else:
nd[v] = k
return nd
d = flatten_dict(country)
#{'San Diego': 'US',
# 'Texas': 'US',
# 'Nebraska': 'US',
# 'Macau': 'Hong Kong',
# 'London': 'England',
# 'Manchester': 'England',
# 'Rome': 'Italy'}
df['Country'] = df['City'].map(d)
You can implement this using geopy
You can install geopy by pip install geopy
Here is the documentation : https://pypi.org/project/geopy/
# import libraries
from geopy.geocoders import Nominatim
# you need to mention a name for the app
geolocator = Nominatim(user_agent="some_random_app_name")
# get country name
df['Country'] = df['City'].apply(lambda x : geolocator.geocode(x).address.split(', ')[-1])
print(df)
City sales Country
0 San Diego 500 United States
1 Texas 400 United States
2 Nebraska 300 United States
3 Macau 200 中国
4 Rome 100 Italia
5 London 50 United Kingdom
6 Manchester 70 United Kingdom
# to get country name in english
df['Country'] = df['City'].apply(lambda x : geolocator.reverse(geolocator.geocode(x).point, language='en').address.split(', ')[-1])
print(df)
City sales Country
0 San Diego 500 United States
1 Texas 400 United States
2 Nebraska 300 United States
3 Macau 200 China
4 Rome 100 Italy
5 London 50 United Kingdom
6 Manchester 70 United Kingdom

How can I combine the Index in the python dataframe?

in the following, can I make a single index for all the entries with common index.
cric = pd.Series(['India', 'Pakistan', 'South Africa', 'England', 'New Zealand'],
index = ['Cricket', 'Cricket', 'Cricket', 'Cricket', 'Cricket'])
ftbl = pd.Series(['England', 'South Africa', 'Australia', 'Netherlands', 'New Zealand'],
index = ['Football', 'Football', 'Football', 'Football' , 'Football'])
hock = pd.Series(['India', 'Pakistan', 'South Korea', 'England', 'India', 'New Zealand'],
index = ['Hockey', 'Hockey', 'Hockey', 'Hockey', 'Hockey', 'Hockey'])
all_countries_1 = cric.append(ftbl)
all_countries_1 = all_countries_1.append(ftbl)
all_countries_1 = all_countries_1.append(hock)
all_countries_1 = all_countries_1.to_frame()
all_countries_1.columns = ['Countries']
all_countries_1
I want the following as my out
Is this what you are looking for?
# zip the first three chars of the index and the index together
z = list(zip(all_countries_1.index.str[:3], all_countries_1.index))
# create multi index
idx = pd.MultiIndex.from_tuples(z)
# assign index
all_countries_1.index = idx
Countries
Cri Cricket India
Cricket Pakistan
Cricket South Africa
Cricket England
Cricket New Zealand
Foo Football England
Football South Africa
Football Australia
Football Netherlands
Football New Zealand
Football England
Football South Africa
Football Australia
Football Netherlands
Football New Zealand
Hoc Hockey India
Hockey Pakistan
Hockey South Korea
Hockey England
Hockey India
Hockey New Zealand
If, by single index, you mean an index made of autoincrementing numbers, there is nothing special you have to do. That is the default index for a DataFrame, so using the reset_index() method will get what you want. The next step will probably be to rename your index column. You can chain that method with reset_index and take care of it one line.
all_countries_1 = all_countries_1.reset_index().rename(columns={"index":"Sports"})

Categories