The dataframe has 122,145 rows.
Following is snippet of data :
country_name,subdivision_1_name,subdivision_2_name,city_name
Spain,Madrid,Madrid,Sevilla La Nueva
Spain,Principality of Asturias,Asturias,Sevares
Spain,Catalonia,Barcelona,Seva
Spain,Cantabria,Cantabria,Setien
Spain,Basque Country,Biscay,Sestao
Spain,Navarre,Navarre,Sesma
Spain,Catalonia,Barcelona,Barcelona
I want to substitute city_name with subdivision_2_name whenever both the following conditions are satisfied:
subdivision_2_name and city_name have same country_name and same
subdivision_1_name , and
subdivision_2_name is present in city_name.
ex: For city_name "Seva" the subdivison_2_name "Barcelona" is present as a city_name as well in the dataframe with the same country_name "Spain" and same subdivision_1_name "Catalonia" , so I will replace "Seva" with "Barcelona".
I am able to create a proper apply func. I have prepared a loop:
for i in range(df.shape[0]):
if df.subdivision_2_name[i] in set(df.city_name[(df.country_name == df.country_name[i]) & (df.subdivision_1_name == df.subdivision_1_name[i])]):
df.city_name[i] = df.subdivision_2_name[i]
Edit : This loop took 1637 seconds(~28 min) to run
Suggest me a better method.
Use:
def f(x):
if x['subdivision_2_name'].isin(x['city_name']).any():
x['city_name'] = x['subdivision_2_name']
return (x)
df1 = df.groupby(['country_name','subdivision_1_name','subdivision_2_name']).apply(f)
print (df1)
country_name subdivision_1_name subdivision_2_name city_name
0 Spain Madrid Madrid Sevilla La Nueva
1 Spain Principality of Asturias Asturias Sevares
2 Spain Catalonia Barcelona Barcelona
3 Spain Cantabria Cantabria Setien
4 Spain Basque Country Biscay Sestao
5 Spain Navarre Navarre Sesma
6 Spain Catalonia Barcelona Barcelona
Related
I have a pandas dataframe in which I have the column "Bio Location", I would like to filter it so that I only have the locations of my list in which there are names of cities. I have made the following code which works except that I have a problem.
For example, if the location is "Paris France" and I have Paris in my list then it will return the result. However, if I had "France Paris", it would not return "Paris". Do you have a solution? Maybe use regex? Thank u a lot!!!
df = pd.read_csv(path_to_file, encoding='utf-8', sep=',')
cities = [Paris, Bruxelles, Madrid]
values = df[df['Bio Location'].isin(citiesfr)]
values.to_csv(r'results.csv', index = False)
What you want here is .str.contains():
1. The DF I used to test:
df = {
'col1':['Paris France','France Paris Test','France Paris','Madrid Spain','Spain Madrid Test','Spain Madrid'] #so tested with 1x at start, 1x in the middle and 1x at the end of a str
}
df = pd.DataFrame(df)
df
Result:
index
col1
0
Paris France
1
France Paris Test
2
France Paris
3
Madrid Spain
4
Spain Madrid Test
5
Spain Madrid
2. Then applying the code below:
Updated following comment
#so tested with 1x at start, 1x in the middle and 1x at the end of a str
reg = ('Paris|Madrid')
df = df[df.col1.str.contains(reg)]
df
Result:
index
col1
0
Paris France
1
France Paris Test
2
France Paris
3
Madrid Spain
4
Spain Madrid Test
5
Spain Madrid
I am trying to adapt the following code from print statement to dataframe output.
places = ['England UK','Paris FRANCE','ITALY,gh ROME','New']
location=['UK','FRANCE','ITALY']
def on_occurence(pos,location):
print (i,':',location)
root = aho_create_statemachine(location)
for i in places:
aho_find_all(i, root, on_occurence)
the print output for the above code is
England UK : UK
Paris FRANCE : FRANCE
ITALY,gh ROME : ITALY
I would like it so the df looked like:
message
country
England UK
UK
Paris FRANCE
FRANCE
ITALY,gh ROME
ITALY
I have tried the following with no luck
places = ['England UK','Paris FRANCE','ITALY,gh ROME','New']
location=['UK','FRANCE','ITALY']
df = pd.DataFrame(columns=["message","location"])
def on_occurence(pos,location):
print (i,':',location)
df = df.append({"message":i,"location":location},ignore_index=True)
root = aho_create_statemachine(location)
for i in places:
aho_find_all(i, root, on_occurence)
However the df looks like the following
message
country
NEW
UK FRANCE ITALY
df = pd.DataFrame(list(zip(places, location)), columns = ["Message", "Country"])
print(df)
My output:
Message Country
0 England UK UK
1 Paris FRANCE FRANCE
2 ITALY,gh ROME ITALY
If you want to print it without Row Index:
print(df.to_string(index=False))
Output in this case is:
Message Country
England UK UK
Paris FRANCE FRANCE
ITALY,gh ROME ITALY
I would recomend using dictionarys instead of 2 separate lists EG:
placeAndLocation = {
"england UK" : "UK",
"Paris France" : "france"
}
and so on.
Then to loop through this use:
for place, location in placeAndLocation.items():
print("place: " + place)
print("location: " + location)
I find this easier as you can easily see what data field lines up with what value and the data is contained within one variavle makeing it easier to resd down the line
My dataframe looks like this:
I want to add the league the club plays in, and the country that league is based in, as new columns, for every row.
I initially tried this using dictionaries with the clubs/countries, and returning the key for a value:
club_country_dict = {'La Liga':['Real Madrid','FC Barcelona'],'France Ligue 1':['Paris Saint-Germain']}
key_list=list(club_country_dict.keys())
val_list=list(club_country_dict.values())
But this ran into issues since each of my keys values is actually a list, rather than a single value.
I then tried some IF THEN logic, by just having standalone variables for each league, and checking if the club value was in each variable:
la_Liga = ['Real Madrid','FC Barcelona']
for row in data:
if data['Club'] in la_Liga:
data['League'] = 'La Liga'
Apologies for the messy question. Basically I'm looking to add a two new columns to my dataset, 'League' and 'Country', based on the 'Club' column value. I'm not sure what's the easiest way to do this but I've hit walls trying to different ways. Thanks in advance.
You could convert the dictionary to a data frame and then merge:
df = pd.DataFrame({"Name": ["CR", "Messi", "neymar"], "Club": ["Real Madrid", "FC Barcelona", "Paris Saint-Germain"]})
df.merge(pd.DataFrame(club_country_dict.items(), columns=['League', 'Club']).explode('Club'),
on = 'Club', how='left')
Here is one of the simple way to solve it. Use Pandas apply function on rows
https://pandas.pydata.org/pandas-docs/stable/reference/api/pandas.DataFrame.apply.html
import pandas as pd
df = pd.DataFrame({"name": ["CR", "Messi", "neymar"], "club": ["RM", "BR", "PSG"]})
country = {"BR": "Spain", "PSG": "France", "RM": "Spain"}
df["country"] = df.apply(lambda row: country[row.club], axis=1)
print(df)
Output:
name club country
0 CR RM Spain
1 Messi BR Spain
2 neymar PSG France
Try pandas replace feature for Series.
df = pd.DataFrame({"Name" : ['Cristiano', 'L. Messi', "Neymar"], 'Club' : ["Real Madrid", "FC Barcelona", "Paris Saint-Germain"]})
df:
Name Club
0 Cristiano Real Madrid
1 L. Messi FC Barcelona
2 Neymar Paris Saint-Germain
Now add new column:
club_country_dict = {'Real Madrid': 'La Liga',
'FC Barcelona' : "La Liga",
'Paris Saint-Germain': 'France Ligue 1'}
df['League'] = df.Club.replace(club_country_dict)
df:
Name Club League
0 Cristiano Real Madrid La Liga
1 L. Messi FC Barcelona La Liga
2 Neymar Paris Saint-Germain France Ligue 1
To cope with the "list problem" in club_country_dict, convert it to
the following Series:
league_club = pd.Series(club_country_dict.values(), index=club_country_dict.keys(),
name='Club').explode()
The result is:
La Liga Real Madrid
La Liga FC Barcelona
France Ligue 1 Paris Saint-Germain
Name: Club, dtype: object
You should also have a "connection" between the league name and its
country (another Series):
league_country = pd.Series({'La Liga': 'Spain', 'France Ligue 1': 'France'}, name='Country')
Of course, add here other leagues of interest with their countries.
The next step is to join them into club_details DataFrame, with Club
as the index:
club_details = league_club.to_frame().join(league_country).reset_index()\
.rename(columns={'index':'League'}).set_index('Club')
The result is:
League Country
Club
Paris Saint-Germain France Ligue 1 France
Real Madrid La Liga Spain
FC Barcelona La Liga Spain
Then, assuming that your first DataFrame is named player, generate
the final result:
result = player.join(club_details, on='Club')
The result is:
Name Club League Country
0 Cristiano Ronaldo Real Madrid La Liga Spain
1 L. Messi FC Barcelona La Liga Spain
2 Neymar Paris Saint-Germain France Ligue 1 France
I have got data like this:
Col
Texas[x]
Dallas
Austin
California[x]
Los Angeles
San Francisco
What i want is this:
col1 Col2
Texas[x] Dallas
Austin
California[x] Los Angeles
San Francisco
Please help!
Use str.extract to create columns and then clean up
df.Col.str.extract('(.*\[x\])?(.*)').ffill()\
.replace('', np.nan).dropna()\
.rename(columns = {0:'Col1', 1: 'Col2'})\
.set_index('Col1')
Col2
Col1
Texas [x] Dallas
Texas [x] Austin
California [x] Los Angeles
California [x] San Francisco
Update: To address the follow-up question.
df.Col.str.extract('(.*\[x\])?(.*)').ffill()\
.replace('', np.nan).dropna()\
.rename(columns = {0:'Col1', 1: 'Col2'})
You get
Col1 Col2
1 Texas[x] Dallas
2 Texas[x] Austin
4 California[x] Los Angeles
5 California[x] San Francisco
Seems like [x] represents state in a list. You can try to iterate over the dataframe using iterrows. Something like this:
state = None # initialize as None, in case something goes wrong
city = None
rowlist = []
for idx, row in df.iterrows():
# get the state
if '[x]' in row['Col']:
state = row['Col']
continue
# now, get the cities
city = row['Col']
rowlist.append([state, city])
df2 = pd.DataFrame(rowlist)
This assumes that your initial dataframe is called df and column name is Col, and only works if cities are followed by states, which it seems like they do from your data sample.
I have dataframe which looks like below
Country City
UK London
USA Washington
UK London
UK Manchester
USA Washington
USA Chicago
I want to group country and aggregate on the most repeated city in a country
My desired output should be like
Country City
UK London
USA Washington
Because London and Washington appears 2 times whereas Manchester and Chicago appears only 1 time.
I tried
from scipy.stats import mode
df_summary = df.groupby('Country')['City'].\
apply(lambda x: mode(x)[0][0]).reset_index()
But it seems it won't work on strings
I can't replicate your error, but you can use pd.Series.mode, which accepts strings and returns a series, using iat to extract the first value:
res = df.groupby('Country')['City'].apply(lambda x: x.mode().iat[0]).reset_index()
print(res)
Country City
0 UK London
1 USA Washington
try like below:
>>> df.City.mode()
0 London
1 Washington
dtype: object
OR
import pandas as pd
from scipy import stats
Can use scipy with stats + lambda :
df.groupby('Country').agg({'City': lambda x:stats.mode(x)[0]})
City
Country
UK London
USA Washington
# df.groupby('Country').agg({'City': lambda x:stats.mode(x)[0]}).reset_index()
However, it gives nice count as well if you don't want to return ony First value:
>>> df.groupby('Country').agg({'City': lambda x:stats.mode(x)})
City
Country
UK ([London], [2])
USA ([Washington], [2])