String mode aggregation with group by function - python

I have dataframe which looks like below
Country City
UK London
USA Washington
UK London
UK Manchester
USA Washington
USA Chicago
I want to group country and aggregate on the most repeated city in a country
My desired output should be like
Country City
UK London
USA Washington
Because London and Washington appears 2 times whereas Manchester and Chicago appears only 1 time.
I tried
from scipy.stats import mode
df_summary = df.groupby('Country')['City'].\
apply(lambda x: mode(x)[0][0]).reset_index()
But it seems it won't work on strings

I can't replicate your error, but you can use pd.Series.mode, which accepts strings and returns a series, using iat to extract the first value:
res = df.groupby('Country')['City'].apply(lambda x: x.mode().iat[0]).reset_index()
print(res)
Country City
0 UK London
1 USA Washington

try like below:
>>> df.City.mode()
0 London
1 Washington
dtype: object
OR
import pandas as pd
from scipy import stats
Can use scipy with stats + lambda :
df.groupby('Country').agg({'City': lambda x:stats.mode(x)[0]})
City
Country
UK London
USA Washington
# df.groupby('Country').agg({'City': lambda x:stats.mode(x)[0]}).reset_index()
However, it gives nice count as well if you don't want to return ony First value:
>>> df.groupby('Country').agg({'City': lambda x:stats.mode(x)})
City
Country
UK ([London], [2])
USA ([Washington], [2])

Related

Dataframe - filter the values of a particular column with isin()

I have a pandas dataframe in which I have the column "Bio Location", I would like to filter it so that I only have the locations of my list in which there are names of cities. I have made the following code which works except that I have a problem.
For example, if the location is "Paris France" and I have Paris in my list then it will return the result. However, if I had "France Paris", it would not return "Paris". Do you have a solution? Maybe use regex? Thank u a lot!!!
df = pd.read_csv(path_to_file, encoding='utf-8', sep=',')
cities = [Paris, Bruxelles, Madrid]
values = df[df['Bio Location'].isin(citiesfr)]
values.to_csv(r'results.csv', index = False)
What you want here is .str.contains():
1. The DF I used to test:
df = {
'col1':['Paris France','France Paris Test','France Paris','Madrid Spain','Spain Madrid Test','Spain Madrid'] #so tested with 1x at start, 1x in the middle and 1x at the end of a str
}
df = pd.DataFrame(df)
df
Result:
index
col1
0
Paris France
1
France Paris Test
2
France Paris
3
Madrid Spain
4
Spain Madrid Test
5
Spain Madrid
2. Then applying the code below:
Updated following comment
#so tested with 1x at start, 1x in the middle and 1x at the end of a str
reg = ('Paris|Madrid')
df = df[df.col1.str.contains(reg)]
df
Result:
index
col1
0
Paris France
1
France Paris Test
2
France Paris
3
Madrid Spain
4
Spain Madrid Test
5
Spain Madrid

How to replace a list with first element of list in pandas dataframe column?

I have a pandas dataframe df, which look like this:
df = pd.DataFrame({'Name':['Harry', 'Sam', 'Raj', 'Jamie', 'Rupert'],
'Country':['USA', "['USA', 'UK', 'India']", "['India', 'USA']", 'Russia', 'China']})
Name Country
Harry USA
Sam ['USA', 'UK', 'India']
Raj ['India', 'USA']
Jamie Russia
Rupert China
Some values in Country column are list, and I want to replace those list with the first element in the list, so that it will look like this:
Name Country
Harry USA
Sam USA
Raj India
Jamie Russia
Rupert China
As you have strings, you could use a regex here:
df['Country'] = df['Country'].str.extract('((?<=\[["\'])[^"\']*|^[^"\']+$)')
output (as a new column for clarity):
Name Country Country2
0 Harry USA USA
1 Sam ['USA', 'UK', 'India'] USA
2 Raj ['India', 'USA'] India
3 Jamie Russia Russia
4 Rupert China China
regex:
( # start capturing
(?<=\[["\']) # if preceded by [" or ['
[^"\']* # get all text until " or '
| # OR
^[^"\']+$ # get whole string if it doesn't contain " or '
) # stop capturing
Try something like:
import ast
def changeStringList(value):
try:
myList = ast.literal_eval(value)
return myList[0]
except:
return value
df["Country"] = df["Country"].apply(changeStringList)
df
Output
Name
Country
0
Harry
USA
1
Sam
USA
2
Raj
India
3
Jamie
Russia
4
Rupert
China
Note that, by using the changeStringList function, we try to reform the string list to an interpretable list of strings and return the first value. If it is not a list, then it returns the value itself.
Try this:
import ast
df['Country'] = df['Country'].where(df['Country'].str.contains('[', regex=False), '[\'' + df['Country'] + '\']').apply(ast.literal_eval).str[0]
Output:
>>> df
Name Country
0 Harry USA
1 Sam USA
2 Raj India
3 Jamie Russia
4 Rupert China
A regex solution.
import re
tempArr = []
for val in df["Country"]:
if val.startswith("["):
val = re.findall(r"[A-Za-z]+",val)[0]
tempArr.append(val)
else: tempArr.append(val)
df["Country"] = tempArr
df
Name Country
0 Harry USA
1 Sam USA
2 Raj India
3 Jamie Russia
4 Rupert China
If you have string you could use Series.str.strip in order to remove ']' or '[' and then use Series.str.split to convert all rows to list ,after that we could use .str accesor
df['Country'] = df['Country'].str.strip('[|]').str.split(',')\
.str[0].str.replace("'", "")
Name Country
0 Harry USA
1 Sam USA
2 Raj India
3 Jamie Russia
4 Rupert China

converting print statement output in loop into dataframe

I am trying to adapt the following code from print statement to dataframe output.
places = ['England UK','Paris FRANCE','ITALY,gh ROME','New']
location=['UK','FRANCE','ITALY']
def on_occurence(pos,location):
print (i,':',location)
root = aho_create_statemachine(location)
for i in places:
aho_find_all(i, root, on_occurence)
the print output for the above code is
England UK : UK
Paris FRANCE : FRANCE
ITALY,gh ROME : ITALY
I would like it so the df looked like:
message
country
England UK
UK
Paris FRANCE
FRANCE
ITALY,gh ROME
ITALY
I have tried the following with no luck
places = ['England UK','Paris FRANCE','ITALY,gh ROME','New']
location=['UK','FRANCE','ITALY']
df = pd.DataFrame(columns=["message","location"])
def on_occurence(pos,location):
print (i,':',location)
df = df.append({"message":i,"location":location},ignore_index=True)
root = aho_create_statemachine(location)
for i in places:
aho_find_all(i, root, on_occurence)
However the df looks like the following
message
country
NEW
UK FRANCE ITALY
df = pd.DataFrame(list(zip(places, location)), columns = ["Message", "Country"])
print(df)
My output:
Message Country
0 England UK UK
1 Paris FRANCE FRANCE
2 ITALY,gh ROME ITALY
If you want to print it without Row Index:
print(df.to_string(index=False))
Output in this case is:
Message Country
England UK UK
Paris FRANCE FRANCE
ITALY,gh ROME ITALY
I would recomend using dictionarys instead of 2 separate lists EG:
placeAndLocation = {
"england UK" : "UK",
"Paris France" : "france"
}
and so on.
Then to loop through this use:
for place, location in placeAndLocation.items():
print("place: " + place)
print("location: " + location)
I find this easier as you can easily see what data field lines up with what value and the data is contained within one variavle makeing it easier to resd down the line

Sorting values in a pandas series in ascending order not working when re-assigned

I am trying to sort a Pandas Series in ascending order.
Top15['HighRenew'].sort_values(ascending=True)
Gives me:
Country
China 1
Russian Federation 1
Canada 1
Germany 1
Italy 1
Spain 1
Brazil 1
South Korea 2.27935
Iran 5.70772
Japan 10.2328
United Kingdom 10.6005
United States 11.571
Australia 11.8108
India 14.9691
France 17.0203
Name: HighRenew, dtype: object
The values are in ascending order.
However, when I then modify the series in the context of the dataframe:
Top15['HighRenew'] = Top15['HighRenew'].sort_values(ascending=True)
Top15['HighRenew']
Gives me:
Country
China 1
United States 11.571
Japan 10.2328
United Kingdom 10.6005
Russian Federation 1
Canada 1
Germany 1
India 14.9691
France 17.0203
South Korea 2.27935
Italy 1
Spain 1
Iran 5.70772
Australia 11.8108
Brazil 1
Name: HighRenew, dtype: object
Why this is giving me a different output to that above?
Would be grateful for any advice?
Top15['HighRenew'] = Top15['HighRenew'].sort_values(ascending=True).to_numpy()
or
Top15['HighRenew'] = Top15['HighRenew'].sort_values(ascending=True).reset_index(drop=True)
When you sort_values , the indexes don't change so it is aligning per the index!
Thank you to anky for providing me with this fantastic solution!

How to test string contains elements in list and assign the target element to another column via Pandas

I have a one column list presenting some company names. Some of those names contain the country names (e.g., "China" in "China A1", 'Finland' in "C1 in Finland"). I want to extract their belonging countries based on the company name and a pre-defined list consisted of country names.
The original dataframe df shows like this
Company name Country
0 China A1
1 Australia-A2
2 Belgium_C1
3 C1 in Finland
4 D1 of Greece
5 E2 for Pakistan
For now, I can only come up with an inefficient method. Here is my code:
country_list = ['China','America','Greece','Pakistan','Finland','Belgium','Japan','British','Australia']
for t in country_list:
df.loc[df['company name'].contains(t),'country']=t
The result shows like
Company name Country
0 China A1 China
1 Australia-A2 Australia
2 Belgium_C1 Belgium
3 C1 in Finland Finland
4 D1 of Greece Greece
5 E2 for Pakistan Pakistan
I thought that when the country_list contains large amount of elements, i,e, countries, it would be time-consuming via loop method. Is there any simpler method to tackle with my problem?
Here's one way using str.extract:
df['Country'] = df['Company name'].str.extract('('+'|'.join(country_list)+')')
Company name Country
0 China A1 China
1 Australia-A2 Australia
2 Belgium_C1 Belgium
3 C1 in Finland Finland
4 D1 of Greece Greece
5 E2 for Pakistan Pakistan
You need series.str.extract() here:
pat = r'({})'.format('|'.join(country_list))
# pat-->'(China|America|Greece|Pakistan|Finland|Belgium|Japan|British|Australia)'
df['Country']=df['Company name'].str.extract(pat, expand=False)
Maybe using findall in case you have more than one country name in one cell
df["Company name"].str.findall('|'.join(country_list)).str[0]
Out[758]:
0 China
1 Australia
2 Belgium
3 Finland
4 Greece
5 Pakistan
Name: Company name, dtype: object
Using str.extract with Regex
Ex:
import pandas as pd
country_list = ['China','America','Greece','Pakistan','Finland','Belgium','Japan','British','Australia']
df = pd.read_csv(filename)
df["Country"] = df["Company_name"].str.extract("("+"|".join(country_list)+ ")")
print(df)
Output:
Company_name Country
0 China A1 China
1 Australia-A2 Australia
2 Belgium_C1 Belgium
3 C1 in Finland Finland
4 D1 of Greece Greece
5 E2 for Pakistan Pakistan

Categories