Split a dataframe column on non-ASCII characters - python

This is a column with data and non ascii characters
Summary 1
United Kingdom - ��Global Consumer Technology - ��American Express
United Kingdom - ��VP Technology - Founder - ��Hogarth Worldwide
Aberdeen - ��SeniorCore Analysis Specialist - ��COREX Group
London, - ��ED, Equit Technology, London - ��Morgan Stanley
United Kingdom - ��Chief Officer, Group Technology - ��BP
How split them and save in different column
The code i used is:
import io
import pandas as pd
df = pd.read_csv("/home/vipul/Desktop/dataminer.csv", sep='\s*\+.*?-\s*')
df = df.reset_index()
df.columns = ["First Name", "Last Name", "Email", "Profile URL", "Summary 1", "Summary 2"]
df.to_csv("/home/vipul/Desktop/new.csv")

Say, you have a column in a series like this:
s
0 United Kingdom - ��Global Consumer Technolog...
1 United Kingdom - ��VP Technology - Founder -...
2 Aberdeen - ��SeniorCore Analysis Specialist ...
3 London, - ��ED, Equit Technology, London - �...
4 United Kingdom - ��Chief Officer, Group Tech...
Name: Summary 1, dtype: object
Option 1
Expanding on this answer, you can split on non-ascii characters using str.split:
s.str.split(r'-\s*[^\x00-\x7f]+', expand=True)
0 1 2
0 United Kingdom Global Consumer Technology American Express
1 United Kingdom VP Technology - Founder Hogarth Worldwide
2 Aberdeen SeniorCore Analysis Specialist COREX Group
3 London, ED, Equit Technology, London Morgan Stanley
4 United Kingdom Chief Officer, Group Technology BP
Option 2
str.extractall + unstack:
s.str.extractall('([\x00-\x7f]+)')[0].str.rstrip(r'- ').unstack()
match 0 1 2
0 United Kingdom Global Consumer Technology American Express
1 United Kingdom VP Technology - Founder Hogarth Worldwide
2 Aberdeen SeniorCore Analysis Specialist COREX Group
3 London, ED, Equit Technology, London Morgan Stanley
4 United Kingdom Chief Officer, Group Technology BP

Another approach :
a
0 United Kingdom - ��Global Consumer Technolog...
1 United Kingdom - ��VP Technology - Founder -...
2 Aberdeen - ��SeniorCore Analysis Specialist ...
3 London, - ��ED, Equit Technology, London - �...
4 United Kingdom - ��Chief Officer, Group Tech...
Use this function to extract assci char (where Unicode code point is superior to 128 ) using ord build-in function
def extract_ascii(x):
string_list = filter(lambda y : ord(y) < 128, x)
return ''.join(string_list)
and apply it to columns.
df1.a.apply(extract_ascii).str.split('-', expand=True)
here is the results :
0 1 2 3
0 United Kingdom Global Consumer Technology American Express None
1 United Kingdom VP Technology Founder Hogarth Worldwide
2 Aberdeen SeniorCore Analysis Specialist COREX Group None
3 London, ED, Equit Technology, London Morgan Stanley None
4 United Kingdom Chief Officer, Group Technology BP None

Related

Pandas filtering to get names of coaches who is coach for both men and women's team

I have a dataframe like this -
Name Country Discipline Event
5 AIKMAN Siegfried Gottlieb Japan Hockey Men
6 AL SAADI Kais Germany Hockey Men
8 ALEKNO Vladimir Islamic Republic of Iran Volleyball Men
9 ALEKSEEV Alexey ROC Handball Women
11 ALSHEHRI Saad Saudi Arabia Football Men
.
.
.
I want to get the Names (Name of coaches) who is coach for both Men and Women team of a particular game(Discipline)
Please help me with this
You can use groupby and check for groups that have Event count >= 2:
filtered = df.groupby(['Discipline', 'Name']).filter(lambda x: x['Event'].count() >= 2)
If you want a list of unique names, then simply:
>>> filtered.Name.unique()

How to get groupby total and then calculate percentage of a Pandas DataFrame column

I'd be grateful for some help as no amount of googling or playing with .agg is helping me solve this problem.
I have a dataframe with election results. I have grouped by Municipality and PartyName to get a total vote for each party in a municipality and it looks like this snippet after I have reset the index:
Municipality PartyName TotalValidVotes
0 BUF - Buffalo City AFRICAN CHRISTIAN DEMOCRATIC PARTY 2519
1 BUF - Buffalo City AFRICAN INDEPENDENT CONGRESS 15600
2 BUF - Buffalo City AFRICAN NATIONAL CONGRESS 268052
3 BUF - Buffalo City CONGRESS OF THE PEOPLE 3913
4 BUF - Buffalo City DEMOCRATIC ALLIANCE 106790
I am now wanting to calculate each party's percentage of the total for a Municipality and cannot figure out how to generate the sum of the vote per municipality so I can do the percentage calculate.
It strikes me that this should be easy to do in pandas but I'm at a loss. Thanks in advance.
A simpler and yet more efficient version:
You can use .groupby() + .transform() on 'sum' to get the sum of the group. Then, you can divide the column TotalValidVotes with this sum and multiply by 100 to get the percentages.
df['TotalValidVotes_Pct'] = (df['TotalValidVotes'] / df.groupby('Municipality')['TotalValidVotes'].transform('sum')) * 100
Note that this version use only vectorized operation and should run faster.
Result:
print(df)
Municipality PartyName TotalValidVotes TotalValidVotes_Pct
0 BUF - Buffalo City AFRICAN CHRISTIAN DEMOCRATIC PARTY 2519 0.634710
1 BUF - Buffalo City AFRICAN INDEPENDENT CONGRESS 15600 3.930719
2 BUF - Buffalo City AFRICAN NATIONAL CONGRESS 268052 67.540832
3 BUF - Buffalo City CONGRESS OF THE PEOPLE 3913 0.985955
4 BUF - Buffalo City DEMOCRATIC ALLIANCE 106790 26.907784
You need to groupby the two vars (Municipality and PartyName) first, then groupby the first index (level=0) of the resultant aggregated DataFrame, then calculate the percentage on each group (.apply(...)).
from io import StringIO
import pandas as pd
s = """Municipality PartyName TotalValidVotes
BUF - Buffalo City AFRICAN CHRISTIAN DEMOCRATIC PARTY 2519
BUF - Buffalo City AFRICAN INDEPENDENT CONGRESS 15600
BUF - Buffalo City AFRICAN NATIONAL CONGRESS 268052
BUF - Buffalo City CONGRESS OF THE PEOPLE 3913
BUF - Buffalo City DEMOCRATIC ALLIANCE 106790
"""
df = pd.read_csv(StringIO(s), sep="\s\s+", engine="python")
df = (
df.groupby(["Municipality", "PartyName"])
.agg({"TotalValidVotes": "sum"})
.groupby(level=0)
.apply(lambda g: 100 * g / g.sum())
.reset_index()
)
Which produces:
Municipality PartyName TotalValidVotes
0 BUF - Buffalo City AFRICAN CHRISTIAN DEMOCRATIC PARTY 0.634710
1 BUF - Buffalo City AFRICAN INDEPENDENT CONGRESS 3.930719
2 BUF - Buffalo City AFRICAN NATIONAL CONGRESS 67.540832
3 BUF - Buffalo City CONGRESS OF THE PEOPLE 0.985955
4 BUF - Buffalo City DEMOCRATIC ALLIANCE 26.907784
This snippet should work without the need to make an intermediate DataFrame.

Assign values from a dictionary to a new column based on condition

This my data frame
City
sales
San Diego
500
Texas
400
Nebraska
300
Macau
200
Rome
100
London
50
Manchester
70
I want to add the country at the end which will look like this
City
sales
Country
San Diego
500
US
Texas
400
US
Nebraska
300
US
Macau
200
Hong Kong
Rome
100
Italy
London
50
England
Manchester
200
England
The countries are stored in below dictionary
country={'US':['San Diego','Texas','Nebraska'], 'Hong Kong':'Macau', 'England':['London','Manchester'],'Italy':'Rome'}
It's a little complicated because you have lists and strings as the values and strings are technically iterable, so distinguishing is more annoying. But here's a function that can flatten your dict:
def flatten_dict(d):
nd = {}
for k,v in d.items():
# Check if it's a list, if so then iterate through
if ((hasattr(v, '__iter__') and not isinstance(v, str))):
for item in v:
nd[item] = k
else:
nd[v] = k
return nd
d = flatten_dict(country)
#{'San Diego': 'US',
# 'Texas': 'US',
# 'Nebraska': 'US',
# 'Macau': 'Hong Kong',
# 'London': 'England',
# 'Manchester': 'England',
# 'Rome': 'Italy'}
df['Country'] = df['City'].map(d)
You can implement this using geopy
You can install geopy by pip install geopy
Here is the documentation : https://pypi.org/project/geopy/
# import libraries
from geopy.geocoders import Nominatim
# you need to mention a name for the app
geolocator = Nominatim(user_agent="some_random_app_name")
# get country name
df['Country'] = df['City'].apply(lambda x : geolocator.geocode(x).address.split(', ')[-1])
print(df)
City sales Country
0 San Diego 500 United States
1 Texas 400 United States
2 Nebraska 300 United States
3 Macau 200 中国
4 Rome 100 Italia
5 London 50 United Kingdom
6 Manchester 70 United Kingdom
# to get country name in english
df['Country'] = df['City'].apply(lambda x : geolocator.reverse(geolocator.geocode(x).point, language='en').address.split(', ')[-1])
print(df)
City sales Country
0 San Diego 500 United States
1 Texas 400 United States
2 Nebraska 300 United States
3 Macau 200 China
4 Rome 100 Italy
5 London 50 United Kingdom
6 Manchester 70 United Kingdom

Sorting values in a pandas series in ascending order not working when re-assigned

I am trying to sort a Pandas Series in ascending order.
Top15['HighRenew'].sort_values(ascending=True)
Gives me:
Country
China 1
Russian Federation 1
Canada 1
Germany 1
Italy 1
Spain 1
Brazil 1
South Korea 2.27935
Iran 5.70772
Japan 10.2328
United Kingdom 10.6005
United States 11.571
Australia 11.8108
India 14.9691
France 17.0203
Name: HighRenew, dtype: object
The values are in ascending order.
However, when I then modify the series in the context of the dataframe:
Top15['HighRenew'] = Top15['HighRenew'].sort_values(ascending=True)
Top15['HighRenew']
Gives me:
Country
China 1
United States 11.571
Japan 10.2328
United Kingdom 10.6005
Russian Federation 1
Canada 1
Germany 1
India 14.9691
France 17.0203
South Korea 2.27935
Italy 1
Spain 1
Iran 5.70772
Australia 11.8108
Brazil 1
Name: HighRenew, dtype: object
Why this is giving me a different output to that above?
Would be grateful for any advice?
Top15['HighRenew'] = Top15['HighRenew'].sort_values(ascending=True).to_numpy()
or
Top15['HighRenew'] = Top15['HighRenew'].sort_values(ascending=True).reset_index(drop=True)
When you sort_values , the indexes don't change so it is aligning per the index!
Thank you to anky for providing me with this fantastic solution!

Rounding and sorting dataframe with pandas

https://github.com/haosmark/jupyter_notebooks/blob/master/Coursera%20week%203%20assignment.ipynb
All the way at the bottom of the code, with question 3, I'm trying to average, round, and sort the data, however for some reason rounding and sorting isn't working at all
i = df.columns.get_loc('2006')
avgGDP = df[df.columns[i:]].copy()
avgGDP = avgGDP.mean(axis=1).round(2).sort_values(ascending=False)
avgGDP
what am I doing wrong here?
This is what df looks like before I apply average, round, and sort.
Your series is actually sorted, the first line being 1.5e+13 and the last one 4.4e+11:
Country
United States 1.536434e+13
China 6.348609e+12
Japan 5.542208e+12
Germany 3.493025e+12
France 2.681725e+12
United Kingdom 2.487907e+12
Brazil 2.189794e+12
Italy 2.120175e+12
India 1.769297e+12
Canada 1.660648e+12
Russian Federation 1.565460e+12
Spain 1.418078e+12
Australia 1.164043e+12
South Korea 1.106714e+12
Iran 4.441558e+11
Rounding doesn't do anything visible here because the smallest value is 4e+11, and rounding it to 2 decimal places doesn't show on this scale. If you want to keep only 2 decimal places in the scientific notation, you can use .map('{:0.2e}'.format), see my note below.
Note: just for fun, you could also calculate the same with a one-liner:
df.filter(regex='^2').mean(1).sort_values()[::-1].map('{:0.2e}'.format)
Output:
Country
United States 1.54e+13
China 6.35e+12
Japan 5.54e+12
Germany 3.49e+12
France 2.68e+12
United Kingdom 2.49e+12
Brazil 2.19e+12
Italy 2.12e+12
India 1.77e+12
Canada 1.66e+12
Russian Federation 1.57e+12
Spain 1.42e+12
Australia 1.16e+12
South Korea 1.11e+12
Iran 4.44e+11

Categories