how to find the full name of athlete in this case? - python

Let's say this is my data frame:
country Edition sports Athletes Medal Firstname Score
Germany 1990 Aquatics HAJOS, Alfred gold Alfred 3
Germany 1990 Aquatics HIRSCHMANN, Otto silver Otto 2
Germany 1990 Aquatics DRIVAS, Dimitrios silver Dimitrios 2
US 2008 Athletics MALOKINIS, Ioannis gold Ioannis 1
US 2008 Athletics HAJOS, Alfred silver Alfred 2
US 2009 Athletics CHASAPIS, Spiridon gold Spiridon 3
France 2010 Athletics CHOROPHAS, Efstathios gold Efstathios 3
France 2010 Athletics CHOROPHAS, Efstathios gold Efstathios 3
France 2010 golf HAJOS, Alfred Bronze Alfred 1
France 2011 golf ANDREOU, Joannis silver Joannis 2
Spain 2011 golf BURKE, Thomas gold Thomas 3
I am trying to find out which Athlete's first name has the largest sum of scores?
I have tried the following:
df.groupby ( 'Firstname' )[Score ].sum().idxmax()
This returns the first name of the Athlete but I want to display the full name of Athlete can anyone help me in this?
for example : I am getting 'Otto' as output but i want to display HIRSCHMANN, Otto as output!
Note: what I have noticed in my original data set when I groupby ( 'Athlete') the answer is different.

idxmax will only give you the index of the first row with maximal value. If multiple Firstname share the max score, it will find to find them.
Try this instead:
sum_score = df.groupby ('Firstname')['Score'].sum()
max_score = sum_score.max()
names = sum_score[sum_score == max_score].index
df[df['Firstname'].isin(names)]

Related

Pandas filtering to get names of coaches who is coach for both men and women's team

I have a dataframe like this -
Name Country Discipline Event
5 AIKMAN Siegfried Gottlieb Japan Hockey Men
6 AL SAADI Kais Germany Hockey Men
8 ALEKNO Vladimir Islamic Republic of Iran Volleyball Men
9 ALEKSEEV Alexey ROC Handball Women
11 ALSHEHRI Saad Saudi Arabia Football Men
.
.
.
I want to get the Names (Name of coaches) who is coach for both Men and Women team of a particular game(Discipline)
Please help me with this
You can use groupby and check for groups that have Event count >= 2:
filtered = df.groupby(['Discipline', 'Name']).filter(lambda x: x['Event'].count() >= 2)
If you want a list of unique names, then simply:
>>> filtered.Name.unique()

Creating a column identifier based on another column

I have a df below as
NAME
German Rural
1990 german
Mexican 1998
Mexican City
How can i create a new column based on the values of these columns ( if the column has the term %German% or % german% regardless of capital or lower case or case insensitive?
Desired output
NAME | Identifier
German Rural Euro
1990 german Euro
Mexican 1998 South American
Mexican City South American
You could do that with something like the following.
conditions = [df["NAME"].str.lower().str.contains("german"),
df["NAME"].str.lower().str.contains("mexican")]
values = [ "Euro", 'South American']
df["identifiter"] = np.select(conditions, values, default=np.nan)
print(df)
NAME identifiter
0 German Rural Euro
1 1990 german Euro
2 Mexican 1998 South American
3 Mexican City South American

clustering/grouping names with little or no anomalies into clusters in pandas

I have a dataframe with names field as:
print(df)
names
--------------------------------
0 U.S.A.
1 United States of America
2 USA
4 US America
5 Kenyan Footbal League
6 Kenyan Football League
7 Kenya Football League Assoc.
8 Kenya Footbal League Association
9 Tata Motors
10 Tat Motor
11 Tata Motors Ltd.
12 Tata Motor Limited
13 REL
14 Reliance Limited
15 Reliance Co.
Now I want to club all these similar kind of names into one category such that the final dataframe looks something like this:
print(df)
names group_name
---------------------------------------------
0 U.S.A. USA
1 United States of America USA
2 USA USA
4 US America USA
5 Kenyan Footbal League Kenya Football League
6 Kenyan Football League Kenya Football League
7 Kenya Football League Assoc. Kenya Football League
8 Kenya Footbal League Association Kenya Football League
9 Tata Motors Tata Motors
10 Tat Motor Tata Motors
11 Tata Motors Ltd. Tata Motors
12 Tata Motor Limited Tata Motors
13 REL Reliance
14 Reliance Limited. Reliance
15 Reliance Co. Reliance
Now this is just 16 records, so its easy to look up all the possible names and anomalies in their names, and create a dictionary for mapping. But in actual I have a data-frame with about 5800 unique names (NOTE: 'USA' and 'U.S.A.' are counted as different entities when stating the count of unique). So is there any programmatic approach to tackle such a scenario?
I tried running fuzzy match using difflib and fuzzywuzzy libraries but even its final results are not concrete. Often times difflib would just match up based on words like 'limited','association',etc. even though they would be referring to two different names with just 'association' or 'limited' as the common word among them.
Any help is appreciated.
EDIT:
Even if I create a list of Stop-words with words like 'associatio','limited','cooprations','group',etc there are chances of missing out these stop word names when mentioned differently. For instance, if 'association' and 'limited' are just mentioned as 'assoc.','ltd' and 'ltd.' there are chances that I'll miss out adding some of these to the stop-word list.
I have already tried, topic modelling with LDA and NMF the results were pretty similar to what I had achieved earlier using difflib and fuzzywuzzy libraries. And yes I did all the preprocessing (converting to lower cases, leamtization, extra whitespaces handling) before any of these approaches
Late answer, focusing on it for a hour, you can use difflib.SequenceMatcher and filter the ratio where it is greater than 0.6, and a big chunk of code as well... also I simply remove the last word of each list, in the names column after it is modified, and get the longest word which apparently gets your desired result, and here it is...
import difflib
df2 = df.copy()
df2.loc[df2.names.str.contains('America'), 'names'] = 'US'
df2['names'] = df2.names.str.replace('.', '').str.lstrip()
df2.loc[df2.names.str.contains('REL'), 'names'] = 'Reliance'
df['group_name'] = df2.names.apply(lambda x: max(sorted([i.rsplit(None, 1)[0] for i in df2.names.tolist() if difflib.SequenceMatcher(None, x, i).ratio() > 0.6]), key=len))
print(df)
Output:
names group_name
0 U.S.A. USA
1 United States of America USA
2 USA USA
3 US America USA
4 Kenyan Footbal League Kenya Football League
5 Kenyan Football League Kenya Football League
6 Kenya Football League Assoc. Kenya Football League
7 Kenya Footbal League Association Kenya Football League
8 Tata Motors Tata Motors
9 Tat Motor Tata Motors
10 Tata Motors Ltd. Tata Motors
11 Tata Motor Limited Tata Motors
12 REL Reliance
13 Reliance Limited Reliance
14 Reliance Co. Reliance
A code with my best effort.
So according to my knowledge. I don't thinks so you can have accurate results but you can do some of things which would help you to clean your data
First lower the strings using .lower()
Strip the strings to remove extra spaces using strip()
tokenize the strings
Stemming and lemmatization of your data
you should do research on sentence similarity multiple libraries exist in python such as gensim,nltk
https://radimrehurek.com/gensim/tutorial.html
https://spacy.io/
https://www.nltk.org/
Even I created very basic document similarity project you can check this github
https://github.com/tawabshakeel/Document-similarity-NLP-
I hope all these things would help you in solving your problem.

Filter and drop rows by proportion python

I have a dataframe called wine that contains a bunch of rows I need to drop.
How do i drop all rows in column 'country' that are less than 1% of the whole?
Here are the proportions:
#proportion of wine countries in the data set
wine.country.value_counts() / len(wine.country)
US 0.382384
France 0.153514
Italy 0.100118
Spain 0.070780
Portugal 0.062186
Chile 0.056742
Argentina 0.042835
Austria 0.034767
Germany 0.028928
Australia 0.021434
South Africa 0.010233
New Zealand 0.009069
Israel 0.006133
Greece 0.004493
Canada 0.002526
Hungary 0.001755
Romania 0.001558
...
I got lazy and didn't include all of the results, but i think you catch my drift. I need to drop all rows with proportions less than .01
Here is the head of my dataframe:
country designation points price province taster_name variety year price_category
Portugal Avidagos 87 15.0 Douro Roger Voss Portuguese Red 2011.0 low
You can use something like this:
df = df[df.proportion >= .01]
From that dataset it should give you something like this:
US 0.382384
France 0.153514
Italy 0.100118
Spain 0.070780
Portugal 0.062186
Chile 0.056742
Argentina 0.042835
Austria 0.034767
Germany 0.028928
Australia 0.021434
South Africa 0.010233
figured it out
country_filter = wine.country.value_counts(normalize=True) > 0.01
country_index = country_filter[country_filter.values == True].index
wine = wine[wine.country.isin(list(country_index))]

pandas groupby include a column in final result

cast year revenue title
id
135397 Chris Pratt 2015 1.392446e+09 Jurassic World
135397 Bryce Dallas Howard 2015 1.392446e+09 Jurassic World
135397 Irrfan Khan 2015 1.392446e+09 Jurassic World
135397 Nick Robinson 2015 1.392446e+09 Jurassic World
Given the above DataFrame, I would like to find the highest earning actors per year (based on the combined revenue of movies they acted in that year). This is what I have so far :
#get the total revenue associated with each cast for each year
f ={'revenue':sum}
#revenue by year for each cast
df_actor_yr = df_actor_yr.groupby(['year', 'cast']).agg(f)
df_actor_yr
year cast
1960 Anthony Perkins 2.359350e+08
Charles Laughton 4.423780e+08
Fred MacMurray 1.843242e+08
Jack Kruschen 1.843242e+08
Jean Simmons 4.423780e+08
John Gavin 2.359350e+08
Kirk Douglas 4.423780e+08
Vera Miles 2.359350e+08
1961 Anthony Quayle 2.108215e+08
Anthony Quinn 2.108215e+08
Ben Wright 1.574815e+09
Betty Lou Gerson 1.574815e+09
...
Next to get the highest earning cast member for each year I did the following
df_actor_yr.reset_index(inplace=True)
g ={"revenue" : max }
df_actor_yr = df_actor_yr.groupby('year').agg(g)
df_actor_yr
revenue
year
1960 4.423780e+08
1961 1.574815e+09
1962 5.045914e+08
1963 5.617734e+08
1964 8.780804e+08
1965 1.129535e+09
1967 1.345551e+09
1968 4.187094e+08
1969 6.081511e+08
...
This only gives me the year and maximum revenue for that year.I would also like to get the corresponding name of the cast member associated with the revenue. How do I go about doing this?
You can split your logic into 2 steps. First sum by cast and year using GroupBy + sum. Then find the maximum revenue by year using GroupBy + idxmax:
# sum by cast and year
df_summed = df.groupby(['cast', 'year'])['revenue'].sum().reset_index()
# maximums by year
res = df_summed.loc[df_summed.groupby('year')['revenue'].idxmax()]
print(res)
cast year revenue
3 NickRobinson 2012 3.401340e+09
0 BryceDallasHoward 2015 1.568978e+09
For the above output, I've used more interesting data:
id cast year revenue title
135397 ChrisPratt 2015 1.392446e+09 JurassicWorld
135397 BryceDallasHoward 2015 1.568978e+09 SomeMovie
135397 IrrfanKhan 2012 1.392446e+09 JurassicWorld
135397 NickRobinson 2012 1.046987e+09 JurassicWorld
135398 NickRobinson 2012 2.354353e+09 SomeOtherMovie

Categories