how to compare two cells with strings in pandas? - python

I have a pandas dataframe as follows:
FIRST GOAL WINNER
Algeria brazil
Argentina Argentina
Japan Germany
brazil brazil
france France
i want to check if the first goal scorer is the winner of the game. can some one help?

You need:
df['is_winnder'] = df['FIRST GOAL'].str.lower() == df['WINNER'].str.lower()
Output:
FIRST GOAL WINNER is_winnder
0 Algeria brazil False
1 Argentina Argentina True
2 Japan Germany False
3 brazil brazil True
4 france France True

IIUC:
You need to compare france to France which requires normalization of the string. We can make all letters UPPER, lower, or Title. I went with lower.
nunique
Stack, then use str.lower to normalize capitalization.
In this answer, I stacked the dataframe in order to only have to call str.lower once on the stacked Series object. I then determined the number of unique values per the first level of the index, which were our old rows. If the number of unique values is equal to one, then the columns must have been equal.
df.stack().str.lower().groupby(level=0).nunique().eq(1)
0 False
1 True
2 False
3 True
4 True
dtype: bool
Or
df.assign(is_winner=df.stack().str.lower().groupby(level=0).nunique().eq(1))
FIRST GOAL WINNER is_winner
0 Algeria brazil False
1 Argentina Argentina True
2 Japan Germany False
3 brazil brazil True
4 france France True
Series.str.lower
This is virtually identical to Harv Ipan's answer with the exception that I added str.lower().
df.assign(is_winner=df['FIRST GOAL'].str.lower() == df['WINNER'].str.lower())
applymap
This is succinct. One call using applymap that uses str.lower. Then I got tricky with unpacking the values array into an eq operator.
from operator import eq
df.assign(winner=eq(*df.applymap(str.lower).values.T))

Related

ValueError: Series.replace cannot use dict-value and non-None to_replace when creating a conditional column

given this dataframe named df:
Number City Country
one Milan Italy
two Paris France
three London UK
four Berlin Germany
five Milan Italy
six Oxford UK
I would like to create a new column called 'Classification' based on this condition:
if df['Country'] = "Italy" and df['City'] = "Milan", result = "zero" else result = df['Number']
The result I want to achieve is this:
Number City Country Classification
one Milan Italy zero
two Paris France two
three London UK three
four Berlin Germany four
five Milan Italy zero
six Oxford UK six
I tried to use this code:
condition = [(df['Country'] == "Italy") & (df['City'] == 'Milan'),]
values = ['zero']
df['Classification'] = np.select(condition, values)
the result of which is this dataframe:
Number City Country Classification
one Milan Italy zero
two Paris France 0
three London UK 0
four Berlin Germany 0
five Milan Italy zero
six Oxford UK 0
now I try to replace the '0' in the 'Classification' column with the values of the column 'Number'
df['Classification'].replace(0, df['Number'])
but the result I get is an error:
ValueError: Series.replace cannot use dict-value and non-None to_replace
I would be very grateful for any suggestion on how to fix this
What you want is np.where
df['Classification'] = np.where((df['Country'] == "Italy") & (df['City'] == 'Milan'), 'zero', df['Number'])
print(df)
Number City Country Classification
0 one Milan Italy zero
1 two Paris France two
2 three London UK three
3 four Berlin Germany four
4 five Milan Italy zero
5 six Oxford UK six
If you want to use np.select, you need to specify default argument
condition = [(df['Country'] == "Italy") & (df['City'] == 'Milan'),]
values = ['zero']
df['Classification'] = np.select(condition, values, default=df['Number'])

Create a column that divides the other 2 columns using Pandas Apply()

Given a dataset -
country year cases population
Afghanistan 1999 745 19987071
Brazil 1999 37737 172006362
China 1999 212258 1272915272
Afghanistan 2000 2666 20595360
Brazil 2000 80488 174504898
China 2000 213766 1280428583
The task is to get the ratio of cases to population using the pandas apply function, in a new column called "prevalence"
This is what I have written
def calc_prevalence(G):
assert 'cases' in G.columns and 'population' in G.columns
G_copy = G.copy()
G_copy['prevalence'] = G_copy['cases','population'].apply(lambda x: (x['cases']/x['population']))
display(G_copy)
but I am getting a
KeyError: ('cases', 'population')
Here is a solution that applies a named function to the dataframe without using lambda:
def calculate_ratio(row):
return row['cases']/row['population']
df['prevalence'] = df.apply(calculate_ratio, axis = 1)
print(df)
#output:
country year cases population prevalence
0 Afghanistan 1999 745 19987071 0.000037
1 Brazil 1999 37737 172006362 0.000219
2 China 1999 212258 1272915272 0.000167
3 Afghanistan 2000 2666 20595360 0.000129
4 Brazil 2000 80488 174504898 0.000461
5 China 2000 213766 1280428583 0.000167
First, unless you've been explicitly told to use an apply function here for some reason, you can call the operation on the columns themselves resulting in a much faster vectorized operation. ie;
G_copy['prevalence']=G_copy['cases']/G_copy['population']
Finally, if you must use an apply for some reason, apply on the df instead of the two series;
G_copy['prevalence']=G_copy.apply(lambda row: row['cases']/row['population'],axis=1)

How can I fill some data of the cell of the new column that is in accord with a substring of the original data using pandas?

There are 2 dataframes, and they have simillar data.
A dataframe
Index Business Address
1 Oils Moskva, Russia
2 Foods Tokyo, Japan
3 IT California, USA
... etc.
B dataframe
Index Country Country Calling Codes
1 USA +1
2 Egypt +20
3 Russia +7
4 Korea +82
5 Japan +81
... etc.
I will add a column named 'Country Calling Codes' to A dataframe, too.
After this, 'Country' column in B dataframe will be compared with the data of 'Address' column. If the string of 'A.Address' includes string of 'B.Country', 'B.Country Calling Codes' will be inserted to 'A.Country Calling Codes' of compared row.
Result is:
Index Business Address Country Calling Codes
1 Oils Moskva, Russia +7
2 Foods Tokyo, Japan +81
3 IT California, USA +1
I don't know how to deal with the issue because I don't have much experience using pandas. I should be very grateful to you if you might help me.
Use Series.str.extract for get possible strings by Country column and then Series.map by Series:
d = B.drop_duplicates('Country').set_index('Country')['Country Calling Codes']
s = A['Address'].str.extract(f'({"|".join(d.keys())})', expand=False)
A['Country Calling Codes'] = s.map(d)
print (A)
Index Business Address Country Calling Codes
0 1 Oils Moskva, Russia +7
1 2 Foods Tokyo, Japan +81
2 3 IT California, USA +1
Detail:
print (A['Address'].str.extract(f'({"|".join(d.keys())})', expand=False))
0 Russia
1 Japan
2 USA
Name: Address, dtype: object

How to plot frequency count of pandas column

I have a pandas dataframe like this:
Year Winner
4 1954 Germany
9 1974 Germany
13 1990 Germany
19 2014 Germany
5 1958 Brazil
6 1962 Brazil
8 1970 Brazil
14 1994 Brazil
16 2002 Brazil
How to plot the frequency count of column Winner, so that y axis has frequency and x-axis has name of country?
I tried:
import numpy as np
import pandas as pd
df.groupby('Winner').size().plot.hist()
df1['Winner'].value_counts().plot.hist()
You are close, need Series.plot.bar because value_counts already count frequency:
df1['Winner'].value_counts().plot.bar()
Also working:
df1.groupby('Winner').size().plot.bar()
Difference between solutions is output of value_counts will be in descending order so that the first element is the most frequently-occurring element.
In addition to #jezrael's answer, you can also do:
df1['Winner'].value_counts().plot(kind='bar')
Other one from #jezrael could be:
df1.groupby('Winner').size().plot(kind='bar')
Just want to say this works with the latest version of plotly. Just need to add ,text_auto=True.
Example:
px.histogram(df, x="User Country",text_auto=True)

How do I turn objects or values in pandas df into boolean?

What I want to do is change values in a column into boolean.
What I am looking at: I have a dataset of artists with a column named "Death Year".
Within that column, it has the Death Year or Nan which I changed to Alive. I want to make this column where it turns the death year into false and alive value as True. dType for this column is Object.
Reproducible Example:
df = pd.DataFrame({'DeathYear':[2005,2003,np.nan,1993]})
DeathYear
0 2005.0
1 2003.0
2 NaN
3 1993.0
which you turned into
df['DeathYear'] = df['DeathYear'].fillna('Alive')
DeathYear
0 2005
1 2003
2 Alive
3 1993
You can just use
df['BoolDeathYear'] = df['DeathYear'] == 'Alive'
DeathYear BoolDeathYear
0 2005 False
1 2003 False
2 Alive True
3 1993 False
Notice that, if your final goal is to have the bool column, you don't have to fill the NaNs at all.
Can just do
df['BoolDeathYear'] = df['DeathYear'].isnull()

Categories