How to plot frequency count of pandas column - python

I have a pandas dataframe like this:
Year Winner
4 1954 Germany
9 1974 Germany
13 1990 Germany
19 2014 Germany
5 1958 Brazil
6 1962 Brazil
8 1970 Brazil
14 1994 Brazil
16 2002 Brazil
How to plot the frequency count of column Winner, so that y axis has frequency and x-axis has name of country?
I tried:
import numpy as np
import pandas as pd
df.groupby('Winner').size().plot.hist()
df1['Winner'].value_counts().plot.hist()

You are close, need Series.plot.bar because value_counts already count frequency:
df1['Winner'].value_counts().plot.bar()
Also working:
df1.groupby('Winner').size().plot.bar()
Difference between solutions is output of value_counts will be in descending order so that the first element is the most frequently-occurring element.

In addition to #jezrael's answer, you can also do:
df1['Winner'].value_counts().plot(kind='bar')
Other one from #jezrael could be:
df1.groupby('Winner').size().plot(kind='bar')

Just want to say this works with the latest version of plotly. Just need to add ,text_auto=True.
Example:
px.histogram(df, x="User Country",text_auto=True)

Related

Add new column based on two conditions

I have the following table in python:
Country
Year
Date
Spain
2020
2020-08-10
Germany
2020
2020-08-10
Italy
2019
2020-08-11
Spain
2019
2020-08-20
Spain
2020
2020-06-10
I would like to add a new column that gives 1 if it's the first date of the year in a country and 0 if it's not the first date.
I've tried to write a function but I'm conscious that it doesn't really make sense `
def first_date(x, country, year):
if df["date"] == df[(df["country"] == country) & (df["year"] == year)]["date"].min():
x==1
else:
x==0
`
There are many ways to achieve this. Let's create a groupby object to get the min index of each country so we can do some assignment using .loc
As an aside, using if with pandas is usually an anti pattern - there are native functions in pandas that help you achieve the same thing whilst taking advantage of the vectorised code base under the hood.
Recommend reading: https://pandas.pydata.org/docs/user_guide/10min.html
df.loc[df.groupby(['Country'])['Date'].idxmin(), 'x'] = 1
df['x'] = df['x'].fillna(0)
Country Year Date x
0 Spain 2020 2020-08-10 0.0
1 Germany 2020 2020-08-10 1.0
2 Italy 2019 2020-08-11 1.0
3 Spain 2019 2020-08-20 0.0
4 Spain 2020 2020-06-10 1.0
or using np.where with df.index.isin
import numpy as np
df['x'] = np.where(
df.index.isin(df.groupby(['Country'])['Date'].transform('idxmin')),1,0)

How to return the rows from the largest value from a group by in Pandas?

I am ranking each instance of a group by. I want to return only the rows where the largest "rank" occurs. In this example the only rows I want to return is the where "rank" is the largest for each individual State grouping.
import pandas as pd
import numpy as np
data = {'Product':['Box','Bottles','Pen','Markers','Bottles','Pen','Markers','Bottles','Box','Markers','Markers','Pen'],
'State':['Alaska','California','Texas','North Carolina','California','Texas','Alaska','Texas','North Carolina','Alaska','California','Texas'],
'Sales':[14,24,31,12,13,7,9,31,18,16,18,14]}
df1=pd.DataFrame(data, columns=['Product','State','Sales'])
df1
df1['Rank'] = df1.groupby(['State'])['Sales'].cumcount().add(1)
Use:
In [1001]: df1[df1['Rank'].eq(df1.groupby('State')['Rank'].transform('max'))]
Out[1001]:
Product State Sales Rank
8 Box North Carolina 18 2
9 Markers Alaska 16 3
10 Markers California 18 3
11 Pen Texas 14 4
Not exactly sure what the desired output should be like, but following your requirements the below should work. I till give you only the max rank on a per State/ per Product basis
>>> df1.groupby(['State','Product'], as_index=False).max()

DataFrame non-NaN series assignment results in NaN

I cannot find a reason why when I assign scaled variable (which is non NaN) to the original DataFrame I get NaNs even though the index matches (years).
Can anyone help? I am leaving out details which I think are not necessary, happy to provide more details if needed.
So, given the following multi-index dataframe df:
value
country year
Canada 2007 1
2006 2
2005 3
United Kingdom 2007 4
2006 5
And the following series scaled:
2006 99
2007 54
2005 78
dtype: int64
You can assign it as a new column if reindexed and converted to a list first, like this:
df.loc["Canada", "new_values"] = scaled.reindex(df.loc["Canada", :].index).to_list()
print(df.loc["Canada", :])
# Output
value new_values
year
2007 1 54.0
2006 2 99.0
2005 3 78.0

Create a column that divides the other 2 columns using Pandas Apply()

Given a dataset -
country year cases population
Afghanistan 1999 745 19987071
Brazil 1999 37737 172006362
China 1999 212258 1272915272
Afghanistan 2000 2666 20595360
Brazil 2000 80488 174504898
China 2000 213766 1280428583
The task is to get the ratio of cases to population using the pandas apply function, in a new column called "prevalence"
This is what I have written
def calc_prevalence(G):
assert 'cases' in G.columns and 'population' in G.columns
G_copy = G.copy()
G_copy['prevalence'] = G_copy['cases','population'].apply(lambda x: (x['cases']/x['population']))
display(G_copy)
but I am getting a
KeyError: ('cases', 'population')
Here is a solution that applies a named function to the dataframe without using lambda:
def calculate_ratio(row):
return row['cases']/row['population']
df['prevalence'] = df.apply(calculate_ratio, axis = 1)
print(df)
#output:
country year cases population prevalence
0 Afghanistan 1999 745 19987071 0.000037
1 Brazil 1999 37737 172006362 0.000219
2 China 1999 212258 1272915272 0.000167
3 Afghanistan 2000 2666 20595360 0.000129
4 Brazil 2000 80488 174504898 0.000461
5 China 2000 213766 1280428583 0.000167
First, unless you've been explicitly told to use an apply function here for some reason, you can call the operation on the columns themselves resulting in a much faster vectorized operation. ie;
G_copy['prevalence']=G_copy['cases']/G_copy['population']
Finally, if you must use an apply for some reason, apply on the df instead of the two series;
G_copy['prevalence']=G_copy.apply(lambda row: row['cases']/row['population'],axis=1)

How can I add the values of one column based on a groupby in Python?

Suppose I have the following dataframe:
year count
2001 14
2004 16
2001 2
2005 21
2001 22
2004 14
2001 8
I want to group by the year column and add the count column for each given year. I would like my result to be
year count
2001 46
2004 30
2005 21
I am struggling a bit finding a way to do this, can anyone help?
import pandas as pd
df = pd.read_csv("test.csv")
df['count'] = pd.to_numeric(df['count'])
#df['count'] = df.groupby(['year'])['count'].sum()
total = df.groupby(['year'])['count'].sum()
print(total)
Yields:
year
2001 46
2004 30
2005 21
Hope this may help !!
Lets assume your pandas dataframe name is df. then groupby code run like below:
df.groupby('year')[['count']].sum()
It will return dataframe you want.

Categories