Given a dataset -
country year cases population
Afghanistan 1999 745 19987071
Brazil 1999 37737 172006362
China 1999 212258 1272915272
Afghanistan 2000 2666 20595360
Brazil 2000 80488 174504898
China 2000 213766 1280428583
The task is to get the ratio of cases to population using the pandas apply function, in a new column called "prevalence"
This is what I have written
def calc_prevalence(G):
assert 'cases' in G.columns and 'population' in G.columns
G_copy = G.copy()
G_copy['prevalence'] = G_copy['cases','population'].apply(lambda x: (x['cases']/x['population']))
display(G_copy)
but I am getting a
KeyError: ('cases', 'population')
Here is a solution that applies a named function to the dataframe without using lambda:
def calculate_ratio(row):
return row['cases']/row['population']
df['prevalence'] = df.apply(calculate_ratio, axis = 1)
print(df)
#output:
country year cases population prevalence
0 Afghanistan 1999 745 19987071 0.000037
1 Brazil 1999 37737 172006362 0.000219
2 China 1999 212258 1272915272 0.000167
3 Afghanistan 2000 2666 20595360 0.000129
4 Brazil 2000 80488 174504898 0.000461
5 China 2000 213766 1280428583 0.000167
First, unless you've been explicitly told to use an apply function here for some reason, you can call the operation on the columns themselves resulting in a much faster vectorized operation. ie;
G_copy['prevalence']=G_copy['cases']/G_copy['population']
Finally, if you must use an apply for some reason, apply on the df instead of the two series;
G_copy['prevalence']=G_copy.apply(lambda row: row['cases']/row['population'],axis=1)
Related
I cannot find a reason why when I assign scaled variable (which is non NaN) to the original DataFrame I get NaNs even though the index matches (years).
Can anyone help? I am leaving out details which I think are not necessary, happy to provide more details if needed.
So, given the following multi-index dataframe df:
value
country year
Canada 2007 1
2006 2
2005 3
United Kingdom 2007 4
2006 5
And the following series scaled:
2006 99
2007 54
2005 78
dtype: int64
You can assign it as a new column if reindexed and converted to a list first, like this:
df.loc["Canada", "new_values"] = scaled.reindex(df.loc["Canada", :].index).to_list()
print(df.loc["Canada", :])
# Output
value new_values
year
2007 1 54.0
2006 2 99.0
2005 3 78.0
I have data for many countries over a period of time (2001-2003). It looks something like this:
index
year
country
inflation
GDP
1
2001
AFG
nan
48
2
2002
AFG
nan
49
3
2003
AFG
nan
50
4
2001
CHI
3.0
nan
5
2002
CHI
5.0
nan
6
2003
CHI
7.0
nan
7
2001
USA
nan
220
8
2002
USA
4.0
250
9
2003
USA
2.5
280
I want to drop countries in case there is no data (i.e. values are missing for all years) for any given variable.
In the example table above, I want to drop AFG (because it misses all values for inflation) and CHI (GDP missing). I don't want to drop observation #7 just because one year is missing.
What's the best way to do that?
This should work by filtering all values that have nan in one of (inflation, GDP):
(
df.groupby(['country'])
.filter(lambda x: not x['inflation'].isnull().all() and not x['GDP'].isnull().all())
)
Note, if you have more than two columns you can work on a more general version of this:
df.groupby(['country']).filter(lambda x: not x.isnull().all().any())
If you want this to work with a specific range of year instead of all columns, you can set up a mask and change the code a bit:
mask = (df['year'] >= 2002) & (df['year'] <= 2003) # mask of years
grp = df.groupby(['country']).filter(lambda x: not x[mask].isnull().all().any())
You can also try this:
# check where the sum is equal to 0 - means no values in the column for a specific country
group_by = df.groupby(['country']).agg({'inflation':sum, 'GDP':sum}).reset_index()
# extract only countries with information on both columns
indexes = group_by[ (group_by['GDP'] != 0) & ( group_by['inflation'] != 0) ].index
final_countries = list(group_by.loc[ group_by.index.isin(indexes), : ]['country'])
# keep the rows contains the countries
df = df.drop(df[~df.country.isin(final_countries)].index)
You could reshape the data frame from long to wide, drop nulls, and then convert back to wide.
To convert from long to wide, you can use pivot functions. See this question too.
Here's code for dropping nulls, after its reshaped:
df.dropna(axis=0, how= 'any', thresh=None, subset=None, inplace=True) # Delete rows, where any value is null
To convert back to long, you can use pd.melt.
I have countries_of_the_world.csv. Basically, it's the table with the following bits of information:
Country Region GDP
Austria Western Europe 100
Chad Africa 30
I need to sort GDP values in descending order by region with countries inside these regions. It should look like:
Region Country GDP
Africa Egypt 42
Chad 30
Kongo 28
Oceania Australia 120
New Zealand 100
Indonesia 50
I tried 'groupby' but it doesn't work without aggregation function applied so I tried lambda but it didn't sort correctly:
countries.sort_values(['GDP'], ascending=False).groupby(['Region','Country']).aggregate(lambda x:x)
How can I handle it?
Use DataFrame.sort_values by both columns and then convert Region and Country to MultiIndex by DataFrame.set_index:
df1 = (countries.sort_values(['Region','GDP'], ascending=[True, False])
.set_index(['Region','Country']))
print (df1)
GDP
Region Country
Africa Egypt 42
Chad 30
Kongo 28
Oceania Australia 120
New Zealand 100
Indonesia 50
I have a pandas dataframe which looks like this:
Country Sold
Japan 3432
Japan 4364
Korea 2231
India 1130
India 2342
USA 4333
USA 2356
USA 3423
I have use the code below and get the sum of the "sold" column
df1= df.groupby(df['Country'])
df2 = df1.sum()
I want to ask how to calculate the percentage of the sum of "sold" column.
You can get the percentage by adding this code
df2["percentage"] = df2['Sold']*100 / df2['Sold'].sum()
In the output dataframe, a column with the percentage of each country is added.
We can divide the original Sold column by a new column consisting of the grouped sums but keeping the same length as the original DataFrame, by using transform
df.assign(
pct_per=df['Sold'] / df.groupby('Country').transform(pd.DataFrame.sum)['Sold']
)
Country Sold pct_per
0 Japan 3432 0.440226
1 Japan 4364 0.559774
2 Korea 2231 1.000000
3 India 1130 0.325461
4 India 2342 0.674539
5 USA 4333 0.428501
6 USA 2356 0.232991
7 USA 3423 0.338509
Simple Solution
You were almost there.
First you need to group by country
Then create the new percentage column (by dividing grouped sales with sum of all sales)
# reset_index() is only there because the groupby makes the grouped column the index
df_grouped_countries = df.groupby(df.Country).sum().reset_index()
df_grouped_countries['pct_sold'] = df_grouped_countries.Sold / df.Sold.sum()
Are you looking for the percentage after or before aggregation?
import pandas as pd
countries = [['Japan',3432],['Japan',4364],['Korea',2231],['India',1130], ['India',2342],['USA',4333],['USA',2356],['USA',3423]]
df = pd.DataFrame(countries,columns=['Country','Sold'])
df1 = df.groupby(df['Country'])
df2 = df1.sum()
df2['percentage'] = (df2['Sold']/df2['Sold'].sum()) * 100
df2
I have a pandas dataframe as follows:
FIRST GOAL WINNER
Algeria brazil
Argentina Argentina
Japan Germany
brazil brazil
france France
i want to check if the first goal scorer is the winner of the game. can some one help?
You need:
df['is_winnder'] = df['FIRST GOAL'].str.lower() == df['WINNER'].str.lower()
Output:
FIRST GOAL WINNER is_winnder
0 Algeria brazil False
1 Argentina Argentina True
2 Japan Germany False
3 brazil brazil True
4 france France True
IIUC:
You need to compare france to France which requires normalization of the string. We can make all letters UPPER, lower, or Title. I went with lower.
nunique
Stack, then use str.lower to normalize capitalization.
In this answer, I stacked the dataframe in order to only have to call str.lower once on the stacked Series object. I then determined the number of unique values per the first level of the index, which were our old rows. If the number of unique values is equal to one, then the columns must have been equal.
df.stack().str.lower().groupby(level=0).nunique().eq(1)
0 False
1 True
2 False
3 True
4 True
dtype: bool
Or
df.assign(is_winner=df.stack().str.lower().groupby(level=0).nunique().eq(1))
FIRST GOAL WINNER is_winner
0 Algeria brazil False
1 Argentina Argentina True
2 Japan Germany False
3 brazil brazil True
4 france France True
Series.str.lower
This is virtually identical to Harv Ipan's answer with the exception that I added str.lower().
df.assign(is_winner=df['FIRST GOAL'].str.lower() == df['WINNER'].str.lower())
applymap
This is succinct. One call using applymap that uses str.lower. Then I got tricky with unpacking the values array into an eq operator.
from operator import eq
df.assign(winner=eq(*df.applymap(str.lower).values.T))