Add new column based on two conditions - python

I have the following table in python:
Country
Year
Date
Spain
2020
2020-08-10
Germany
2020
2020-08-10
Italy
2019
2020-08-11
Spain
2019
2020-08-20
Spain
2020
2020-06-10
I would like to add a new column that gives 1 if it's the first date of the year in a country and 0 if it's not the first date.
I've tried to write a function but I'm conscious that it doesn't really make sense `
def first_date(x, country, year):
if df["date"] == df[(df["country"] == country) & (df["year"] == year)]["date"].min():
x==1
else:
x==0
`

There are many ways to achieve this. Let's create a groupby object to get the min index of each country so we can do some assignment using .loc
As an aside, using if with pandas is usually an anti pattern - there are native functions in pandas that help you achieve the same thing whilst taking advantage of the vectorised code base under the hood.
Recommend reading: https://pandas.pydata.org/docs/user_guide/10min.html
df.loc[df.groupby(['Country'])['Date'].idxmin(), 'x'] = 1
df['x'] = df['x'].fillna(0)
Country Year Date x
0 Spain 2020 2020-08-10 0.0
1 Germany 2020 2020-08-10 1.0
2 Italy 2019 2020-08-11 1.0
3 Spain 2019 2020-08-20 0.0
4 Spain 2020 2020-06-10 1.0
or using np.where with df.index.isin
import numpy as np
df['x'] = np.where(
df.index.isin(df.groupby(['Country'])['Date'].transform('idxmin')),1,0)

Related

argument of type "float" is not iterable when trying to use for loop

I have a countrydf as below, in which each cell in the country column contains a list of the countries where the movie was released.
countrydf
id Country release_year
s1 [US] 2020
s2 [South Africa] 2021
s3 NaN 2021
s4 NaN 2021
s5 [India] 2021
I want to make a new df which look like this:
country_yeardf
Year US UK Japan India
1925 NaN NaN NaN NaN
1926 NaN NaN NaN NaN
1927 NaN NaN NaN NaN
1928 NaN NaN NaN NaN
It has the release year and the number of movies released in each country.
My solution is that: with a blank df like the second one, run a for loop to count the number of movies released and then modify the value in the cell relatively.
countrylist=['Afghanistan', 'Aland Islands', 'Albania', 'Algeria', 'American Samoa', 'Andorra', 'Angola', 'Anguilla', 'Antarctica', ….]
for x in countrylist:
for j in list(range(0,8807)):
if x in countrydf.country[j]:
t=int (countrydf.release_year[j] )
country_yeardf.at[t, x] = country_yeardf.at[t, x]+1
an error occurred which read:
TypeError Traceback (most recent call last)
<ipython-input-25-225281f8759a> in <module>()
1 for x in countrylist:
2 for j in li:
----> 3 if x in countrydf.country[j]:
4 t=int(countrydf.release_year[j])
5 country_yeardf.at[t, x] = country_yeardf.at[t, x]+1
TypeError: argument of type 'float' is not iterable
I don’t know which one is of float type here, I have check the type of countrydf.country[j] and it returned int.
I was using pandas and I am just getting started with it. Can anyone please explain the error and suggest a solution for a df that I want to create?
P/s: my English is not so good so hop you guys understand.
Here is a solution using groupby
df = pd.DataFrame([['US', 2015], ['India', 2015], ['US', 2015], ['Russia', 2016]], columns=['country', 'year'])
country year
0 US 2015
1 India 2015
2 US 2015
3 Russia 2016
Now just groupby country and year and unstack the output:
df.groupby(['year', 'country']).size().unstack()
country India Russia US
year
2015 1.0 NaN 2.0
2016 NaN 1.0 NaN
Some alternative ways to achieve this in pandas without loops.
If the Country Column have more than 1 value in the list in each row, you can try the below:
>>df['Country'].str.join("|").str.get_dummies().groupby(df['release_year']).sum()
India South Africa US
release_year
2020 0 0 1
2021 1 1 0
Else if Country has just 1 value per row in the list as you have shown in the example, you can use crosstab
>>pd.crosstab(df['release_year'],df['Country'].str[0])
Country India South Africa US
release_year
2020 0 0 1
2021 1 1 0

Column Differences for MultiIndex Dataframes

I've got (probably) quite a simple problem that I just cannot wrap my head around right now. I'm collecting the following two series:
from pandas_datareader import wb
countries = [
'DZA', 'ARM','AZE','BLR','BIH','BRN','KHM','CHN','HRV', 'CZE','EGY',\
'EST','GEO','HUN','IND','IDN','ISR','JPN','JOR','KAZ','KOR','KGZ','LAO','LVA',\
'LBN','LTU','MYS','MDA','MNG','MMR','MKD','PHL','POL','ROU', 'RUS','SAU',\
'SGP','SVK','SVN','TJK','THA','TUR','UKR','UZB','VNM'
]
dat = wb.download(indicator='FR.INR.LEND', country=countries, start=2010, end=2019)
dat.columns = ['lending_rate']
us = wb.download(indicator='FR.INR.LEND', country='US', start=2010, end=2019)
us.columns = ['lending_rate_us']
dat2=pd.concat([dat,us])
dat2
I'd like to take the difference between lending_rate and lending_rate_us but obviously would like to subtract lending_rate_us for the US only from lending_rate in all other countries (ie. avoid what would otherwise lead to NANs everywhere).
So I guess what I'm trying to do is to copy the values for lending_rate_us to all other countries to then take the difference between both columns.
Does anyone have an idea how to do that (or an alternative idea that makes more sense)?
Thanks!
EDIT:
I tried the following, alas without success:
from pandas_datareader import wb
countries = [
'DZA', 'ARM','AZE','BLR','BIH','BRN','KHM','CHN','HRV', 'CZE','EGY',\
'EST','GEO','HUN','IND','IDN','ISR','JPN','JOR','KAZ','KOR','KGZ','LAO','LVA',\
'LBN','LTU','MYS','MDA','MNG','MMR','MKD','PHL','POL','ROU', 'RUS','SAU',\
'SGP','SVK','SVN','TJK','THA','TUR','UKR','UZB','VNM'
]
dat = wb.download(indicator='FR.INR.LEND', country=countries, start=2010, end=2019)
dat.columns = ['lending_rate']
us = wb.download(indicator='FR.INR.LEND', country='US', start=2010, end=2019)
us.columns = ['lending_rate']
for i in dat.index.get_level_values(0).unique():
dat["lending_rate_spread"]=dat.loc[i,:]-us.loc["United States",:]
dat
Output:
lending_rate lending_rate_spread
country year
Armenia
2019 12.141989 NaN
2018 12.793042 NaN
2017 14.406002 NaN
2016 17.356706 NaN
2015 17.590330 NaN
... ... ... ...
Vietnam
2014 8.665000 NaN
2013 10.374167 NaN
2012 13.471667 NaN
2011 16.953833 NaN
2010 13.135250 NaN
450 rows × 2 columns
But when I just print the result of the loop without creating a new column I get the correct values:
for i in dat.index.get_level_values(0):
print(dat.loc[i,:]-us.loc["United States",:])
Output:
lending_rate
year
2019 6.859489
2018 7.888875
2017 10.309335
2016 13.845039
2015 14.330330
2014 13.158665
2013 12.744987
2012 13.980068
2011 14.504474
2010 15.950428
lending_rate
year
2019 11.998715
2018 12.544167
2017 12.445833
2016 12.863560
2015 14.274167
I don't understand why I would get the correct result but not present it in the correct way?
In response to your comment, I reviewed the data again. I reworked the data for each country as NA data existed, and found that all of the data was for 10 years.
The method #Paul commented on is possible, so I modified the code.
dat['lending_rate_us'] = us['lending_rate_us']*len(dat['country'].unique())
dat['lending_rate_spread'] = dat['lending_rate'] - us['lending_rate_us']
dat.head()
country year lending_rate lending_rate_us lending_rate_spread
0 Armenia 2019 12.141989 237.7125 6.859489
1 Armenia 2018 12.793042 220.6875 7.888875
2 Armenia 2017 14.406002 184.3500 10.309335
3 Armenia 2016 17.356706 158.0250 13.845039
4 Armenia 2015 17.590330 146.7000 14.330330

How to use the value of one column as part of a string to fill NaNs in another column?

Let's say I have the following df:
year date_until
1 2010 -
2 2011 30.06.13
3 2011 NaN
4 2015 30.06.18
5 2020 -
I'd like to fill all - and NaNs in the date_until column with 30/06/{year +1}. I tried the following but it uses the whole year column instead of the corresponding value of the specific row:
df['date_until] = df['date_until].str.replace('-', f'30/06/{df["year"]+1}')
my final goal is to calculate the difference between the year and the year of date_until, so maybe the step above is even unnecessary.
We can use pd.to_datetime here with errors='coerce' to ignore the faulty dates. Then use the dt.year to calculate the difference:
df['date_until'] = pd.to_datetime(df['date_until'], format='%d.%m.%y', errors='coerce')
df['diff_year'] = df['date_until'].dt.year - df['year']
year date_until diff_year
0 2010 NaT NaN
1 2011 2013-06-30 2.0
2 2011 NaT NaN
3 2015 2018-06-30 3.0
4 2020 NaT NaN
For everybody who is trying to replace values just like I wanted to in the first place, here is how you could solve it:
for i in range(len(df)):
if pd.isna(df['date_until'].iloc[i]):
df['date_until'].iloc[i] = f'30.06.{df["year"].iloc[i] +1}'
if df['date_until'].iloc[i] == '-':
df['date_until'].iloc[i] = f'30.06.{df["year"].iloc[i] +1}
But #Erfan's approach is much cleaner

Filtering Dataframe in Python

I have a dataframe with 2 columns as below:
Index Year Country
0 2015 US
1 2015 US
2 2015 UK
3 2015 Indonesia
4 2015 US
5 2016 India
6 2016 India
7 2016 UK
I want to create a new dataframe containing the maximum count of country in every year.
The new dataframe will contain 3 columns as below:
Index Year Country Count
0 2015 US 3
1 2016 India 2
Is there any function in pandas where this can be done quickly?
One way can be to use groupby and along with size for finding in each category adn sort values and slice by possible number of year. You can try the following:
num_year = df['Year'].nunique()
new_df = df.groupby(['Year', 'Country']).size().rename('Count').sort_values(ascending=False).reset_index()[:num_year]
Result:
Year Country Count
0 2015 US 3
1 2016 India 2
Use:
1.
First get count of each pairs Year and Country by groupby and size.
Then get index of max value by idxmax and select row by loc:
df = df.groupby(['Year','Country']).size()
df = df.loc[df.groupby(level=0).idxmax()].reset_index(name='Count')
print (df)
Year Country Count
0 2015 US 3
1 2016 India 2
2.
Use custom function with value_counts and head:
df = df.groupby('Year')['Country']
.apply(lambda x: x.value_counts().head(1))
.rename_axis(('Year','Country'))
.reset_index(name='Count')
print (df)
Year Country Count
0 2015 US 3
1 2016 India 2
Just provide a method without groupby
Count=pd.Series(list(zip(df2.Year,df2.Country))).value_counts()
.head(2).reset_index(name='Count')
Count[['Year','Country']]=Count['index'].apply(pd.Series)
Count.drop('index',1)
Out[266]:
Count Year Country
0 3 2015 US
1 2 2016 India

Aggregate function to data frame in pandas

I want to create a dataframe from an aggregate function. I thought that it would create by default a dataframe as this solution states, but it creates a series and I don't know why (Converting a Pandas GroupBy object to DataFrame).
The dataframe is from Kaggle's San Francisco Salaries. My code:
df=pd.read_csv('Salaries.csv')
in: type(df)
out: pandas.core.frame.DataFrame
in: df.head()
out: EmployeeName JobTitle TotalPay TotalPayBenefits Year Status 2BasePay 2OvertimePay 2OtherPay 2Benefits 2Year
0 NATHANIEL FORD GENERAL MANAGER-METROPOLITAN TRANSIT AUTHORITY 567595.43 567595.43 2011 NaN 167411.18 0.00 400184.25 NaN 2011-01-01
1 GARY JIMENEZ CAPTAIN III (POLICE DEPARTMENT) 538909.28 538909.28 2011 NaN 155966.02 245131.88 137811.38 NaN 2011-01-01
2 ALBERT PARDINI CAPTAIN III (POLICE DEPARTMENT) 335279.91 335279.91 2011 NaN 212739.13 106088.18 16452.60 NaN 2011-01-01
3 CHRISTOPHER CHONG WIRE ROPE CABLE MAINTENANCE MECHANIC 332343.61 332343.61 2011 NaN 77916.00 56120.71 198306.90 NaN 2011-01-01
4 PATRICK GARDNER DEPUTY CHIEF OF DEPARTMENT,(FIRE DEPARTMENT) 326373.19 326373.19 2011 NaN 134401.60 9737.00 182234.59 NaN 2011-01-01
in: df2=df.groupby(['JobTitle'])['TotalPay'].mean()
type(df2)
out: pandas.core.series.Series
I want df2 to be a dataframe with the columns 'JobTitle' and 'TotalPlay'
Breaking down your code:
df2 = df.groupby(['JobTitle'])['TotalPay'].mean()
The groupby is fine. It's the ['TotalPay'] that is the misstep. That is telling the groupby to only execute the the mean function on the pd.Series df['TotalPay'] for each group defined in ['JobTitle']. Instead, you want to refer to this column with [['TotalPay']]. Notice the double brackets. Those double brackets say pd.DataFrame.
Recap
df2 = df2=df.groupby(['JobTitle'])[['TotalPay']].mean()

Categories