I have the following dataframe:
count country year age_group gender type
7 Albania 2006 014 f ep
1 Albania 2007 014 f ep
3 Albania 2008 014 f ep
2 Albania 2009 014 f ep
2 Albania 2010 014 f ep
I'm trying to make adjustments to the "gender" column so that 'f' becomes 'female' and same for m and male.
I tried the following code:
who3['gender'] = pd.np.where(who3['gender'] == 'f', "female")
But it gives me this error:
Now when I try this code:
who3['gender'] = pd.np.where(who3['gender'] == 'f', "female",
pd.np.where(who3['gender'] == 'm', "male"))
I get error below:
What am I doing wrong?
You can use also .replace():
df["gender"] = df["gender"].replace({"f": "female", "m": "male"})
print(df)
Prints:
count country year age_group gender type
0 7 Albania 2006 14 female ep
1 1 Albania 2007 14 female ep
2 3 Albania 2008 14 female ep
3 2 Albania 2009 14 female ep
4 2 Albania 2010 14 female ep
np.where needs the condition as the first parameter, and then the desire output if the condition is met, and as the third parameter it gets an output when the condition is not met, Try this:
who3['gender'] = np.where(who3['gender'] == 'f', "female", 'male')
Another solution is using replace method:
who3['gender'] = who3['gender'].replace({'f': 'female', 'm': 'male'})
Related
I have the following dataframe:
import pandas as pd
fertilityRates = pd.read_csv('fertility_rate.csv')
fertilityRatesRowCount = len(fertilityRates.axes[0])
fertilityRates.head(fertilityRatesRowCount)
I have found a way to find the mean for each row over columns 1960-1969, but would like to do so without removing the column called "Country".
The following is what is outputted after I execute the following commands:
Mean1960To1970 = fertilityRates.iloc[:, 1:11].mean(axis=1)
Mean1960To1970
You can use pandas.DataFrame.loc to select a range of years (e.g "1960":"1968" means from 1960 to 1968).
Try this :
Mean1960To1968 = (
fertilityRates[["Country"]]
.assign(Mean= fertilityRates.loc[:, "1960":"1968"].mean(axis=1))
)
# Output :
print(Mean1960To1968)
Country Mean
0 _World 5.004444
1 Afghanistan 7.450000
2 Albania 5.913333
3 Algeria 7.635556
4 Angola 7.030000
5 Antigua and Barbuda 4.223333
6 Arab World 7.023333
7 Argentina 3.073333
8 Armenia 4.133333
9 Aruba 4.044444
10 Australia 3.167778
11 Austria 2.715556
I have a dataframe where I want append row add some groupby + additional conditions. Looking for for loop or other solution whatever can work.
or if its easier...
first melt df and then add new ratio % col then unmelt.
As calculations are customise, I think for loop can find the solution with or without groupby.
---Line 6,7,8 are my requirement.---
0-14 = child and unemployed
14-50 = young and working
50+ = old and unemployed
# ref line 6,7,8 = showing which rows to (+) and (/)
Currently I want to put 3 conditions in output line 6,7,8:
d = { 'year': [2019,2019,2019,2020,2020,2020],
'age group': ['(0-14)','(14-50)','(50+)','(0-14)','(14-50)','(50+)'],
'con': ['UK','UK','UK','US','US','US'],
'population': [10,20,300,400,1000,2000]}
df = pd.DataFrame(data=d)
df2 = df.copy()
df
year age group con population
0 2019 (0-14) UK 10
1 2019 (14-50) UK 20
2 2019 (50+) UK 300
3 2020 (0-14) US 400
4 2020 (14-50) US 1000
5 2020 (50+) US 2000
output required:
year age group con population
0 2019 (0-14) UK 10.0
1 2019 (14-50) UK 20.0
2 2019 (50+) UK 300.0
3 2020 (0-14) US 400.0
4 2020 (14-50) US 1000.0
5 2020 (50+) US 2000.0
6 2019 young vs child UK-young vs child 2.0 # 20/10
7 2019 old vs young UK-old vs young 15.0 #300/20
8 2019 unemployed vs working UK-unemployed vs working. 15.5 #300+10 20
Trials now:
df2 = df.copy()
criteria = [df2['con'].str.contains('0-14'),
df2['con'].str.contains('14-50'),
df2['con'].str.contains('50+')]
#conditions should be according to requirements
values = ['young vs child','old vs young', 'unemployed vs working']
df2['con'] = df2['con']+'_'+np.select(criteria, values, 0)
df2['age group'] = df2['age group']+'_'+np.select(criteria, values, 0)
df.groupby(['year','age group','con']).sum().groupby(level=[0,1]).cumdiv()
pd.concat([df,df2])
#----errors. cumdiv() not found and missing conditions criteria-------
also tried:
df['population'].div(df.groupby('con')['population'].shift(1))
#but looking for customisations into this
#so it can first sum rows and then divide
#according to unemployed condition-- row 8 reference.
CLOSEST TRAIL
n_df_2 = df.copy()
con_list = [x for x in df.con]
year_list = [x for x in df.year]
age_list = [x for x in df['age group']]
new_list = ['young vs child','old vs young', 'unemployed vs working']
for country in con_list:
bev_child = n_df_2[(n_df_2['con'].str.contains(country)) & (n_df_2['age group'].str.contains(age_list[0]))]
bev_work = n_df_2[(n_df_2['con'].str.contains(country)) & (n_df_2['age group'].str.contains(age_list[1]))]
bev_old = n_df_2[(n_df_2['con'].str.contains(country)) & (n_df_2['age group'].str.contains(age_list[2]))]
bev_child.loc[:,'population'] = bev_work.loc[:,'population'].max() / bev_child.loc[:,'population'].max()
bev_child.loc[:,'con'] = country +'-'+new_list[0]
bev_child.loc[:,'age group'] = new_list[0]
s = n_df_2.append(bev_child, ignore_index=True)
bev_child.loc[:,'population'] = bev_child.loc[:,'population'].max() + bev_old.loc[:,'population'].max()/ bev_work.loc[:,'population'].max()
bev_child.loc[:,'con'] = country +'-'+ new_list[2]
bev_child.loc[:,'age group'] = new_list[2]
s = s.append(bev_child, ignore_index=True)
bev_child.loc[:,'population'] = bev_old.loc[:,'population'].max() / bev_work.loc[:,'population'].max()
bev_child.loc[:,'con'] = country +'-'+ new_list[1]
bev_child.loc[:,'age group'] = new_list[1]
s = s.append(bev_child, ignore_index=True)
s
year age group con population
0 2019 (0-14) UK 10.0
1 2019 (14-50) UK 20.0
2 2019 (50+) UK 300.0
3 2020 (0-14) US 400.0
4 2020 (14-50) US 1000.0
5 2020 (50+) US 2000.0
6 2020 young vs child US-young vs child 2.5
7 2020 unemployed vs working US-unemployed vs working 4.5
8 2020 old vs young US-old vs young 2.0
also
PLEASE find the easiest way to solve it... Please...
What im trying to achieve is to combine Name into one value using comma delimiter whenever Country column is duplicated, and sum the values in Salary column.
Current input :
pd.DataFrame({'Name': {0: 'John',1: 'Steven',2: 'Ibrahim',3: 'George',4: 'Nancy',5: 'Mo',6: 'Khalil'},
'Country': {0: 'USA',1: 'UK',2: 'UK',3: 'France',4: 'Ireland',5: 'Ireland',6: 'Ireland'},
'Salary': {0: 100, 1: 200, 2: 200, 3: 100, 4: 50, 5: 100, 6: 10}})
Name Country Salary
0 John USA 100
1 Steven UK 200
2 Ibrahim UK 200
3 George France 100
4 Nancy Ireland 50
5 Mo Ireland 100
6 Khalil Ireland 10
Expected output :
Row 1 & 2 (in inputs) got grupped into one since Country column is duplicated & Salary column is summed up.
Tha same goes for Row 4,5 & 6.
Name Country Salary
0 John USA 100
1 Steven, Ibrahim UK 400
2 George France 100
3 Nancy, Mo, Khalil Ireland 160
What i have tried, but im not sure how to combine text in Name column :
df.groupby(['Country'],as_index=False)['Salary'].sum()
[Out:]
Country Salary
0 France 100
1 Ireland 160
2 UK 400
3 USA 100
use groupby() and agg():
out=df.groupby('Country',as_index=False).agg({'Name':', '.join,'Salary':'sum'})
If needed unique values of 'Name' column then use :
out=(df.groupby('Country',as_index=False)
.agg({'Name':lambda x:', '.join(set(x)),'Salary':'sum'}))
Note: use pd.unique() in place of set() if order of unique values is important
output of out:
Country Name Salary
0 France George 100
1 Ireland Nancy, Mo, Khalil 160
2 UK Steven, Ibrahim 400
3 USA John 100
Use agg:
df.groupby(['Country'], as_index=False).agg({'Name': ', '.join, 'Salary':'sum'})
And to get the columns in order you can add [df.columns] to the pipe:
df.groupby(['Country'], as_index=False).agg({'Name': ', '.join, 'Salary':'sum'})[df.columns]
Name Country Salary
0 John USA 100
1 Steven, Ibrahim UK 400
2 George France 100
3 Nancy, Mo, Khalil Ireland 160
I'm reading the documentation to understand the method filter when used with groupby. In order to understand it, I've got the below scenario:
I'm trying to get the duplicate names grouped by city from my DataFrame df.
Below is my try:
df = pd.DataFrame({
'city':['LA','LA','LA','LA','NY', 'NY'],
'name':['Ana','Pedro','Maria','Maria','Peter','Peter'],
'age':[24, 27, 19, 34, 31, 20],
'sex':['F','M','F','F','M', 'M'] })
df_filtered = df.groupby('city').filter(lambda x: len(x['name']) >= 2)
df_filtered
The output I'm getting is:
city name age sex
LA Ana 24 F
LA Pedro 27 M
LA Maria 19 F
LA Maria 34 F
NY Peter 31 M
NY Peter 20 M
The output I'm expecting is:
city name age sex
LA Maria 19 F
LA Maria 34 F
NY Peter 31 M
NY Peter 20 M
It's not clear to me in which cases I have to use different column names in the "groupby" method and in the "len" inside of the "filter" method
Thank you
How about just duplicated:
df[df.duplicated(['city', 'name'], keep=False)]
You should groupby two columns 'city','name'
Yourdf=df.groupby(['city','name']).filter(lambda x : len(x)>=2)
Yourdf
Out[234]:
city name age sex
2 LA Maria 19 F
3 LA Maria 34 F
4 NY Peter 31 M
5 NY Peter 20 M
I have three dataframes df_Male , df_female , Df_TransGender
sample dataframe df_Male
continent avg_count_country avg_age
Asia 55 5
Africa 65 10
Europe 75 8
df_Female
continent avg_count_country avg_age
Asia 50 7
Africa 60 12
Europe 70 0
df_Transgender
continent avg_count_country avg_age
Asia 30 6
Africa 40 11
America 80 10
Now our stacked bar grap should look like
X axis will contain three ticks Male , Female , Transgender
Y axis will be Total_count--100
And in the Bar avg_age will be stacked
Now I was trying like with pivot table
pivot_df = df.pivot(index='new_Columns', columns='avg_age ', values='Values')
getting confused how to plot this , can anyone please help on how to concatenate three dataframe in one , so that it create Male,Female and Transgener columns
This topic is handeled here: https://pandas.pydata.org/pandas-docs/stable/merging.html
(Please note, that the third continent in df_Transgenderis different to the other dataframes, 'America' instead of 'Europe'; I changed that for the following plot, hoping that this is correct.)
frames = [df_Male, df_Female, df_Transgender]
df = pd.concat(frames, keys=['Male', 'Female', 'Transgender'])
continent avg_count_country avg_age
Male 0 Asia 55 5
1 Africa 65 10
2 Europe 75 8
Female 0 Asia 50 7
1 Africa 60 12
2 Europe 70 0
Transgender 0 Asia 30 6
1 Africa 40 11
2 Europe 80 10
btm = [0, 0, 0]
for name, grp in df.groupby('continent', sort=False):
plt.bar(grp.index.levels[1], grp.avg_age.values, bottom=btm, tick_label=grp.index.levels[0], label=name)
btm = grp.avg_age.values
plt.legend(ncol = 3)
As you commented below that America in the third dataset was no mistake, you can add rows accordingly to each dataframe like this bevor you go on like above:
df_Male.append({'avg_age': 0, 'continent': 'America'}, ignore_index=True)
df_Female.append({'avg_age': 0, 'continent': 'America'}, ignore_index=True)
df_Transgender.append({'avg_age': 0, 'continent': 'Europe'}, ignore_index=True)