Age Range to Age Numerical value(python) - python

I want to transform age range to age numerical value. I used def Age(x) & If statement to transform, but it doesn't work and give the wrong result.
I attached the images of the step that I did and the result.
The dataset that I used is BlackFriday.
Please help me to clarify the mistakes.
Thank you!

Given what is shown from the result of value_counts, it seems like a simple str.extract with a fillna for ages of 55+ will do:
df.Age.str.extract(r'(?<=-)(\d+)').fillna(56)
Lets consider the following example:
df = pd.DataFrame({'Age':['26-35','36-45', '55+']})
Age
0 26-35
1 36-45
2 55+
df.Age.str.extract(r'(?<=-)(\d+)').fillna(56).rename(columns={0:'Age'})
Age
0 35
1 45
2 56

A simple function to modifiy age_range to mean:
Here is the age ranges we have
temp_df['age_range'].unique()
array([70, '18-25', '26-35', '36-45', '46-55', '56-70'], dtype=object)
Function to modify age
def mod_age(df):
for i in range(df.shape[0]):
if(df.loc[i,'age_range']==70):
df.loc[i,'age_range']=70
elif(df.loc[i,'age_range']=='18-25'):
df.loc[i,'age_range']=(18+25)//2
elif(df.loc[i,'age_range']=='26-35'):
df.loc[i,'age_range']=(26+35)//2
elif(df.loc[i,'age_range']=='36-45'):
df.loc[i,'age_range']=(36+45)//2
elif(df.loc[i,'age_range']=='46-55'):
df.loc[i,'age_range']=(46+55)//2
elif(df.loc[i,'age_range']=='56-70'):
df.loc[i,'age_range']=(56+75)//2
age_range family_size marital_status sum
2 70 2 Single 4
25 40 4 Single 2
5 21 2 Married 4
32 50 3 Single 3
13 30 2 Single 5

Related

how to combine the first 2 column in pandas/python with n/a value

I have some questions about combining the first 2 columns in pandas/python with n/a value
long story: I need to read an excel and alter those changes. I can not change anything in excel, so any change has been done by python.
Here is the excel input
and the expected expect output will be
I manage to read it in, but when I try to combine the first 2 columns, I have some problems. since in excel, the first row is merged, so once it is read in. only one row has value, but the rest of the row is all N/A.
such as below:
Year number 2016
Month Jan
Month 2016-01
Grade 1 100
NaN 2 99
NaN 3 98
NaN 4 96
NaN 5 92
NaN Total 485
Is there any function that can easily help me to combine the first two columns and make it as below:
Year 2016
Month Jan
Month 2016-01
Grade 1 100
Grade 2 99
Grade 3 98
Grade 4 96
Grade 5 92
Grade Total 485
Anything will be really appreciated.
I searched and google the key word for so long but did not find any answer that fits my situation here.
d = '''
Year,number,2016
Month,,Jan
Month,,2016-01
Grade,1, 100
NaN,2, 99
NaN,3, 98
NaN,4, 96
NaN,5, 92
NaN,Total,485
'''
df = pd.read_csv(StringIO(d))
df
df['Year'] = df.Year.fillna(method='ffill')
df = df.fillna('') # skip this step if your data from excel does not have nan in col 2.
df['Year'] = df.Year + ' ' + df.number.astype('str')
df = df.drop('number',axis=1)
df

How to group by a df in Python by a column with the difference between the max value of one column and the min of another column?

I have a data frame which looks like this:
student_id
session_id
reading_level_id
st_week
end_week
1
3334
3
3
3
1
3335
2
4
4
2
3335
2
2
2
2
3336
2
2
3
2
3337
2
3
3
2
3339
2
3
4
...
There are multiple session_id's, st_weeks and end_weeks for every student_id. Im trying to group the data by 'student_id' and I want to calculate the difference between the maximum(end_week) and the minimum (st_week) for each student.
Aiming for an output that would look something like this:
Student_id
Diff
1
1
2
2
....
I am relatively new to Python as well as Stack Overflow and have been trying to find an appropriate solution - any help is appreciated.
Using the data you shared, a simpler solution is possible:
Group by student_id, and pass False argument to the as_index parameter (this works for a dataframe, and returns a dataframe);
Next, use a named aggregation to get the `max week for end week and the min week for st_week for each group
Get the difference between max_wk and end_wk
Finally, keep only the required columns
(
df.groupby("student_id", as_index=False)
.agg(max_wk=("end_week", "max"), min_wk=("st_week", "min"))
.assign(Diff=lambda x: x["max_wk"] - x["min_wk"])
.loc[:, ["student_id", "Diff"]]
)
student_id Diff
0 1 1
1 2 2
There's probably a more efficient way to do this, but I broke this into separate steps for the grouping to get max and min values for each id, and then created a new column representing the difference. I used numpy's randint() function in this example because I didn't have access to a sample dataframe.
import pandas as pd
import numpy as np
# generate dataframe
df = pd.DataFrame(np.random.randint(0,100,size=(1200, 4)), columns=['student_id', 'session_id', 'st_week', 'end_week'])
# use groupby to get max and min for each student_id
max_vals = df.groupby(['student_id'], sort=False)['end_week'].max().to_frame()
min_vals = df.groupby(['student_id'], sort=False)['st_week'].min().to_frame()
# use join to put max and min back together in one dataframe
merged = min_vals.join(max_vals)
# use assign() to calculate difference as new column
merged = merged.assign(difference=lambda x: x.end_week - x.st_week).reset_index()
merged
student_id st_week end_week difference
0 40 2 99 97
1 23 5 74 69
2 78 9 93 84
3 11 1 97 96
4 97 24 88 64
... ... ... ... ...
95 54 0 96 96
96 18 0 99 99
97 8 18 97 79
98 75 21 97 76
99 33 14 93 79
You can create a custom function and apply it to a group-by over students:
def week_diff(g):
return g.end_week.max() - g.st_week.min()
df.groupby("student_id").apply(week_diff)
Result:
student_id
1 1
2 2
dtype: int64

Python pandas problem calculating percent in dataFrame and making it to a list

I have a problem calculating percent within a dataframe.
I have the following dataframe called dfGender:
age gender impressions
0 13-17 female 234561
1 13-17 male 34574
2 25-34 female 120665
3 25-34 male 234560
4 35-44 female 5134
5 35-44 male 2405
6 45-54 female 423
7 45-54 male 324
Now I would like to make to have list of the total percent for all female and male impressions like this: [female%, male%].
My idea is to pivot_table with the following code:
df_genderSum = dfGender.pivot_table(columns='gender', values='impressions', aggfunc='sum')
Then calculating the total of them all:
df_genderSum['total'] = df_genderSum.sum(axis=1)
Then after this making the percent calculations through:
df_genderSum['female%'] = (df_genderSum['female']/df_genderSum['total'])*100
df_genderSum['male%'] = (df_genderSum['male']/df_genderSum['total'])*100
Now this gives me the desired correct calculations, altough I think it's a really messy code.
I have 2 questions:
1: Is there a simpler way to do this, where you get a dataframe only existing of:
gender female% male%
impressions "number" "number"
2: How do i make it to a list. I was thinking of the following code.
list = df_genderSum.reset_index().values.tolist()
Any help is appreciated!
You can try:
df.groupby('gender')['impressions'].apply(lambda x : (sum(x)/sum(df['impressions'])*100))
gender
female 57.0276
male 42.9724
and
df.groupby('gender')['impressions'].apply(lambda x : (sum(x)/sum(df['impressions'])*100)).to_list()
[57.02762682448004, 42.972373175519957]
If you want the exact dataframe that you asked for, save the above as "s" and do the following:
s=df.groupby('gender')['impressions'].apply(lambda x : (sum(x)/sum(df['impressions'])*100))
pd.DataFrame(s).T
gender female male
impressions 57.027627 42.972373
Here you go:
df_agg = df.drop(['age'], axis=1).groupby('gender').sum()
print(df_agg['impressions']/df_agg['impressions'].sum()*100)
Prints (can be different based on your data):
F 71.428571
M 28.571429
Name: impressions, dtype: float64
df_genderSum = df_gender.groupby('gender')['impressions'].sum() # same result as pivot_table
df_genderSum /= df_genderSum.sum() # percentage, inplace
# now it is a series, reshape as needed
df_genderSum = df_genderSum.to_frame().T
you can try this one :
(df.groupby('gender').sum()['impressions']/df['impressions'].sum()).to_frame(name = 'impressions').T

Transform a Column conditional on another Column

I have a dataframe that looks like:
Age Age Type
12 Years
5 Days
13 Hours
20 Months
... ......
I want to have my Age column in Years...so depending on Age Type, if it is either in Days, Hours, or Months, I will have to perform a scalar operation. I tried to implement a for loop but not sure if I'm going about it the right way. Thanks!
Create a map dict
d={'Years':1,'Days':1/365,'Hours':1/364/24,'Months':1/12}
df.Age*df.AgeType.map(d)
Out[373]:
0 12.000000
1 0.013699
2 0.001488
3 1.666667
dtype: float64

Is there an "ungroup by" operation opposite to .groupby in pandas?

Suppose we take a pandas dataframe...
name age family
0 john 1 1
1 jason 36 1
2 jane 32 1
3 jack 26 2
4 james 30 2
Then do a groupby() ...
group_df = df.groupby('family')
group_df = group_df.aggregate({'name': name_join, 'age': pd.np.mean})
Then do some aggregate/summarize operation (in my example, my function name_join aggregates the names):
def name_join(list_names, concat='-'):
return concat.join(list_names)
The grouped summarized output is thus:
age name
family
1 23 john-jason-jane
2 28 jack-james
Question:
Is there a quick, efficient way to get to the following from the aggregated table?
name age family
0 john 23 1
1 jason 23 1
2 jane 23 1
3 jack 28 2
4 james 28 2
(Note: the age column values are just examples, I don't care for the information I am losing after averaging in this specific example)
The way I thought I could do it does not look too efficient:
create empty dataframe
from every line in group_df, separate the names
return a dataframe with as many rows as there are names in the starting row
append the output to the empty dataframe
The rough equivalent is .reset_index(), but it may not be helpful to think of it as the "opposite" of groupby().
You are splitting a string in to pieces, and maintaining each piece's association with 'family'. This old answer of mine does the job.
Just set 'family' as the index column first, refer to the link above, and then reset_index() at the end to get your desired result.
It turns out that pd.groupby() returns an object with the original data stored in obj. So ungrouping is just pulling out the original data.
group_df = df.groupby('family')
group_df.obj
Example
>>> dat_1 = df.groupby("category_2")
>>> dat_1
<pandas.core.groupby.generic.DataFrameGroupBy object at 0x7fce78b3dd00>
>>> dat_1.obj
order_date category_2 value
1 2011-02-01 Cross Country Race 324400.0
2 2011-03-01 Cross Country Race 142000.0
3 2011-04-01 Cross Country Race 498580.0
4 2011-05-01 Cross Country Race 220310.0
5 2011-06-01 Cross Country Race 364420.0
.. ... ... ...
535 2015-08-01 Triathalon 39200.0
536 2015-09-01 Triathalon 75600.0
537 2015-10-01 Triathalon 58600.0
538 2015-11-01 Triathalon 70050.0
539 2015-12-01 Triathalon 38600.0
[531 rows x 3 columns]
Here's a complete example that recovers the original dataframe from the grouped object
def name_join(list_names, concat='-'):
return concat.join(list_names)
print('create dataframe\n')
df = pandas.DataFrame({'name':['john', 'jason', 'jane', 'jack', 'james'], 'age':[1,36,32,26,30], 'family':[1,1,1,2,2]})
df.index.name='indexer'
print(df)
print('create group_by object')
group_obj_df = df.groupby('family')
print(group_obj_df)
print('\nrecover grouped df')
group_joined_df = group_obj_df.aggregate({'name': name_join, 'age': 'mean'})
group_joined_df
create dataframe
name age family
indexer
0 john 1 1
1 jason 36 1
2 jane 32 1
3 jack 26 2
4 james 30 2
create group_by object
<pandas.core.groupby.generic.DataFrameGroupBy object at 0x7fbfdd9dd048>
recover grouped df
name age
family
1 john-jason-jane 23
2 jack-james 28
print('\nRecover the original dataframe')
print(pandas.concat([group_obj_df.get_group(key) for key in group_obj_df.groups]))
Recover the original dataframe
name age family
indexer
0 john 1 1
1 jason 36 1
2 jane 32 1
3 jack 26 2
4 james 30 2
There are a few ways to undo DataFrame.groupby, one way is to do DataFrame.groupby.filter(lambda x:True), this gets back to the original DataFrame.

Categories