I have a dataframe that looks like:
Age Age Type
12 Years
5 Days
13 Hours
20 Months
... ......
I want to have my Age column in Years...so depending on Age Type, if it is either in Days, Hours, or Months, I will have to perform a scalar operation. I tried to implement a for loop but not sure if I'm going about it the right way. Thanks!
Create a map dict
d={'Years':1,'Days':1/365,'Hours':1/364/24,'Months':1/12}
df.Age*df.AgeType.map(d)
Out[373]:
0 12.000000
1 0.013699
2 0.001488
3 1.666667
dtype: float64
Related
I'm trying to get the total number of books that an author wrote and put it in a column called book number with my dataframe that has 15 other columns.
I checked online and people use groupby with count(), however it doesn't create the column that I want, it only gives a column of numbers without a name and I can't put it together with the original dataframe.
author_count_df = (df_author["Name"]).groupby(df_author["Name"]).count()
print(author_count_df)
Result:
Name
A D 3
A Gill 4
A GOO 3
ALL SHOT 10
AMIT PATEL 5
..
vishal raina 7
walt walter 6
waqas alhafidh 3
yogesh koshal 8
zainab m.jawad 9
Name: Name, Length: 696, dtype: int64
Expected: A dataframe with
Name other 14 columns from author_df Book Number
A D ... 3
A Gill ... 4
A GOO ... 3
ALL SHOT ... 10
AMIT PATEL ... 5
... ..
vishal raina ... 7
walt walter ... 6
waqas alhafidh ... 3
yogesh koshal ... 8
zainab m.jawad ... 9
Use transform with the groupby and assign it back:
df_author['Book Number']=df_author.groupby("Name")['Name'].transform('count')
For a new df, use:
author_count_df = df_author.assign(BookNum=df_author.groupby("Name")['Name']
.transform('count'))
Use reset_index()
author_count_df = (df_author["Name"]).groupby(df_author["Name"]).count().reset_index()
This basically tells the pandas groupby to reset back to the original index
You have done the good Job except you need to check how to populate or assign the values back into a new column which you have got, Which you can achieve with DataFrame.assign method which does the Job quite elegantly.
Straight from the Docs:
Assign new columns to a DataFrame.
Returns a new object with all original columns in addition to new ones. Existing columns that are re-assigned will be overwritten.
I want to transform age range to age numerical value. I used def Age(x) & If statement to transform, but it doesn't work and give the wrong result.
I attached the images of the step that I did and the result.
The dataset that I used is BlackFriday.
Please help me to clarify the mistakes.
Thank you!
Given what is shown from the result of value_counts, it seems like a simple str.extract with a fillna for ages of 55+ will do:
df.Age.str.extract(r'(?<=-)(\d+)').fillna(56)
Lets consider the following example:
df = pd.DataFrame({'Age':['26-35','36-45', '55+']})
Age
0 26-35
1 36-45
2 55+
df.Age.str.extract(r'(?<=-)(\d+)').fillna(56).rename(columns={0:'Age'})
Age
0 35
1 45
2 56
A simple function to modifiy age_range to mean:
Here is the age ranges we have
temp_df['age_range'].unique()
array([70, '18-25', '26-35', '36-45', '46-55', '56-70'], dtype=object)
Function to modify age
def mod_age(df):
for i in range(df.shape[0]):
if(df.loc[i,'age_range']==70):
df.loc[i,'age_range']=70
elif(df.loc[i,'age_range']=='18-25'):
df.loc[i,'age_range']=(18+25)//2
elif(df.loc[i,'age_range']=='26-35'):
df.loc[i,'age_range']=(26+35)//2
elif(df.loc[i,'age_range']=='36-45'):
df.loc[i,'age_range']=(36+45)//2
elif(df.loc[i,'age_range']=='46-55'):
df.loc[i,'age_range']=(46+55)//2
elif(df.loc[i,'age_range']=='56-70'):
df.loc[i,'age_range']=(56+75)//2
age_range family_size marital_status sum
2 70 2 Single 4
25 40 4 Single 2
5 21 2 Married 4
32 50 3 Single 3
13 30 2 Single 5
I have big DataFrame (df) which looks like:
Acc_num date_diff
0 29 0:04:43
1 29 0:01:43
2 29 2:22:45
3 29 0:16:21
4 29 0:58:20
5 30 0:00:35
6 34 7:15:26
7 34 4:40:01
8 34 0:56:02
9 34 6:53:44
10 34 1:36:58
.....
Acc_num int64
date_diff timedelta64[ns]
dtype: object
I need to calculate 'date_diff' mean (in timedelta format) for each account number.
df.date_diff.mean() works correctly. But when I try next:
df.groupby('Acc_num').date_diff.mean() it raises an exception:
"DataError: No numeric types to aggregate"
I also tried df.pivot_table() method, but didn't acheive anything.
Could someone help me with this stuff. Thank you in advance!
Weird limitation indeed. But a simple solution would be:
df.groupby('Acc_num').date_diff.agg(lambda g:g.sum()/g.count())
Edit:
Pandas will actually attempt to aggregate non-numeric columns if you pass numeric_only=False
df.groupby('Acc_num').date_diff.mean(numeric_only=False)
I'd like to change the value associated with the first day in every month for a pandas.Series I have. For example, given something like this:
Date
1984-01-03 0.992701
1984-01-04 1.003614
1984-01-17 0.994647
1984-01-18 1.007440
1984-01-27 1.006097
1984-01-30 0.991546
1984-01-31 1.002928
1984-02-01 1.009894
1984-02-02 0.996608
1984-02-03 0.996595
...
I'd like to change the values associated with 1984-01-03, 1984-02-01 and so on. I've racked my brain for hours on this one and have looked around Stack Overflow a fair bit. Some solutions have come close. For example, using:
[In]: series.groupby((m_ret.index.year, m_ret.index.month)).first()
[Out]:
Date Date
1984 1 0.992701
2 1.009894
3 1.005963
4 0.997899
5 1.000342
6 0.995429
7 0.994620
8 1.019377
9 0.993209
10 1.000992
11 1.009786
12 0.999069
1985 1 0.981220
2 1.011928
3 0.993042
4 1.015153
...
Is almost there, but I'm sturggling to proceed further.
What I'd ike to do is set the values associated with the first day present in each month for every year to 1.
series[m_ret.index.is_month_start] = 1 comes close, but the problem here is that is_month_start only selects rows where the day value is 1. However in my series, this isn't always the case as you can see. For example, the date of the first day in January is 1984-01-03.
series.groupby(pd.TimeGrouper('BM')).nth(0) doesn't appear to return the first day either, instead I get the last day:
Date
1984-01-31 0.992701
1984-02-29 1.009894
1984-03-30 1.005963
1984-04-30 0.997899
1984-05-31 1.000342
1984-06-29 0.995429
1984-07-31 0.994620
1984-08-31 1.019377
...
I'm completely stumped. Your help is as always, greatly appreciated! Thank you.
One way would to be to use your .groupby((m_ret.index.year, m_ret.index.month)) idea, but use idxmin instead on the index itself converted into a Series:
In [74]: s.index.to_series().groupby([s.index.year, s.index.month]).idxmin()
Out[74]:
Date Date
1984 1 1984-01-03
2 1984-02-01
Name: Date, dtype: datetime64[ns]
In [75]: start = s.index.to_series().groupby([s.index.year, s.index.month]).idxmin()
In [76]: s.loc[start] = 999
In [77]: s
Out[77]:
Date
1984-01-03 999.000000
1984-01-04 1.003614
1984-01-17 0.994647
1984-01-18 1.007440
1984-01-27 1.006097
1984-01-30 0.991546
1984-01-31 1.002928
1984-02-01 999.000000
1984-02-02 0.996608
1984-02-03 0.996595
dtype: float64
Consider the DataFrame data:
one two three four
Ohio 2013-01-01 1 2 3
Colorado 2014-01-05 5 6 7
Utah 2015-05-06 9 10 11
New York 2016-10-11 13 14 15
I'd like to extract the row using only the criterion that the year is a given year, e.g., something like data['one'][:][0:4] == '2013'. But the command data['one'][:][0:4] returns
Ohio 2013-01-01
Colorado 2014-01-05
Utah 2015-05-06
New York 2016-10-11
Name: one, dtype: object
I thought this is the right thing to do because the command data['one'][0][0:4] returns
'2013'
Why the difference, and what's the correct way to do this?
Since column 'one' consists of dates, it'd be best to have pandas recognize it as such, instead of recognizing it as strings. You can use pd.to_datetime to do this:
df['one'] = pd.to_datetime(df['one'])
This allows you to filter on date properties without needing to worry about slicing strings. For example, you can check for year using Series.dt.year:
df['one'].dt.year == 2013
Combining this with loc allows you to get all rows where the year is 2013:
df.loc[df['one'].dt.year == 2013, :]
The condition you are looking for is
df['one'].str[0:4] == "2013"
Basically, you need to tell Pandas to read your column as a string, then operate on the strings from that column.
The way you have it written (df['one'][:]), says "give me the column called "one", then give me all of them [:].
query works out well too on datetime columns
In [13]: df.query('one == 2013')
Out[13]:
one two three four
Ohio 2013-01-01 1 2 3