In Pandas, why does a TimedeltaProperties object have no attribute 'years'?
After all, the datetime object has this property.
It seems like a very natural thing for an object that is concerned with time to have. Especially if it already has an hours, seconds, etc attribute.
Is there a workaround so that my column, which is full of values like
10060 days,
can be converted to years? Or better yet, just converted to an integer representation for years?
TimedeltaProperties does not have year or month attributes because according to TimedeltaProperties source code . It is -
Accessor object for datetimelike properties of the Series values.
But , months or years have no constant definition.
1 month can take on different different number of days, based on the month itself, like January -> 31 days , April -> 30 days , etc.
1 month can take on different values based on the year as well (in case of February month) , if the year is 2004 , February has 29 days , if the year is 2003 February has 28 days, etc.
Same is the case with years , it can take on different values based on which exact year it is, for example - if the year is 2003 , it has 365 days, if the year is 2004 it has 366 days.
Hence, an requirement like - Convert 10060 days to years is not accurate, which years?
Like previously stated, the accurate amount of years those no. of days correspond to depend on the actual years those days represent.
This workaround gets you closer.
round((df["Accident Date"] - df["Iw Date Of Birth"]).dt.days / 365, 1)
I used .astype('timedelta64[Y]') and then .astype('int') to get integer years:
df['age'] = (pd.Timestamp('now') - df.birthDate).astype('timedelta64[Y]').astype('int')
Output:
nflId height weight birthDate collegeName position displayName age
0 2539334 72 190 1990-09-10 Washington CB Desmond Trufant 30
1 2539653 70 186 1988-11-01 Southeastern Louisiana CB Robert Alford 31
2 2543850 69 186 1991-12-18 Purdue SS Ricardo Allen 28
3 2555162 73 227 1994-11-04 Louisiana State MLB Deion Jones 25
4 2555255 75 232 1993-07-01 Minnesota OLB DeVondre Campbell 27
Related
I have a data frame like so. I am trying to make a plot with the mean of 'number' for each year on the y and the year on the x. I think what I have to do to do this is make a new data frame with 2 columns 'year' and 'avg number' for each year. How would I go about doing that?
year number
0 2010 40
1 2010 44
2 2011 33
3 2011 32
4 2012 34
5 2012 56
When opening a question about pandas please make sure you following these guidelines: How to make good reproducible pandas examples. It will help us reproduce your environment.
Assuming your dataframe is stored in the df variable:
df.groupby('year').mean().plot()
I have a dataset of deaths per week, week 1 is the week with the first Thurday. If there are days of that week in the previous year those are week 53. If there are extra days in January before week 1 those are week 0.
This gives very inconsistent graphs around newyears because the week lengths vary there.
I want week 53 to be merged with week 1 and week 0 with week 53 to get a dataset with no partial weeks.
The dataset looks like this:
Period Deaths
0 1995 week 0 (1 dag) 394.0
106 1996 week 53 (2 dagen) 858.0
108 1997 week 1 (5 dagen) 2268.0
160 1997 week 53 (3 dagen) 1124.0
162 1998 week 1 (4 dagen) 1551.0
214 1998 week 53 (4 dagen) 1732.0
216 1999 week 0 (3 dagen) 1250.0
268 1999 week 52 (5 dagen) 2306.0
270 2000 week 0 (2 dagen) 956.0
Is there a good way to solve this?
I can't even wrap my head around it properly since there's also sometimes a week 53 without extra days
I have tried various pandas solutions, but they all don't work, not technically, but principly
Thanks
PS. I've tried this, but is doesnt do exactly what I want:
df['grp'] = (df.Period != df.Period.shift()).cumsum()
out = df.groupby(['grp', 'Period'])['Deaths'].apply(lambda x: \
",".join(x)).reset_index()[['Period', 'Deaths']]
So I ended up with this solution if anyone needs it:
df['to_first_week']=df.Period.str.contains('dag') & df_clean.Period.str.contains('week 1')
df['to_last_week']=df.Period.str.contains('dag') & df.Period.shift(-1).str.contains('week 0')
df['partial_week']=df.Period.str.contains('dag')
df.loc[df['to_first_week'] == True, 'death_fin'] = df['Deaths'].shift(+1) + df['Deaths']
df.loc[df['to_last_week'] == True, 'death_fin'] = df['Deaths'].shift(-1) + df['Deaths']
df.loc[df['partial_week'] == False, 'death_fin'] = df['Deaths']
df = df.dropna(subset = ["death_fin"]).reset_index(drop=True)
I hope it is clear. If week 1 is incomplete, the incomplete week from last year is added to week 1. If week 0 is incomplete, it gets added to the last week from last year
I am working with NLSY79 data and I am trying to construct a 'smoothed' income variable that averages over a period of 4 years. Between 1979 and 1994, the NLSY conducted surveys annually, while after 1996 the survey was conducted biennially. This means that my smoothed income variable will average four observations prior to 1994 and only two after 1996.
I would like my smoothed income variable to satisfy the following criteria:
1) It should be an average of 4 income observations from 1979 to 1994 and only 2 from 1996 onward
2) The window should START from a given observation rather than be centered at it. Therefore, my smoothed income variable should tell me the average income over the four years starting from that date
3) It should ignore NaNs
It should, therefore, look like the following (note that I only computed values for 'smoothed income' that could be computed with the data I have provided.)
id year income 'smoothed income'
1 1979 20,000 21,250
1 1980 22,000
1 1981 21,000
1 1982 22,000
...
1 2014 34,000 34,500
1 2016 35,000
2 1979 28,000 28,333
2 1980 NaN
2 1981 28,000
2 1982 29,000
I am relatively new to dataframe manipulation with pandas, so here is what I have tried:
smooth = DATA.groupby('id')['income'].rolling(window=4, min_periods=1).mean()
DATA['smoothIncome'] = smooth.reset_index(level=0, drop=True)
This code accounts for NaNs, but otherwise does not accomplish objectives 2) and 3).
Any help would be much appreciated
Ok, I've modified the code provided by ansev to make it work. filling in NaNs was causing the problems.
Here's the modified code:
df.set_index('year').groupby('id').income.apply(lambda x: x.reindex(range(x.index.min(),x.index.max()+1))
.rolling(4, min_periods = 1).mean().shift(-3)).reset_index()
The only problem I have now is that the mean is not calculated when there are fewer that 4 years remaining (e.g. from 2014 onward, because my data goes until 2016). Is there a way of shortening the window length after 2014?
I have a dataframe for which I'm looking at histograms of subsets of the data using column and by of pandas' hist() method, as in:
ax = df.hist(column='activity_count', by='activity_month')
(then I go along and plot this info). I'm trying to determine how to programmatically pull out two pieces of data: the number of records with that particular value of 'activity_month' as well as the value of 'activity_month' when I loop over the axes:
for i,x in enumerate(ax):`
print("the value of a is", a)
print("the number of rows with value of a", b)
so that I'd get:
January 1002
February 4305
etc
Now, I can easily get the list of unique values of "activity_month", as well as a count of how many rows have a given value of activity_month equal to that,
a="January"
len(df[df["activity_month"]=a])
but I'd like to do that within the loop, for a particular iteration of i,x. How do I get a handle on the subsetted data within "x" on each iteration so I can look at the value of the "activity_month" and the number of rows with that value on that iteration?
Here is a short example dataframe:
import pandas as pd
df = pd.DataFrame([['January',19],['March',6],['January',24],['November',83],['February',23],
['November',4],['February',98],['January',44],['October',47],['January',4],
['April',8],['March',21],['April',41],['June',34],['March',63]],
columns=['activity_month','activity_count'])
Yields:
activity_month activity_count
0 January 19
1 March 6
2 January 24
3 November 83
4 February 23
5 November 4
6 February 98
7 January 44
8 October 47
9 January 4
10 April 8
11 March 21
12 April 41
13 June 34
14 March 63
If you want the sum of the values for each group from your df.groupby('activity_month'), then this will do:
df.groupby('activity_month')['activity_count'].sum()
Gives:
activity_month
April 49
February 121
January 91
June 34
March 90
November 87
October 47
Name: activity_count, dtype: int64
To get the number of rows that correspond to a given group:
df.groupby('activity_month')['activity_count'].agg('count')
Gives:
activity_month
April 2
February 2
January 4
June 1
March 3
November 2
October 1
Name: activity_count, dtype: int64
After re-reading your question, I'm convinced that you are not approaching this problem in the most efficient manner. I would highly recommend that you do not explicitly loop through the axes you have created with df.hist(), especially when this information is quickly (and directly) accessible from df itself.
My grouped data looks like:
deviceid time
01691cbb94f16f737e4c83eca8e5f5e5390c2801 January 10
022009f075929be71975ce70db19cd47780b112f April 566
August 210
January 4
July 578
June 1048
May 1483
02bad1cdf92fbaa9327a65babc1c081e59fbf435 November 309
October 54
Where the last column represents the count. I obtained this grouped representation using the expression:
data1.groupby(['deviceid', 'time'])
How do I get the average for each device id, i.e., the sum of the counts of all months divided by the number of months? My output should look like:
deviceid mean
01691cbb94f16f737e4c83eca8e5f5e5390c2801 10
022009f075929be71975ce70db19cd47780b112f 777.8
02bad1cdf92fbaa9327a65babc1c081e59fbf435 181.5
You an specify the level in the mean method:
s.mean(level=0) # or: s.mean(level='deviceid')
This is equivalent to grouping by the first level of the index and taking the mean of each group: s.groupby(level=0).mean()