I have big DataFrame (df) which looks like:
Acc_num date_diff
0 29 0:04:43
1 29 0:01:43
2 29 2:22:45
3 29 0:16:21
4 29 0:58:20
5 30 0:00:35
6 34 7:15:26
7 34 4:40:01
8 34 0:56:02
9 34 6:53:44
10 34 1:36:58
.....
Acc_num int64
date_diff timedelta64[ns]
dtype: object
I need to calculate 'date_diff' mean (in timedelta format) for each account number.
df.date_diff.mean() works correctly. But when I try next:
df.groupby('Acc_num').date_diff.mean() it raises an exception:
"DataError: No numeric types to aggregate"
I also tried df.pivot_table() method, but didn't acheive anything.
Could someone help me with this stuff. Thank you in advance!
Weird limitation indeed. But a simple solution would be:
df.groupby('Acc_num').date_diff.agg(lambda g:g.sum()/g.count())
Edit:
Pandas will actually attempt to aggregate non-numeric columns if you pass numeric_only=False
df.groupby('Acc_num').date_diff.mean(numeric_only=False)
Related
My df looks like this;
no_1 no_2 no_3
2022-10-12 4 5 53
2022-10-13 48 4 34
2022-10-14 0 43 93
2022-10-15 0 3 43
.
.
.
2022-10-22 8 34 4
I'm simply trying to add a new column which the the value when you subtract two other columns which should be easy but for some reason keeps failing.
I have the following;
no_data['no_4'] = no_data['no_3'] - no_data['no_1']
but keep getting the error..
TypeError: only integer scalar arrays can be converted to a scalar index
I'm afraid my trouble-shooting hasn't helped me with this so any help is much appreciated!
Thanks
My assumption Mostly MultiIndex issue try re assigning column names & Let me know if it worked
no_data.columns=['no_1', 'no_2', 'no_3']
no_data['no_4'] = no_data['no_3'] - no_data['no_1']
As part of my ongoing quest to get my head around pandas I am confronted by a surprise series. I don't understand how and why the output is a series - I was expecting a dataframe. If someone could explain what is happening here it would be much appreciated.
ta, Andrew
Some data:
hash email date subject subject_length
0 65319af6e jbrockmendel#gmail.com 2020-11-28 REF-IntervalIndex._assert_can_do_setop-38112 44
1 0bf58d8a9 simonjayhawkins#gmail.com 2020-11-28 DOC-add-contibutors-to-1.2.0-release-notes-38132 48
2 d16df293c 45562402+rhshadrach#users.noreply.github.com 2020-11-28 TYP-Add-cast-to-ABC-Index-like-types-38043 42
...
Some Code:
def my_function(row):
output = row['email'].value_counts().sort_values(ascending = False).head(3)
return output
top_three = dataframe.groupby(pd.Grouper(key='date', freq='1M')).apply(my_function)
Some Output:
date
2020-01-31 jbrockmendel#gmail.com 159
50263213+MomIsBestFriend#users.noreply.github.com 44
TomAugspurger#users.noreply.github.com 41
...
2020-10-31 jbrockmendel#gmail.com 170
2658661+dsaxton#users.noreply.github.com 23
61934744+phofl#users.noreply.github.com 21
2020-11-30 jbrockmendel#gmail.com 134
61934744+phofl#users.noreply.github.com 36
41443370+ivanovmg#users.noreply.github.com 19
Name: email, dtype: int64
It depends on what your Groupby is returning.
In your case, you are applying a function on row['email'] and returning a single value_counts, while all other columns in your data are part of index. A reset_index() would therefore give you what you need. Meaning, you are returning a multi-index single column output after groupby, which will be returned as a Series instead of a DataFrame.
For more clarity on which data structure is returned, we can do a toy experiment.
For example, for the first case, the apply function is applying the lambda function on groups where each group contains a dataframe (check [i for i in df.groupby(['a'])] to see what each group contains.
df = pd.DataFrame({'a':[1,1,2,2,3], 'b':[4,5,6,7,8]})
print(df.groupby(['a']).apply(lambda x:x**2))
#dataframe
a b
0 1 16
1 1 25
2 4 36
3 4 49
4 9 64
For the second case, we are only applying the lambda function on a series object OR only a single series is being returned. In this case, it doesn't return a dataframe and instead returns a series.
print(df.groupby(['a'])['b'].apply(lambda x:x**2))
#series
0 16
1 25
2 36
3 49
4 64
Name: b, dtype: int64
This can be solved simply by -
print(df.groupby(['a'])[['b']].apply(lambda x:x**2))
#dataframe
b
0 16
1 25
2 36
3 49
I have a dataframe that looks like:
Age Age Type
12 Years
5 Days
13 Hours
20 Months
... ......
I want to have my Age column in Years...so depending on Age Type, if it is either in Days, Hours, or Months, I will have to perform a scalar operation. I tried to implement a for loop but not sure if I'm going about it the right way. Thanks!
Create a map dict
d={'Years':1,'Days':1/365,'Hours':1/364/24,'Months':1/12}
df.Age*df.AgeType.map(d)
Out[373]:
0 12.000000
1 0.013699
2 0.001488
3 1.666667
dtype: float64
The data is from the book "Python for Data Analysis", chp 8, Bar Plots
tips = pd.read_csv('ch8/tips.csv')
party_counts = pd.crosstab(tips.day,tips.size)
when I run the above codes, I find I can not get the result as the book shows.
In [70]: party_counts
Out[70]:
size 1 2 3 4 5 6
day
Fri 1 16 1 1 0 0
Sat 2 53 18 13 1 0
Sun 0 39 15 18 3 1
Thur 1 48 4 5 1 3
my result is
In[36]: party_counts
Out[36]:
col_0 1708
day
Fri 19
Sat 87
Sun 76
Thur 62
I test tips' type
In[49]: tips.dtypes
Out[49]:
total_bill float64
tip float64
sex object
smoker object
day object
time object
size int64
dtype: object
while I found this question which also has one column is int can get the crosstab result.
so, anything wrong with me?
ps, my pandas version is '0.20.2', python 3.6
size is an attribute of the dataframe to get the number of elements from it, if you have a size column, you need to use ['size'] to avoid confusion:
pd.crosstab(tips.day, tips['size'])
I have a Python dataframe with 1408 lines of data. My goal is to compare the largest number and smallest number associated with a given weekday during one week to the next week's number on the same day of the week which the prior largest/smallest occurred. Essentially, I want to look at quintiles (since there are 5 days in a business week) rank 1 and 5 and see how they change from week to week. Build a cdf of numbers associated to each weekday.
To clean the data, I need to remove 18 weeks in total from it. That is, every week in the dataframe associated with holidays plus the entire week following week after the holiday occurred.
After this, I think I should insert a column in the dataframe that labels all my data with Monday through Friday-- for all the dates in the file (there are 6 years of data). The reason for labeling M-F is so that I can sort each number associated to the day of the week in ascending order. And query on the day of the week.
Methodological suggestions on either 1. or 2. or both would be immensely appreciated.
Thank you!
#2 seems like it's best tackled with a combination of df.groupby() and apply() on the resulting Groupby object. Perhaps an example is the best way to explain.
Given a dataframe:
In [53]: df
Out[53]:
Value
2012-08-01 61
2012-08-02 52
2012-08-03 89
2012-08-06 44
2012-08-07 35
2012-08-08 98
2012-08-09 64
2012-08-10 48
2012-08-13 100
2012-08-14 95
2012-08-15 14
2012-08-16 55
2012-08-17 58
2012-08-20 11
2012-08-21 28
2012-08-22 95
2012-08-23 18
2012-08-24 81
2012-08-27 27
2012-08-28 81
2012-08-29 28
2012-08-30 16
2012-08-31 50
In [54]: def rankdays(df):
.....: if len(df) != 5:
.....: return pandas.Series()
.....: return pandas.Series(df.Value.rank(), index=df.index.weekday)
.....:
In [52]: df.groupby(lambda x: x.week).apply(rankdays).unstack()
Out[52]:
0 1 2 3 4
32 2 1 5 4 3
33 5 4 1 2 3
34 1 3 5 2 4
35 2 5 3 1 4