I want to count the number of days after grouping by 2 columns:
groups = df.groupby([df.col1,df.col2])
Now i want to count the number of days relevant for each group:
result = groups['date_time'].dt.date.nunique()
I'm using something similar when I want to group by day, but here I get an error:
AttributeError: Cannot access attribute 'dt' of 'SeriesGroupBy' objects, try using the 'apply' method
What is the proper way to get the number of days?
You need another variation of groupby - define column first:
df['date_time'].dt.date.groupby([df.col1,df.col2]).nunique()
df.groupby(['col1','col2'])['date_time'].apply(lambda x: x.dt.date.nunique())
df['date_time1'] = df['date_time'].dt.date
a = df.groupby([df.col1,df.col2]).date_time1.nunique()
Sample:
start = pd.to_datetime('2015-02-24')
rng = pd.date_range(start, periods=10, freq='15H')
df = pd.DataFrame({'date_time': rng, 'col1': [0]*5 + [1]*5, 'col2': [2]*3 + [3]*4+ [4]*3})
print (df)
col1 col2 date_time
0 0 2 2015-02-24 00:00:00
1 0 2 2015-02-24 15:00:00
2 0 2 2015-02-25 06:00:00
3 0 3 2015-02-25 21:00:00
4 0 3 2015-02-26 12:00:00
5 1 3 2015-02-27 03:00:00
6 1 3 2015-02-27 18:00:00
7 1 4 2015-02-28 09:00:00
8 1 4 2015-03-01 00:00:00
9 1 4 2015-03-01 15:00:00
#solution with apply
df1 = df.groupby(['col1','col2'])['date_time'].apply(lambda x: x.dt.date.nunique())
print (df1)
col1 col2
0 2 2
3 2
1 3 1
4 2
Name: date_time, dtype: int64
#create new helper column
df['date_time1'] = df['date_time'].dt.date
df2 = df.groupby([df.col1,df.col2]).date_time1.nunique()
print (df2)
col1 col2
0 2 2
3 2
1 3 1
4 2
Name: date_time, dtype: int64
df3 = df['date_time'].dt.date.groupby([df.col1,df.col2]).nunique()
print (df3)
col1 col2
0 2 2
3 2
1 3 1
4 2
Name: date_time, dtype: int64
Related
I have 2 Dataframes like this:
ID Date1
1 2018-02-01
2 2019-03-01
3 2005-09-02
4 2021-11-09
And then I have this Dataframe:
ID Date2
4 2003-02-01
4 2004-03-11
3 1998-02-11
2 1999-02-11
1 2000-09-25
What I would want to do is find the difference in dates (who have the same ID between he differences in dataframes) using this function:
def days_between(d1, d2):
d1 = datetime.strptime(d1, "%Y-%m-%d")
d2 = datetime.strptime(d2, "%Y-%m-%d")
return abs((d2 - d1).days)
and summing up the differences for the corresponding Id.
The expected output would be:
Date is the summed up Differences in datewith corresponding ID
ID Date
1 6338
2 7323
3 2760
4 13308
Solution if df1.ID has no duplicates, only df2.ID use Series.map for new column used for subtracting by Series.sub, convert timedeltas to days by Series.dt.days and last aggregate sum:
df1['Date1'] = pd.to_datetime(df1['Date1'])
df2['Date2'] = pd.to_datetime(df2['Date2'])
df2['Date'] = df2['ID'].map(df1.set_index('ID')['Date1']).sub(df2['Date2']).dt.days
print (df2)
ID Date2 Date
0 4 2003-02-01 6856
1 4 2004-03-11 6452
2 3 1998-02-11 2760
3 2 1999-02-11 7323
4 1 2000-09-25 6338
df3 = df2.groupby('ID', as_index=False)['Date'].sum()
print (df3)
ID Date
0 1 6338
1 2 7323
2 3 2760
3 4 13308
Or use DataFrame.merge instead map:
df1['Date1'] = pd.to_datetime(df1['Date1'])
df2['Date2'] = pd.to_datetime(df2['Date2'])
df2 = df1.merge(df2, on='ID')
df2['Date'] = df2['Date1'].sub(df2['Date2']).dt.days
print (df2)
ID Date1 Date2 Date
0 1 2018-02-01 2000-09-25 6338
1 2 2019-03-01 1999-02-11 7323
2 3 2005-09-02 1998-02-11 2760
3 4 2021-11-09 2003-02-01 6856
4 4 2021-11-09 2004-03-11 6452
df3 = df2.groupby('ID', as_index=False)['Date'].sum()
print (df3)
ID Date
0 1 6338
1 2 7323
2 3 2760
3 4 13308
Does this work:
d = pd.merge(d1,d2)
d[['Date1','Date2']] = d[['Date1','Date2']].apply(pd.to_datetime, format = '%Y-%m-%d')
d['Date'] = d['Date1'] - d['Date2']
d.groupby('ID')['Date'].sum().reset_index()
ID Date
0 1 6338 days
1 2 7323 days
2 3 2760 days
3 4 13308 days
I have the following dataframe:
df=pd.DataFrame(np.array([[1,2,3,4,7,9,5],[2,6,5,4,9,8,2],[3,5,3,21,12,6,7],[1,7,8,4,3,4,3]]),
columns=['9:00:00','9:05:00','09:10:00','09:15:00','09:20:00','09:25:00','09:30:00'])
>>> 9:00:00 9:05:00 09:10:00 09:15:00 09:20:00 09:25:00 09:30:00 ....
a 1 2 3 4 7 9 5
b 2 6 5 4 9 8 2
c 3 5 3 21 12 6 7
d 1 7 8 4 3 4 3
I would like to get for each row (e.g a,b,c,d ...) the mean vale between specific hours. The hours are between 9-15, and I want to groupby period, for example to calculate the mean value between 09:00:00 to 11:00:00, between 11- 12, between 13-15 (or any period I decide to).
I was trying first to convert the columns values to datetime format and then I though it would be easier to do this:
df.columns = pd.to_datetime(df.columns,format="%H:%M:%S")
but then I got the columns names with fake year "1900-01-01 09:00:00"...
And also, the columns headers type was object, so I felt a bit lost...
My end goal is to be able to calculate new columns with the mean value for each row only between columns that fall inside the defined time period (e.g 9-11 etc...)
If need some period, e.g. each 2 hours:
df.columns = pd.to_datetime(df.columns,format="%H:%M:%S")
df1 = df.resample('2H', axis=1).mean()
print (df1)
1900-01-01 08:00:00
0 4.428571
1 5.142857
2 8.142857
3 4.285714
If need some custom periods is possible use cut:
df.columns = pd.to_datetime(df.columns,format="%H:%M:%S")
bins = ['5:00:00','9:00:00','11:00:00','12:00:00', '23:59:59']
dates = pd.to_datetime(bins,format="%H:%M:%S")
labels = [f'{i}-{j}' for i, j in zip(bins[:-1], bins[1:])]
df.columns = pd.cut(df.columns, bins=dates, labels=labels, right=False)
print (df)
9:00:00-11:00:00 9:00:00-11:00:00 9:00:00-11:00:00 9:00:00-11:00:00 \
0 1 2 3 4
1 2 6 5 4
2 3 5 3 21
3 1 7 8 4
9:00:00-11:00:00 9:00:00-11:00:00 9:00:00-11:00:00
0 7 9 5
1 9 8 2
2 12 6 7
3 3 4 3
And last use mean per columns, reason of NaNs columns is columns are categoricals:
df2 = df.mean(level=0, axis=1)
print (df2)
9:00:00-11:00:00 5:00:00-9:00:00 11:00:00-12:00:00 12:00:00-23:59:59
0 4.428571 NaN NaN NaN
1 5.142857 NaN NaN NaN
2 8.142857 NaN NaN NaN
3 4.285714 NaN NaN NaN
For avoid NaNs columns convert columns names to strings:
df3 = df.rename(columns=str).mean(level=0, axis=1)
print (df3)
9:00:00-11:00:00
0 4.428571
1 5.142857
2 8.142857
3 4.285714
EDIT: Solution above with timedeltas, because format HH:MM:SS:
df.columns = pd.to_timedelta(df.columns)
print (df)
0 days 09:00:00 0 days 09:05:00 0 days 09:10:00 0 days 09:15:00 \
0 1 2 3 4
1 2 6 5 4
2 3 5 3 21
3 1 7 8 4
0 days 09:20:00 0 days 09:25:00 0 days 09:30:00
0 7 9 5
1 9 8 2
2 12 6 7
3 3 4 3
bins = ['9:00:00','11:00:00','12:00:00']
dates = pd.to_timedelta(bins)
labels = [f'{i}-{j}' for i, j in zip(bins[:-1], bins[1:])]
df.columns = pd.cut(df.columns, bins=dates, labels=labels, right=False)
print (df)
9:00:00-11:00:00 9:00:00-11:00:00 9:00:00-11:00:00 9:00:00-11:00:00 \
0 1 2 3 4
1 2 6 5 4
2 3 5 3 21
3 1 7 8 4
9:00:00-11:00:00 9:00:00-11:00:00 9:00:00-11:00:00
0 7 9 5
1 9 8 2
2 12 6 7
3 3 4 3
#missing values because not exist datetimes between 11:00:00-12:00:00
df2 = df.mean(level=0, axis=1)
print (df2)
9:00:00-11:00:00 11:00:00-12:00:00
0 4.428571 NaN
1 5.142857 NaN
2 8.142857 NaN
3 4.285714 NaN
df3 = df.rename(columns=str).mean(level=0, axis=1)
print (df3)
9:00:00-11:00:00
0 4.428571
1 5.142857
2 8.142857
3 4.285714
I am going to show you my code and the results after the ejecution.
First import libraries and dataframe
import numpy as np
import pandas as pd
df=pd.DataFrame(np.array([[1,2,3,4,7,9,5],[2,6,5,4,9,8,2],[3,5,3,21,12,6,7],
[1,7,8,4,3,4,3]]),
columns=
['9:00:00','9:05:00','09:10:00','09:15:00','09:20:00','09:25:00','09:30:00'])
It would be nice create a class in order to define what is a period:
class Period():
def __init__(self,initial,end):
self.initial=initial
self.end=end
def __repr__(self):
return self.initial +' -- ' +self.end
With comand .loc we can get a subdataframe with the columns that I desire:
`def get_colMean(df,period):
df2 = df.loc[:,period.initial:period.end]
array_mean = df.mean(axis=1).values
col_name = 'mean_'+period.initial+'--'+period.end
pd_colMean = pd.DataFrame(array_mean,columns=[col_name])
return pd_colMean`
Finally we use .join in orde to add our column with the means to our original dataframe:
def join_colMean(df,period):
pd_colMean = get_colMean(df,period)
df = df.join(pd_colMean)
return df
I am goint to show you my results:
I have a dataframe like this
enter image description here
I want to backfill each item where date_activity is 1/1/2000 12:00:00 with the max date_activity for each item_id. In the end, I want something like this using pandas
enter image description here
Create missing values by Series.duplicated and Series.mask and then backfilling values:
df = pd.DataFrame({'item_id':[1,1,1,2,2,2,2],
'date_active':pd.date_range('2019-02-02', periods=7)})
print (df)
item_id date_active
0 1 2019-02-02
1 1 2019-02-03
2 1 2019-02-04
3 2 2019-02-05
4 2 2019-02-06
5 2 2019-02-07
6 2 2019-02-08
df['date_active'] = df['date_active'].mask(df['item_id'].duplicated(keep='last')).bfill()
print (df)
item_id date_active
0 1 2019-02-04
1 1 2019-02-04
2 1 2019-02-04
3 2 2019-02-08
4 2 2019-02-08
5 2 2019-02-08
6 2 2019-02-08
Details:
print (df['item_id'].duplicated(keep='last'))
0 True
1 True
2 False
3 True
4 True
5 True
6 False
Name: item_id, dtype: bool
print (df['date_active'].mask(df['item_id'].duplicated(keep='last')))
0 NaT
1 NaT
2 2019-02-04
3 NaT
4 NaT
5 NaT
6 2019-02-08
Name: date_active, dtype: datetime64[ns]
EDIT:
If real data is necessary sorting values before solution for last maximum value per group:
print (df)
item_id date_active
0 1 7/26/2019 17:06
1 1 8/27/2019 17:06
df['date_active'] = pd.to_datetime(df['date_active'])
df = df.sort_values(['item_id','date_active'])
df['date_active'] = df['date_active'].mask(df['item_id'].duplicated(keep='last')).bfill()
print (df)
item_id date_active
0 1 2019-08-27 17:06:00
1 1 2019-08-27 17:06:00
EDIT1: Use DataFrame.resample for add missing datetimes per groups:
df['date_active'] = pd.to_datetime(df['date_active'])
df = df.sort_values(['item_id','date_active'])
df = (df.set_index('date_active').groupby('item_id')
.resample('D')
.last()
.drop('item_id', axis=1)
.reset_index())
df['date_active'] = df['date_active'].mask(df['item_id'].duplicated(keep='last')).bfill()
print (df.tail())
item_id date_active
28 1 2019-08-27
29 1 2019-08-27
30 1 2019-08-27
31 1 2019-08-27
32 1 2019-08-27
I have a dataframe 3 columns One Date, 2 Object Columns. I need to fill missing dates of different col1 and col 2 combinations by using max and min dates of the dataframe. Date column only contains first day of each month.
I have done it using naive manner but original data is in thousands or records taking huge amount of time to iterate thru all COL1+COL2 combinations, date ranges. original dataframe contains 15000 records and 30 columns. I need to fill missing date + col1 + col2 then rest all columns empty values. If I have data for Jan 2019 for a col1+col2 combination and dont have it for feb I actually wanted to insert feb, col1, col2, other records empty.
There should be equal unique combinations (COL1 + COL2) from original dataframe to after filling. Same number of combinations before and after
Please help me optimizing it.
df_1 = pd.DataFrame({'Date':['2018-01-01','2018-02-01','2018-03-01','2018-05-01','2018-05-01'],
'COL1':['A','A','B','B','A'],
'COL2':['1','2','1','2','1']})
df_1['Date'] = pd.to_datetime(df_1['Date'])
Initial Dataframe -->>
Date COL1 COL2
0 2018-01-01 A 1
1 2018-02-01 A 2
2 2018-03-01 B 1
3 2018-05-01 B 2
4 2018-05-01 A 1
--
print(df_1.dtypes)
print(df_1)
COLS_COMBO = [i for i in list(set(list(df_1[['COL1','COL2']].itertuples(name='',index=False))))]
months_range = [str(i.date()) for i in list(pd.date_range(start=min(df_1['Date']).date(),
end=max(df_1['Date']).date(), freq='MS'))]
print(COLS_COMBO)
print(months_range)
for col in COLS_COMBO:
col1,col2 = col[0], col[1]
for month in months_range:
d = df_1[(df_1['Date'] == month) & (df_1['COL1'] == col1) & (df_1['COL2'] == col2)]
if len(d) == 0:
dx = {'Date':month,'COL1':col1,'COL2':col2}
df_1 = df_1.append(dx, ignore_index=True)
print(df_1)
OUTPUT
Data TYPES -->>
Date datetime64[ns]
COL1 object
COL2 object
dtype: object
Unique COmbinations of COL1 + COL2 -->>
[('A', '2'), ('B', '2'), ('B', '1'), ('A', '1')]
Months range using min, max in the dataframe -->>
['2018-01-01', '2018-02-01', '2018-03-01', '2018-04-01', '2018-05-01']
My final output is
FINAL Dataframe -->>
Date COL1 COL2
0 2018-01-01 A 1
1 2018-02-01 A 2
2 2018-03-01 B 1
3 2018-05-01 B 2
4 2018-05-01 A 1
5 2018-01-01 A 2
6 2018-02-01 A 2
7 2018-03-01 A 2
8 2018-04-01 A 2
9 2018-05-01 A 2
10 2018-01-01 B 2
11 2018-02-01 B 2
12 2018-03-01 B 2
13 2018-04-01 B 2
14 2018-05-01 B 2
15 2018-01-01 B 1
16 2018-02-01 B 1
17 2018-03-01 B 1
18 2018-04-01 B 1
19 2018-05-01 B 1
20 2018-01-01 A 1
21 2018-02-01 A 1
22 2018-03-01 A 1
23 2018-04-01 A 1
24 2018-05-01 A 1
PS:
COL1 is like parent COL2 is child. So there should be no change in the original combinations and also (date+col1+col2) combinations shouldn't be duplicated / updated if exists.
You can use:
from itertools import product
#get all unique combinations of columns
COLS_COMBO = df_1[['COL1','COL2']].drop_duplicates().values.tolist()
#remove times and create MS date range
dates = df_1['Date'].dt.floor('d')
months_range = pd.date_range(dates.min(), dates.max(), freq='MS')
print(COLS_COMBO)
print(months_range)
#create all combinations of values
df = pd.DataFrame([(c, a, b) for (a, b), c in product(COLS_COMBO, months_range)],
columns=['Date','COL1','COL2'])
print (df)
Date COL1 COL2
0 2018-01-01 A 1
1 2018-02-01 A 1
2 2018-03-01 A 1
3 2018-04-01 A 1
4 2018-05-01 A 1
5 2018-01-01 A 2
6 2018-02-01 A 2
7 2018-03-01 A 2
8 2018-04-01 A 2
9 2018-05-01 A 2
10 2018-01-01 B 1
11 2018-02-01 B 1
12 2018-03-01 B 1
13 2018-04-01 B 1
14 2018-05-01 B 1
15 2018-01-01 B 2
16 2018-02-01 B 2
17 2018-03-01 B 2
18 2018-04-01 B 2
19 2018-05-01 B 2
#add to original df_1 and remove duplicates
df_1 = pd.concat([df_1, df], ignore_index=True).drop_duplicates()
print (df_1)
Date COL1 COL2
0 2018-01-01 A 1
1 2018-02-01 A 2
2 2018-03-01 B 1
3 2018-05-01 B 2
4 2018-05-01 A 1
6 2018-02-01 A 1
7 2018-03-01 A 1
8 2018-04-01 A 1
10 2018-01-01 A 2
12 2018-03-01 A 2
13 2018-04-01 A 2
14 2018-05-01 A 2
15 2018-01-01 B 1
16 2018-02-01 B 1
18 2018-04-01 B 1
19 2018-05-01 B 1
20 2018-01-01 B 2
21 2018-02-01 B 2
22 2018-03-01 B 2
23 2018-04-01 B 2
yearCount = df[['antibiotic', 'order_date', 'antiYearCount']]
yearGroups = yearCount.groupby('order_date')
for year in yearGroups:
yearCount['antiYearCount'] =year.groupby('antibiotic'['antibiotic'].transform(pd.Series.value_counts)
In this case, yearCount is a dataframe containing 'order_date', 'antibiotic', 'antiYearCount'. I have cleaned 'order_date' to only contain the year of the order. I want to group yearCount by the years in 'order_date', count the number of times each 'antibiotic' appears in each "year group" then assign that value to yearCount's 'antiYearCount' variable.
I think you need add new column order_date to groupby and then is also possible usesize instead pd.Series.value_counts for same output:
df = pd.DataFrame({'antibiotic':list('accbbb'),
'antiYearCount':[4,5,4,5,5,4],
'C':[7,8,9,4,2,3],
'D':[1,3,5,7,1,0],
'E':[5,3,6,9,2,4],
'order_date': pd.to_datetime(['2012-01-01']*3+['2012-01-02']*3)})
print (df)
C D E antiYearCount antibiotic order_date
0 7 1 5 4 a 2012-01-01
1 8 3 3 5 c 2012-01-01
2 9 5 6 4 c 2012-01-01
3 4 7 9 5 b 2012-01-02
4 2 1 2 5 b 2012-01-02
5 3 0 4 4 b 2012-01-02
#copy for remove warning
#https://stackoverflow.com/a/45035966/2901002
yearCount = df[['antibiotic', 'order_date', 'antiYearCount']].copy()
yearCount['antiYearCount'] = yearCount.groupby(['order_date','antibiotic'])['antibiotic'] \
.transform('size')
print (yearCount)
antibiotic order_date antiYearCount
0 a 2012-01-01 1
1 c 2012-01-01 2
2 c 2012-01-01 2
3 b 2012-01-02 3
4 b 2012-01-02 3
5 b 2012-01-02 3
yearCount['antiYearCount'] = yearCount.groupby(['order_date','antibiotic'])['antibiotic'] \
.transform(pd.Series.value_counts)
print (yearCount)
antibiotic order_date antiYearCount
0 a 2012-01-01 1
1 c 2012-01-01 2
2 c 2012-01-01 2
3 b 2012-01-02 3
4 b 2012-01-02 3
5 b 2012-01-02 3