I have pandas dataframe that contains dates in column Date. I need to add another column Days which contains the date difference from previous cell. So date in ith cell should difference from i-1th. And for the first difference consider it to be 0.
Date Days
08-01-1997 0
09-01-1997 1
10-01-1997 1
13-01-1997 3
14-01-1997 1
15-01-1997 1
01-03-1997 45
03-03-1997 2
04-03-1997 1
05-03-1997 1
13-06-1997 100
I tried this but not useful.
First convert the Date column to pandas DateTime object, then calculate the difference which is timedelta object, from there, take the days from Series.dt and assign 0 to first value
>>> df['Date']=pd.to_datetime(df['Date'], dayfirst=True)
>>> df['Days']=(df['Date']-df['Date'].shift()).dt.days.fillna(0).astype(int)
OUTPUT
df
Date Days
0 1997-01-08 0
1 1997-01-09 1
2 1997-01-10 1
3 1997-01-13 3
4 1997-01-14 1
5 1997-01-15 1
6 1997-03-01 45
7 1997-03-03 2
8 1997-03-04 1
9 1997-03-05 1
10 1997-06-13 100
you can use diff as well
df['date_up'] = pd.to_datetime(df['Date'],dayfirst=True)
df['date_diff'] = df['date_up'].diff()
df['date_diff_num_days'] = df['date_diff'].dt.days.fillna(0).astype(int)
df.head()
Date Days date_up date_diff date_diff_num_days
0 08-01-1997 0 1997-01-08 NaT 0
1 09-01-1997 1 1997-01-09 1 days 1
2 10-01-1997 1 1997-01-10 1 days 1
3 13-01-1997 3 1997-01-13 3 days 3
4 14-01-1997 1 1997-01-14 1 days 1
Related
I want to reduce my data. My initial dataframe looks as follows:
index
time [hh:mm:ss]
value1
value2
0
0 days 00:00:00.000000
3
4
1
0 days 00:00:04.000000
5
2
2
0 days 00:02:02.002300
7
9
3
0 days 00:02:03.000000
9
7
4
0 days 03:02:03.000000
4
3
Now I want to reduce my data in order to only keep the cells of every new minute (respectively also new hour and days). the other way around: only the first line of a new minute should be kept. all remaining lines of this minute should be dropped.
So the resulting table looks as follows:
index
time
value1
value2
0
0 days 00:00:00.000000
3
4
2
0 days 00:02:02.002300
7
9
4
0 days 03:02:03.000000
4
3
Any ideas how to approach this?
There is used timedeltas so is possible create TimedeltaIndex and use DataFrame.resample by 1Minute with Resampler.first, only are added all minutes, so removed only NaNs rows:
df.index = pd.to_timedelta(df['time [hh:mm:ss]'])
df = df.resample('1Min').first().dropna(how='all').reset_index(drop=True)
print (df)
time [hh:mm:ss] value1 value2
0 0 days 00:00:00.000000 3.0 4.0
1 0 days 00:02:02.002300 7.0 9.0
2 0 days 03:02:03.000000 4.0 3.0
You could extract the D:HH:MM using apply and multiple splits, and then delete the duplicates, choosing the first value.
dms = df['time [hh:mm:ss]'].apply(lambda x: ':'.join( [x.split(' days ')[0], *x.split('days ')[1].split(':')[:2]]) )
df.iloc[dms.drop_duplicates().index]
d = '''index,time,value1,value2 0,0 days 00:00:00.000000,3,4 1,0 days
00:00:04.000000,5,2 2,0 days 00:02:02.002300,7,9 3,0 days
00:02:03.000000,9,7 4,0 days 03:02:03.000000,4,3'''
df = pd.read_csv(StringIO(d),parse_dates=True)
df
df['time1'] = pd.to_datetime(df['time'].str.slice(7))
df.set_index('time1',inplace=True)
df
df.groupby([df.index.hour,df.index.minute]).head(1).sort_index().reset_index(drop=True)
This question already has an answer here:
How to count unique occurrences grouping by changing time period?
(1 answer)
Closed 1 year ago.
I need to aggregate data between constant date, like first day of year, and all the other dates through the year. There are two variants of this problem:
easier - sum:
created_at value
01-01-2012 5
02-01-2012 6
05-01-2012 1
05-01-2012 1
01-02-2012 3
02-02-2012 2
05-02-2012 1
which should output:
Date Month to date sum Year to date sum
01-01-2012 5 5
02-01-2012 11 11
05-01-2012 13 13
01-02-2012 3 14
02-02-2012 5 15
05-02-2012 6 16
and harder - count unique:
created_at value
01-01-2012 a
02-01-2012 b
05-01-2012 c
05-01-2012 c
01-02-2012 a
02-02-2012 a
05-02-2012 d
which should output:
Date Month to date unique Year to date unique
01-01-2012 1 1
02-01-2012 2 2
05-01-2012 3 3
01-02-2012 1 3
02-02-2012 1 3
05-02-2012 2 4
The data is, of course, in Pandas data frame.The obvious, but very clumsy way is to create for loop between the starting dates and the moving one. The problem looks like a popular one. Is there some reasonable pandas builtin way for such type of computation? Regarding counting unique I also want to avoid stacking lists, as I have large number of rows and unique values.
I was checking out Pandas window functions, but it doesn't look like a solution.
Try with groupby:
Cumulative sum:
df["created_at"] = pd.to_datetime(df["created_at"], format="%d-%m-%Y")
df["Month to date sum"] = df.groupby(df["created_at"].dt.month)["value"].transform('cumsum')
df["Year to date sum"] = df.groupby(df["created_at"].dt.year)["value"].transform('cumsum')
>>> df
created_at value Month to date sum Year to date sum
0 2012-01-01 5 5 5
1 2012-01-02 6 11 11
2 2012-01-05 1 12 12
3 2012-02-01 3 3 15
4 2012-02-02 2 5 17
5 2012-02-05 1 6 18
Cumulative unique count:
df2["created_at"] = pd.to_datetime(df2["created_at"], format="%d-%m-%Y")
df2["Month to date unique"] = df2.groupby(df2["created_at"].dt.month)["value"].apply(lambda x: (~x.duplicated()).cumsum())
df2["Year to date unique"] = df2.groupby(df2["created_at"].dt.year)["value"].apply(lambda x: (~x.duplicated()).cumsum())
>>> df2
created_at value Month to date unique Year to date unique
0 2012-01-01 a 1 1
1 2012-01-02 b 2 2
2 2012-01-05 c 3 3
3 2012-02-01 a 1 3
4 2012-02-02 a 1 3
5 2012-02-05 d 2 4
I have a dataframe looks like :
id TakingTime
1 03-01-2015
1 18-07-2015
1 22-10-2015
1 14-01-2016
2 11-02-2015
2 28-02-2015
2 18-04-2015
2 19-05-2015
3 11-02-2015
3 16-11-2015
3 19-02-2016
3 21-04-2016
4 03-01-2015
4 03-01-2015
4 03-01-2015
4 03-01-2015
The output desired is :
id TakingTime
1 03-01-2015
1 18-07-2015
1 22-10-2015
1 14-01-2016
3 11-02-2015
3 16-11-2015
3 19-02-2016
3 21-04-2016
When I want to remove all id which have a difference time between the first and last taking time one year minimum.
I tried with
df[df.groupby('ID')['takingtime'].transform(lambda x: x.nunique() > 1)]
But I'm not sure if it's the right way to do this and if yes what is meaning of > 5 ? Days, Months, Years ... ?
Use:
idx = df.groupby('id').TakingTime.transform(lambda x: x.dt.year.diff().sum().astype(bool))
df[idx]
Output:
id TakingTime
0 1 2015-03-01
1 1 2015-07-18
2 1 2015-10-22
3 1 2016-01-14
8 3 2015-11-02
9 3 2015-11-16
10 3 2016-02-19
11 3 2016-04-21
Explanation:
For each id, take the difference across the years. If there's any difference greater than 0 (i.e. sum().astype(bool)), returns True. We used transform to replicate the output for the whole group. Finally, slice the dataframe with the output indexes.
Edit:
To analyze a specific amount of time (in days):
days = 865
df.groupby('id').TakingTime.transform(lambda x: (x.max() - x.min()).days >= days)
or:
from datetime import timedelta
days = timedelta(865)
df.groupby('id').TakingTime.transform(lambda x: (x.max() - x.min()) >= days)
I need to create groups using two columns. For example, I took shop_id and week. Here is the df:
shop_id week
0 1 1
1 1 2
2 1 3
3 2 1
4 2 2
5 3 2
6 1 5
Imagine that each group is some promo which took place in each shop consecutively (week by week). So, my attempt was to use sorting, shifting by 1 to get last_week, use booleans and then iterate over them, incrementing each time whereas condition not met:
test_df = pd.DataFrame({'shop_id':[1,1,1,2,2,3,1], 'week':[1,2,3,1,2,2,5]})
def createGroups(df, shop_id, week, group):
'''Create groups where is the same shop_id and consecutive week
'''
periods = []
period = 0
# sorting to create chronological order
df = df.sort_values(by = [shop_id,week],ignore_index = True)
last_week = df[week].shift(+1)==df[week]-1
last_shop = df[shop_id].shift(+1)==df[shop_id]
# here i iterate over booleans and increment group by 1
# if shop is different or last period is more than 1 week ago
for p,s in zip(last_week,last_shop):
if (p == True) and (s == True):
periods.append(period)
else:
period += 1
periods.append(period)
df[group] = periods
return df
createGroups(test_df, 'shop_id', 'week', 'promo')
And I get the grouping I need:
shop_id week promo
0 1 1 1
1 1 2 1
2 1 3 1
3 1 5 2
4 2 1 3
5 2 2 3
6 3 2 4
However, function seems to be an overkill. Any ideas on how to get the same without a for-loop using native pandas function? I saw .ngroups() in docs but have no idea how to apply it to my case. Even better would be to vectorise it somehow, but I don't know how to achieve this:(
First we want to identify the promotions (continuously in weeks), then use groupby().ngroup() to enumerate the promotion:
df = df.sort_values('shop_id')
s = df['week'].diff().ne(1).groupby(df['shop_id']).cumsum()
df['promo'] = df.groupby(['shop_id',s]).ngroup() + 1
Update: This is based on your solution:
df = df.sort_values(['shop_id','week'])
s = df[['shop_id', 'week']]
df['promo'] = (s['shop_id'].ne(s['shop_id'].shift()) |
s['week'].diff().ne(1) ).cumsum()
Output:
shop_id week promo
0 1 1 1
1 1 2 1
2 1 3 1
6 1 5 2
3 2 1 3
4 2 2 3
5 3 2 4
I have a Pandas Dataframe with data about calls. Each call has a unique ID and each customer has an ID (but can have multiple Calls). A third column gives a day. For each customer I want to calculate the maximum number of calls made in a period of 7 days.
I have been using the following code to count the number of calls within 7 days of the call on each row:
df['ContactsIN7Days'] = df.apply(lambda row: len(df[(df['PersonID']==row['PersonID']) & (abs(df['Day'] - row['Day']) <=7)]), axis=1)
Output:
CallID Day PersonID ContactsIN7Days
6 2 3 2
3 14 2 2
1 8 1 1
5 1 3 2
2 12 2 2
7 100 3 1
This works, however this is going to be applied on a big data set. Would there be a way to make this more efficient. Through vectorization?
IIUC this is a convoluted, but I think effective solution to your issue. Note that the order of your dataframe is modified as a result, and that your Day column is modified to a timedelta dtype:
Starting from your dataframe df:
CallID Day PersonID
0 6 2 3
1 3 14 2
2 1 8 1
3 5 1 3
4 2 12 2
5 7 100 3
Start by modifying Day to a timedelta series:
df['Day'] = pd.to_timedelta(df['Day'], unit='d')
Then, use pd.merge_asof, to merge your dataframe with the count of calls by each individual in a period of 7 days. To get this, use groupby with a pd.Grouper with a frequency of 7 days:
new_df = (pd.merge_asof(df.sort_values(['Day']),
df.sort_values(['Day'])
.groupby([pd.Grouper(key='Day', freq='7d'), 'PersonID'])
.size()
.to_frame('ContactsIN7Days')
.reset_index(),
left_on='Day', right_on='Day',
left_by='PersonID', right_by='PersonID',
direction='nearest'))
Your resulting new_df will look like this:
CallID Day PersonID ContactsIN7Days
0 5 1 days 3 2
1 6 2 days 3 2
2 1 8 days 1 1
3 2 12 days 2 2
4 3 14 days 2 2
5 7 100 days 3 1