Time Ranking DataFrame within the constraints of noise - python

I have a dataframe df with three columns, viz., Date, Time, Name (there can be more extra columns). df is sorted in ascending order of Time. On any given Date there could be multiple Time values which can either be 5 minutes apart or > 15 minutes apart. On any given day anything within 5 minutes should be treated as same. I want to add column TimeRank which on any given day clusters similar Time within 5 minutes together and give them same TimeRank. For example,
Date Name Time TimeRank
0 2017-01-01 Henry 2017-01-01 09:21:01 1
1 2017-01-01 John 2017-01-01 09:23:43 1
2 2017-01-01 Svetlana 2017-01-01 10:15:01 2
3 2017-01-01 Sara 2017-01-01 11:01:01 3
4 2017-01-01 Whitney 2017-01-01 11:03:03 3
5 2017-01-02 Lara 2017-01-02 11:03:03 1
6 2017-01-02 Eugene 2017-01-02 16:46:00 2
7 2017-01-02 Richard 2017-01-02 16:46:00 2
8 2017-01-03 Andy 2017-01-03 11:01:01 1
9 2017-01-03 Paul 2017-01-03 11:03:03 1
Below I have created a sample df. Unfortunately, I am constrained with using an older version of pandas 0.16.
import pandas as pd
from random import randint
from datetime import time
dates = pd.date_range('2017-01-01', '2017-01-04')
dates2 = [dates[i] for i in [randint(0, len(dates) -1) for i in range (0, 100)]]
timelist = [time(9,20,45), time(9,21,0), time(9,23,43), time(9,50,0), time(10,15,1), time(11,1,1), time(11,3,3), time(16,45,0), time(16,46,0)]
timelist2 = [timelist[i] for i in [randint(0, len(timelist) -1) for i in range (0, 100)]]
names = ['henry', 'tom', 'andy', 'lara', 'whitney', 'eleanor', 'paloma', 'john', 'james', 'svetlana', 'paul']
names2 = [names[i] for i in [randint(0, len(names)-1) for i in range (0, 100)]]
df = pd.DataFrame({'Date':dates2, 'Time':timelist2, 'Name':names2})
df['Time'] = df.apply(lambda r:pd.datetime.combine(r['Date'],r['Time']), axis=1)
df.sort('Time', inplace=True)

df.loc[:, 'minutes'] = df.apply(lambda x:x['Time'].minute + 60*x['Time'].hour, axis=1)
df.loc[:, 'delTime'] = df.groupby('Date')['minutes'].diff()
df.loc[(df['delTime'] <=5) & (df['delTime'] >=-5), 'delTime'] = 0
df.loc[np.isnan(df['delTime']), 'delTime'] = 1.
df.loc[(df['delTime']) == 0, 'delTime'] = np.nan
df.loc[~np.isnan(df['delTime']), 'delTime'] = df['minutes']
df = df.ffill()
df.loc[:, 'TimeRank'] = df.groupby('Date')['delTime'].rank(method='dense')
df.drop(['minutes', 'delTime'], inplace=True, axis=1)

Related

How to add time values manually to Pandas dataframe TimeStamp column?

Suppose I have a dataframe df looking like this
df
TimeStamp. Column1......Column n.
2017-01-01
2017-01-02
...
But I want it like this
TimeStamp. Column1......Column n.
2017-01-01 00:00:00
2017-01-02.00:00:00
...
How can I add this (00:00:00) to all TimeStamps in the dataframe? Thanks
Find the below code:
import pandas as pd
df=pd.DataFrame([{"Timestamp":"2017-01-01"},{"Timestamp":"2017-01-01"}],columns=['Timestamp'])
df_new=df['Timestamp'].apply(lambda k:k+" 00:00:00")
Output:
df_new['Timestamp']
0 2017-01-01 00:00:00
1 2017-01-01 00:00:00
Name: Timestamp, dtype: object
import pandas as pd
from datetime import datetime, timedelta
Name = ['a', 'b', 'c', 'd']
Age = [10, 20, 30, 40]
somedate = datetime.date(datetime.now())
DOB = [somedate] * 4
somelistdata = list(zip(Name, Age, DOB))
df = pd.DataFrame(somelistdata, columns = ['Name', 'Age', 'DOB'])
# problem statement
print(df)
# solution to your problem
df['DOB'] = pd.to_datetime(df['DOB']).dt.strftime('%Y-%m-%d %H:%M:%S')
print(df)
Problem statement
Name Age DOB
0 a 10 2019-09-19
1 b 20 2019-09-19
2 c 30 2019-09-19
3 d 40 2019-09-19
Solution
Name Age DOB
0 a 10 2019-09-19 00:00:00
1 b 20 2019-09-19 00:00:00
2 c 30 2019-09-19 00:00:00
3 d 40 2019-09-19 00:00:00

Pandas closest future value not equal to current row

I have a Pandas DataFrame with one column, price, and a DateTimeIndex. I would like to create a new column that is 1 when price increases the next time it changes and 0 if it decreases. Multiple consecutive rows may have the same value of price.
Example:
import pandas as pd
df = pd.DataFrame({"price" : [10, 10, 20, 10, 30, 5]}, index=pd.date_range(start="2017-01-01", end="2017-01-06"))
The output should then be:
2017-01-01 1
2017-01-02 1
2017-01-03 0
2017-01-04 1
2017-01-05 0
2017-01-06 NaN
In practice this DF has ~20mm rows so I'm really looking for a vectorized method of doing this.
Here is one way to do this:
calculate the price difference and shift up by one;
use numpy.where to assign one to positions where price increases, zero to positions where price decreases;
back fill the indicator column, so non change values are the same as the next available observation;
In code:
import numpy as np
price_diff = df.price.diff().shift(-1)
df['indicator'] = np.where(price_diff.gt(0), 1, np.where(price_diff.lt(0), 0, np.nan))
df['indicator'] = df.indicator.bfill()
df
# price indicator
#2017-01-01 10 1.0
#2017-01-02 10 1.0
#2017-01-03 20 0.0
#2017-01-04 10 1.0
#2017-01-05 30 0.0
#2017-01-06 5 NaN
df['New']=(df-df.shift(-1))[:-1].le(0).astype(int)
df
Out[879]:
price New
2017-01-01 10 1.0
2017-01-02 10 1.0
2017-01-03 20 0.0
2017-01-04 10 1.0
2017-01-05 30 0.0
2017-01-06 5 NaN
use shift:
sh = df['price'].shift(-1)
out = sh[~sh.isnull()] = df['price']<=sh
or
sh = df['price'].shift(-1)
out = np.where(sh.isnull(), np.nan, df['price']<=sh)

how to merge group rows in dataframe based on differences between datetime?

I have a dataframe with contains events on each row, with a Start and End datatime.
import pandas as pd
import datetime
df = pd.DataFrame({ 'Value' : [1.,2.,3.],
'Start' : [datetime.datetime(2017,1,1,0,0,0),datetime.datetime(2017,1,1,0,1,0),datetime.datetime(2017,1,1,0,4,0)],
'End' : [datetime.datetime(2017,1,1,0,0,59),datetime.datetime(2017,1,1,0,5,0),datetime.datetime(2017,1,1,0,6,00)]},
index=[0,1,2])
df
Out[7]:
End Start Value
0 2017-01-01 00:00:59 2017-01-01 00:00:00 1.0
1 2017-01-01 00:05:00 2017-01-01 00:01:00 2.0
2 2017-01-01 00:07:00 2017-01-01 00:06:00 3.0
I would like to group consecutive rows where the the differences between End and Start of consecutive rows is smaller than a given timedelta.
e.g. here for a timedelta of 5 seconds I would like to group row with index 0,1 and with timedelta of 2 minutes it should yield in rows 0,1,2
A solution would be to compare consecutive rows with their shifted version using .shift(), however, I would need to iterate the comparison multiple times if groups of more than 2 rows need to be merged.
As my df is very large, this is not an option.
threshold = datetime.timedelta(minutes=5)
df['delta'] = df['End'] - df['Start']
df['group'] = (df['delta'] - df['delta'].shift(-1) <= threshold).cumsum()
groups = df.groupby('group')
i assume you try to aggregate based on time difference.
marker = 60
df = df.assign(diff=df.apply(lambda row:(row.End - row.Start).total_seconds() <= marker, axis=1))
for g in df.groupby('diff'):
print g[1]
End Start Value diff
1 2017-01-01 00:05:00 2017-01-01 00:01:00 2.0 False
2 2017-01-01 00:06:00 2017-01-01 00:04:00 3.0 False
End Start Value diff
0 2017-01-01 00:00:59 2017-01-01 1.0 True

Pandas GroupBy Date Chunks

I am trying to group a Pandas Dataframe into buckets of 2 days. For example, if I do the below:
df = pd.DataFrame()
df['action_date'] = ['2017-01-01', '2017-01-01', '2017-01-03', '2017-01-04', '2017-01-04', '2017-01-05', '2017-01-06']
df['action_date'] = pd.to_datetime(df['action_date'], format="%Y-%m-%d")
df['user_name'] = ['abc', 'wdt', 'sdf', 'dfe', 'dsd', 'erw', 'fds']
df['number_of_apples'] = [1,2,3,4,5,6,2]
df = df.groupby(['action_date', 'number_of_apples']).sum()
I get a dataframe grouped by action_date with number_of_apples per day.
However, if I wanted to look at the dataframe in chunks of 2 days, how could I do so? I would then like to analyze the number_of_apples per date_chunk, either by making new dataframes for the dates 2017-01-01 & 2017-01-03, another for 2017-01-04 & 2017-01-05, and then one last one for 2017-01-06, OR just by regrouping and working within.
EDIT: I ultimately would like to make lists of users based on the the number of apples they have for each day chunk, so do not want to get the sum nor mean of each day chunk's apples. Sorry for the confusion!
Thank you in advance!
You can use resample:
print (df.resample('2D', on='action_date')['number_of_apples'].sum().reset_index())
action_date number_of_apples
0 2017-01-01 3
1 2017-01-03 12
2 2017-01-05 8
EDIT:
print (df.resample('2D', on='action_date')['user_name'].apply(list).reset_index())
action_date user_name
0 2017-01-01 [abc, wdt]
1 2017-01-03 [sdf, dfe, dsd]
2 2017-01-05 [erw, fds]
Try using a TimeGrouper to group by two days.
>>df.index=df.action_date
>>dg = df.groupby(pd.TimeGrouper(freq='2D'))['user_name'].apply(list) # 2 day frequency
>>dg.head()
action_date
2017-01-01 [abc, wdt]
2017-01-03 [sdf, dfe, dsd]
2017-01-05 [erw, fds]

Compare two dataframes and delete not same dates

I have two dataframes and want to compare them and delete the days in the df2 which are not the same as in df1. I tried to use:
df2[~df2.Date.isin(df1.Date)]
but this does not work and getting an empty dataframe. df2 should look like df1. The dataframe's looks like the following:
df1
Date
0 20-12-16
1 21-12-16
2 22-12-16
3 23-12-16
4 27-12-16
5 28-12-16
6 29-12-16
7 30-12-16
8 02-01-17
9 03-01-17
10 04-01-17
11 05-01-17
12 06-01-17
df2
Date
0 20-12-16
1 21-12-16
2 22-12-16
3 23-12-16
4 24-12-16
5 25-12-16
6 26-12-16
7 27-12-16
8 28-12-16
9 29-12-16
10 30-12-16
11 31-12-16
12 01-01-17
13 02-01-17
14 03-01-17
15 04-01-17
16 05-01-17
17 06-01-17
It seems dtypes are different. For comparing need same.
Check it by:
print (df1.Date.dtype)
print (df2.Date.dtype)
and then convert if necessary:
df1['Date'] = pd.to_datetime(df1['Date'])
df2['Date'] = pd.to_datetime(df2['Date'])
I add another 2 solutions - first with numpy.in1d and second with merge, because need default inner join:
df = df2[np.in1d(df2.Date, df1.Date)]
print (df)
Date
0 2016-12-20
1 2016-12-21
2 2016-12-22
3 2016-12-23
7 2016-12-27
8 2016-12-28
9 2016-12-29
10 2016-12-30
13 2017-01-02
14 2017-01-03
15 2017-01-04
16 2017-01-05
17 2017-01-06
df = df1.merge(df2, on='Date')
print (df)
Date
0 2016-12-20
1 2016-12-21
2 2016-12-22
3 2016-12-23
7 2016-12-27
8 2016-12-28
9 2016-12-29
10 2016-12-30
13 2017-01-02
14 2017-01-03
15 2017-01-04
16 2017-01-05
17 2017-01-06
Sample:
d1 = {'Date': ['20-12-16', '21-12-16', '22-12-16', '23-12-16', '27-12-16', '28-12-16', '29-12-16', '30-12-16', '02-01-17', '03-01-17', '04-01-17', '05-01-17', '06-01-17']}
d2 = {'Date': ['20-12-16', '21-12-16', '22-12-16', '23-12-16', '24-12-16', '25-12-16', '26-12-16', '27-12-16', '28-12-16', '29-12-16', '30-12-16', '31-12-16', '01-01-17', '02-01-17', '03-01-17', '04-01-17', '05-01-17', '06-01-17']}
df1 = pd.DataFrame(d1)
df2 = pd.DataFrame(d2)
print (df1.Date.dtype)
object
print (df2.Date.dtype)
object
df1['Date'] = pd.to_datetime(df1['Date'], format='%d-%m-%y')
df2['Date'] = pd.to_datetime(df2['Date'], format='%d-%m-%y')
Your mistake is from logic. You want to select the df2 date that are df1. So you should write
df2[df2.Date.isin(df1.Date)]
not the contrary of the boolean where comparison/inclusion in df1 is true
You could also obtain the same result with
set(b.Date)-(set(b.Date)-set(a.Date))
Which should then be used through:
pd.DataFrame(sorted((set(b.Date)-(set(b.Date)-set(a.Date)))), columns=["Date"] )
While the sorting is not optimal and you may change it in pandas by better logic.
df = pd.DataFrame(list((set(b.Date)-(set(b.Date)-set(a.Date)))), columns=["Date"] )
df.Date = [date.date() for date in df.Date]
or
df.Date.dt.date
(see How do I convert dates in a Pandas data frame to a 'date' data type?)

Categories