I have a dataframe where I need to group the TX/RX column into pairs, and then put these into a new dataframe with a new index and the timedelta between them as values.
df = pd.DataFrame()
df['time1'] = pd.date_range('2018-01-01', periods=6, freq='H')
df['time2'] = pd.date_range('2018-01-01', periods=6, freq='1H1min')
df['id'] = ids
df['val'] = vals
time1 time2 id val
0 2018-01-01 00:00:00 2018-01-01 00:00:00 1 A
1 2018-01-01 01:00:00 2018-01-01 01:01:00 2 B
2 2018-01-01 02:00:00 2018-01-01 02:02:00 3 A
3 2018-01-01 03:00:00 2018-01-01 03:03:00 4 B
4 2018-01-01 04:00:00 2018-01-01 04:04:00 5 A
5 2018-01-01 05:00:00 2018-01-01 05:05:00 6 B
needs to be...
index timedelta A B
0 1 1 2
1 1 3 4
2 1 5 6
I think that pivot_tables or stack/unstack is probably the best way to go about this, but I'm not entirely sure how...
I believe you need:
df = pd.DataFrame()
df['time1'] = pd.date_range('2018-01-01', periods=6, freq='H')
df['time2'] = df['time1'] + pd.to_timedelta([60,60,120,120,180,180], 's')
df['id'] = range(1,7)
df['val'] = ['A','B'] * 3
df['t'] = df['time2'] - df['time1']
print (df)
time1 time2 id val t
0 2018-01-01 00:00:00 2018-01-01 00:01:00 1 A 00:01:00
1 2018-01-01 01:00:00 2018-01-01 01:01:00 2 B 00:01:00
2 2018-01-01 02:00:00 2018-01-01 02:02:00 3 A 00:02:00
3 2018-01-01 03:00:00 2018-01-01 03:02:00 4 B 00:02:00
4 2018-01-01 04:00:00 2018-01-01 04:03:00 5 A 00:03:00
5 2018-01-01 05:00:00 2018-01-01 05:03:00 6 B 00:03:00
#if necessary convert to seconds
#df['t'] = (df['time2'] - df['time1']).dt.total_seconds()
df = df.pivot('t','val','id').reset_index().rename_axis(None, axis=1)
#if necessary aggregate values
#df = (df.pivot_table(index='t',columns='val',values='id', aggfunc='mean')
# .reset_index().rename_axis(None, axis=1))
print (df)
t A B
0 00:01:00 1 2
1 00:02:00 3 4
2 00:03:00 5 6
Related
I have a large time-series > 5 million rows, the values in time series fluctuate randomly between 2-10:
A small section of time-series:
I want to identify a certain pattern from this time series, pattern:
when the value of pct_change is >= threshold " T " I want to raise a flag that says reading begins
if the value of pct_change is >= T or < T and !=0 after reading begins flag has been raised then a reading continue flag should be raised until a zero is encountered
if a zero is encountered then a reading stop flag should be raised if the value of pct_change is < T after this flag has been raised then a not reading flag should be raised.
I want to write a function that can tell me how many times and for what duration this happened.
If we take a threshold T of 4 and use pct_change from the example data screenshot then the output that I want is :
The main goal behind this is to find how many times this cycle is repeating for different thresholds.
To generate sample data :
import pandas as pd
a = [2,3,4,2,0,14,5,6,3,2,0,4,5,7,8,10,4,0,5,6,7,10,7,6,4,2,0,1,2,5,6]
idx = pd.date_range("2018-01-01", periods=len(a), freq="H")
ts = pd.Series(a, index=idx)
dd = pd.DataFrame()
dd['pct_change'] =ts
dd.head()
Can you please suggest an efficient way of doing it?
Output that I want if threshold 'T' is >= 4 :
First, keep only interesting data (>= T | == 0):
threshold = 4
df = dd.loc[dd["pct_change"].ge(threshold) | dd["pct_change"].eq(0)]
>>> df
pct_change
2018-01-01 02:00:00 4 # group 0, end=2018-01-01 04:00:00
2018-01-01 04:00:00 0
2018-01-01 05:00:00 14 # group 1, end=2018-01-01 10:00:00
2018-01-01 06:00:00 5
2018-01-01 07:00:00 6
2018-01-01 10:00:00 0
2018-01-01 11:00:00 4 # group 2, end=2018-01-01 17:00:00
2018-01-01 12:00:00 5
2018-01-01 13:00:00 7
2018-01-01 14:00:00 8
2018-01-01 15:00:00 10
2018-01-01 16:00:00 4
2018-01-01 17:00:00 0
2018-01-01 18:00:00 5 # group 3, end=2018-01-02 02:00:00
2018-01-01 19:00:00 6
2018-01-01 20:00:00 7
2018-01-01 21:00:00 10
2018-01-01 22:00:00 7
2018-01-01 23:00:00 6
2018-01-02 00:00:00 4
2018-01-02 02:00:00 0
2018-01-02 05:00:00 5 # group 4, end=2018-01-02 06:00:00
2018-01-02 06:00:00 6
Then, create wanting groups:
groups = df["pct_change"].eq(0).shift(fill_value=0).cumsum()
>>> groups
2018-01-01 02:00:00 0 # group 0
2018-01-01 04:00:00 0
2018-01-01 05:00:00 1 # group 1
2018-01-01 06:00:00 1
2018-01-01 07:00:00 1
2018-01-01 10:00:00 1
2018-01-01 11:00:00 2 # group 2
2018-01-01 12:00:00 2
2018-01-01 13:00:00 2
2018-01-01 14:00:00 2
2018-01-01 15:00:00 2
2018-01-01 16:00:00 2
2018-01-01 17:00:00 2
2018-01-01 18:00:00 3 # group 3
2018-01-01 19:00:00 3
2018-01-01 20:00:00 3
2018-01-01 21:00:00 3
2018-01-01 22:00:00 3
2018-01-01 23:00:00 3
2018-01-02 00:00:00 3
2018-01-02 02:00:00 3
2018-01-02 05:00:00 4 # group 4
2018-01-02 06:00:00 4
Name: pct_change, dtype: object
Finally, use groups to output result:
out = pd.DataFrame(df.groupby(groups) \
.apply(lambda x: (x.index[0], x.index[-1])) \
.tolist(), columns=["StartTime", "EndTime"])
>>> out
StartTime EndTime
0 2018-01-01 02:00:00 2018-01-01 04:00:00 # group 0
1 2018-01-01 05:00:00 2018-01-01 10:00:00 # group 1
2 2018-01-01 11:00:00 2018-01-01 17:00:00 # group 2
3 2018-01-01 18:00:00 2018-01-02 02:00:00 # group 3
4 2018-01-02 05:00:00 2018-01-02 06:00:00 # group 4
Bonus
There are some case where you have to remove groups:
The first pct value is 0
Two or more consecutive pct value is 0
To remove them:
out = out[~out["StartTime"].eq(out["EndTime"])]
I have the following DataFrame:
df_h00 = df.copy()
tt = df_h00.set_index('username').post_time_data.str.extractall(r'totalCount\":([^,}]*)')
tt['index']=tt.index
tt[['user','hour']]=pd.DataFrame(tt['index'].values.tolist(),
index=tt.index)
tt = tt.drop(['index'], axis=1)
tt.columns = ['totalCount', 'user', 'hours']
tt.head()
totalCount user hours
username match
lowi 0 15 lowi 0
1 11 lowi 1
2 2 lowi 2
3 0 lowi 3
4 0 lowi 4
I want to convert the column tt['hours'] which is non-null int64 to date time with format "%H:%M".
I've tried the following code:
tthour = tt['hours']
tthour = pd.to_datetime(tthour, format='%H', errors='coerce')
tthour = tthour.to_frame()
tthour.head()
hours
username match
lowi 0 1900-01-01 00:00:00
1 1900-01-01 01:00:00
2 1900-01-01 02:00:00
3 1900-01-01 03:00:00
4 1900-01-01 04:00:00
However, I only want "%H:%M". So the expected output would be like this:
hours
username match
lowi 0 00:00
1 01:00
2 02:00
3 03:00
4 04:00
Datetimes in your expected format not exist in python.
Close what you need are timedeltas by to_timedelta with Series.str.zfill or strings:
tt = pd.DataFrame({'hours':np.arange(5)})
tt['td'] = pd.to_timedelta(tt['hours'].astype(str).str.zfill(2) + ':00:00', errors='coerce')
tt['str'] = tt['hours'].astype(str).str.zfill(2) + ':00'
print (tt)
hours td str
0 0 00:00:00 00:00
1 1 01:00:00 01:00
2 2 02:00:00 02:00
3 3 03:00:00 03:00
4 4 04:00:00 04:00
I have a data frame with a datetime column every 10 minutes and a numerical value:
df1 = pd.DataFrame({'time' : pd.date_range('1/1/2018', periods=20, freq='10min'), 'value' : np.random.randint(2, 20, size=20)})
And another with a schedule of events, with a start time and end time. There can be multiple events happening at the same time:
df2 = pd.DataFrame({'start_time' : ['2018-01-01 00:00:00', '2018-01-01 00:00:00','2018-01-01 01:00:00', '2018-01-01 01:00:00', '2018-01-01 01:00:00', '2018-01-01 02:00:00' ], 'end_time' : ['2018-01-01 01:00:00', '2018-01-01 01:00:00', '2018-01-01 02:00:00','2018-01-01 02:00:00', '2018-01-01 02:00:00', '2018-01-01 03:00:00'], 'event' : ['A', 'B', 'C', 'D', 'E', 'F'] })
df2[['start_time', 'end_time']] = df2.iloc[:,0:2].apply(pd.to_datetime)
I want to do a left join on df1, with all events that fall inside the start and end times. My output table should be:
time value event
0 2018-01-01 00:00:00 5 A
1 2018-01-01 00:00:00 5 B
2 2018-01-01 00:10:00 15 A
3 2018-01-01 00:10:00 15 B
4 2018-01-01 00:20:00 16 A
5 2018-01-01 00:20:00 16 B
.....
17 2018-01-01 02:50:00 7 F
I attempted these SO solutions, but they fail because of duplicate time intervals.
Setup (Only using a few entries from df1 for brevity):
df1 = pd.DataFrame({'time' : pd.date_range('1/1/2018', periods=20, freq='10min'), 'value' : np.random.randint(2, 20, size=20)})
df2 = pd.DataFrame({'start_time' : ['2018-01-01 00:00:00', '2018-01-01 00:00:00','2018-01-01 01:00:00', '2018-01-01 01:00:00', '2018-01-01 01:00:00', '2018-01-01 02:00:00' ], 'end_time' : ['2018-01-01 01:00:00', '2018-01-01 01:00:00', '2018-01-01 02:00:00','2018-01-01 02:00:00', '2018-01-01 02:00:00', '2018-01-01 03:00:00'], 'event' : ['A', 'B', 'C', 'D', 'E', 'F'] })
df1 = df1.sample(5)
df2[['start_time', 'end_time']] = df2.iloc[:,0:2].apply(pd.to_datetime)
You can use a couple of straightfoward list comprehensions to achieve your result. This answer assumes that all date columns are in fact, of type datetime in your DataFrame:
Step 1
Find all events that occur within a particular time range using a list comprehension and simple interval checking:
packed = list(zip(df2.start_time, df2.end_time, df2.event))
df1['event'] = [[ev for strt, end, ev in packed if strt <= el <= end] for el in df1.time]
time value event
2 2018-01-01 00:20:00 8 [A, B]
14 2018-01-01 02:20:00 14 [F]
8 2018-01-01 01:20:00 6 [C, D, E]
19 2018-01-01 03:10:00 16 []
4 2018-01-01 00:40:00 7 [A, B]
Step 2:
Finally, explode each list from the last result to a new row using another list comprehension:
pd.DataFrame(
[[t, val, e] for t, val, event in zip(df1.time, df1.value, df1.event)
for e in event
], columns=df1.columns
)
Output:
time value event
0 2018-01-01 00:20:00 8 A
1 2018-01-01 00:20:00 8 B
2 2018-01-01 02:20:00 14 F
3 2018-01-01 01:20:00 6 C
4 2018-01-01 01:20:00 6 D
5 2018-01-01 01:20:00 6 E
6 2018-01-01 00:40:00 7 A
7 2018-01-01 00:40:00 7 B
I'm not entirely sure of your question, but if you are trying to join on "events that fall inside the start and end times," then sounds like you need something akin to a "between" operator from SQL. You're data doesn't make it particularly clear.
Pandas doesn't have this natively, but Pandasql does. It allows you to run sqlite against you're dataframe. I think something like this is what you need:
import pandasql as ps
sqlcode = '''
select *
from df1
inner join df2 on df1.event=df2.event
where df2.time >= d1.start_time and df2.fdate <= d1.stop_time
'''
newdf = ps.sqldf(sqlcode,locals())
Relevant Question:
Merge pandas dataframes where one value is between two others
One option is with the conditional_join from pyjanitor:
# pip install pyjanitor
import pandas as pd
import janitor
out = df1.conditional_join(
df2,
('time', 'start_time', '>='),
('time', 'end_time', '<=')
)
out.head()
time value start_time end_time event
0 2018-01-01 00:00:00 14 2018-01-01 2018-01-01 01:00:00 A
1 2018-01-01 00:00:00 14 2018-01-01 2018-01-01 01:00:00 B
2 2018-01-01 00:10:00 10 2018-01-01 2018-01-01 01:00:00 A
3 2018-01-01 00:10:00 10 2018-01-01 2018-01-01 01:00:00 B
4 2018-01-01 00:20:00 15 2018-01-01 2018-01-01 01:00:00 A
You can work on df2 to create a column with all the time with a resampling '10min' (like in df1) for each event, and then use merge. It's a lot of manipulation so probably not the most efficient.
df2_manip = (df2.set_index('event').stack().reset_index().set_index(0)
.groupby('event').resample('10T').ffill().reset_index(1))
and df2_manip looks like:
0 event level_1
event
A 2018-01-01 00:00:00 A start_time
A 2018-01-01 00:10:00 A start_time
A 2018-01-01 00:20:00 A start_time
A 2018-01-01 00:30:00 A start_time
A 2018-01-01 00:40:00 A start_time
A 2018-01-01 00:50:00 A start_time
A 2018-01-01 01:00:00 A end_time
B 2018-01-01 00:00:00 B start_time
B 2018-01-01 00:10:00 B start_time
B 2018-01-01 00:20:00 B start_time
B 2018-01-01 00:30:00 B start_time
...
Now you can merge:
df1 = df1.merge(df2_manip[[0, 'event']].rename(columns={0:'time'}))
and you get df1:
time value event
0 2018-01-01 00:00:00 9 A
1 2018-01-01 00:00:00 9 B
2 2018-01-01 00:10:00 16 A
3 2018-01-01 00:10:00 16 B
...
33 2018-01-01 02:00:00 6 D
34 2018-01-01 02:00:00 6 E
35 2018-01-01 02:00:00 6 F
36 2018-01-01 02:10:00 2 F
37 2018-01-01 02:20:00 18 F
38 2018-01-01 02:30:00 14 F
39 2018-01-01 02:40:00 5 F
40 2018-01-01 02:50:00 3 F
41 2018-01-01 03:00:00 9 F
I have a dataframe df as below:
date1 item_id
2000-01-01 00:00:00 0
2000-01-01 10:01:00 1
2000-01-01 00:02:00 2
2000-01-01 00:03:00 3
2000-01-01 00:04:00 4
2000-01-01 00:05:00 5
2000-01-01 00:06:00 6
2000-01-01 12:07:00 7
2000-01-02 00:08:00 8
2000-01-02 00:00:00 0
2000-01-02 00:01:00 1
2000-01-02 03:02:00 2
2000-01-02 00:03:00 3
2000-01-02 00:04:00 4
2000-01-02 00:05:00 5
2000-01-02 04:06:00 6
2000-01-02 00:07:00 7
2000-01-02 00:08:00 8
I need the data for single day i.e. 1st Jan 2000. Below query gives me the correct result. But is there a way it can be done just by passing "2000-01-01"?
result= df[(df['date1'] > '2000-01-01 00:00') & (df['date1'] < '2000-01-01 23:59')]
Use partial string indexing, but need DatetimeIndex first:
df = df.set_index('date1')['2000-01-01']
print (df)
item_id
date1
2000-01-01 00:00:00 0
2000-01-01 10:01:00 1
2000-01-01 00:02:00 2
2000-01-01 00:03:00 3
2000-01-01 00:04:00 4
2000-01-01 00:05:00 5
2000-01-01 00:06:00 6
2000-01-01 12:07:00 7
Another solution is convert datetimes to strings by strftime and filter by boolean indexing:
df = df[df['date1'].dt.strftime('%Y-%m-%d') == '2000-01-01']
print (df)
date1 item_id
0 2000-01-01 00:00:00 0
1 2000-01-01 10:01:00 1
2 2000-01-01 00:02:00 2
3 2000-01-01 00:03:00 3
4 2000-01-01 00:04:00 4
5 2000-01-01 00:05:00 5
6 2000-01-01 00:06:00 6
7 2000-01-01 12:07:00 7
The other alternative would be to create a mask:
df[df.date1.dt.date.astype(str) == '2000-01-01']
Full example:
import pandas as pd
data = '''\
date1 item_id
2000-01-01T00:00:00 0
2000-01-01T10:01:00 1
2000-01-01T00:02:00 2
2000-01-01T00:03:00 3
2000-01-01T00:04:00 4
2000-01-01T00:05:00 5
2000-01-01T00:06:00 6
2000-01-01T12:07:00 7
2000-01-02T00:08:00 8
2000-01-02T00:00:00 0
2000-01-02T00:01:00 1
2000-01-02T03:02:00 2'''
df = pd.read_csv(pd.compat.StringIO(data), sep='\s+', parse_dates=['date1'])
res = df[df.date1.dt.date.astype(str) == '2000-01-01']
print(res)
Returns:
date1 item_id
0 2000-01-01 00:00:00 0
1 2000-01-01 10:01:00 1
2 2000-01-01 00:02:00 2
3 2000-01-01 00:03:00 3
4 2000-01-01 00:04:00 4
5 2000-01-01 00:05:00 5
6 2000-01-01 00:06:00 6
7 2000-01-01 12:07:00 7
Or
import datetime
df[df.date1.dt.date == datetime.date(2000,1,1)]
I'm trying to resample a dataframe with a time series from 1-hour increments to 15-minute. Both .resample() and .asfreq() do almost exactly what I want, but I'm having a hard time filling the last three intervals.
I could add an extra hour at the end, resample, and then drop that last hour, but it feels hacky.
Current code:
df = pd.DataFrame({'date':pd.date_range('2018-01-01 00:00', '2018-01-01 01:00', freq = '1H'), 'num':5})
df = df.set_index('date').asfreq('15T', method = 'ffill', how = 'end').reset_index()
Current output:
date num
0 2018-01-01 00:00:00 5
1 2018-01-01 00:15:00 5
2 2018-01-01 00:30:00 5
3 2018-01-01 00:45:00 5
4 2018-01-01 01:00:00 5
Desired output:
date num
0 2018-01-01 00:00:00 5
1 2018-01-01 00:15:00 5
2 2018-01-01 00:30:00 5
3 2018-01-01 00:45:00 5
4 2018-01-01 01:00:00 5
5 2018-01-01 01:15:00 5
6 2018-01-01 01:30:00 5
7 2018-01-01 01:45:00 5
Thoughts?
Not sure about asfreq but reindex works wonderfully:
df.set_index('date').reindex(
pd.date_range(
df.date.min(),
df.date.max() + pd.Timedelta('1H'), freq='15T', closed='left'
),
method='ffill'
)
num
2018-01-01 00:00:00 5
2018-01-01 00:15:00 5
2018-01-01 00:30:00 5
2018-01-01 00:45:00 5
2018-01-01 01:00:00 5
2018-01-01 01:15:00 5
2018-01-01 01:30:00 5
2018-01-01 01:45:00 5