More efficient alternative to nested For loop - python

I have two dataframes which contain data collected at two different frequencies.
I want to update the label of df2, to that of df1 if it falls into the duration of an event.
I created a nested for-loop to do it, but it takes a rather long time.
Here is the code I used:
for i in np.arange(len(df1)-1):
for j in np.arange(len(df2)):
if (df2.timestamp[j] > df1.timestamp[i]) & (df2.timestamp[j] < (df1.timestamp[i] + df1.duration[i])):
df2.loc[j,"label"] = df1.loc[i,"label"]
Is there a more efficient way of doing this?
df1 size (367, 4)
df2 size (342423, 9)
short example data:
import numpy as np
import pandas as pd
data1 = {'timestamp': [1,2,3,4,5,6,7,8,9],
'duration': [0.5,0.3,0.8,0.2,0.4,0.5,0.3,0.7,0.5],
'label': ['inh','exh','inh','exh','inh','exh','inh','exh','inh']
}
df1 = pd.DataFrame (data1, columns = ['timestamp','duration','label'])
data2 = {'timestamp': [1,1.5,2,2.5,3,3.5,4,4.5,5,5.5,6,6.5,7,7.5,8,8.5,9,9.5],
'label': ['plc','plc','plc','plc','plc','plc','plc','plc','plc','plc','plc','plc','plc','plc','plc','plc','plc','plc']
}
df2 = pd.DataFrame (data2, columns = ['timestamp','label'])

I would first use a merge_asof to select the highest timestamp from df1 below the timestamp from df2. Next a simple (vectorized) comparison of df2.timestamp and df1.timestamp + df1.duration is enough to select matching lines.
Code could be:
df1['t2'] = df1['timestamp'].astype('float64') # types of join columns must be the same
temp = pd.merge_asof(df2, df1, left_on='timestamp', right_on='t2')
df2.loc[temp.timestamp_x <= temp.t2 + temp.duration, 'label'] = temp.label_y
It gives for df2:
timestamp label
0 1.0 inh
1 1.5 inh
2 2.0 exh
3 2.5 plc
4 3.0 inh
5 3.5 inh
6 4.0 exh
7 4.5 plc
8 5.0 inh
9 5.5 plc
10 6.0 exh
11 6.5 exh
12 7.0 inh
13 7.5 plc
14 8.0 exh
15 8.5 exh
16 9.0 inh
17 9.5 inh

Related

Merging 2 dataframes and sorting by datetime Pandas Python

I want to produce a code where it creates an additional table to the dataframe data. The new dataframe data2 will have the following changes:
label will be New instead of Old
col1's last index will be deleted
col2's first index will be deleted
date will be first index will be deleted and all date values will
be subtracted by 1 minute
Then I want to concatenate the two data frames to make one data frame called merge I want to sort the dataframe by dates. Since the first index of data2 is dropped the order of merge should be in order of label: New, Old, New, Old. How can I subtract 1 minute from date_mod and merge the two data frames in order of dates?
import pandas as pd
d = {'col1': [4, 5, 2, 2, 3, 5, 1, 1, 6], 'col2': [6, 2, 1, 7, 3, 5, 3, 3, 9],
'label':['Old','Old','Old','Old','Old','Old','Old','Old','Old'],
'date': ['2022-01-24 10:07:02', '2022-01-27 01:55:03', '2022-01-30 19:09:03', '2022-02-02 14:34:06',
'2022-02-08 12:37:03', '2022-02-10 03:07:02', '2022-02-10 14:02:03', '2022-02-11 00:32:25',
'2022-02-12 21:42:03']}
data = pd.DataFrame(d)
'''
Additional Dataframe
label will have New
'col1'`s last index will be deleted
'col2'`s first index will be deleted
'date' will be first index will be deleted and all date values will be subtracted by 1 minute
'''
a = data['col1'].drop(data['col1'].index[-1])
b = data['col2'].drop(data['col2'].index[0])
# subtract the date_mod by 1 minute
date_mod = pd.to_datetime(data['date'][1:])
data2 = pd.DataFrame({'col1':a,'col2':b,
'label':['New','New','New','New','New','New','New','New'],
'date': date_mod})
'''
Merging data and data2
Sort by 'date'
Should go in order as Old, New, Old, New ...
The length of the columns are 1 less than of data bc of the dropped indexes
'''
merge=pd.merge(data,displayer)
the simplest way I think off, - place all adjustments into the function and apply to the copy of the original dataframe, later simply concat and sort:
data.date = pd.to_datetime(data.date) # converting column date str values to datetime to deduct 1minute later
def adjust_data(df):
df['col1'] = df['col1'].drop(df['col1'].index[-1])
df['col2'] = df['col2'].drop(df['col2'].index[0])
df.date = df.date - pd.Timedelta(minutes=1) # subtract the datetime by 1 minute
df.label = df.label.replace('Old','New') # change values in the column "label"
data2 = data.copy()
adjust_data(data2) # apply function to data2
# concat both dataframes and sort by column "date"
merge = pd.concat([data,data2], axis=0).sort_values(by=['date']).reset_index(drop=True)
print(merge)
out:
col1 col2 label date
0 4.0 NaN New 2022-01-24 10:06:02
1 4.0 6.0 Old 2022-01-24 10:07:02
2 5.0 2.0 New 2022-01-27 01:54:03
3 5.0 2.0 Old 2022-01-27 01:55:03
4 2.0 1.0 New 2022-01-30 19:08:03
5 2.0 1.0 Old 2022-01-30 19:09:03
6 2.0 7.0 New 2022-02-02 14:33:06
7 2.0 7.0 Old 2022-02-02 14:34:06
8 3.0 3.0 New 2022-02-08 12:36:03
9 3.0 3.0 Old 2022-02-08 12:37:03
10 5.0 5.0 New 2022-02-10 03:06:02
11 5.0 5.0 Old 2022-02-10 03:07:02
12 1.0 3.0 New 2022-02-10 14:01:03
13 1.0 3.0 Old 2022-02-10 14:02:03
14 1.0 3.0 New 2022-02-11 00:31:25
15 1.0 3.0 Old 2022-02-11 00:32:25
16 NaN 9.0 New 2022-02-12 21:41:03
17 6.0 9.0 Old 2022-02-12 21:42:03

Pandas combine two dataseries into one series

I need to combine the dataseries rateScore and rate into one.
This is the current DataFrame I have
rateScore rate
10 NaN 4.5
11 2.5 NaN
12 4.5 NaN
13 NaN 5.0
..
235 NaN 4.7
236 3.8 NaN
This needs to be something like this:
rateScore
10 4.5
11 2.5
12 4.5
13 5.0
..
235 4.7
236 3.8
The rate column needs to be dropped after merging the series and also for each row, the index number needs stay the same.
You can try with the following with fillna(), redifining the rateScore column and dropping rate:
df = df.fillna(0)
df['rateScore'] = df['rateScore'] + df['rate']
df = df.drop(columns='rate')
You could use combine_first to fill NaN values from a second Series:
df['rateScore'] = df['rateScore'].combine_first(df['rateScore'])
Let us do add
df['rateScore'] = df['rateScore'].add(df['rate'],fill_value=0)

How to manipulate pandas dataframe with multiple IF statements/ conditions?

I'm relatively new to python and pandas and am trying to determine how do I create a IF statement or any other statement that once initially returns value continues with other IF statement with in given range?
I have tried .between, .loc, and if statements but am still struggling. I have tried to recreate what is happening in my code but cannot replicate it precisely. Any suggestions or ideas around this problem?
import pandas as pd
data = {'Yrs': [ '2018','2019', '2020', '2021', '2022'], 'Val': [1.50, 1.75, 2.0, 2.25, 2.5] }
data2 = {'F':['2015','2018', '2020'], 'L': ['2019','2022', '2024'], 'Base':['2','5','5'],
'O':[20, 40, 60], 'S': [5, 10, 15]}
df = pd.DataFrame(data)
df2 = pd.DataFrame(data2)
r = pd.DataFrame()
#use this code to get first value when F <= Yrs
r.loc[(df2['F'] <= df.at[0,'Yrs']), '2018'] = \
(1/pd.to_numeric(df2['Base']))*(pd.to_numeric(df2['S']))* \
(pd.to_numeric(df.at[0, 'Val']))+(pd.to_numeric(df2['Of']))
#use this code to get the rest of the values until L = Yrs
r.loc[(df2['L'] <= df.at[1,'Yrs']) & (df2['L'] >= df.at[1,'Yrs']),\
'2019'] = (pd.to_numeric(r['2018'])- pd.to_numeric(df2['Of']))* \
pd.to_numeric(df.at[1, 'Val'] / pd.to_numeric(df.at[0, 'Val'])) + \
pd.to_numeric(df2['Of'])
r
I expect output to be:(the values may be different but its the pattern I want)
2018 2019 2020 2021 2022
0 7.75 8.375 NaN NaN NaN
1 11.0 11.5 12 12.5 13.0
2 NaN NaN 18 18.75 19.25
but i get:
2018 2019 2020 2021 2022
0 7.75 8.375 9.0 9.625 10.25
1 11.0 11.5 12 NaN NaN
2 16.50 17.25 18 NaN NaN

How to perform functions that reference previous row on subset of data in a dataframe using groupby

I have some log data that represents an item (id) and a timestamp that an action was a started and I want to determine the time between actions on each item.
for example, I have some data that looks like this:
data = [{"timestamp":"2019-05-21T14:17:29.265Z","id":"ff9dad92-e7c1-47a5-93a7-6e49533a6e25"},{"timestamp":"2019-05-21T14:21:49.722Z","id":"ff9dad92-e7c1-47a5-93a7-6e49533a6e25"},{"timestamp":"2019-05-21T15:16:25.695Z","id":"ff9dad92-e7c1-47a5-93a7-6e49533a6e25"},{"timestamp":"2019-05-21T15:16:25.696Z","id":"ff9dad92-e7c1-47a5-93a7-6e49533a6e25"},{"timestamp":"2019-05-22T07:51:17.49Z","id":"ff12891e-5786-438b-891c-abd4244723b4"},{"timestamp":"2019-05-22T08:11:13.948Z","id":"ff12891e-5786-438b-891c-abd4244723b4"},{"timestamp":"2019-05-22T11:52:59.897Z","id":"ff12891e-5786-438b-891c-abd4244723b4"},{"timestamp":"2019-05-22T11:53:03.406Z","id":"ff12891e-5786-438b-891c-abd4244723b4"},{"timestamp":"2019-05-22T11:53:03.481Z","id":"ff12891e-5786-438b-891c-abd4244723b4"},{"timestamp":"2019-05-21T14:23:08.147Z","id":"fe55bb22-fe5b-4b12-8aaf-d5f0320ac7fa"},{"timestamp":"2019-05-21T14:29:18.228Z","id":"fe55bb22-fe5b-4b12-8aaf-d5f0320ac7fa"},{"timestamp":"2019-05-21T15:17:09.831Z","id":"fe55bb22-fe5b-4b12-8aaf-d5f0320ac7fa"},{"timestamp":"2019-05-21T15:17:09.834Z","id":"fe55bb22-fe5b-4b12-8aaf-d5f0320ac7fa"},{"timestamp":"2019-05-21T14:02:19.072Z","id":"fd3554cd-b83d-49af-a8e6-7bf41c741cd0"},{"timestamp":"2019-05-21T14:02:34.867Z","id":"fd3554cd-b83d-49af-a8e6-7bf41c741cd0"},{"timestamp":"2019-05-21T14:12:28.877Z","id":"fd3554cd-b83d-49af-a8e6-7bf41c741cd0"},{"timestamp":"2019-05-21T15:19:19.567Z","id":"fd3554cd-b83d-49af-a8e6-7bf41c741cd0"},{"timestamp":"2019-05-21T15:19:19.582Z","id":"fd3554cd-b83d-49af-a8e6-7bf41c741cd0"},{"timestamp":"2019-05-21T09:58:02.185Z","id":"f89c2e3e-06dc-467b-813b-dc92f2692f63"},{"timestamp":"2019-05-21T10:07:24.044Z","id":"f89c2e3e-06dc-467b-813b-dc92f2692f63"}]
stack = pd.DataFrame(data)
stack.head()
I have tried getting all the unique ids to split the data frame and then getting the time taken with the index to recombine with the original set like, but the function is extremely slow on large data-sets and messes up both the index
and timestamp order resulting in results getting miss matched.
import ciso8601 as time
records = []
for i in list(stack.id.unique()):
dff = stack[stack.id == i]
time_taken = []
times = []
i = 0
for _, row in dff.iterrows():
if bool(times):
print(_)
current_time = time.parse_datetime(row.timestamp)
prev_time = times[i]
time_taken = current_time - prev_time
times.append(current_time)
i+=1
records.append(dict(index = _, time_taken = time_taken.seconds))
else:
records.append(dict(index = _, time_taken = 0))
times.append(time.parse_datetime(row.timestamp))
x = pd.DataFrame(records).set_index('index')
stack.merge(x, left_index=True, right_index=True, how='inner')
Is there a neat pandas groupby and apply way of doing this so that I don't have to split the frame and store it in memory so that can reference the previous row in the subset?
Thanks
You can use GroupBy.diff:
stack['timestamp'] = pd.to_datetime(stack['timestamp'])
stack['timestamp']= (stack.sort_values(['id','timestamp'])
.groupby('id')
.diff()['timestamp']
.dt.total_seconds()
.round().fillna(0))
print(stack['time_taken'])
0 0.0
1 260.0
2 3276.0
3 0.0
4 0.0
5 1196.0
6 13306.0
7 4.0
8 0.0
9 0.0
10 370.0
11 2872.0
...
If you want the resulting dataframe to be ordered by date, instead do:
stack['timestamp'] = pd.to_datetime(stack['timestamp'])
stack = stack.sort_values(['id','timestamp'])
stack['time_taken'] = (stack.groupby('id')
.diff()['timestamp']
.dt.total_seconds()
.round()
.fillna(0))
If dont need replace timestamp to datetimes create Series filled by datetimes by to_datetime and pass to DataFrameGroupBy.diff, then convert to seconds by Series.dt.total_seconds, if necessary round by Series.round and replace missing values by 0:
t = pd.to_datetime(stack['timestamp'])
stack['time_taken'] = t.groupby(stack['id']).diff().dt.total_seconds().round().fillna(0)
print (stack)
id timestamp time_taken
0 ff9dad92-e7c1-47a5-93a7-6e49533a6e25 2019-05-21T14:17:29.265Z 0.0
1 ff9dad92-e7c1-47a5-93a7-6e49533a6e25 2019-05-21T14:21:49.722Z 260.0
2 ff9dad92-e7c1-47a5-93a7-6e49533a6e25 2019-05-21T15:16:25.695Z 3276.0
3 ff9dad92-e7c1-47a5-93a7-6e49533a6e25 2019-05-21T15:16:25.696Z 0.0
4 ff12891e-5786-438b-891c-abd4244723b4 2019-05-22T07:51:17.49Z 0.0
5 ff12891e-5786-438b-891c-abd4244723b4 2019-05-22T08:11:13.948Z 1196.0
6 ff12891e-5786-438b-891c-abd4244723b4 2019-05-22T11:52:59.897Z 13306.0
7 ff12891e-5786-438b-891c-abd4244723b4 2019-05-22T11:53:03.406Z 4.0
8 ff12891e-5786-438b-891c-abd4244723b4 2019-05-22T11:53:03.481Z 0.0
9 fe55bb22-fe5b-4b12-8aaf-d5f0320ac7fa 2019-05-21T14:23:08.147Z 0.0
10 fe55bb22-fe5b-4b12-8aaf-d5f0320ac7fa 2019-05-21T14:29:18.228Z 370.0
11 fe55bb22-fe5b-4b12-8aaf-d5f0320ac7fa 2019-05-21T15:17:09.831Z 2872.0
12 fe55bb22-fe5b-4b12-8aaf-d5f0320ac7fa 2019-05-21T15:17:09.834Z 0.0
13 fd3554cd-b83d-49af-a8e6-7bf41c741cd0 2019-05-21T14:02:19.072Z 0.0
14 fd3554cd-b83d-49af-a8e6-7bf41c741cd0 2019-05-21T14:02:34.867Z 16.0
15 fd3554cd-b83d-49af-a8e6-7bf41c741cd0 2019-05-21T14:12:28.877Z 594.0
16 fd3554cd-b83d-49af-a8e6-7bf41c741cd0 2019-05-21T15:19:19.567Z 4011.0
17 fd3554cd-b83d-49af-a8e6-7bf41c741cd0 2019-05-21T15:19:19.582Z 0.0
18 f89c2e3e-06dc-467b-813b-dc92f2692f63 2019-05-21T09:58:02.185Z 0.0
19 f89c2e3e-06dc-467b-813b-dc92f2692f63 2019-05-21T10:07:24.044Z 562.0
Or if need replace timestamp to datetimes use #yatu answer.

python pandas:get rolling value of one Dataframe by rolling index of another Dataframe

I have two dataframes: one has multi levels of columns, and another has only single level column (which is the first level of the first dataframe, or say the second dataframe is calculated by grouping the first dataframe).
These two dataframes look like the following:
first dataframe-df1
second dataframe-df2
The relationship between df1 and df2 is:
df2 = df1.groupby(axis=1, level='sector').mean()
Then, I get the index of rolling_max of df1 by:
result1=pd.rolling_apply(df1,window=5,func=lambda x: pd.Series(x).idxmax(),min_periods=4)
Let me explain result1 a little bit. For example, during the five days (window length) 2016/2/23 - 2016/2/29, the max price of the stock sh600870 happened in 2016/2/24, the index of 2016/2/24 in the five-day range is 1. So, in result1, the value of stock sh600870 in 2016/2/29 is 1.
Now, I want to get the sector price for each stock by the index in result1.
Let's take the same stock as example, the stock sh600870 is in sector ’家用电器视听器材白色家电‘. So in 2016/2/29, I wanna get the sector price in 2016/2/24, which is 8.770.
How can I do that?
idxmax (or np.argmax) returns an index which is relative to the rolling
window. To make the index relative to df1, add the index of the left edge of
the rolling window:
index = pd.rolling_apply(df1, window=5, min_periods=4, func=np.argmax)
shift = pd.rolling_min(np.arange(len(df1)), window=5, min_periods=4)
index = index.add(shift, axis=0)
Once you have ordinal indices relative to df1, you can use them to index
into df1 or df2 using .iloc.
For example,
import numpy as np
import pandas as pd
np.random.seed(2016)
N = 15
columns = pd.MultiIndex.from_product([['foo','bar'], ['A','B']])
columns.names = ['sector', 'stock']
dates = pd.date_range('2016-02-01', periods=N, freq='D')
df1 = pd.DataFrame(np.random.randint(10, size=(N, 4)), columns=columns, index=dates)
df2 = df1.groupby(axis=1, level='sector').mean()
window_size, min_periods = 5, 4
index = pd.rolling_apply(df1, window=window_size, min_periods=min_periods, func=np.argmax)
shift = pd.rolling_min(np.arange(len(df1)), window=window_size, min_periods=min_periods)
# alternative, you could use
# shift = np.pad(np.arange(len(df1)-window_size+1), (window_size-1, 0), mode='constant')
# but this is harder to read/understand, and therefore it maybe more prone to bugs.
index = index.add(shift, axis=0)
result = pd.DataFrame(index=df1.index, columns=df1.columns)
for col in index:
sector, stock = col
mask = pd.notnull(index[col])
idx = index.loc[mask, col].astype(int)
result.loc[mask, col] = df2[sector].iloc[idx].values
print(result)
yields
sector foo bar
stock A B A B
2016-02-01 NaN NaN NaN NaN
2016-02-02 NaN NaN NaN NaN
2016-02-03 NaN NaN NaN NaN
2016-02-04 5.5 5 5 7.5
2016-02-05 5.5 5 5 8.5
2016-02-06 5.5 6.5 5 8.5
2016-02-07 5.5 6.5 5 8.5
2016-02-08 6.5 6.5 5 8.5
2016-02-09 6.5 6.5 6.5 8.5
2016-02-10 6.5 6.5 6.5 6
2016-02-11 6 6.5 4.5 6
2016-02-12 6 6.5 4.5 4
2016-02-13 2 6.5 4.5 5
2016-02-14 4 6.5 4.5 5
2016-02-15 4 6.5 4 3.5
Note in pandas 0.18 the rolling_apply syntax was changed. DataFrames and Series now have a rolling method, so that now you would use:
index = df1.rolling(window=window_size, min_periods=min_periods).apply(np.argmax)
shift = (pd.Series(np.arange(len(df1)))
.rolling(window=window_size, min_periods=min_periods).min())
index = index.add(shift.values, axis=0)

Categories