How to add calculated rows below each row in a pandas DataFrame - python

I have a dataframe_1 as such:
Index Time Label
0 0.000 ns Segment 1
1 2.749 sec baseline
2 3.459 min begin test
3 7.009 min end of test
And I would like to add multiple new rows in between each of dataframe_1's rows, where the Time column for each new row would add an additional minute until reaching dataframe_1's next row's time (and corresponding Label). For example, the above table should ultimately look like this:
Index Time Label
0 0.000 ns Segment 1
1 2.749 sec baseline
2 00:01:02.749000 baseline + 1min
3 00:02:02.749000 baseline + 2min
4 00:03:02.749000 baseline + 3min
5 3.459 min begin test
6 00:04:27.540000 begin test + 1min
7 00:05:27.540000 begin test + 2min
8 00:06:27.540000 begin test + 3min
9 7.009 min end of test
Using Timedelta type via pd.to_timedelta() is perfectly fine.
I thought the best way to do this would be to break up each row of dataframe_1 into its own dataframe, and then adding rows for each added minute, and then concating the dataframes back together. However, I am unsure of how to accomplish this.
Should I use a nested for-loop to [first] iterate over the rows of dataframe_1 and then [second] iterate over a counter so I can create new rows with added minutes?
I was previously not splitting up the individual rows into new dataframes, and I was doing the second iteration like this:
baseline_row = df_legend[df_legend['Label'] == 'baseline']
[baseline_index] = baseline_row.index
baseline_time = baseline_row['Time']
interval_mins = 1
new_time = baseline_time + pd.Timedelta(minutes=interval_mins)
cutoff_time_np = df_legend.iloc[baseline_row.index + 1]['Time']
cutoff_time = pd.to_timedelta(cutoff_time_np)
while new_time.reset_index(drop=True).get(0) < cutoff_time.reset_index(drop=True).get(0):
new_row = baseline_row.copy()
new_row['Label'] = f'minute {interval_mins}'
new_row['Time'] = baseline_time + pd.Timedelta(minutes=interval_mins)
new_row.index = [baseline_index + interval_mins - 0.5]
df_legend = df_legend.append(new_row, ignore_index=False)
df_legend = df_legend.sort_index().reset_index(drop=True)
pdb.set_trace()
interval_mins += 1
new_time = baseline_time + pd.Timedelta(minutes=interval_mins)
But since I want to do this for each row in the original dataframe_1, then I was thinking to split it up into separate dataframes and put it back together. I'm just not sure what the best way is to do that, especially since pandas is apparently very slow if iterating over the rows.
I would really appreciate some guidance.

This might faster than your solution.
df.Time = pd.to_timedelta(df.Time)
df['counts'] = df.Time.diff().apply(lambda x: x.total_seconds()) / 60
df['counts'] = np.floor(df.counts.shift(-1)).fillna(0).astype(int)
df.drop(columns='Index', inplace=True)
df
Time Label counts
0 00:00:00 Segment 1 0
1 00:00:02.749000 baseline 3
2 00:03:27.540000 begin test 3
3 00:07:00.540000 end of test 0
Then use iterrows to get your desire output.
new_df = []
for _, row in df.iterrows():
val = row.counts
if val == 0:
new_df.append(row)
else:
new_df.append(row)
new_row = row.copy()
label = row.Label
for i in range(val):
new_row = new_row.copy()
new_row.Time += pd.Timedelta('1 min')
new_row.Label = f'{label} + {i+1}min'
new_df.append(new_row)
new_df = pd.DataFrame(new_df)
new_df
Time Label counts
0 00:00:00 Segment 1 0
1 00:00:02.749000 baseline 3
1 00:01:02.749000 baseline + 1min 3
1 00:02:02.749000 baseline + 2min 3
1 00:03:02.749000 baseline + 3min 3
2 00:03:27.540000 begin test 3
2 00:04:27.540000 begin test + 1min 3
2 00:05:27.540000 begin test + 2min 3
2 00:06:27.540000 begin test + 3min 3
3 00:07:00.540000 end of test 0

I assume that you converted Time column from "number unit" format to a string
representation of the time. Something like:
Time Label
Index
0 00:00:00.000 Segment 1
1 00:00:02.749 baseline
2 00:03:27.540 begin test
3 00:07:00.540 end of test
Then, to get your result:
Compute timNxt - the Time column shifted by 1 position and converted
to datetime:
timNxt = pd.to_datetime(df.Time.shift(-1))
Define the following "replication" function:
def myRepl(row):
timCurr = pd.to_datetime(row.Time)
timNext = timNxt[row.name]
tbl = [[timCurr.strftime('%H:%M:%S.%f'), row.Label]]
if pd.notna(timNext):
n = (timNext - timCurr) // np.timedelta64(1, 'm') + 1
tbl.extend([ [(timCurr + np.timedelta64(i, 'm')).strftime('%H:%M:%S.%f'),
row.Label + f' + {i}min'] for i in range(1, n)])
return pd.DataFrame(tbl, columns=row.index)
Apply it to each row of your df and concatenate results:
result = pd.concat(df.apply(myRepl, axis=1).tolist(), ignore_index=True)
The result is:
Time Label
0 00:00:00.000000 Segment 1
1 00:00:02.749000 baseline
2 00:01:02.749000 baseline + 1min
3 00:02:02.749000 baseline + 2min
4 00:03:02.749000 baseline + 3min
5 00:03:27.540000 begin test
6 00:04:27.540000 begin test + 1min
7 00:05:27.540000 begin test + 2min
8 00:06:27.540000 begin test + 3min
9 00:07:00.540000 end of test
The resulting DataFrame has Time column also as string, but at
least the fractional part of second has 6 digits everywhere.

Related

Update each row value with constant value plus previous value using pandas

I have a data frame having 4 columns, 1st column is equal to the counter which has values in hexadecimal.
Data
counter frequency resistance phase
0 15000.000000 698.617126 -0.745298
1 16000.000000 647.001708 -0.269421
2 17000.000000 649.572265 -0.097540
3 18000.000000 665.282775 0.008724
4 19000.000000 690.836975 -0.011101
5 20000.000000 698.051025 -0.093241
6 21000.000000 737.854003 -0.182556
7 22000.000000 648.586792 -0.125149
8 23000.000000 643.014160 -0.172503
9 24000.000000 634.954223 -0.126519
a 25000.000000 631.901733 -0.122870
b 26000.000000 629.401123 -0.123728
c 27000.000000 629.442016 -0.156490
Expected output
| counter | sampling frequency | time. |
| --------| ------------------ |---------|
| 0 | - |t0=0 |
| 1 | 1 |t1=t0+sf |
| 2 | 1 |t2=t1+sf |
| 3 | 1 |t3=t2+sf |
The time column is the new column added to the original data frame. I want to plot time in the x-axis and frequency, resistance, and phase in y-axis.
Because in order to calculate the value of any row you need to calculate the value of the previous row before, you may have to use a for loop for this problem.
For a constant frequency, you could just calculate it in advance, no need to operate in the datafame:
sampling_freq = 1
df['time'] = [sampling_freq * i for i in range(len(df))]
If you need to operate in the dataframe (let's say the frequency may change at some point), in order to call each cell based on row number and column name, you can this suggestion. Syntax would be a lot easier using both numbers for row and column, but I prefer to refer to 'time' instead of 2.
df['time'] = np.zeros(len(df))
for i in range(1, len(df)):
df.iloc[i, df.columns.get_loc('time')] = df.iloc[i-1, df.columns.get_loc('time')] + df.iloc[i, df.columns.get_loc('sampling frequency')]
Or, alternatively, resetting the index so you can iterate through consecutive numbers:
df['time'] = np.zeros(len(df))
df = df.reset_index()
for i in range(1, len(df)):
df.loc[i, 'time'] = df.loc[i-1, 'time'] + df.loc[i, 'sampling frequency']
df = df.set_index('counter')
Note that, because your sampling frequency is likely constant in the whole experiment, you could simplify it like:
sampling_freq = 1
df['time'] = np.zeros(len(df))
for i in range(1,len(df)):
df.iloc[i, df.columns.get_loc('time')] = df.iloc[i-1, df.columns.get_loc('time')] + sampling_freq
But it's not going to be better than just create the time series as in the first example.

How to discretize a datetime column?

I have a dataset that contains a column of datetime of a month, and I need to divide it into two blocks (day and night or am\pm) and then discretize the time in each block into 10mins bins. I could add another column of 0 and 1 to show it is am or pm, but I cannot discretize it! Can you please help me with it?
df['started_at'] = pd.to_datetime(df['started_at'])
df['start hour'] = df['started_at'].dt.hour.astype('int')
df['mor/aft'] = np.where(df['start hour'] < 12, 1, 0)
df['started_at']
0 16:05:36
2 06:22:40
3 16:08:10
4 12:28:57
6 15:47:30
...
3084526 15:24:24
3084527 16:33:07
3084532 14:08:12
3084535 09:43:46
3084536 17:02:26
If I understood correctly you are trying to add a column for every interval of ten minutes to indicate if an observation is from that interval of time.
You can use lambda expressions to loop through each observation from the series.
Dividing by 10 and making this an integer gives the first digit of the minutes, based on which you can add indicator columns.
I also included how to extract the day indicator column with a lambda expression for you to compare. It achieves the same as your np.where().
import pandas as pd
from datetime import datetime
# make dataframe
df = pd.DataFrame({
'started_at': ['14:20:56',
'00:13:24',
'16:01:33']
})
# convert column to datetime
df['started_at'] = pd.to_datetime(df['started_at'])
# make day indicator column
df['day'] = df['started_at'].apply(lambda ts: 1 if ts.hour > 12 else 0)
# make indicator column for every ten minutes
for i in range(24):
for j in range(6):
col = 'hour_' + str(i) + '_min_' + str(j) + '0'
df[col] = df['started_at'].apply(lambda ts: 1 if int(ts.minute/10) == j and ts.hour == i else 0)
print(df)
Output first columns:
started_at day hour_0_min_00 hour_0_min_10 hour_0_min_20
0 2021-11-21 14:20:56 1 0 0 0
1 2021-11-21 00:13:24 0 0 1 0
2 2021-11-21 16:01:33 1 0 0 0
...
...
...

vectoring pandas df by row with multiple conditional statements

I'm trying to avoid for loops applying a function on a per row basis of a pandas df. I have looked at many vectorization examples but have not come across anything that will work completely. Ultimately I am trying to add an additional df column with the summation of successful conditions with a specified value per each condition by row.
I have looked at np.apply_along_axis but that's just a hidden loop, np.where but I could not see this working for 25 conditions that I am checking
A B C ... R S T
0 0.279610 0.307119 0.553411 ... 0.897890 0.757151 0.735718
1 0.718537 0.974766 0.040607 ... 0.470836 0.103732 0.322093
2 0.222187 0.130348 0.894208 ... 0.480049 0.348090 0.844101
3 0.834743 0.473529 0.031600 ... 0.049258 0.594022 0.562006
4 0.087919 0.044066 0.936441 ... 0.259909 0.979909 0.403292
[5 rows x 20 columns]
def point_calc(row):
points = 0
if row[2] >= row[13]:
points += 1
if row[2] < 0:
points -= 3
if row[4] >= row[8]:
points += 2
if row[4] < row[12]:
points += 1
if row[16] == row[18]:
points += 4
return points
points_list = []
for indx, row in df.iterrows():
value = point_calc(row)
points_list.append(value)
df['points'] = points_list
This is obviously not efficient but I am not sure how I can vectorize my code since it requires the values per row for each column in the df to get a custom summation of the conditions.
Any help in pointing me in the right direction would be much appreciated.
Thank you.
UPDATE:
I was able to get a little more speed replacing the df.iterrows section with df.apply.
df['points'] = df.apply(lambda row: point_calc(row), axis=1)
UPDATE2:
I updated the function as follows and have substantially decreased the run time with a 10x speed increase from using df.apply and the initial function.
def point_calc(row):
a1 = np.where(row[:,2]) >= row[:,13], 1,0)
a2 = np.where(row[:,2] < 0, -3, 0)
a3 = np.where(row[:,4] >= row[:,8])
etc.
all_points = a1 + a2 + a3 + etc.
return all_points
df['points'] = point_calc(df.to_numpy())
What I am still working on is using np.vectorize on the function itself to see if that can be improved upon as well.
You can try it it the following way:
# this is a small version of your dataframe
df = pd.DataFrame(np.random.random((10,4)), columns=list('ABCD'))
It looks like that:
A B C D
0 0.724198 0.444924 0.554168 0.368286
1 0.512431 0.633557 0.571369 0.812635
2 0.680520 0.666035 0.946170 0.652588
3 0.467660 0.277428 0.964336 0.751566
4 0.762783 0.685524 0.294148 0.515455
5 0.588832 0.276401 0.336392 0.997571
6 0.652105 0.072181 0.426501 0.755760
7 0.238815 0.620558 0.309208 0.427332
8 0.740555 0.566231 0.114300 0.353880
9 0.664978 0.711948 0.929396 0.014719
You can create a Series which counts your points and is initialized with zeros:
points = pd.Series(0, index=df.index)
It looks like that:
0 0
1 0
2 0
3 0
4 0
5 0
6 0
7 0
8 0
9 0
dtype: int64
Afterwards you can add and subtract values line by line if you want:
The condition within the brackets selects the rows, where the condition is true.
Therefore -= and += is only applied in those rows.
points.loc[df.A < df.C] += 1
points.loc[df.B < 0] -= 3
At the end you can extract the values of the series as numpy array if you want (optional):
point_list = points.values
Does this solve your problem?

Efficient way to add many rows to a DataFrame

I really want to speed up my code.
My already working code loops through a DataFrame and gets the start and end year. Then I add it to the lists. At the end of the loop, I append to the empty DataFrame.
rows = range(3560)
#initiate lists and dataframe
start_year = []
end_year = []
for i in rows:
start_year.append(i)
end_year.append(i)
df = pd.DataFrame({'Start date':start_year, 'End date':end_year})
I get what I expect, but very slowly:
Start date End date
0 1 1
1 2 2
2 3 3
3 4 4
Yes, it can be made faster. The trick is to avoid list.append (or, worse pd.DataFrame.append) in a loop. You can use list(range(3560)), but you may find np.arange even more efficient. Here you can assign an array to multiple series via dict.fromkeys:
df = pd.DataFrame(dict.fromkeys(['Start date', 'End date'], np.arange(3560)))
print(df.shape)
# (3560, 2)
print(df.head())
# Start date End date
# 0 0 0
# 1 1 1
# 2 2 2
# 3 3 3
# 4 4 4

How to apply a python function to splitted 'from the end' pandas sub-dataframes and get a new dataframe?

The problem
Starting from a pandas dataframe df made of dim_df rows, I need a new
dataframe df_new obtained by applying a function to every sub-dataframe of dimension dim_blk, ideally splitted starting from the last row (so the first block, not the last, may have or not the right number of rows, dim_blk), in the most efficient way (may be vectorized?).
Example
In the following example the dataframe is made of few rows, but the real dataframe will be made of millions of rows, that's why I need an efficient solution.
dim_df = 7 # dimension of the starting dataframe
dim_blk = 3 # number of rows of the splitted block
df = pd.DataFrame(np.arange(1,dim_df+1), columns=['TEST'])
print(df)
Output:
TEST
0 1
1 2
2 3
3 4
4 5
5 6
6 7
The splitted blocks I want:
1 # note: this is the first block composed by a <= dim_blk number of rows
2,3,4
5,6,7 # note: this is the last block and it has dim_blk number of rows
I've done so (I don't know if this is the efficient way):
lst = np.arange(dim_df, 0, -dim_blk) # [7 4 1]
lst_mod = lst[1:] # [4 1] to cut off the last empty sub-dataframe
split_df = np.array_split(df, lst_mod[::-1]) # splitted by reversed list
print(split_df)
Output:
split_df: [
TEST
0 1,
TEST
1 2
2 3
3 4,
TEST
4 5
5 6
6 7]
For example:
print(split_df[1])
Output:
TEST
1 2
2 3
3 4
How can I get a new dataframe, df_new, where every row is made by two columns, min and max (just an example) calculated for every blocks?
I.e:
# df_new
Min Max
0 1 1
1 2 4
2 5 7
Thank you,
Gilberto
You can convert the split_df into dataframe and then create a dataframe using min and max functions i.e
split_df = pd.DataFrame(np.array_split(df['TEST'], lst_mod[::-1]))
df_new = pd.DataFrame({"MIN":split_df.min(axis=1),"MAX":split_df.max(axis=1)}).reset_index(drop=True)
Output:
MAX MIN
0 1.0 1.0
1 4.0 2.0
2 7.0 5.0
Moved solution from question to answer:
The Solution
I've think laterally and found a very speedy solution:
Apply a rolling function to the entire dataframe
Choose every num_blk rows starting from the end
The code (with different values):
import numpy as np
import pandas as pd
import time
dim_df = 500000
dim_blk = 240
df = pd.DataFrame(np.arange(1,dim_df+1), columns=['TEST'])
start_time = time.time()
df['MAX'] = df['TEST'].rolling(dim_blk).max()
df['MIN'] = df['TEST'].rolling(dim_blk).min()
df[['MAX', 'MIN']] = df[['MAX', 'MIN']].fillna(method='bfill')
df_split = pd.DataFrame(columns=['MIN', 'MAX'])
df_split['MAX'] = df['MAX'][-1::-dim_blk][::-1]
df_split['MIN'] = df['MIN'][-1::-dim_blk][::-1]
df_split.reset_index(inplace=True)
del(df_split['index'])
print(df_split.tail())
print('\n\nEND\n\n')
print("--- %s seconds ---" % (time.time() - start_time))
Time Stats
The original code stops after 545 secs. The new code stops after 0,16 secs. Awesome!

Categories