find indices where df value is in bin range of other df - python

I am trying to create a new column df2["v2"] in a dataframe filled with the values from a different dataframe df1["v1"].
The first dataframe holds values from measurement 1 which are measured at the times stored in df1["T1"]. The second dataframe should now store the values from measurement 1, but has a different time sampeling. In the real world task the time sampling is not evenly spaced (nor monotonically increasing, at least by default).
df1 = pd.DataFrame({"T1": [0, 5, 10, 15], "v1":[0, 1, 2, 3]})
df2 = pd.DataFrame({"T2": np.arange(0, 15)})
A stupid way of doing this could be:
df2["v2"] = pd.Series()
for n in range(df1["T1"].size-1):
t1 = df1["T1"].iloc[n]
t2 = df1["T1"].iloc[n+1]
mask = (t1 <= df2["T2"]) & (df2["T2"] < t2)
df2["v2"].loc[mask]= df1["v1"].iloc[n]
The resulting dataframe should look like this:
T2 v2
0 0 0.0
1 1 0.0
2 2 0.0
3 3 0.0
4 4 0.0
5 5 1.0
6 6 1.0
7 7 1.0
8 8 1.0
9 9 1.0
10 10 2.0
11 11 2.0
12 12 2.0
13 13 2.0
14 14 2.0
Whats the fastest/most elegant way of achieving the same?

Here is one way of solving the problem with pd.cut:
bins = pd.cut(df1['T1'], df1['T1'], right=False)
mapping = df1[:-1].set_index(bins[:-1])['v1']
df2['v2'] = df2['T2'].map(mapping)
Details:
Categorize the values in column T1 into discrete intervals characterised by the column T1 itself:
>>> bins
0 [0.0, 5.0)
1 [5.0, 10.0)
2 [10.0, 15.0)
3 NaN
Name: T1, dtype: category
Categories (3, interval[int64]): [[0, 5) < [5, 10) < [10, 15)]
Create a mapping series:
>>> mapping
T1
[0, 5) 0
[5, 10) 1
[10, 15) 2
Name: v1, dtype: int64
map the values in the column T2 with help of above mapping series:
>>> df2
T2 v2
0 0 0
1 1 0
2 2 0
3 3 0
4 4 0
5 5 1
6 6 1
7 7 1
8 8 1
9 9 1
10 10 2
11 11 2
12 12 2
13 13 2
14 14 2

Related

min/max value of a column based on values of another column, grouped by and transformed in pandas

I'd like to know if I can do all this in one line, rather than multiple lines.
my dataframe:
import pandas as pd
df = pd.DataFrame({'ID' : [1,1,1,1,1,1,2,2,2,2,2,2]
,'A': [1, 2, 3, 10, np.nan, 5 , 20, 6, 7, np.nan, np.nan, np.nan]
, 'B': [0,1,1,0,1,1,1,1,1,0,1,0]
, 'desired_output' : [5,5,5,5,5,5,20,20,20,20,20,20]})
df
ID A B desired_output
0 1 1.0 0 5
1 1 2.0 1 5
2 1 3.0 1 5
3 1 10.0 0 5
4 1 NaN 1 5
5 1 5.0 1 5
6 2 20.0 1 20
7 2 6.0 1 20
8 2 7.0 1 20
9 2 NaN 0 20
10 2 NaN 1 20
11 2 NaN 0 20
I'm trying to find the maximum value of column A, for values of column B == 1, group by column ID, and transform the results directly so that the value is back in the dataframe without extra merging et al.
something like the following (but without getting errors!)
df['desired_output'] = df.groupby('ID').A.where(df.B == 1).transform('max') ## this gives error
The max function should ignore the NaNs as well. I wonder if I'm trying too much in one line, but one can hope there is a way for a beautiful code.
EDIT:
I can get a very similar output by changing the where clause:
df['desired_output'] = df.where(df.B == 1).groupby('ID').A.transform('max') ## this works but output is not what i want
but the output is not exactly what I want. desired_output should not have any NaN, unless all values of A are NaN for when B == 1.
Here is a way to do it:
import pandas as pd
import numpy as np
df = pd.DataFrame({
'ID' : [1,1,1,1,1,1,2,2,2,2,2,2],
'A': [1, 2, 3, 10, np.nan, 5 , 20, 6, 7, np.nan, np.nan, np.nan],
'B': [0,1,1,0,1,1,1,1,1,0,1,0],
'desired_output' : [5,5,5,5,5,5,20,20,20,20,20,20]
})
df['output'] = df[df.B == 1].groupby('ID').A.max()[df.ID].array
df
Result:
ID A B desired_output output
0 1 1.0 0 5 5.0
1 1 2.0 1 5 5.0
2 1 3.0 1 5 5.0
3 1 10.0 0 5 5.0
4 1 NaN 1 5 5.0
5 1 5.0 1 5 5.0
6 2 20.0 1 20 20.0
7 2 6.0 1 20 20.0
8 2 7.0 1 20 20.0
9 2 NaN 0 20 20.0
10 2 NaN 1 20 20.0
11 2 NaN 0 20 20.0
Decomposition:
df[df.B == 1] # start by filtering on B
.groupby('ID') # group by ID
.A.max() # get max values in column A
[df.ID] # recast the result on ID series shape
.array # fetch the raw values from the Series
Important note: it relies on the fact that the index is as in the given example, that is, sorted, starting from 0, with a 1 increment. You will have to reset_index() of your DataFrame before this operation when this is not the case.

Pandas Constant Values after each Zero Value

Say I have the following dataframe:
values
0 4
1 0
2 2
3 3
4 0
5 8
6 5
7 1
8 0
9 4
10 7
I want to find a pandas vectorized function (preferably using groupby) that would replace all nonzero values with the first nonzero value in that chunk of nonzero values, i.e. something that would give me
values new
0 4 4
1 0 0
2 2 2
3 3 2
4 0 0
5 8 8
6 5 8
7 1 8
8 0 0
9 4 4
10 7 4
Is there a good way of achieving this?
Make a boolean mask to select the rows having zero and its following row, then use this boolean mask with where to replace remaining values with NaN, then use forward fill to propagate the values in forward direction.
m = df['values'].eq(0)
df['new'] = df['values'].where(m | m.shift()).ffill().fillna(df['values'])
Result
print(df)
values new
0 4 4.0
1 0 0.0
2 2 2.0
3 3 2.0
4 0 0.0
5 8 8.0
6 5 8.0
7 1 8.0
8 0 0.0
9 4 4.0
10 7 4.0
get rows for zeros, and the rows immediately after:
zeros = df.index[df['values'].eq(0)]
after_zeros = zeros.union(zeros +1)
Get the rows that need to be forward filled:
replace = df.index.difference(after_zeros)
replace = replace[replace > zeros[0]]
Set values and forward fill on replace:
df['new'] = df['values']
df.loc[replace, 'new'] = np.nan
df.ffill()
values new
0 4 4.0
1 0 0.0
2 2 2.0
3 3 2.0
4 0 0.0
5 8 8.0
6 5 8.0
7 1 8.0
8 0 0.0
9 4 4.0
10 7 4.0
The following function should do the job for you. Check the comments in the function to understand the work flow of the solution.
import pandas as pd
def ffill_nonZeros(values):
# get the values that are not equal to 0
non_zero = values[df['values'] != 0]
# get their indexes
non_zero_idx = non_zero.index.to_series()
# find where indexes are consecutive
diff = non_zero_idx.diff()
mask = diff == 1
# using the mask make all places in non_zero where the change is consecutive equal None
non_zero[mask] = None
# fill forward (replace all None values with previous valid value)
new_non_zero = non_zero.fillna(method='ffill')
# put new values back in their indexs
new = values.copy()
new[new_non_zero.index] = new_non_zero
return new
Now applying this function to your data:
df = pd.DataFrame([4, 0, 2, 3, 0, 8, 5, 1, 0, 4, 7], columns=['values'])
df['new'] = ffill_nonZeros(df['values'])
print(df)
Output:
values new
0 4 4
1 0 0
2 2 2
3 3 2
4 0 0
5 8 8
6 5 8
7 1 8
8 0 0
9 4 4
10 7 4

Reduce the number of rows in a dataframe based on a condition

I have a dataframe which consists of 9821 rows and one column. The values in it are listed in groups of 161 produced 61 times (161X61=9821). I need to reduce the number of rows to 9660 (161X60=9660) by replacing the first 2 values of each group of 161 into an average of those 2 values. In more simple words, in my existing dataframe the following groups of indexes (0, 1), (61, 62) ... (9760, 9761) need to be averaged in order to get a new dataframe with 9660 rows. Any ideas?
this is what I have (groups of 4 produced 3 times - 4X3=12):
0 10
1 11
2 12
3 13
4 14
5 15
6 16
7 17
8 18
9 19
10 20
11 21
this is what I want (groups of 3 produced 3 times - 3X3=9):
0 10.5
1 12
2 13
3 14.5
4 16
5 17
6 18.5
7 20
8 21
I'm not super happy with this answer but I'm putting it out there for review.
>>> df[df.index%4 == 0] = df.groupby(df.index//4).apply(lambda s: s.iloc[:2].mean()).values
>>> df = df[:-3]
>>> df
0
0 10.5
1 11.0
2 12.0
3 13.0
4 14.5
5 15.0
6 16.0
7 17.0
8 18.5
rotation - my existing dataframe (161X61)
rot- my new dataframe (161X60)
arr = np.zeros((9821, 1))
rot = pd.DataFrame(arr, index=range(0, 9821))
for i in range(0, 9821):
if i==0:
rot.iloc[i, 0] = (rotation.iloc[i, 0]+rotation.iloc[i+1, 0])/2
elif ((i%61)==0):
rot.iloc[i-1, 0] = (rotation.iloc[i, 0]+rotation.iloc[i+1, 0])/2
rot.iloc[i, 0] = 'del'
else:
if ((i==9820)):
rot.iloc[i, 0] = 'del'
break
rot.iloc[i, 0]=rotation.iloc[i+1, 0]
rot.columns = ['alpha']
rot = rot[~rot['alpha'].isin(['del'])]
rot.reset_index(drop=True, inplace=True)
rot

Extrapolate time series data based on Start and end values, using Python?

I have an excel sheet with values representing start and end_time of a time series data, as shown below. Times are in seconds.
+------------+---------+-------+
Start_Time End_Time Value
0 2 A
2 3 B
3 9 A
9 11 C
I want to extrapolate the values between start and end_time and display the values for each second.
+---------+------+
Time Value
0 A
1 A
2 A
3 B
4 A
5 A
6 A
7 A
8 A
9 A
10 C
11 c
Any help to implement it in Python will be appreciated. Thanks.
Setup
You should find how to read your excel sheet with pandas easily, and options will depend on the file itself, so I won't cover this part.
Below is the reproduction of your sample dataframe, used for the example.
import pandas as pd
df = pd.DataFrame({'Start_Time': [0, 2, 3, 9],
'End_Time': [2, 3, 9, 11],
'Value': ['A', 'B', 'A', 'C']})
>>> df
Out[]:
End_Time Start_Time Value
0 2 0 A
1 3 2 B
2 9 3 A
3 11 9 C
Solution
(pd.Series(range(df.End_Time.max() + 1), name='Value') # Create a series on whole range
.map(df.set_index('End_Time').Value) # Set values from "df"
.bfill() # Backward fill NaNs values
.rename_axis('Time')) # Purely cosmetic axis rename
Out[]:
Time
0 A
1 A
2 A
3 B
4 A
5 A
6 A
7 A
8 A
9 A
10 C
11 C
Name: Value, dtype: object
Walkthrough
Create the whole "Time" range
s = pd.Series(range(df.End_Time.max() + 1))
>>> s
Out[]:
0 0
1 1
2 2
3 3
4 4
5 5
6 6
7 7
8 8
9 9
10 10
11 11
dtype: int32
Use "End_Time" as index for df
>>> df.set_index('End_Time')
Out[]:
Start_Time Value
End_Time
2 0 A
3 2 B
9 3 A
11 9 C
Map df values to corresponding "End_Time" values from s
s = s.map(df.set_index('End_Time').Value)
>>> s
Out[]:
0 NaN
1 NaN
2 A
3 B
4 NaN
5 NaN
6 NaN
7 NaN
8 NaN
9 A
10 NaN
11 C
dtype: object
Backward-fill the NaN values
s = s.bfill()
>>> s
Out[]:
0 A
1 A
2 A
3 B
4 A
5 A
6 A
7 A
8 A
9 A
10 C
11 C
dtype: object
Then rename_axis('Time') only renames the series axis to match your desired output.
Note that this works here because you use excluding Start_Time.
If you were using including Start_Time (where Value really starts at Start_Time, which is more common) you should change End_Time to Start_Time and bfill() to ffill() (forward-fill).

Pandas DataFrame, calculate max column value relative to current row column value

I have a dataframe:
df = pd.DataFrame( {
'epoch' : [1, 4, 7, 8, 9, 11, 12, 15, 16, 17],
'price' : [1, 2, 3, 3, 1, 4, 2, 3, 4, 4]
} )
epoch price
0 1 1
1 4 2
2 7 3
3 8 3
4 9 1
5 11 4
6 12 2
7 15 3
8 16 4
9 17 4
I have to create a new column that should be calculated in the following way:
For each row
Find current row's epoch (let's say e_cur)
Calculate e_cur-3 = e_cur – 3 (three is a constant here but it will be variable)
Calculate price max value where epoch >= e-3_cur and epoch <= e_cur
In other words, find maximum price in rows that are three epoch away from current row's epoch.
For example:
Index=0, e_cur = epoch = 1, e_cur-3 = 1 -3 = -2, there is only one (first) row whose epoch is between -2 and 1 so the price from the first row is maximum price
Index =6, e_cur = epoch = 12, e_cur-3 = 12 – 3 = 9, there are three rows whose epoch is between 9 and 12, but row with index=5 has the maximum price = 4.
Here are the results for every row that I calculated manually:
epoch price max_price_where_epoch_is_between_e_cur-3_and_e_cur
0 1 1 1
1 4 2 2
2 7 3 3
3 8 3 3
4 9 1 3
5 11 4 4
6 12 2 4
7 15 3 3
8 16 4 4
9 17 4 4
As you can see, epoch something goes one by one, but sometimes there are "holes".
How to calculate that with pandas?
Using rolling window:
In [161]: df['between'] = df.epoch.map(df.set_index('epoch')
...: .reindex(np.arange(df.epoch.min(), df.epoch.max()+1))
...: .rolling(3, min_periods=1)
...: .max()['price'])
...:
In [162]: df
Out[162]:
epoch price between
0 1 1 1.0
1 4 2 2.0
2 7 3 3.0
3 8 3 3.0
4 9 1 3.0
5 11 4 4.0
6 12 2 4.0
7 15 3 3.0
8 16 4 4.0
9 17 4 4.0
Explanation:
Helper DF:
In [165]: df.set_index('epoch').reindex(np.arange(df.epoch.min(), df.epoch.max()+1))
Out[165]:
price
epoch
1 1.0
2 NaN
3 NaN
4 2.0
5 NaN
6 NaN
7 3.0
8 3.0
9 1.0
10 NaN
11 4.0
12 2.0
13 NaN
14 NaN
15 3.0
16 4.0
17 4.0
In [166]: df.set_index('epoch').reindex(np.arange(df.epoch.min(), df.epoch.max()+1)).rolling(3, min_periods=1).max()
Out[166]:
price
epoch
1 1.0
2 1.0
3 1.0
4 2.0
5 2.0
6 2.0
7 3.0
8 3.0
9 3.0
10 3.0
11 4.0
12 4.0
13 4.0
14 2.0
15 3.0
16 4.0
17 4.0
Consider applying function on epoch column where you can locate the required rows and calculate their price max value
>> df['between'] = df['epoch'].apply(lambda e: df.loc[
>> (df['epoch'] >= e - 3) & (df['epoch'] <= e), 'price'].max())
>> df
epoch price between
0 1 1 1
1 4 2 2
2 7 3 3
3 8 3 3
4 9 1 3
5 11 4 4
6 12 2 4
7 15 3 3
8 16 4 4
9 17 4 4
I have tried both solutions, from tarashypka and MaxU.
The first solution I have tried was Tarashypka's. I tested it on 100k rows. It took about one minute.
Than I tried MaxU's solution, that has finished in about 4 seconds.
I prefer MaxU's solution because of the speed, but with Tarashypka's solution I also learned how to use lambda function with DataFrame.
Thank you very very much to all of you.
Best regards and wishes.

Categories