pandas how to select some values from difference - python

How to select values from diff that lie in a certain range?
df['timestamp'].diff() # .select(1 < x < 10)

Using loc + lambda
df['timestamp'].diff().loc[lambda x : (x>1) &(x<10)]

If I understand correctly, you can obtain the original row in your dataframe where the diff is between 1 and 10 like this:
df.loc[(df['timestamp'].diff() > 1) & (df['timestamp'].diff() < 10)]
Example:
given a df:
>>> df
timestamp
0 8
1 4
2 1
3 5
4 3
With these diff() values:
>>> df.diff()
timestamp
0 NaN
1 -4.0
2 -3.0
3 4.0
4 -2.0
You can extract that row where the diff is in your range:
>>> df.loc[(df['timestamp'].diff() > 1) & (df['timestamp'].diff() < 10)]
timestamp
3 5
Edit As pointed out by #Wen, using diff() twice is not really so efficient. You can also create a mask using diff() and use that mask to extract your rows, along the lines of:
msk = df.diff()
df.where((msk > 1) & (msk < 10))

Related

How to apply a formula to a Pandas DataFrame using a rolling window as well as a mean value across this window?

I am trying to implement a small algorithm that creates a new column in my DataFrame depending on wether a certain condition on another column exceeds a threshold or not. The formula looks like this:
df.loc[:, 'new_column'] = 0
df.loc[sum(abs(np.diff(df.loc[:, 'old_column']) / df.loc[:, 'old_column'].mean())) > threshold, 'new_column'] = 1
However now I don't want to apply this formula to the whole height of the DataFrame, but rather would like to apply a rolling window, i.e. the mean value calculated in the formula rolls through the rows of the DataFrame. I found this page in the documentation but don't know how I can apply this for a formula like in my case. How could I do something like this?
Using column names a and b instead of old_column and new_column:
df = pd.DataFrame(np.random.randint(10, size=10), columns=['a'])
window = 3
val = df['a'].diff().abs() / df['a'].rolling(window, 1).mean()
threshold = 1
condition = (val > threshold)
df['b'] = 0
df.loc[condition, 'b'] = 1
Example results:
a b
0 1 0
1 7 1
2 6 0
3 1 1
4 8 1
5 2 1
6 3 0
7 1 0
8 0 0
9 8 1
Pay close attention to NaN values in the intermediate results. diff() returns nan in the first row. This is unlike np.diff() because np.diff() returns a smaller array than the input.
And rolling().mean() will return NaN values depending on your min_periods parameter.
The final result contains no NaN values because (val > threshold) is always False for NaN inputs.

Python: Split pandas dataframe by range of values

I have a simple dataframe in which I am trying to split into multiple groups based on whether the x column value falls within a range.
e.g. if I have:
print(df1)
x
0 5
1 7.5
2 10
3 12.5
4 15
And wish to create a new dataframe, df2, of values of x which are within the range 7-13 (7 < x < 13)
print(df1)
x
0 5
4 15
print(df2)
x
1 7.5
2 10
3 12.5
I have been able to split the dataframe based on a single value boolean e.g. ( x < 11), using the following - but have unable to develop this into a range of values.
thresh = 11
df2 = df1[df1['x'] < thresh]
print(df2)
x
0 5
1 7.5
2 10
You can create a boolean mask for the range (7 < x < 13) by AND condition of (x > 7) and (x < 13). Then create df2 with this boolean mask. The remaining entries left in df1 being the negation of this boolean mask:
thresh_low = 7
thresh_high = 13
mask = (df1['x'] > thresh_low) & (df1['x'] < thresh_high)
df2 = df1[mask]
df1 = df1[~mask]
Result:
print(df2)
x
1 7.5
2 10.0
3 12.5
print(df1)
x
0 5.0
4 15.0
You can use between to categorize whether the condition is met and then groupby to split based on your condition. Here I'll store the results in a dict
d = dict(tuple(df1.groupby(df1['x'].between(7, 13, inclusive=False))))
d[True]
# x
#1 7.5
#2 10.0
#3 12.5
d[False]
# x
#0 5.0
#4 15.0
Or with only two possible splits you can manually define the Boolean Series and then split based on it.
m = df1['x'].between(7, 13, inclusive=False)
df_in = df1[m]
df_out = df1[~m]

Merge Pandas Dataframe based on boolean function

I am looking for an efficient way to merge two pandas data frames based on a function that takes as input columns from both data frames and returns True or False. E.g. Assume I have the following "tables":
import pandas as pd
df_1 = pd.DataFrame(data=[1, 2, 3])
df_2 = pd.DataFrame(data=[4, 5, 6])
def validation(a, b):
return ((a + b) % 2) == 0
I would like to join df1 and df2 on each row where the sum of the first column is an even number. The resulting table would be
1 5
df_3 = 2 4
2 6
3 5
Please think of it as a general problem not as a task to return just df_3. The solution should accept any function that validates a combination of columns and return True or False.
THX Lazloo
You can do with merge on parity:
(df_1.assign(parity=df_1[0]%2)
.merge(df_2.assign(parity=df_2[0]%2), on='dummy')
.drop('parity', axis=1)
)
output:
0_x 0_y
0 1 5
1 3 5
2 2 4
3 2 6
You can use broadcasting, or the outer functions, to compare all rows. You'll run into issues as the length becomes large.
import pandas as pd
import numpy as np
def validation(a, b):
"""a,b : np.array"""
arr = np.add.outer(a, b) # How to combine rows
i,j = np.where(arr % 2 == 0) # Condition
return pd.DataFrame(np.stack([a[i], b[j]], axis=1))
validation(df_1[0].to_numpy(), df_2[0].to_numpy())
0 1
0 1 5
1 2 4
2 2 6
3 3 5
In this particular case you might leverage the fact that even numbers maintain parity when added to even numbers, and odd numbers change parity when added to odd numbers, so define that column and merge on that.
df_1['parity'] = df_1[0]%2
df_2['parity'] = df_2[0]%2
df_3 = df_1.merge(df_2, on='parity')
0_x parity 0_y
0 1 1 5
1 3 1 5
2 2 0 4
3 2 0 6
This is a basic solution but not very efficient if you are working on large dataframes
df_1.index *= 0
df_2.index *= 0
df = df_1.join(df_2, lsuffix='_2')
df = df[df.sum(axis=1) % 2 == 0]
Edit,
here is a better solution
df_1.index = df_1.iloc[:,0] % 2
df_2.index = df_2.iloc[:,0] % 2
df = df_1.join(df_2, lsuffix='_2')

Pandas: splitting data frame based on the slope of data

I have this data frame
x = pd.DataFrame({'entity':[5,7,5,5,5,6,3,2,0,5]})
Update: I want a function If the slope is negetive and the length of the group is more than 2 then it should return True, index of start and end of the group. for this case it should return: result=True, index=5, index=8
1- I want to split the data frame based on the slope. This example should have 6 groups.
2- how can I check the length of groups?
I tried to get groups by the below code but I don't know how can split the data frame and how can check the length of each part
New update: Thanks Matt W. for his code. finally I found the solution.
df = pd.DataFrame({'entity':[5,7,5,5,5,6,3,2,0,5]})
df['diff'] = df.entity.diff().fillna(0)
df.loc[df['diff'] < 0, 'diff'] = -1
init = [0]
for x in df['diff'] == df['diff'].shift(1):
if x:
init.append(init[-1])
else:
init.append(init[-1]+1)
def get_slope(df):
x=np.array(df.iloc[:,0].index)
y=np.array(df.iloc[:,0])
X = x - x.mean()
Y = y - y.mean()
slope = (X.dot(Y)) / (X.dot(X))
return slope
df['g'] = init[1:]
df.groupby('g').apply(get_slope)
Result
0 NaN
1 NaN
2 NaN
3 0.0
4 NaN
5 -1.5
6 NaN
Take the difference and bfill() the start so that you have the same number in the 0th element. Then turn all negatives the same so we can imitate them being the same "slope". Then I shift it to check to see if the next number is the same and iterate through giving us a list of when it changes, assigning that to g.
df = pd.DataFrame({'entity':[5,7,5,5,5,6,3,2,0,5]})
df['diff'] = df.entity.diff().bfill()
df.loc[df['diff'] < 0, 'diff'] = -1
init = [0]
for x in df['diff'] == df['diff'].shift(1):
if x:
init.append(init[-1])
else:
init.append(init[-1]+1)
df['g'] = init[1:]
df
entity diff g
0 5 2.0 1
1 7 2.0 1
2 5 -1.0 2
3 5 0.0 3
4 5 0.0 3
5 6 1.0 4
6 3 -1.0 5
7 2 -1.0 5
8 0 -1.0 5
9 5 5.0 6
Just wanted to present another solution that doesn't require a for-loop:
df = pd.DataFrame({'entity':[5,7,5,5,5,6,3,2,0,5]})
df['diff'] = df.entity.diff().bfill()
df.loc[diff < 0, 'diff'] = -1
df['g'] = (~(df['diff'] == df['diff'].shift(1))).cumsum()
df

Pandas delete first n rows until condition on columns is fulfilled

I am trying to delete some rows from my dataframe. In fact I want to delete the the first n rows, while n should be the row number of a certain condition. I want the dataframe to start with the row that contains the x-y values xEnd,yEnd. All earlier rows shall be dropped from the dataframe. Somehow I do not get the solution. That is what i have so far.
Example:
import pandas as pd
xEnd=2
yEnd=3
df = pd.DataFrame({'x':[1,1,1,2,2,2], 'y':[1,2,3,3,4,3], 'id':[0,1,2,3,4,5]})
n=df["id"].iloc[df["x"]==xEnd and df["y"]==yEnd]
df = df.iloc[n:]
I want my code to reduce the dataframe from
{'x':[1,1,1,2,2,2], 'y':[1,2,3,3,4,3], 'id':[0,1,2,3,4,5]}
to
{'x':[2,2,2], 'y':[3,4,3], 'id':[3,4,5]}
Use & instead of and
Use loc instead of iloc. You can use iloc but it could break depending on the index
Use idxmax to find the first positiopn
# I used idxmax to find the index |
# v
df.loc[((df['x'] == xEnd) & (df['y'] == yEnd)).idxmax():]
# ^
# | finding the index goes with using loc
id x y
3 3 2 3
4 4 2 4
5 5 2 3
Here is an iloc variation
# I used values.argmax to find the position |
# v
df.iloc[((df['x'] == xEnd) & (df['y'] == yEnd)).values.argmax():]
# ^
# | finding the position goes with using iloc
id x y
3 3 2 3
4 4 2 4
5 5 2 3
Using cummax
df[((df['x'] == xEnd) & (df['y'] == yEnd)).cummax()]
Out[147]:
id x y
3 3 2 3
4 4 2 4
5 5 2 3

Categories