I have a simple dataframe in which I am trying to split into multiple groups based on whether the x column value falls within a range.
e.g. if I have:
print(df1)
x
0 5
1 7.5
2 10
3 12.5
4 15
And wish to create a new dataframe, df2, of values of x which are within the range 7-13 (7 < x < 13)
print(df1)
x
0 5
4 15
print(df2)
x
1 7.5
2 10
3 12.5
I have been able to split the dataframe based on a single value boolean e.g. ( x < 11), using the following - but have unable to develop this into a range of values.
thresh = 11
df2 = df1[df1['x'] < thresh]
print(df2)
x
0 5
1 7.5
2 10
You can create a boolean mask for the range (7 < x < 13) by AND condition of (x > 7) and (x < 13). Then create df2 with this boolean mask. The remaining entries left in df1 being the negation of this boolean mask:
thresh_low = 7
thresh_high = 13
mask = (df1['x'] > thresh_low) & (df1['x'] < thresh_high)
df2 = df1[mask]
df1 = df1[~mask]
Result:
print(df2)
x
1 7.5
2 10.0
3 12.5
print(df1)
x
0 5.0
4 15.0
You can use between to categorize whether the condition is met and then groupby to split based on your condition. Here I'll store the results in a dict
d = dict(tuple(df1.groupby(df1['x'].between(7, 13, inclusive=False))))
d[True]
# x
#1 7.5
#2 10.0
#3 12.5
d[False]
# x
#0 5.0
#4 15.0
Or with only two possible splits you can manually define the Boolean Series and then split based on it.
m = df1['x'].between(7, 13, inclusive=False)
df_in = df1[m]
df_out = df1[~m]
Related
I have DataFrame with column Sales.
How can I split it into 2 based on Sales value?
First DataFrame will have data with 'Sales' < s and second with 'Sales' >= s
You can use boolean indexing:
df = pd.DataFrame({'Sales':[10,20,30,40,50], 'A':[3,4,7,6,1]})
print (df)
A Sales
0 3 10
1 4 20
2 7 30
3 6 40
4 1 50
s = 30
df1 = df[df['Sales'] >= s]
print (df1)
A Sales
2 7 30
3 6 40
4 1 50
df2 = df[df['Sales'] < s]
print (df2)
A Sales
0 3 10
1 4 20
It's also possible to invert mask by ~:
mask = df['Sales'] >= s
df1 = df[mask]
df2 = df[~mask]
print (df1)
A Sales
2 7 30
3 6 40
4 1 50
print (df2)
A Sales
0 3 10
1 4 20
print (mask)
0 False
1 False
2 True
3 True
4 True
Name: Sales, dtype: bool
print (~mask)
0 True
1 True
2 False
3 False
4 False
Name: Sales, dtype: bool
Using groupby you could split into two dataframes like
In [1047]: df1, df2 = [x for _, x in df.groupby(df['Sales'] < 30)]
In [1048]: df1
Out[1048]:
A Sales
2 7 30
3 6 40
4 1 50
In [1049]: df2
Out[1049]:
A Sales
0 3 10
1 4 20
Using groupby and list comprehension:
Storing all the split dataframe in list variable and accessing each of the seprated dataframe by their index.
DF = pd.DataFrame({'chr':["chr3","chr3","chr7","chr6","chr1"],'pos':[10,20,30,40,50],})
ans = [y for x, y in DF.groupby('chr')]
accessing the separated DF like this:
ans[0]
ans[1]
ans[len(ans)-1] # this is the last separated DF
accessing the column value of the separated DF like this:
ansI_chr=ans[i].chr
One-liner using the walrus operator (Python 3.8):
df1, df2 = df[(mask:=df['Sales'] >= 30)], df[~mask]
Consider using copy to avoid SettingWithCopyWarning:
df1, df2 = df[(mask:=df['Sales'] >= 30)].copy(), df[~mask].copy()
Alternatively, you can use the method query:
df1, df2 = df.query('Sales >= 30').copy(), df.query('Sales < 30').copy()
I like to use this for speeding up searches or rolling average finds .apply(lambda x...) type functions so I split big files into dictionaries of dataframes:
df_dict = {sale_v: df[df['Sales'] == sale_v] for sale_v in df.Sales.unique()}
This should do it if you wanted to go based on categorical groups.
This is my dataframe:
df = pd.DataFrame({'a':list('xxxyyzz'), 'b':[10,20,30,5,3,1,2]})
I group them:
groups = df.groupby('a')
I want to print the groups that has at least one b above 20. In this case I want to print x.
This is my desired outcome:
x
a b
0 x 10
1 x 20
2 x 30
Compare values by Series.gt, grouping by a column like Series - df['a'] and use GroupBy.transform with GroupBy.any for test at least one True per groups:
df1 = df[df['b'].gt(20).groupby(df['a']).transform('any')]
print (df1)
a b
0 x 10
1 x 20
2 x 30
You could check which values are above 20, GroupBy column a and transform with any to select only those groups with at least one row satisfying the condition:
df[df.b.gt(20).groupby(df.a).transform('any')]
a b
0 x 10
1 x 20
2 x 30
No need groupby , just do isin
df[df.a.isin(df.loc[df.b>20,'a'])]
Out[996]:
a b
0 x 10
1 x 20
2 x 30
I have this data frame
x = pd.DataFrame({'entity':[5,7,5,5,5,6,3,2,0,5]})
Update: I want a function If the slope is negetive and the length of the group is more than 2 then it should return True, index of start and end of the group. for this case it should return: result=True, index=5, index=8
1- I want to split the data frame based on the slope. This example should have 6 groups.
2- how can I check the length of groups?
I tried to get groups by the below code but I don't know how can split the data frame and how can check the length of each part
New update: Thanks Matt W. for his code. finally I found the solution.
df = pd.DataFrame({'entity':[5,7,5,5,5,6,3,2,0,5]})
df['diff'] = df.entity.diff().fillna(0)
df.loc[df['diff'] < 0, 'diff'] = -1
init = [0]
for x in df['diff'] == df['diff'].shift(1):
if x:
init.append(init[-1])
else:
init.append(init[-1]+1)
def get_slope(df):
x=np.array(df.iloc[:,0].index)
y=np.array(df.iloc[:,0])
X = x - x.mean()
Y = y - y.mean()
slope = (X.dot(Y)) / (X.dot(X))
return slope
df['g'] = init[1:]
df.groupby('g').apply(get_slope)
Result
0 NaN
1 NaN
2 NaN
3 0.0
4 NaN
5 -1.5
6 NaN
Take the difference and bfill() the start so that you have the same number in the 0th element. Then turn all negatives the same so we can imitate them being the same "slope". Then I shift it to check to see if the next number is the same and iterate through giving us a list of when it changes, assigning that to g.
df = pd.DataFrame({'entity':[5,7,5,5,5,6,3,2,0,5]})
df['diff'] = df.entity.diff().bfill()
df.loc[df['diff'] < 0, 'diff'] = -1
init = [0]
for x in df['diff'] == df['diff'].shift(1):
if x:
init.append(init[-1])
else:
init.append(init[-1]+1)
df['g'] = init[1:]
df
entity diff g
0 5 2.0 1
1 7 2.0 1
2 5 -1.0 2
3 5 0.0 3
4 5 0.0 3
5 6 1.0 4
6 3 -1.0 5
7 2 -1.0 5
8 0 -1.0 5
9 5 5.0 6
Just wanted to present another solution that doesn't require a for-loop:
df = pd.DataFrame({'entity':[5,7,5,5,5,6,3,2,0,5]})
df['diff'] = df.entity.diff().bfill()
df.loc[diff < 0, 'diff'] = -1
df['g'] = (~(df['diff'] == df['diff'].shift(1))).cumsum()
df
How to select values from diff that lie in a certain range?
df['timestamp'].diff() # .select(1 < x < 10)
Using loc + lambda
df['timestamp'].diff().loc[lambda x : (x>1) &(x<10)]
If I understand correctly, you can obtain the original row in your dataframe where the diff is between 1 and 10 like this:
df.loc[(df['timestamp'].diff() > 1) & (df['timestamp'].diff() < 10)]
Example:
given a df:
>>> df
timestamp
0 8
1 4
2 1
3 5
4 3
With these diff() values:
>>> df.diff()
timestamp
0 NaN
1 -4.0
2 -3.0
3 4.0
4 -2.0
You can extract that row where the diff is in your range:
>>> df.loc[(df['timestamp'].diff() > 1) & (df['timestamp'].diff() < 10)]
timestamp
3 5
Edit As pointed out by #Wen, using diff() twice is not really so efficient. You can also create a mask using diff() and use that mask to extract your rows, along the lines of:
msk = df.diff()
df.where((msk > 1) & (msk < 10))
I want to add a column to a df. The values of this new df will be dependent upon the values of the other columns. eg
dc = {'A':[0,9,4,5],'B':[6,0,10,12],'C':[1,3,15,18]}
df = pd.DataFrame(dc)
A B C
0 0 6 1
1 9 0 3
2 4 10 15
3 5 12 18
Now I want to add another column D whose values will depend on values of A,B,C.
So for example if was iterating through the df I would just do:
for row in df.iterrows():
if(row['A'] != 0 and row[B] !=0):
row['D'] = (float(row['A'])/float(row['B']))*row['C']
elif(row['C'] ==0 and row['A'] != 0 and row[B] ==0):
row['D'] == 250.0
else:
row['D'] == 20.0
Is there a way to do this without the for loop or using where () or apply () functions.
Thanks
apply should work well for you:
In [20]: def func(row):
if (row == 0).all():
return 250.0
elif (row[['A', 'B']] != 0).all():
return (float(row['A']) / row['B'] ) * row['C']
else:
return 20
....:
In [21]: df['D'] = df.apply(func, axis=1)
In [22]: df
Out[22]:
A B C D
0 0 6 1 20.0
1 9 0 3 20.0
2 4 10 15 6.0
3 5 12 18 7.5
[4 rows x 4 columns]
.where can be much faster than .apply, so if all you're doing is if/elses then I'd aim for .where. As you're returning scalars in some cases, np.where will be easier to use than pandas' own .where.
import pandas as pd
import numpy as np
df['D'] = np.where((df.A!=0) & (df.B!=0), ((df.A/df.B)*df.C),
np.where((df.C==0) & (df.A!=0) & (df.B==0), 250,
20))
A B C D
0 0 6 1 20.0
1 9 0 3 20.0
2 4 10 15 6.0
3 5 12 18 7.5
For a tiny df like this, you wouldn't need to worry about speed. However, on a 10000 row df of randn, this is almost 2000 times faster than the .apply solution above: 3ms vs 5850ms. That said if speed isn't a concern, then .apply can often be easier to read.
here's a start:
df['D'] = np.nan
df['D'].loc[df[(df.A != 0) & (df.B != 0)].index] = df.A / df.B.astype(np.float) * df.C
edit, you should probably just go ahead and cast the whole thing to floats unless you really care about integers for some reason:
df = df.astype(np.float)
and then you don't have to constantly keep converting in call itself