I have this data frame
x = pd.DataFrame({'entity':[5,7,5,5,5,6,3,2,0,5]})
Update: I want a function If the slope is negetive and the length of the group is more than 2 then it should return True, index of start and end of the group. for this case it should return: result=True, index=5, index=8
1- I want to split the data frame based on the slope. This example should have 6 groups.
2- how can I check the length of groups?
I tried to get groups by the below code but I don't know how can split the data frame and how can check the length of each part
New update: Thanks Matt W. for his code. finally I found the solution.
df = pd.DataFrame({'entity':[5,7,5,5,5,6,3,2,0,5]})
df['diff'] = df.entity.diff().fillna(0)
df.loc[df['diff'] < 0, 'diff'] = -1
init = [0]
for x in df['diff'] == df['diff'].shift(1):
if x:
init.append(init[-1])
else:
init.append(init[-1]+1)
def get_slope(df):
x=np.array(df.iloc[:,0].index)
y=np.array(df.iloc[:,0])
X = x - x.mean()
Y = y - y.mean()
slope = (X.dot(Y)) / (X.dot(X))
return slope
df['g'] = init[1:]
df.groupby('g').apply(get_slope)
Result
0 NaN
1 NaN
2 NaN
3 0.0
4 NaN
5 -1.5
6 NaN
Take the difference and bfill() the start so that you have the same number in the 0th element. Then turn all negatives the same so we can imitate them being the same "slope". Then I shift it to check to see if the next number is the same and iterate through giving us a list of when it changes, assigning that to g.
df = pd.DataFrame({'entity':[5,7,5,5,5,6,3,2,0,5]})
df['diff'] = df.entity.diff().bfill()
df.loc[df['diff'] < 0, 'diff'] = -1
init = [0]
for x in df['diff'] == df['diff'].shift(1):
if x:
init.append(init[-1])
else:
init.append(init[-1]+1)
df['g'] = init[1:]
df
entity diff g
0 5 2.0 1
1 7 2.0 1
2 5 -1.0 2
3 5 0.0 3
4 5 0.0 3
5 6 1.0 4
6 3 -1.0 5
7 2 -1.0 5
8 0 -1.0 5
9 5 5.0 6
Just wanted to present another solution that doesn't require a for-loop:
df = pd.DataFrame({'entity':[5,7,5,5,5,6,3,2,0,5]})
df['diff'] = df.entity.diff().bfill()
df.loc[diff < 0, 'diff'] = -1
df['g'] = (~(df['diff'] == df['diff'].shift(1))).cumsum()
df
Related
I have a simple dataframe in which I am trying to split into multiple groups based on whether the x column value falls within a range.
e.g. if I have:
print(df1)
x
0 5
1 7.5
2 10
3 12.5
4 15
And wish to create a new dataframe, df2, of values of x which are within the range 7-13 (7 < x < 13)
print(df1)
x
0 5
4 15
print(df2)
x
1 7.5
2 10
3 12.5
I have been able to split the dataframe based on a single value boolean e.g. ( x < 11), using the following - but have unable to develop this into a range of values.
thresh = 11
df2 = df1[df1['x'] < thresh]
print(df2)
x
0 5
1 7.5
2 10
You can create a boolean mask for the range (7 < x < 13) by AND condition of (x > 7) and (x < 13). Then create df2 with this boolean mask. The remaining entries left in df1 being the negation of this boolean mask:
thresh_low = 7
thresh_high = 13
mask = (df1['x'] > thresh_low) & (df1['x'] < thresh_high)
df2 = df1[mask]
df1 = df1[~mask]
Result:
print(df2)
x
1 7.5
2 10.0
3 12.5
print(df1)
x
0 5.0
4 15.0
You can use between to categorize whether the condition is met and then groupby to split based on your condition. Here I'll store the results in a dict
d = dict(tuple(df1.groupby(df1['x'].between(7, 13, inclusive=False))))
d[True]
# x
#1 7.5
#2 10.0
#3 12.5
d[False]
# x
#0 5.0
#4 15.0
Or with only two possible splits you can manually define the Boolean Series and then split based on it.
m = df1['x'].between(7, 13, inclusive=False)
df_in = df1[m]
df_out = df1[~m]
I'm having trouble coming up with a good solution to the problem below. Please share your thoughts.
I'm attempting to map X,Y spatial data (the physical location of buckets) to categories / groupby object in pandas. Each bucket has space between them physically so the data points of similar buckets are close, unique buckets have larger spaces between them. Buckets are physically spaced out in two planes X and Y and have differing number of points within them. Only X will shown below for simplicity:
df = pd.DataFrame([0,1,2,6,7,8,12,13,14],columns=['X'])
df['Xdiff'] = df['X'].diff() #Get Diff
X Xdiff
0 0 NaN
1 1 1.0
2 2 1.0
3 6 4.0
4 7 1.0
5 8 1.0
6 12 4.0
7 13 1.0
8 14 1.0
I would like a groupby object whose 1st group is data index locations 0-2, the second is 3-5, the third is 6-8. Something like below
grp1.index = [0,1,2]
grp2.index = [3,4,5]
grp2.index = [6,7,8]
I have attempted cut and groupby with no luck:
new_bins = df[ (df['Xdiff'] > 1) ] #Get larger diff values
X Xdiff
3 6 4.0
6 12 4.0
bins = pd.cut( df['X'], new_bins['Xdiff'])
0 NaN
1 NaN
2 NaN
3 NaN
4 (6.0, 12.0]
5 (6.0, 12.0]
6 (6.0, 12.0]
7 NaN
8 NaN
Name: X, dtype: category
Categories (1, interval[float64]): [(6.0, 12.0]]
I appreciate your help. Thanks!
Update:
I'm now working on adding groups outside of pandas then groupby:
groups = []
for d in df['Xdiff']:
if d < 4:
groups.append(i)
else:
i+=1
groups.append(i)
df['bin'] =groups
df.groupby('bin').count()
And the output:
This works but I believe there must be a better way in pandas. Thanks!
X Xdiff
bin
1 3 2
2 3 3
3 3 3
Let's try using boolean logic on Xdiff to create the groups instead:
import pandas as pd
df = pd.DataFrame([0, 1, 2, 6, 7, 8, 12, 13, 14], columns=['X'])
df['Xdiff'] = df['X'].diff()
# Create Bins Based on Where Xdiff is Not 1
df['bin'] = df['Xdiff'].fillna(1).ne(1).cumsum() + 1
# Groupby on new index 'bin'
df = df.groupby('bin').agg('count')
print(df)
Output:
X Xdiff
bin
1 3 2
2 3 3
3 3 3
I have a dataframe like this:
import pandas as pd
df = pd.DataFrame({
'stuff_1_var_1': range(5),
'stuff_1_var_2': range(2, 7),
'stuff_2_var_1': range(3, 8),
'stuff_2_var_2': range(5, 10)
})
stuff_1_var_1 stuff_1_var_2 stuff_2_var_1 stuff_2_var_2
0 0 2 3 5
1 1 3 4 6
I would like to groupby based on the column headers and then add the mean and median of each group as new columns. So my expected output looks like this:
stuff_1_var_mean stuff_1_var_median stuff_2_var_mean stuff_2_var_median
0 1 1 4 4
1 2 2 5 5
Brief explanation:
we have two groups stuff_1_var_ and stuff_2_var_ for which would calculate the mean and median per row. So, e.g. for stuff_1_var_ it would be:
# values from stuff_1_var_1 and stuff_1_var_2
(0 + 2) / 2 = 1 and
( 1 + 3) / 2 = 2
The values are then added as a new column stuff_1_var_mean; analogue for meadian and stuff_2_var_.
I got until:
df = df.T
pattern = df.index.str.extract('(^stuff_\d_var_)', expand=False)
dfgb = df.groupby(pattern).agg(['mean', 'median']).T
stuff_1_var_ stuff_2_var_
0 mean 1 4
median 1 4
1 mean 2 5
median 2 5
How can I do the final step(s)?
Your solution should be changed:
df = df.T
pattern = df.index.str.extract('(^stuff_\d_var_)', expand=False)
dfgb = df.groupby(pattern).agg(['mean', 'median']).T.unstack()
dfgb.columns = dfgb.columns.map(lambda x: f'{x[0]}{x[1]}')
print (dfgb)
stuff_1_var_mean stuff_1_var_median stuff_2_var_mean stuff_2_var_median
0 1 1 4 4
1 2 2 5 5
2 3 3 6 6
3 4 4 7 7
4 5 5 8 8
Unfortunately for axis=1 is not implemented agg, so possible solution is create mean and median separately and then concat:
dfgb = df.groupby(pattern, axis=1).agg(['mean','median'])
NotImplementedError: axis other than 0 is not supported
pattern = df.columns.str.extract('(^stuff_\d_var_)', expand=False)
g = df.groupby(pattern, axis=1)
dfgb = pd.concat([g.mean().add_suffix('mean'),
g.median().add_suffix('median')], axis=1)
dfgb = dfgb.iloc[:, [0,2,1,3]]
print (dfgb)
stuff_1_var_mean stuff_1_var_median stuff_2_var_mean stuff_2_var_median
0 1 1 4 4
1 2 2 5 5
2 3 3 6 6
3 4 4 7 7
4 5 5 8 8
Here's a way you can do:
col = 'stuff_1_var_'
use_col = [x for x in df.columns if 'stuff_1' in x]
df[f'{col}mean'] = df[use_col].mean(1)
df[f'{col}median'] = df[use_col].median(1)
col2 = 'stuff_2_var_'
use_col = [x for x in df.columns if 'stuff_2' in x]
df[f'{col2}mean'] = df[use_col].mean(1)
df[f'{col2}median'] = df[use_col].median(1)
print(df.iloc[:,-4:]) # showing last four new columns
stuff_1_var_mean stuff_1_var_median stuff_2_var_mean stuff_2_var_median
0 1.0 1.0 4.0 4.0
1 2.0 2.0 5.0 5.0
2 3.0 3.0 6.0 6.0
3 4.0 4.0 7.0 7.0
4 5.0 5.0 8.0 8.0
Ofcourse, you can put it in a function to avoid repeating the same code.
I have a DataFrame and I will like to multiply (or divide) every n indexes by a specific number from an array. A brief example is the following, where the letters are just numbers.
df =
0 1
0 A B
1 C D
2 E F
3 G H
4 I J
5 K L
6 M N
7 O P
DataFrame (or numpy array):
0 1
0 W X
1 Y Z
I will like to obtain the following result:
Result =
0 1
0 A/W B/X
1 C/Y D/Z
2 E/W F/X
3 G/Y H/Z
4 I/W J/X
5 K/Y L/Z
6 M/W N/X
7 O/Y P/Z
Is it any way to solve this using df.groupy(df % 2).agg() or df.groupy(df % 2).apply() ? I am handling a huge DataFrame and I believe if I apply a for loop will take more time than needed.
I know I have to use a function, but I cannot code one that does what I am looking for.
Thanks.
Try the following code:
Start with defining a function to be applied to each group:
def dv(tbl):
return tbl.divide(df2.values, axis='columns')
df2 is converted to the underlying values in order to "free"
oneself from index alignment.
Then we read the number of rows in df2 (the size of a group
in grouping of df):
len2 = len(df2.index)
Then the actual division can be performed with a single instruction:
df.groupby(np.arange(len(df.index)) // len2).apply(dv)
np.arange(len(df.index)) // len2 provides division of df into
groups containing the same number of rows as df2.
To each group there is applied dv function (defined above).
For the test purpose I created the first DataFrame (df) as:
0 1
0 10.0 11.0
1 12.0 13.0
2 14.0 15.0
3 16.0 17.0
4 18.0 19.0
5 20.0 21.0
6 22.0 23.0
7 24.0 25.0
and the second (df2) as:
0 1
0 2.0 2.5
1 3.0 3.5
The result was:
0 1
0 5.000000 4.400000
1 4.000000 3.714286
2 7.000000 6.000000
3 5.333333 4.857143
4 9.000000 7.600000
5 6.666667 6.000000
6 11.000000 9.200000
7 8.000000 7.142857
Of couse, the above code was for division.
If you want to multiply, then define a function:
def ml(tbl):
return tbl.multiply(df2.values, axis='columns')
and apply it calling:
df.groupby(np.arange(len(df.index)) // len2).apply(ml)
You can modify the index of the first dataframe as follows:
df.index = df.index % 2
Then merge on index:
df = df.join(df2, lsuffix='_l', rsuffix = '_r')
Then you want will be something like this
df['ratio1'] = df['0_l'] / df['0_r']
df['ratio2'] = df['1_l'] / df['1_r']
To get the exact form of your answer:
column_map = {'ratio1': 0, 'ratio2': 1}
df = df[['ratio1', 'ratio2']].rename(columns= column_map)
This should do the trick without requiring a loop or using apply:
df.iloc[::2, 0] = df.iloc[::2, 0] / df2.iloc[0, 0]
df.iloc[1::2, 0] = df.iloc[1::2, 0] / df2.iloc[0, 1]
df.iloc[::2, 1] = df.iloc[::2, 1] / df2.iloc[1, 0]
df.iloc[1::2, 1] = df.iloc[1::2, 1] / df2.iloc[1, 1]
This may also work, and could be used with any number of columns:
df.iloc[::2, :] = df.iloc[::2, :] / df2.iloc[0, :]
df.iloc[1::2, :] = df.iloc[1::2, :] / df2.iloc[1, :]
I have the following dataframe:
W Y
0 1 5
1 2 NaN
2 3 NaN
3 4 NaN
4 5 NaN
5 6 NaN
6 7 NaN
...
as the table rows keeps going until index 240. I want to get the following dataframe:
W Y
0 1 5
1 2 7
2 3 10
3 4 14
4 5 19
5 6 27
6 7 37
...
Please note that the values of W are arbitrary (just to make the computation here easier, in fact they are np.random.normal in my real program).
Or in other words:
If Y index is 0, then the value of Y is 5;
If Y index is between 1 and 4 (includes) then Y_i is the sum of the previous element in Y and the current elemnt in W.
If Y index is >=5 then the value of Y is: Y_{i-1} + Y_{i-4} - Y_{i-5} + W_i
using iipr answer I've managed to compute the first five values by running:
def calculate(add):
global value
value = value + add
return value
df.Y = np.nan
value = 5
df.loc[0, 'Y'] = value
df.loc[1:5, 'Y'] = df.loc[1:5].apply(lambda row: calculate(*row[['W']]), axis=1)
but I haven't managed to compute the rest of values (where index>=5).
Does anyone have any suggestions?
I wouldn't recommend to use apply in this case.
Why not simply use two loops, for each differently defined range one:
for i in df.index[1:5]:
df.loc[i, 'Y'] = df.W.loc[i] + df.Y.loc[i-1]
for i in df.index[5:]:
df.loc[i, 'Y'] = df.W.loc[i] + df.Y.loc[i-1] + df.Y.loc[i-4] - df.Y.loc[i-5]
This is straight forward and you still know next week what the code does.