I need to perform a cumulative sum on a data frame that is grouped, but I need to have it reset when the previous value is negative and the current value is positive.
In R I could apply a condition to the groupby with ave() function, but I can't do that in python, so I am having a bit of trouble thinking of a solution. Can anyone help me out?
Here is a sample:
import pandas as pd
df = pd.DataFrame({'PRODUCT': ['A'] * 40, 'GROUP': ['1'] * 40, 'FORECAST': [100, -40, -40, -40]*10, })
df['CS'] = df.groupby(['GROUP', 'PRODUCT']).FORECAST.cumsum()
# Reset cumsum if
# condition: (df.FORECAST > 0) & (df.groupby(['GROUP', 'PRODUCT']).FORECAST.shift(-1).fillna(0) <= 0)
This solution will work to reset the sum for any example where the values to be summed change from negative to positive (regardless of whether the dataset is nice and periodic as it is in your example)
import numpy as np
import pandas as pd
df = pd.DataFrame({'PRODUCT': ['A'] * 40, 'GROUP': ['1'] * 40, 'FORECAST': [100, -40, -40, -40]*10, })
cumsum = np.cumsum(df['FORECAST'])
# Array of indices where sum should be reset
reset_ind = np.where(df['FORECAST'].diff() > 0)[0]
# Sums that need to be subtracted at resets
subs = cumsum[reset_ind-1].values
# Repeat subtraction values for every entry BETWEEN resets and values after final reset
rep_subs = np.repeat(subs, np.hstack([np.diff(reset_ind), df['FORECAST'].size - reset_ind[-1]]))
# Stack together values before first reset and resetted sums
df['CS'] = np.hstack([cumsum[:reset_ind[0]], cumsum[reset_ind[0]:] - rep_subs])
Alternatively, based on this solution to a similar question (and my realisation of the usefulness of groupby)
import pandas as pd
import numpy as np
df = pd.DataFrame({'PRODUCT': ['A'] * 40, 'GROUP': ['1'] * 40, 'FORECAST': [100, -40, -40, -40]*10, })
# Create indices to group sums together
df['cumsum'] = (df['FORECAST'].diff() > 0).cumsum()
# Perform group-wise cumsum
df['CS'] = df.groupby(['cumsum'])['FORECAST'].cumsum()
# Remove intermediary cumsum column
df = df.drop(['cumsum'], axis=1)
Related
In the following code, the column df2['age_groups'] has definitely one null value.
I am trying to append this null value to a list, but this list turns out to be empty.
Why am I running into this problem?
import numpy as np
np.random.seed(100)
df1 = pd.DataFrame({"ages": np.random.randint(10, 80, 20)})
df2 = pd.DataFrame({"ages": np.random.randint(10, 80, 20)})
df1['age_groups'], bins = pd.cut(df1["ages"], 4, retbins=True)
df2['age_groups'] = pd.cut(df2["ages"], bins=bins)
null_list=[]
for i in df2['age_groups']:
if i == float('nan'):
null_list.append(i)
print(null_list) #empty list
print(df2['age_groups'].isna().sum()) # it shows that there is one null value
and
type(i) == float('nan')
generates the same outcome
For test missing values NaN and None (obviously same processing in pandas) is used pd.isna, not is, not ==:
null_list=[]
for i in df2['age_groups']:
if pd.isna(i):
null_list.append(i)
You only need to fix the if condition. See also this SO question.
Try this:
np.random.seed(100)
df1 = pd.DataFrame({"ages": np.random.randint(10, 80, 20)})
df2 = pd.DataFrame({"ages": np.random.randint(10, 80, 20)})
df1['age_groups'], bins = pd.cut(df1["ages"], 4, retbins=True)
df2['age_groups'] = pd.cut(df2["ages"], bins=bins)
null_list=[]
for i in df2['age_groups']:
if i is np.nan: # <- code changed here to np.nan
null_list.append(i)
print(null_list)
print(df2['age_groups'].isna().sum())
Output:
[nan]
1
I am new to programming and Pythone could you help me?
I have a data frame which look like this.
d = {'time': [4, 10, 15, 6, 0, 20, 40, 11, 9, 12, 11, 25],
'value': [0, 0, 0, 50, 100, 0, 0, 70, 100, 0,100, 20]}
df = pd.DataFrame(data=d)
I want to slice the data whenever value == 100 and then plot all slices in a figer.
So my questions are how to slice or cut the data as described? and what's the best structure to save slices in order to plot?.
Note 1: value column has no frequency that I can use and it varies from 0 to 100 where time is arbitrary.
Note 2: I already tried this solution but I get the same table
decreased_value = df[df['value'] <= 100][['time', 'value']].reset_index(drop=True)
How can I slice one column in a dataframe to several series based on a condition
Thanks in advance!
EDIT:
Here's a simpler way of handling my first answer (thanks to #aneroid for the suggestion).
Get the indices where value==100 and add +1 so that these land at the bottom of each slice:
indices = df.index[df['value'] == 100] + 1
Then use numpy.split (thanks to this answer for that method) to make a list of dataframes:
df_list = np.split(df, indices)
Then do your plotting for each slice in a for loop:
for df in df_list:
--- plot based on df here ---
VERBOSE / FROM SCRATCH METHOD:
You can get the indices for where value==100 like this:
indices = df.index[df.value==100]
Then add the smallest and largest indices in order to not leave out the beginning and end of the df:
indices = indices.insert(0,0).to_list()
indices.append(df.index[-1]+1)
Then cycle through a while loop to cut up the dataframe and put each slice into a list of dataframes:
i = 0
df_list = []
while i+1 < len(indices):
df_list.append(df.iloc[indices[i]:indices[i+1]])
i += 1
I already solved the problem using for loop, which can be used to slice and plot at the same time without using np.split function, as well as maintain the data structure.
Thanks to the previous answer by #k_n_c, it helps me improve it.
slices = df.index[df['score'] == 100]
slices = slices + 1
slices = np.insert(slices, 0,0, axis=0)
slices = np.append(slices,df.index[-1]+1)
prev_ind = 0
for ind in slices:
temp = df.iloc[prev_ind:ind,:]
plt.plot(temp.time, temp.score)
prev_ind = ind
plt.show()
import pandas as pd
temp=[79,80,81,80,80,79,76,75,76,78,80,81]
for i in range(len(temp)):
if temp[i]<=80:
level=0
elif temp[i]>80 and temp[i]<=100:
level=1
elif temp[i]<=75:
n_level=0
elif temp[i]>75 and temp[i]<=95:
n_level=1//
df=pd.DataFrame([[temp[i],level]],columns=['temp1','level1','newlevel'])//
print(df)//
Unable to get the output which is expected to be like this
#temp# ##level## ###newlevel###
79 0 0
80 0 0
81 1 1
Please find below solution, I have used list comprehension to create the data required to build the data frame. Please change the conditions inside the list comprehensions to meet your needs. Code and Output
import pandas as pd
import numpy as np
temp=[79,80,81,80,80,79,76,75,76,78,80,81]
level=[0 if i<=80 else 1 for i in temp]
n_level=[0 if i<=75 else 1 for i in temp]
#print(temp)
#print(level)
#print(n_level)
df=pd.DataFrame({'temp':temp,'level':level,'newlevel':n_level})
df
You can generate your DataFrame using solely Pandas methods.
To generate both "level" columns, it is enough to use cut:
df = pd.DataFrame({'temp': temp,
'level1' : pd.cut(temp, [0, 80, 1000], labels=[0, 1]),
'newlevel': pd.cut(temp, [0, 75, 1000], labels=[0, 1])})
Note: Both "level" columns of the output DataFrame have category type.
If you are unhappy about that, cast them to int:
df = pd.DataFrame({'temp': temp,
'level1' : pd.cut(temp, [0, 80, 1000], labels=[0, 1]).astype(int),
'newlevel': pd.cut(temp, [0, 75, 1000], labels=[0, 1]).astype(int)})
I have a list
a = [15, 50 , 75]
Using the above list I have to create smaller dataframes filtering out rows (the number of rows is defined by the list) on the index from the main dataframe.
let's say my main dataframe is df
the dataframes I'd like to have is df1 (from row index 0-15),df2 (from row index 15-65), df3 (from row index 65 - 125)
since these are just three I can easily use something like this below:
limit1 = a[0]
limit2 = a[1] + limit1
limit3 = a[2] + limit3
df1 = df.loc[df.index <= limit1]
df2 = df.loc[(df.index > limit1) & (df.index <= limit2)]
df2 = df2.reset_index(drop = True)
df3 = df.loc[(df.index > limit2) & (df.index <= limit3)]
df3 = df3.reset_index(drop = True)
But what if I want to implement this with a long list on the main dataframe df, I am looking for something which is iterable like the following (which doesn't work):
df1 = df.loc[df.index <= limit1]
for i in range(2,3):
for j in range(2,3):
for k in range(2,3):
df[i] = df.loc[(df.index > limit[j]) & (df.index <= limit[k])]
df[i] = df[i].reset_index(drop=True)
print(df[i])
you could modify your code by building dataframes from the main dataframe iteratively cutting out slices from the end of the dataframe.
dfs = [] # this list contains your partitioned dataframes
a = [15, 50 , 75]
for idx in a[::-1]:
dfs.insert(0, df.iloc[idx:])
df = df.iloc[:idx]
dfs.insert(0, df) # add the last remaining dataframe
print(dfs)
Another option is to use list expressions as follows:
a = [0, 15, 50 , 75]
dfs = [df.iloc[a[i]:a[i+1]] for i in range(len(a)-1)]
This does it. It's better to use dictionaries if you want to store multiple variables and call them later. It's bad practice to create variables in an iterative way, so always avoid it.
df = pd.DataFrame(np.linspace(1,75,75), columns=['a'])
a = [15, 50 , 25]
d = {}
b = 0
for n,i in enumerate(a):
d[f'df{n}'] = df.iloc[b:b+i]
b+=i
Output:
Summary
I am trying to iterate over a large dataframe. Identify unique groups based on several columns, apply the mean to another column based on how many are in the group. My current approach is very slow when iterating over a large dataset and applying the average function across many columns. Is there a way I can do this more efficiently?
Example
Here's a example of the problem. I want to find unique combinations of ['A', 'B', 'C']. For each unique combination, I want the value of column ['D'] / number of rows in the group.
Edit:
Resulting dataframe should preserve the duplicated groups. But with edited column 'D'
import pandas as pd
import numpy as np
import datetime
def time_mean_rows():
# Generate some random data
A = np.random.randint(0, 5, 1000)
B = np.random.randint(0, 5, 1000)
C = np.random.randint(0, 5, 1000)
D = np.random.randint(0, 10, 1000)
# init dataframe
df = pd.DataFrame(data=[A, B, C, D]).T
df.columns = ['A', 'B', 'C', 'D']
tstart = datetime.datetime.now()
# Get unique combinations of A, B, C
unique_groups = df[['A', 'B', 'C']].drop_duplicates().reset_index()
# Iterate unique groups
normalised_solutions = []
for idx, row in unique_groups.iterrows():
# Subset dataframe to the unique group
sub_df = df[
(df['A'] == row['A']) &
(df['B'] == row['B']) &
(df['C'] == row['C'])
]
# If more than one solution, get mean of column D
num_solutions = len(sub_df)
if num_solutions > 1:
sub_df.loc[:, 'D'] = sub_df.loc[:,'D'].values.sum(axis=0) / num_solutions
normalised_solutions.append(sub_df)
# Concatenate results
res = pd.concat(normalised_solutions)
tend = datetime.datetime.now()
time_elapsed = (tstart - tend).seconds
print(time_elapsed)
I know the section causing slowdown is when num_solutions > 1. How can I do this more efficiently
Hm, why don't you use groupby?
df_res = df.groupby(['A', 'B', 'C'])['D'].mean().reset_index()
This is a complement to AT_asks's answer which only gave the first part of the solution.
Once we have df.groupby(['A', 'B', 'C'])['D'].mean() we can use it to change the value of the column 'D' in a copy of the original dataframe provided we use a dataframe sharing same index. The global solution is then:
res = df.set_index(['A', 'B', 'C']).assign(
D=df.groupby(['A', 'B', 'C'])['D'].mean()).reset_index()
This will contains same rows (even if a different order that the res dataframe from OP's question.
Here's a solution I found
Using groupby as suggested by AT, then merging back to the original df and dropping the original ['D', 'E'] columns. Nice speedup!
def time_mean_rows():
# Generate some random data
np.random.seed(seed=42)
A = np.random.randint(0, 10, 10000)
B = np.random.randint(0, 10, 10000)
C = np.random.randint(0, 10, 10000)
D = np.random.randint(0, 10, 10000)
E = np.random.randint(0, 10, 10000)
# init dataframe
df = pd.DataFrame(data=[A, B, C, D, E]).T
df.columns = ['A', 'B', 'C', 'D', 'E']
tstart_grpby = timer()
cols = ['D', 'E']
group_df = df.groupby(['A', 'B', 'C'])[cols].mean().reset_index()
# Merge df
df = pd.merge(df, group_df, how='left', on=['A', 'B', 'C'], suffixes=('_left', ''))
# Get left columns (have not been normalised) and drop
drop_cols = [x for x in df.columns if x.endswith('_left')]
df.drop(drop_cols, inplace=True, axis='columns')
tend_grpby = timer()
time_elapsed_grpby = timedelta(seconds=tend_grpby-tstart_grpby).total_seconds()
print(time_elapsed_grpby)