Pandas: dynamically add rows before & after groups - python

I have a Pandas data frame:
pd.DataFrame(data={"Value": [5, 0, 0, 8, 0, 3, 0, 2, 7, 0], "Group": [1, 1, 1, 1, 2, 2, 2, 2, 2, 2]})
Value Group
0 5.0 1
1 0.0 1
2 0.0 1
3 8.0 1
4 0.0 2
5 3.0 2
6 0.0 2
7 2.0 2
8 7.0 2
9 0.0 2
Also I calculated the cumulative sum with respect to two rows for each group:
{"2-cumsum Group 1": array([5., 8.]), "2-cumsum Group 2": array([7., 5.])}
E.g. array([5., 8.]) because array([5.0, 0.0]) (rows 0 and 1) + array([0.0 + 8.0]) (rows 2 and 3) = array([5., 8.]).
What I now need is to append exactly two rows at the beginning of df, in-between each group and at the end of df so that I get the following data frame (gaps are for illustration purposes):
Value Group
0 10.0 0 # Initialize with 10.0
1 10.0 0 # Initialize with 10.0
2 5.0 1
3 0.0 1
4 0.0 1
5 8.0 1
6 5.0 0 # 5.0 ("2-cumsum Group 1"[0])
7 2.0 0 # 8.0 ("2-cumsum Group 1"[1])
8 0.0 2
9 3.0 2
10 0.0 2
11 2.0 2
12 7.0 2
13 0.0 2
14 7.0 0 # 5.0 ("2-cumsum Group 2"[0])
15 5.0 0 # 8.0 ("2-cumsum Group 2"[1])
Please consider that the original data frame is much larger, has more than just two columns and I need to dynamically append rows with varying entries. E.g. the rows to append should have an additional column with "10.0" entries. Also, calculating the cumulative sum with respect to some integer (in this case 2) is variable (could be 8).
There are so many occasions where I need to generate rows based on other rows in data frames but I didn't find any effective solutions other than using for-loops and some temporary cache lists that save values from previous iterations.
I would appreciate some help.
Thank you in advance and kind regards.
My original code applied to the exemplary data, in case anybody needs it. It's very confusing and ineffective so only consider it if you really need to:
import pandas as pd
import numpy as np
# Some stuff
df = pd.DataFrame(data={"Group1": ["a", "a", "b", "b", "b"],
"Group2": [1, 2, 1, 2, 3],
"Group3": [1, 9, 2, 1, 1],
"Value": [5, 8, 3, 2, 7]})
length = 2
max_value = 20
g = df['Group1'].unique()
h = df["Group2"].unique()
i = range(1,df['Group3'].max()+1)
df2 = df.set_index(['Group1','Group2','Group3']).reindex(pd.MultiIndex.from_product([g,h,i])).assign(cc = lambda x: (x.groupby(level=0).cumcount())//length).rename_axis(['Group1','Group2','Group3'],axis=0)
df2 = df2.loc[~df2['Value'].isna().groupby([pd.Grouper(level=0),df2['cc']]).transform('all')].reset_index().fillna(0).drop('cc',axis=1)
values = df2["Value"].copy().to_numpy()
values = np.array_split(values, len(values)/length)
stock = df2["Group1"].copy().to_numpy()
stock = np.array_split(stock, len(stock)/length)
# Generate the "Group" column and generate the "2-cumsum" arrays
# stored in the "volumes" variable
k = 0
a_groups = []
values_cache = []
volumes = []
for e, i in enumerate(values):
if any(stock[e] == stock[e-1]):
if np.any(i + values_cache >= max_value):
k += 1
volumes.append(values_cache)
values_cache = i
else:
values_cache += i
a_groups.extend([k] * length)
else:
k += 1
if e:
volumes.append(values_cache)
values_cache = i
a_groups.extend([k] * length)
volumes.append(values_cache)
df2["Group"] = a_groups
print(df2[["Value", "Group"]])
print("\n")
print(f"2-cumsums: {volumes}")
"""Output
Value Group
0 5.0 1
1 0.0 1
2 0.0 1
3 8.0 1
4 0.0 2
5 3.0 2
6 0.0 2
7 2.0 2
8 7.0 2
9 0.0 2
2-cumsums: [array([5., 8.]), array([7., 5.])]"""

Related

Calculate the sum of absolute difference of a column over two adjacent days on id

I have a dataframe like this
import pandas as pd
df = pd.DataFrame(
dict(
day=[1, 1, 1, 2, 2, 2, 3, 3, 4, 4, 5, 5, 5],
id=[1, 2, 3, 2, 3, 4, 1, 2, 2, 3, 1, 3, 4],
value=[1, 2, 2, 3, 5, 2, 1, 2, 7, 3, 5, 3, 4],
)
)
I want to calculate the sum of absolute difference of 'value' column between every two consecutive days (ordered from smallest to largest) matched by id and treat null/None/unmatched as 0. To be more specific, the result for day 1 and 2 can be calculated as:
Note id: 1 id: 2 id: 3 id: 4 -- difference for each id (treat non-existent as 0)
(1 - 0) + (3 - 2) + (5 - 2) + (2 - 0) = 7
And, the final result for my example should be:
day res
1-2 7
2-3 9
3-4 9
4-5 16
How can I achieve the result I want with idiomatic pandas code?
Is it possible to achieve the goad via groupby and some shift operations? One challenge I have with shift is that non-overlapping ids between two days cannot be handled.
Thanks so much for your help!
Pivot the dataframe to reshape then calculate the sum of abs diff
p = df.pivot('id', 'day', 'value').fillna(0)
# day 1 2
# id
# 1 1.0 0.0
# 2 2.0 3.0
# 3 2.0 5.0
# 4 0.0 2.0
sum(abs(p[1] - p[2]))
# 7
To calculate sum of abs diff's between multiple days
p = df.pivot('id', 'day', 'value').fillna(0)
# day 1 2 3 4 5
# id
# 1 1.0 0.0 1.0 0.0 5.0
# 2 2.0 3.0 2.0 7.0 0.0
# 3 2.0 5.0 0.0 3.0 3.0
# 4 0.0 2.0 0.0 0.0 4.0
s = p.diff(axis=1).abs().iloc[:, 1:].sum()
# day
# 2 7.0
# 3 9.0
# 4 9.0
# 5 16.0
# dtype: float64
s.index = [f'{x}-{y}' for x, y in zip(p.columns[:-1], p.columns[1:])]
# 1-2 7.0
# 2-3 9.0
# 3-4 9.0
# 4-5 16.0
# dtype: float64

Group dataframe according to start and stop columns

I'd like to cut/group a pandas Dataframe according to a start and a stop column, but only in the case of start->stop.
I would like the range of indexes from 'start' non-zero value to 'stop' non-zero value. But only if the 'start' non-zero value is followed next by a 'stop' non-zero value. Running through the indices from top to bottom
I attached some code creating a simplified version of the problem and a corresponding image.
col1 = np.zeros(10)
col2 = np.zeros(10)
col1[[0, 1, 5, 8]] = 1
col2[[3, 6, 7, 9]] = 1
df = pd.DataFrame({'start': col1, 'stop': col2})
The desired output would group the indexes somewhat like:
[(1,2,3), (5,6), (8,9)]
Additional info in case this would simplify things:
Merging the columns would be fine.
My original data frame has a pd.TimedeltaIndex.
Visual Clarification of the desired result:
First we need to look the intervals of start and stop and find out which are “valid” interval ends:
>>> ends = df.index.to_series().where(df['stop'].ne(0))
>>> starts = df.index.to_series().where(df['start'].ne(0))
>>> ends
0 NaN
1 NaN
2 NaN
3 3.0
4 NaN
5 NaN
6 6.0
7 7.0
8 NaN
9 9.0
dtype: float64
>>> starts
0 0.0
1 1.0
2 NaN
3 NaN
4 NaN
5 5.0
6 NaN
7 NaN
8 8.0
9 NaN
dtype: float64
Now we can try to get for each valid start the next valid end:
>>> next_end = ends.bfill().rename('end')
>>> valid_starts = starts.dropna().rename('start')
>>> candidates = valid_starts.to_frame().join(next_end, how='left')
>>> candidates
start end
0 0.0 3.0
1 1.0 3.0
5 5.0 6.0
8 8.0 9.0
Here we see that there is an issue with the interval starting at 0: another interval starts later (at 1) so [0, 3] is not valid and we should only keep [1, 3]. This could be done with groupby + max for example:
>>> intervals = candidates.groupby('end')['start'].max().reset_index().astype(int)
>>> intervals
end start
0 3 1
1 6 5
2 9 8
Finally generating the list of indexes from the endpoints is easy:
>>> intervals.agg(lambda s: list(range(s['start'], s['end'] + 1)), axis='columns')
0 [1, 2, 3]
1 [5, 6]
2 [8, 9]
dtype: object

How to create conditional columns in Pandas Data Frame in which column values are based on other columns

I am new to python, I am attempting what would be a conditional mutate in R DPLYR.
In short I would like to create a new column in the Data-frame called Result where : if df.['test'] is greater than 1 df.['Result'] equals the respective df.['count'] for that row, if it lower than 1 then df.['Result'] is
df.['count'] *df.['test']
I have tried df['Result']=df['test'].apply(lambda x: df['count'] if x >=1 else ...) Unfortunately this results in a series, I have also attempted to write small functions which also return series
I would like the final Dataframe to look like this...
no_ Test Count Result
1 2 1 1
2 3 5 5
3 4 1 1
4 6 2 2
5 0.5 2 1
You can use np.where:
df['Result'] = np.where(df['Test'] > 1, df['Count'], df['Count'] * df['Test'])
Output:
No_ Test Count Result
0 1 2.0 1 1.0
1 2 3.0 5 5.0
2 3 4.0 1 1.0
3 4 6.0 2 2.0
4 5 0.5 2 1.0
You can work it out with lists comprehensions:
df['Result'] = [ df['count'][i] if df['test'][i]>1 else
df['count'][i] * df['test'][i]
for i in range(df.shape[0]) ]
Here is a way to do this:
import pandas as pd
df = pd.DataFrame(columns = ['Test', 'Count'],
data={'Test':[2, 3, 4, 6, 0.5], 'Count':[1, 5, 1, 2, 2]})
df['Result'] = df['Count']
df.loc[df['Test'] < 1, 'Result'] = df['Test'] * df['Count']
Output:
Test Count Result
0 2.0 1 1.0
1 3.0 5 5.0
2 4.0 1 1.0
3 6.0 2 2.0
4 0.5 2 1.0

Filling Pandas columns with lists of unequal lengths

I am having trouble filling Pandas dataframes with values from lists of unequal lengths.
nx_lists_into_df is a list of numpy arrays.
I get the following error:
ValueError: Length of values does not match length of index
The code is below:
# Column headers
df_cols = ["f1","f2"]
# Create one dataframe fror each sheet
df1 = pd.DataFrame(columns=df_cols)
df2 = pd.DataFrame(columns=df_cols)
# Create list of dataframes to iterate through
df_list = [df1, df2]
# Lists to be put into the dataframes
nx_lists_into_df = [[array([0, 1, 3, 4, 7]),
array([2, 5, 6, 8])],
[array([0, 1, 2, 6, 7]),
array([3, 4, 5, 8])]]
# Loop through each sheet (i.e. each round of k folds)
for df, test_index_list in zip_longest(df_list, nx_lists_into_df):
counter = -1
# Loop through each column in that sheet (i.e. each fold)
for col in df_cols:
print(col)
counter += 1
# Add 1 to each index value to start indexing at 1
df[col] = test_index_list[counter] + 1
Thank you for your help.
Edit: This is how the result should hopefully look:-
print(df1)
f1 f2
0 0 2
1 1 5
2 3 6
3 4 8
4 7 NaN
print(df2)
f1 f2
0 0 3
1 1 4
2 2 5
3 6 8
4 7 NaN
We'll leverage pd.Series to attach an appropriate index and will allow us to use the pd.DataFrame constructor without complaining of unequal lengths.
df1, df2 = (
pd.DataFrame(dict(zip(df_cols, map(pd.Series, d))))
for d in nx_lists_into_df
)
print(df1)
f1 f2
0 0 2.0
1 1 5.0
2 3 6.0
3 4 8.0
4 7 NaN
print(df2)
f1 f2
0 0 3.0
1 1 4.0
2 2 5.0
3 6 8.0
4 7 NaN
Setup
from numpy import array
nx_lists_into_df = [[array([0, 1, 3, 4, 7]),
array([2, 5, 6, 8])],
[array([0, 1, 2, 6, 7]),
array([3, 4, 5, 8])]]
# Column headers
df_cols = ["f1","f2"]
You could predefine the size of your DataFrames (by setting the index range to the length of the longest column you want to add [or any size bigger than the longest column]) like so:
df1 = pd.DataFrame(columns=df_cols, index=range(5))
df2 = pd.DataFrame(columns=df_cols, index=range(5))
print(df1)
f1 f2
0 NaN NaN
1 NaN NaN
2 NaN NaN
3 NaN NaN
4 NaN NaN
(df2 is the same)
The DataFrame will be filled with NaNs automatically.
Then you use .loc to access each entry separately like so:
for x in range(len(nx_lists_into_df)):
for col_idx, y in enumerate(nx_lists_into_df[x]):
df_list[x].loc[range(len(y)), df_cols[col_idx]] = y
print(df1)
f1 f2
0 0 2
1 1 5
2 3 6
3 4 8
4 7 NaN
print(df2)
f1 f2
0 0 3
1 1 4
2 2 5
3 6 8
4 7 NaN
The first loop iterates over the first dimension of your array (or the number of DataFrames you want to create).
The second loop iterates over the column values for the DataFrame, where y are the values for the current column and df_cols[col_idx] is the respective column (f1 or f2).
Since the row & col indices are the same size as y, you don't get the length mismatch.
Also check out the enumerate(iterable, start=0) function to get around those "counter" variables.
Hope this helps.
If I understand correctly, this is possible via pd.concat.
But see #pir's solution for an extendable version.
# Lists to be put into the dataframes
nx_lists_into_df = [[array([0, 1, 3, 4, 7]),
array([2, 5, 6, 8])],
[array([0, 1, 2, 6, 7]),
array([3, 4, 5, 8])]]
df1 = pd.concat([pd.DataFrame({'A': nx_lists_into_df[0][0]}),
pd.DataFrame({'B': nx_lists_into_df[0][1]})],
axis=1)
# A B
# 0 0 2.0
# 1 1 5.0
# 2 3 6.0
# 3 4 8.0
# 4 7 NaN
df2 = pd.concat([pd.DataFrame({'C': nx_lists_into_df[1][0]}),
pd.DataFrame({'D': nx_lists_into_df[1][1]})],
axis=1)
# C D
# 0 0 3.0
# 1 1 4.0
# 2 2 5.0
# 3 6 8.0
# 4 7 NaN

DataFrame calculating average purchase price

I have a dataframe with two columns: quantity and price.
df = pd.DataFrame([
[ 1, 5],
[-1, 6],
[ 2, 3],
[-1, 2],
[-1, 4],
[ 1, 2],
[ 1, 3],
[ 1, 4],
[-2, 5]], columns=['quantity', 'price'])
df['amount'] = df['quantity'] * df['price']
df['cum_qty'] = df['quantity'].cumsum()
I have added two new columns amount and cum_qty (cumulative quantity).
Now dataframe looks like this (positive quantity represents buys, negative quantity represents sells):
quantity price amount cum_qty
0 1 5 5 1
1 -1 6 -6 0
2 2 3 6 2
3 -1 2 -2 1
4 -1 4 -4 0
5 1 2 2 1
6 1 3 3 2
7 1 4 4 3
8 -2 5 -10 1
I would like to calculate average buy price.
Every time when cum_qty = 0, qantity and amount should be reset to zero.
So we are looking at rows with index = [5,6,7].
For each row one item is bought at prices 2, 3 and 4, which means I have on stock 3 each at average price of 3 [(2 + 3 + 4)/3].
After sell at index = 8 has happened (sell transactions doesn't change buy price), I will have one each at price 3.
So, basically, I have to divide all cumulative buy amounts by cumulative quantities from last cumulative quantity that is not zero.
How to calculate buy on hand as result of all transactions with pandas DataFrame?
Here is a different solution using a loop:
import pandas as pd
import numpy as np
# Original data
df = pd.DataFrame({
'quantity': [ 1, -1, 2, -1, -1, 1, 1, 1, -2],
'price': [5, 6, 3, 2, 4, 2, 3, 4, 5]
})
# Process the data and add the new columns
df['amount'] = df['quantity'] * df['price']
df['cum_qty'] = df['quantity'].cumsum()
df['prev_cum_qty'] = df['cum_qty'].shift(1, fill_value=0)
df['average_price'] = np.nan
for i, row in df.iterrows():
if row['quantity'] > 0:
df.iloc[i, df.columns == 'average_price' ] = (
row['amount'] +
df['average_price'].shift(1, fill_value=df['price'][0])[i] *
df['prev_cum_qty'][i]
)/df['cum_qty'][i]
else:
df.iloc[i, df.columns == 'average_price' ] = df['average_price'][i-1]
df.drop('prev_cum_qty', axis=1)
An advantage of this approach is that it will also work if there are new buys
before the cum_qty gets to zero. As an example, suppose there was a new buy
of 5 at the price of 3, that is, run the following line before processing the
data:
# Add more data, exemplifying a different situation
df = df.append({'quantity': 5, 'price': 3}, ignore_index=True)
I would expect the following result:
quantity price amount cum_qty average_price
0 1 5 5 1 5.0
1 -1 6 -6 0 5.0
2 2 3 6 2 3.0
3 -1 2 -2 1 3.0
4 -1 4 -4 0 3.0
5 1 2 2 1 2.0
6 1 3 3 2 2.5
7 1 4 4 3 3.0
8 -2 5 -10 1 3.0
9 5 3 15 6 3.0 # Not 4.0
That is, since there was still 1 item bought at the price 3, the cum_qty is now 6, and the average price is still 3.
Base on my understanding , you need buy price for each trading circle, then you can try this.
df['new_index'] = df.cum_qty.eq(0).shift().cumsum().fillna(0.)#give back the group id for each trading circle.*
df=df.loc[df.quantity>0]# kick out the selling action
df.groupby('new_index').apply(lambda x:(x.amount.sum()/x.quantity.sum()))
new_index
0.0 5.0# 1st ave price 5
1.0 3.0# 2nd ave price 3
2.0 3.0# 3nd ave price 3 ps: this circle no end , your position still pos 1
dtype: float64
EDIT1 for you additional requirement
DF=df.groupby('new_index',as_index=False).apply(lambda x : x.amount.cumsum()/ x.cum_qty).reset_index()
DF.columns=['Index','AvePrice']
DF.index=DF.level_1
DF.drop(['level_0', 'level_1'],axis=1,inplace=True)
pd.concat([df,DF],axis=1)
Out[572]:
quantity price amount cum_qty new_index 0
level_1
0 1 5 5 1 0.0 5.0
2 2 3 6 2 1.0 3.0
5 1 2 2 1 2.0 2.0
6 1 3 3 2 2.0 2.5
7 1 4 4 3 2.0 3.0
df[df['cum_qty'].map(lambda x: x == 0)].index
will give you at which rows you have a cum_qty of 0
df[df['cum_qty'].map(lambda x: x == 0)].index.max()
gives you the last row with 0 cum_qty
start = df[df['cum_qty'].map(lambda x: x == 0)].index.max() + 1
end = len(df) - 1
gives you the start and end row numbers that are the range you are referring to
df['price'][start:end].sum() / df['quantity'][start:end].sum()
gives you the answer you did in the example you gave
If you want to know this value for each occurrence of cum_qty 0, then you can apply the start/end logic by using the index of each (the result of my first line of code).

Categories