DataFrame calculating average purchase price - python

I have a dataframe with two columns: quantity and price.
df = pd.DataFrame([
[ 1, 5],
[-1, 6],
[ 2, 3],
[-1, 2],
[-1, 4],
[ 1, 2],
[ 1, 3],
[ 1, 4],
[-2, 5]], columns=['quantity', 'price'])
df['amount'] = df['quantity'] * df['price']
df['cum_qty'] = df['quantity'].cumsum()
I have added two new columns amount and cum_qty (cumulative quantity).
Now dataframe looks like this (positive quantity represents buys, negative quantity represents sells):
quantity price amount cum_qty
0 1 5 5 1
1 -1 6 -6 0
2 2 3 6 2
3 -1 2 -2 1
4 -1 4 -4 0
5 1 2 2 1
6 1 3 3 2
7 1 4 4 3
8 -2 5 -10 1
I would like to calculate average buy price.
Every time when cum_qty = 0, qantity and amount should be reset to zero.
So we are looking at rows with index = [5,6,7].
For each row one item is bought at prices 2, 3 and 4, which means I have on stock 3 each at average price of 3 [(2 + 3 + 4)/3].
After sell at index = 8 has happened (sell transactions doesn't change buy price), I will have one each at price 3.
So, basically, I have to divide all cumulative buy amounts by cumulative quantities from last cumulative quantity that is not zero.
How to calculate buy on hand as result of all transactions with pandas DataFrame?

Here is a different solution using a loop:
import pandas as pd
import numpy as np
# Original data
df = pd.DataFrame({
'quantity': [ 1, -1, 2, -1, -1, 1, 1, 1, -2],
'price': [5, 6, 3, 2, 4, 2, 3, 4, 5]
})
# Process the data and add the new columns
df['amount'] = df['quantity'] * df['price']
df['cum_qty'] = df['quantity'].cumsum()
df['prev_cum_qty'] = df['cum_qty'].shift(1, fill_value=0)
df['average_price'] = np.nan
for i, row in df.iterrows():
if row['quantity'] > 0:
df.iloc[i, df.columns == 'average_price' ] = (
row['amount'] +
df['average_price'].shift(1, fill_value=df['price'][0])[i] *
df['prev_cum_qty'][i]
)/df['cum_qty'][i]
else:
df.iloc[i, df.columns == 'average_price' ] = df['average_price'][i-1]
df.drop('prev_cum_qty', axis=1)
An advantage of this approach is that it will also work if there are new buys
before the cum_qty gets to zero. As an example, suppose there was a new buy
of 5 at the price of 3, that is, run the following line before processing the
data:
# Add more data, exemplifying a different situation
df = df.append({'quantity': 5, 'price': 3}, ignore_index=True)
I would expect the following result:
quantity price amount cum_qty average_price
0 1 5 5 1 5.0
1 -1 6 -6 0 5.0
2 2 3 6 2 3.0
3 -1 2 -2 1 3.0
4 -1 4 -4 0 3.0
5 1 2 2 1 2.0
6 1 3 3 2 2.5
7 1 4 4 3 3.0
8 -2 5 -10 1 3.0
9 5 3 15 6 3.0 # Not 4.0
That is, since there was still 1 item bought at the price 3, the cum_qty is now 6, and the average price is still 3.

Base on my understanding , you need buy price for each trading circle, then you can try this.
df['new_index'] = df.cum_qty.eq(0).shift().cumsum().fillna(0.)#give back the group id for each trading circle.*
df=df.loc[df.quantity>0]# kick out the selling action
df.groupby('new_index').apply(lambda x:(x.amount.sum()/x.quantity.sum()))
new_index
0.0 5.0# 1st ave price 5
1.0 3.0# 2nd ave price 3
2.0 3.0# 3nd ave price 3 ps: this circle no end , your position still pos 1
dtype: float64
EDIT1 for you additional requirement
DF=df.groupby('new_index',as_index=False).apply(lambda x : x.amount.cumsum()/ x.cum_qty).reset_index()
DF.columns=['Index','AvePrice']
DF.index=DF.level_1
DF.drop(['level_0', 'level_1'],axis=1,inplace=True)
pd.concat([df,DF],axis=1)
Out[572]:
quantity price amount cum_qty new_index 0
level_1
0 1 5 5 1 0.0 5.0
2 2 3 6 2 1.0 3.0
5 1 2 2 1 2.0 2.0
6 1 3 3 2 2.0 2.5
7 1 4 4 3 2.0 3.0

df[df['cum_qty'].map(lambda x: x == 0)].index
will give you at which rows you have a cum_qty of 0
df[df['cum_qty'].map(lambda x: x == 0)].index.max()
gives you the last row with 0 cum_qty
start = df[df['cum_qty'].map(lambda x: x == 0)].index.max() + 1
end = len(df) - 1
gives you the start and end row numbers that are the range you are referring to
df['price'][start:end].sum() / df['quantity'][start:end].sum()
gives you the answer you did in the example you gave
If you want to know this value for each occurrence of cum_qty 0, then you can apply the start/end logic by using the index of each (the result of my first line of code).

Related

How to change several values of pandas DataFrame at once?

Let's consider very simple data frame:
import pandas as pd
df = pd.DataFrame([[0, 1, 2, 3, 2, 5], [3, 4, 5, 0, 2, 7]]).transpose()
df.columns = ["A", "B"]
A B
0 0 3
1 1 4
2 2 5
3 3 0
4 2 2
5 5 7
I want to do two things with this dataframe:
All numbers below 3 has to be changed to 0
All numbers equal to 0 has to be changed to 10
The problem is, that when we apply:
df[df < 3] = 0
df[df == 0] = 10
we are also going to change numbers which were initially not 0, obtaining:
A B
0 10 3
1 10 4
2 10 5
3 3 10
4 10 10
5 5 7
which is not a desired output which should look like this:
A B
0 10 3
1 0 4
2 0 5
3 3 10
4 0 0
5 5 7
My question is - is there any opportunity to change both those things at the same time? i.e. I want to change numbers which are smaller than 3 to 0 and numbers which equal to 0 to 10 independently of each other.
Note! This example is created to just outline the problem. An obvious solution is to change the order of replacement - first change 0 to 10, and then numbers smaller than 3 to 0. But I'm struggling with a much complex problem, and I want to know if it is possible to change both of those at once.
Use applymap() to apply a function to each element in the DataFrame:
df.applymap(lambda x: 10 if x == 0 else (0 if x < 3 else x))
results in
A B
0 10 3
1 0 4
2 0 5
3 3 10
4 0 0
5 5 7
I would do it following way
import pandas as pd
df = pd.DataFrame([[0, 1, 2, 3, 2, 5], [3, 4, 5, 0, 2, 7]]).transpose()
df.columns = ["A", "B"]
df_orig = df.copy()
df[df_orig < 3] = 0
df[df_orig == 0] = 10
print(df)
output
A B
0 10 3
1 0 4
2 0 5
3 3 10
4 0 0
5 5 7
Explanation: I use .copy method to get copy of DataFrame, which is placed in variable df_orig, then use said DataFrame, which is not altered during run of program, to select places to put 0 and 10.
You can create the mask first then change value
m1 = df < 3
m2 = df == 0
df[m1] = 0
df[m2] = 10
print(df)
A B
0 10 3
1 0 4
2 0 5
3 3 10
4 0 0
5 5 7

Subtract with value in previous row to create a new column by subject

Using python and this data set https://raw.githubusercontent.com/yadatree/AL/main/AK4.csv I would like to create a new column for each subject, that starts with 0 (in the first row) and then subtracts the SCALE value from row 2 from row 1, then row 3 from row 2, row 4 from row 3, etc.
However, if this produces a negative value, then to give the output of 0.
Edit: Thank you for the response. That worked perfectly. The only remaining issue is that I'd like to start again with each subject (SUBJECT column). The number of values for each subject is not fixed thus something that checks the SUBJECT column and then starts again from 0 would be ideal.
screenshot
You can use .shift(1) create new column with values moved from previous rows - and then you will have both values in the same row and you can substract columns.
And later you can selecte all negative results and assign zero
import pandas as pd
data = {
'A': [1, 3, 2, 5, 1],
}
df = pd.DataFrame(data)
df['previous'] = df['A'].shift(1)
df['result'] = df['A'] - df['previous']
print(df)
#df['result'] = df['A'] - df['A'].shift(1)
#print(df)
df.loc[ df['result'] < 0 , 'result'] = 0
print(df)
Result:
A previous result
0 1 NaN NaN
1 3 1.0 2.0
2 2 3.0 -1.0
3 5 2.0 3.0
4 1 5.0 -4.0
A previous result
0 1 NaN NaN
1 3 1.0 2.0
2 2 3.0 0.0
3 5 2.0 3.0
4 1 5.0 0.0
EDIT:
If you use df['result'] = df['A'] - df['A'].shift(1) then you get column result without creating column previous.
And if you use .shift(1, fill_value=0) then it will put 0 instead of NaN in first row.
EDIT:
You can use groupy("SUBJECT") to group by subject and later in every group you can put 0 in first row.
import pandas as pd
data = {
'S': ['A', 'A', 'A', 'B', 'B', 'B'],
'A': [1, 3, 2, 1, 5, 1],
}
df = pd.DataFrame(data)
df['result'] = df['A'] - df['A'].shift(1, fill_value=0)
print(df)
df.loc[ df['result'] < 0 , 'result'] = 0
print(df)
all_groups = df.groupby('S')
first_index = all_groups.apply(lambda grp: grp.index[0])
df.loc[first_index, 'result'] = 0
print(df)
Results:
S A result
0 A 1 1
1 A 3 2
2 A 2 -1
3 B 1 -1
4 B 5 4
5 B 1 -4
S A result
0 A 1 1
1 A 3 2
2 A 2 0
3 B 1 0
4 B 5 4
5 B 1 0
S A result
0 A 1 0
1 A 3 2
2 A 2 0
3 B 1 0
4 B 5 4
5 B 1 0

Trying to multiply a certain data cell by another certain data cell in pandas

Due to misunderstanding using my real scenario I am going to create one.
Here is the DataFrame.
import pandas as pd
num1df = pd.DataFrame({'Number 1': [1, 4, 3, 2, 100]})
num2df = pd.DataFrame({'Number 2': [1, 2, 'NaN', 4, 5]})
num3df = pd.DataFrame({'Number 3': [1, 2, 3, 1000, 0]})
numsdf = pd.concat([num1df, num2df, num3df], axis=1, join="inner")
print(numsdf)
Number 1 Number 2 Number 3
0 1 1 1
1 4 2 2
2 3 NaN 3
3 2 4 1000
4 100 5 0
I want to be able to do the follow addition. Column Number 1 row 4 plus column Number 3 row 3 = Column Number 2 row 2. 100 + 1000 = 1100 (the answer should be in place of the NaN)
This should be the expected outcome:
Number 1 Number 2 Number 3
0 1 1 1
1 4 2 2
2 3 1100 3
3 2 4 1000
4 100 5 0
How would I do that? I cannot figure it out.
Notice: Solution working only if sme indices in all 3 DataFrames.
If possible replace non numeric values by missing values and then forward filling last non missng values in same column use:
marketcapdf['Market Cap'] = stockpricedf['Stock Price'] *
pd.to_numeric(outstandingdf['Outstanding'],
errors='coerce').ffill()
If working in one DataFrame:
df['Market Cap'] = df['Stock Price'] *
pd.to_numeric(df['Outstanding'],
errors='coerce').ffill()
EDIT: If need multiple by shifted second column with no change first value use:
numsdf['new'] = numsdf['Number 1'] * numsdf['Number 2'].shift(fill_value=1)
print(numsdf)
Number 1 Number 2 new
0 5 1 5
1 4 2 4
2 3 3 6
3 2 4 6
4 1 5 4
EDIT1: I create new columns for better understanding:
num1df = pd.DataFrame({'Number 1': [1, 4, 3, 2, 100]})
num2df = pd.DataFrame({'Number 2': [1, 2, np.nan, 4, 5]})
num3df = pd.DataFrame({'Number 3': [1, 2, 3, 1000, 0]})
numsdf = pd.concat([num1df, num2df, num3df], axis=1, join="inner")
#add by shifted values
numsdf['new'] = numsdf['Number 1'].shift(-1, fill_value=0) + numsdf['Number 3']
#shift again
numsdf['new1'] = numsdf['new'].shift(-1, fill_value=0)
#replace NaN by another column
numsdf['new2'] = numsdf['Number 2'].fillna(numsdf['new1'])
print(numsdf)
Number 1 Number 2 Number 3 new new1 new2
0 1 1.0 1 5 5 1.0
1 4 2.0 2 5 5 2.0
2 3 NaN 3 5 1100 1100.0
3 2 4.0 1000 1100 0 4.0
4 100 5.0 0 0 0 5.0
foo = numsdf.iloc[4, 0]
bar = numsdf.iloc[3, 2]
numsdf.at[2, 'Number 2'] = foo + bar
Output:
Number 1 Number 2 Number 3
0 1 1 1
1 4 2 2
2 3 1100 3
3 2 4 1000
4 100 5 0

Pandas: dynamically add rows before & after groups

I have a Pandas data frame:
pd.DataFrame(data={"Value": [5, 0, 0, 8, 0, 3, 0, 2, 7, 0], "Group": [1, 1, 1, 1, 2, 2, 2, 2, 2, 2]})
Value Group
0 5.0 1
1 0.0 1
2 0.0 1
3 8.0 1
4 0.0 2
5 3.0 2
6 0.0 2
7 2.0 2
8 7.0 2
9 0.0 2
Also I calculated the cumulative sum with respect to two rows for each group:
{"2-cumsum Group 1": array([5., 8.]), "2-cumsum Group 2": array([7., 5.])}
E.g. array([5., 8.]) because array([5.0, 0.0]) (rows 0 and 1) + array([0.0 + 8.0]) (rows 2 and 3) = array([5., 8.]).
What I now need is to append exactly two rows at the beginning of df, in-between each group and at the end of df so that I get the following data frame (gaps are for illustration purposes):
Value Group
0 10.0 0 # Initialize with 10.0
1 10.0 0 # Initialize with 10.0
2 5.0 1
3 0.0 1
4 0.0 1
5 8.0 1
6 5.0 0 # 5.0 ("2-cumsum Group 1"[0])
7 2.0 0 # 8.0 ("2-cumsum Group 1"[1])
8 0.0 2
9 3.0 2
10 0.0 2
11 2.0 2
12 7.0 2
13 0.0 2
14 7.0 0 # 5.0 ("2-cumsum Group 2"[0])
15 5.0 0 # 8.0 ("2-cumsum Group 2"[1])
Please consider that the original data frame is much larger, has more than just two columns and I need to dynamically append rows with varying entries. E.g. the rows to append should have an additional column with "10.0" entries. Also, calculating the cumulative sum with respect to some integer (in this case 2) is variable (could be 8).
There are so many occasions where I need to generate rows based on other rows in data frames but I didn't find any effective solutions other than using for-loops and some temporary cache lists that save values from previous iterations.
I would appreciate some help.
Thank you in advance and kind regards.
My original code applied to the exemplary data, in case anybody needs it. It's very confusing and ineffective so only consider it if you really need to:
import pandas as pd
import numpy as np
# Some stuff
df = pd.DataFrame(data={"Group1": ["a", "a", "b", "b", "b"],
"Group2": [1, 2, 1, 2, 3],
"Group3": [1, 9, 2, 1, 1],
"Value": [5, 8, 3, 2, 7]})
length = 2
max_value = 20
g = df['Group1'].unique()
h = df["Group2"].unique()
i = range(1,df['Group3'].max()+1)
df2 = df.set_index(['Group1','Group2','Group3']).reindex(pd.MultiIndex.from_product([g,h,i])).assign(cc = lambda x: (x.groupby(level=0).cumcount())//length).rename_axis(['Group1','Group2','Group3'],axis=0)
df2 = df2.loc[~df2['Value'].isna().groupby([pd.Grouper(level=0),df2['cc']]).transform('all')].reset_index().fillna(0).drop('cc',axis=1)
values = df2["Value"].copy().to_numpy()
values = np.array_split(values, len(values)/length)
stock = df2["Group1"].copy().to_numpy()
stock = np.array_split(stock, len(stock)/length)
# Generate the "Group" column and generate the "2-cumsum" arrays
# stored in the "volumes" variable
k = 0
a_groups = []
values_cache = []
volumes = []
for e, i in enumerate(values):
if any(stock[e] == stock[e-1]):
if np.any(i + values_cache >= max_value):
k += 1
volumes.append(values_cache)
values_cache = i
else:
values_cache += i
a_groups.extend([k] * length)
else:
k += 1
if e:
volumes.append(values_cache)
values_cache = i
a_groups.extend([k] * length)
volumes.append(values_cache)
df2["Group"] = a_groups
print(df2[["Value", "Group"]])
print("\n")
print(f"2-cumsums: {volumes}")
"""Output
Value Group
0 5.0 1
1 0.0 1
2 0.0 1
3 8.0 1
4 0.0 2
5 3.0 2
6 0.0 2
7 2.0 2
8 7.0 2
9 0.0 2
2-cumsums: [array([5., 8.]), array([7., 5.])]"""

How to save the result from equation(float) to column, python

I have data frame look line this:
df:
1 2 3.4
-2 2 1.1
2 3 4
-5 5 5
I can use this data on my equation like:
result=abs(int(df[0])) +( int(df[1]) / 2 + float(df[2]) / 32)
So after this calculation I receive a list with results for each line from df , and the resulting type is a float.
Question: How can I save it to one column or dataframe and add this one column with result to the another dataframe that's same as df ?
I've tried pd.DataFrame(result), which doesn't work.
Assign directly to the new column you're trying to create.
df[3] = abs(int(df[0])) +( int(df[1]) / 2 + float(df[2]) / 32)
I think you need Series.astype for cast columns with Series.abs:
df = pd.DataFrame({0: [1, -2, 2, -5], 1: [2, 2, 3, 5], 2: [3.4, 1.1, 4.0, 5.0]})
print (df)
0 1 2
0 1 2 3.4
1 -2 2 1.1
2 2 3 4.0
3 -5 5 5.0
df[3] = df[1].astype(int).abs() +df[1].astype(int) / 2 + df[2].astype(float) / 32
print (df)
0 1 2 3
0 1 2 3.4 3.106250
1 -2 2 1.1 3.034375
2 2 3 4.0 4.625000
3 -5 5 5.0 7.656250

Categories