How to find the cumsum of a column? - python

dict={"asset":["S3","S2","E4","E1","A6","A8"],
"Rank":[1,2,3,4,5,6],"number_of_attributes":[2,1,2,2,1,1],
"number_of_cards":[1,2,2,1,2," "],"cards_plus1":[2,3,3,2,3," "]}
dframe=pd.DataFrame(dict,index=[1,2,3,4,5,6],
columns=["asset","Rank","number_of_attributes","number_of_cards","cards_plus1"])
i want to do cumsum of the column "cards_plus1".
How can I do this?
the output of the column cumsum should be that:
0
2
5
8
10
13

i want to start with zero instead of 2.. i want this outup : cards_plus1_cumsum 0 2 5 8 10 13
We can just pad a zero before the sums:
dframe["cumsum"] = np.pad(dframe["cards_plus1"][:-1].cumsum(), (1, 0), 'constant')

Try this:
First, replace the blank values by nan
import pandas as pd
import numpy as np
dict={"asset":["S3","S2","E4","E1","A6","A8"],"Rank":[1,2,3,4,5,6],"number_of_attributes":[2,1,2,2,1,1],
"number_of_cards":[1,2,2,1,2," "],"cards_plus1":[2,3,3,2,3," "]}
dframe=pd.DataFrame(dict,index=[1,2,3,4,5,6],
columns=["asset","Rank","number_of_attributes","number_of_cards","cards_plus1"])
## replace blank values by nan
print(dframe.replace(r'^\s*$', np.nan, regex=True, inplace=True))
print (dframe)
>>> asset Rank number_of_attributes number_of_cards cards_plus1
1 S3 1 2 1.0 2.0
2 S2 2 1 2.0 3.0
3 E4 3 2 2.0 3.0
4 E1 4 2 1.0 2.0
5 A6 5 1 2.0 3.0
6 A8 6 1 NaN NaN
Now the data type of the cards_plus1 column is object - change to numeric
### convert data type of the cards_plus1 to numeric
dframe['cards_plus1'] = pd.to_numeric(dframe['cards_plus1'])
Now calculate cumulative sum
### now we can calculate cumsum
dframe['cards_plus1_cumsum'] = dframe['cards_plus1'].cumsum()
print(dframe)
>>>
asset Rank number_of_attributes number_of_cards cards_plus1 \
1 S3 1 2 1.0 2.0
2 S2 2 1 2.0 3.0
3 E4 3 2 2.0 3.0
4 E1 4 2 1.0 2.0
5 A6 5 1 2.0 3.0
6 A8 6 1 NaN NaN
cards_plus1_cumsum
1 2.0
2 5.0
3 8.0
4 10.0
5 13.0
6 NaN
Instead of replacing the blank values by nan, you can replace them by zero, depends on what you want.. Hope this helped..

Related

Pandas: How to replace values of Nan in column based on another column?

Given that, i have a dataset as below:
dict = {
"A": [math.nan,math.nan,1,math.nan,2,math.nan,3,5],
"B": np.random.randint(1,5,size=8)
}
dt = pd.DataFrame(dict)
My favorite output is, if the in column A we have an Nan then multiply the value of the column B in the same row and replace it with Nan. So, given that, the below is my dataset:
A B
NaN 1
NaN 1
1.0 3
NaN 2
2.0 3
NaN 1
3.0 1
5.0 3
My favorite output is:
A B
2 1
2 1
1 3
4 2
2 3
2 1
3 1
5 3
My current solution is as below which does not work:
dt[pd.isna(dt["A"])]["A"] = dt[pd.isna(dt["A"])]["B"].apply( lambda x:2*x )
print(dt)
In your case with fillna
df.A.fillna(df.B*2, inplace=True)
df
A B
0 2.0 1
1 2.0 1
2 1.0 3
3 4.0 2
4 2.0 3
5 2.0 1
6 3.0 1
7 5.0 3

Python Pandas: difference of column values insert into new column

I have a Pandas dataframe that looks like the following:
c1 c2 c3 c4
p1 q1 r1 20
p2 q2 r2 10
p3 q3 r1 30
The Desired output looks like this.
c1 c2 c3 c4 NewColumn(c1.1)
p1 q1 r1 20 0
p2 q2 r2 10 p2-p1
p3 q3 r1 30 p3-p2
The shape of my dataset is(333650,665) I want to do that for all columns. Are there any ways to achieve this?
The code I am using:
data = pd.read_csv('Mydataset.csv')
i=0
j=1
while j < len(data['columnname']):
j=data['columnname'][i+1] - data['columnname'][i]
i+=1 #Next value of column.
j+=1 #Next value new column.
print(j)
Is this what you want? it finds the difference between the rows of a particular column using the shift method and assigns it to a new column.
Note that I am using the data from Dave.
df['New Column'] = df.a.sub(df.a.shift()).fillna(0)
a b c New Column
0 1 1 1 0.0
1 2 1 4 1.0
2 3 2 9 1.0
3 4 3 16 1.0
4 5 5 25 1.0
5 6 8 36 1.0
For multiple columns, this may suffice:
M = df.diff().fillna(0).add_suffix('_1')
#concatenate along the columns axis
pd.concat([df,M], axis = 1)
a b c a_1 b_1 c_1
0 1 1 1 0.0 0.0 0.0
1 2 1 4 1.0 0.0 3.0
2 3 2 9 1.0 1.0 5.0
3 4 3 16 1.0 1.0 7.0
4 5 5 25 1.0 2.0 9.0
5 6 8 36 1.0 3.0 11.0
You want the diff function:
https://pandas.pydata.org/pandas-docs/stable/reference/api/pandas.DataFrame.diff.html
df
a b c
0 1 1 1
1 2 1 4
2 3 2 9
3 4 3 16
4 5 5 25
5 6 8 36
df.diff()
a b c
0 NaN NaN NaN
1 1.0 0.0 3.0
2 1.0 1.0 5.0
3 1.0 1.0 7.0
4 1.0 2.0 9.0
5 1.0 3.0 11.0

How to find the difference between multiple columns of a given data frame and save the result as a separate data frame

i have data frame as below ,
df = pd.DataFrame({'A':[1,4,7,1,4,7],'B':[2,5,8,2,5,8],'C':[3,6,9,3,6,9],'D':[1,2,3,1,2,3]})
A B C D
0 1 2 3 1
1 4 5 6 2
2 7 8 9 3
3 1 2 3 1
4 4 5 6 2
5 7 8 9 3
how can I find the difference between column (A & B) and save as AB, and do the same with (C & D) and save as CD within the data frame.
Expected output:
AB CD
0 1.0 -2.0
1 1.0 -4.0
2 1.0 -6.0
3 1.0 -2.0
4 1.0 -4.0
5 1.0 -6.0
tried using
d = dict(A='AB', B='AB', C='CD', D='CD')
df.groupby(d, axis=1).diff()
as explained here, this works well for sum(), but does not work as expected for diff(). Can someone please explain why?
Difference is diff not aggregate values like sum, but return new 2 columns - first filled by NAN and second with values.
So possible solution here is remove only NaNs columns by DataFrame.dropna:
d = dict(A='AB', B='AB', C='CD', D='CD')
df1 = df.rename(columns=d).groupby(level=0, axis=1).diff().dropna(axis=1, how='all')
print (df1)
AB CD
0 1.0 -2.0
1 1.0 -4.0
2 1.0 -6.0
3 1.0 -2.0
4 1.0 -4.0
5 1.0 -6.0

opposite of df.diff() in pandas

I have searched the forums in search of a cleaner way to create a new column in a dataframe that is the sum of the row with the previous row- the opposite of the .diff() function which takes the difference.
this is how I'm currently solving the problem
df = pd.DataFrame ({'c':['dd','ee','ff', 'gg', 'hh'], 'd':[1,2,3,4,5]}
df['e']= df['d'].shift(-1)
df['f'] = df['d'] + df['e']
Your ideas are appreciated.
You can use rolling with a window size of 2 and sum:
df['f'] = df['d'].rolling(2).sum().shift(-1)
c d f
0 dd 1 3.0
1 ee 2 5.0
2 ff 3 7.0
3 gg 4 9.0
4 hh 5 NaN
df.cumsum()
Example:
data = {'a':[1,6,3,9,5], 'b':[13,1,2,5,23]}
df = pd.DataFrame(data)
df =
a b
0 1 13
1 6 1
2 3 2
3 9 5
4 5 23
df.diff()
a b
0 NaN NaN
1 5.0 -12.0
2 -3.0 1.0
3 6.0 3.0
4 -4.0 18.0
df.cumsum()
a b
0 1 13
1 7 14
2 10 16
3 19 21
4 24 44
If you cannot use rolling, due to multindex or else, you can try using .cumsum(), and then .diff(-2) to sub the .cumsum() result from two positions before.
data = {'a':[1,6,3,9,5,30, 101, 8]}
df = pd.DataFrame(data)
df['opp_diff'] = df['a'].cumsum().diff(2)
a opp_diff
0 1 NaN
1 6 NaN
2 3 9.0
3 9 12.0
4 5 14.0
5 30 35.0
6 101 131.0
7 8 109.0
Generally to get an inverse of .diff(n) you should be able to do .cumsum().diff(n+1). The issue is that that you will get n+1 first results as NaNs

Pandas: rolling count if within a loop

In my data frame I want to create a column '5D_Peak' as a rolling max, and then another column with rolling count of historical data that's close to the peak. I wonder if there is an easier way to simply or ideally vectorise the calculation.
This is my codes in a plain but complicated way:
import numpy as np
import pandas as pd
df = pd.DataFrame([[1,2,4],[4,5,2],[3,5,8],[1,8,6],[5,2,8],[1,4,10],[3,5,9],[1,4,7],[1,4,6]], columns=list('ABC'))
df['5D_Peak']=df['C'].rolling(window=5,center=False).max()
for i in range(5,len(df.A)):
val=0
for j in range(i-5,i):
if df.loc[j,'C']>df.loc[i,'5D_Peak']-2 and df.loc[j,'C']<df.loc[i,'5D_Peak']+2:
val+=1
df.loc[i,'5D_Close_to_Peak_Count']=val
This is the output I want:
A B C 5D_Peak 5D_Close_to_Peak_Count
0 1 2 4 NaN NaN
1 4 5 2 NaN NaN
2 3 5 8 NaN NaN
3 1 8 6 NaN NaN
4 5 2 8 8.0 NaN
5 1 4 10 10.0 0.0
6 3 5 9 10.0 1.0
7 1 4 7 10.0 2.0
8 1 4 6 10.0 2.0
I believe this is what you want. You can set the two values below:
'''the window within which to search "close-to_peak" values'''
lkp_rng = 5
'''how close is close?'''
closeness_measure = 2
'''function to count the number of "close-to_peak" values in the lkp_rng'''
fc = lambda x: np.count_nonzero(np.where(x >= x.max()- closeness_measure))
'''apply fc to the coulmn you choose'''
df['5D_Close_to_Peak_Count'] = df['C'].rolling(window=lkp_range,center=False).apply(fc)
df.head(10)
A B C 5D_Peak 5D_Close_to_Peak_Count
0 1 2 4 NaN NaN
1 4 5 2 NaN NaN
2 3 5 8 NaN NaN
3 1 8 6 NaN NaN
4 5 2 8 8.0 3.0
5 1 4 10 10.0 3.0
6 3 5 9 10.0 3.0
7 1 4 7 10.0 3.0
8 1 4 6 10.0 2.0
I am guessing what you mean by "historical data".

Categories