Pandas : reverse cumulative sum's process - python

I've got a simple pandas' Serie, like this one :
st
0 74
1 91
2 105
3 121
4 136
5 157
Datas for this Serie are the result of a cumulative sum, so I was wondering if a pandas function could "undo" the process, and return a new Serie like :
st result
0 74 74
1 91 17
2 105 14
3 121 16
4 136 15
5 157 21
result[0] = st[0], but after result[i] = st[i]-st[i-1].
It's seemed to be very simple (and maybe I missed a post), but I didn't find anything...

Use Series.diff with replace first missing value by original by Series.fillna and then if necessary cast to integers:
df['res'] = df['st'].diff().fillna(df['st']).astype(int)
print (df)
st res
0 74 74
1 91 17
2 105 14
3 121 16
4 136 15
5 157 21

Related

concat result of apply in python

I am trying to apply a function on a column of a dataframe.
After getting multiple results as dataframes, I want to concat them all in one.
Why does the first option work and the second not?
import numpy as np
import pandas as pd
def testdf(n):
test = pd.DataFrame(np.random.randint(0,n*100,size=(n*3, 3)), columns=list('ABC'))
test['index'] = n
return test
test = pd.DataFrame({'id': [1,2,3,4]})
testapply = test['id'].apply(func = testdf)
#option 1
pd.concat([testapply[0],testapply[1],testapply[2],testapply[3]])
#option2
pd.concat([testapply])
pd.concat expects a sequence of pandas objects, but your #2 case/option passes a sequence of single pd.Series object that contains multiple dataframes, so it doesn't make concatenation - you just get that series as is.To fix your 2nd approach use unpacking:
print(pd.concat([*testapply]))
A B C index
0 91 15 91 1
1 93 85 91 1
2 26 87 74 1
0 195 103 134 2
1 14 26 159 2
2 96 143 9 2
3 18 153 35 2
4 148 146 130 2
5 99 149 103 2
0 276 150 115 3
1 232 126 91 3
2 37 242 234 3
3 144 73 81 3
4 96 153 145 3
5 144 94 207 3
6 104 197 49 3
7 0 93 179 3
8 16 29 27 3
0 390 74 379 4
1 78 37 148 4
2 350 381 260 4
3 279 112 260 4
4 115 387 173 4
5 70 213 378 4
6 43 37 149 4
7 240 399 117 4
8 123 0 47 4
9 255 172 1 4
10 311 329 9 4
11 346 234 374 4

How to re order rows, by moving multiple separated rows an X amount of rows below in python with either pandas or numpy

I have a very long dataframe with hundreds of rows. I want to select the rows with one key word in one of the columns, and lower the whole row 18 places below. Since there are too many, using reindex and doing it manually would be too long.
As an example, for this df I would like to move the rows with the word "Base" in Column A, three rows below, after "Three" :
A B C
Base 572 55
One 654 196
Two 2 156
Three 154 123
Base 78 45
One 251 78
Two 5 56
Three 321 59
Base 48 45
One 5 12
Two 531 231
Three 51 123
So, I want it to look like:
A B C
One 654 196
Two 2 156
Three 154 123
Base 572 55
One 251 78
Two 5 56
Three 321 59
Base 78 45
One 5 12
Two 531 231
Three 51 123
Base 48 45
I am new at programming, so would appreciate your help!
First create extra, dummy column, to mock your sorting key. In this case, as far as I understood you:
ord=["One", "Two", "Three", "Base"]
df["sorting_key"]=df.groupby("A").cumcount().map(str)+":"+df["A"].apply(ord.index).map(str)
Then just sort by it:
df.sort_values("sorting_key")
Result:
A B C sorting_key
1 One 654 196 0:0
2 Two 2 156 0:1
3 Three 154 123 0:2
0 Base 572 55 0:3
5 One 251 78 1:0
6 Two 5 56 1:1
7 Three 321 59 1:2
4 Base 78 45 1:3
9 One 5 12 2:0
10 Two 531 231 2:1
11 Three 51 123 2:2
8 Base 48 45 2:3
Then in order to reindex it, and drop the dummy column:
df.sort_values("sorting_key").reset_index(drop=True).drop(columns="sorting_key")
Output:
A B C
0 One 654 196
1 Two 2 156
2 Three 154 123
3 Base 572 55
4 One 251 78
5 Two 5 56
6 Three 321 59
7 Base 78 45
8 One 5 12
9 Two 531 231
10 Three 51 123
11 Base 48 45
You could do the following:
# create mask for identifying Base
mask = df.A.eq("Base")
# create index with non base values
non_base = df[~mask].reset_index(drop=True) # reset index
# create DataFrame with Base values
base = df[mask]
base.index = base.index + (3 - np.arange(len(base))) # change index to reflect new indices in result
# concat and sort by index
result = pd.concat([base, non_base], sort=True).sort_index().reset_index(drop=True)
print(result)
Output
A B C
0 One 654 196
1 Two 2 156
2 Three 154 123
3 Base 572 55
4 One 251 78
5 Two 5 56
6 Three 321 59
7 Base 78 45
8 One 5 12
9 Two 531 231
10 Three 51 123
11 Base 48 45

find the minimum value of a column between two occurrences of a value in another column in a Python data frame

I have stock price data containing Open, High, Low,Close prices on a daily basis. I am creating a new column "signal", which will take the values "signal" or "none" based on some conditions.
Every time df['signal']=="signal",we have to compare it with the previous 3 occurrences of df['signal']=="signal". let us imagine the current occurrence to be the 4th signal. So, the previous occurrence of df['signal']=="signal" would be the 3rd signal, the even previous would the 2nd signal and the signal previous to that would be the first signal.
I need to check if the minimum value of df['low']between signal 4 and signal 3 is GREATER THAN the minimum value of df['low'] between signal 1 and signal 2.
If it is greater, I need a new column df['trade']=="Buy".
Sample data
No Open High Low Close signal Trade
1 75 95 65 50 signal
2 78 94 74 77 none
3 83 91 81 84 none
4 91 101 88 93 signal
5 104 121 95 103 none
6 101 111 99 105 none
7 97 108 95 101 signal
8 103 113 102 106 none
9 108 128 105 114 signal BUY
10 104 114 99 102 none
11 110 130 105 115 signal BUY
12 112 122 110 115 none
13 118 145 112 123 none
14 123 143 71 133 signal NONE
15 130 150 120 140 none
In the sample data above, in Line no 9, df['Trade']=="BUY" happens since the minimum value of df['Low']=95 between this df['signal']="signal" and previous df['signal']="signal" IS GREATER THAN the minimum value of df['Low']= 65 between the previous two occurences of df['signal']="signal".
Similarly, in Line no 14, df['Trade']="None" happened because the minimum value of df['Low']=71 between this signal and previous signal is NOT GREATER THAN the minimum value of df['Low']=99 between the previous two signals.
I need help with the code to implement this.
import pandas as pd
import numpy as np
import bisect as bs
df = pd.read_csv("Nifty.csv")
cols = ['No', 'Low', 'signal']
df['5EMA'] = df['Close'].ewm(span=5).mean()
df['10EMA'] = df['Close'].ewm(span=10).mean()
condition1 = df['5EMA'].shift(1) < df['10EMA'].shift(1)
condition2 = df['5EMA'] > df['10EMA']
df['signal'] = np.where(condition1 & condition2, 'signal', None)
df1 = pd.concat([df[cols], df.loc[df.signal=='signal',cols].assign(signal='temp')]) \
.sort_values(['No', 'signal'],ascending=[1,0])
df1['g'] = (df1.signal == 'signal').cumsum()
df1['Low_min'] = df1.groupby('g').Low.transform('min')
s = df1.groupby('g').Low.min()
buy = s[s.shift(1) > s.shift(3)].index.tolist()
m1 = df1.signal.eq('signal') & df1.g.gt(3)
m2 = df1.g.isin(buy) & m1
df1['trade'] = np.select([m2, m1], ['Buy', 'None'], '')
df['trade'] = ''
df.trade.update(df1.loc[df1.signal=='signal',"trade"])
print(df)
Your problem can be simplified after some extra temporary rows are added. I set up a new dataframe which contains only required fields from the original df, and cloned all rows labelled as 'signal' but renamed to 'temp' df.loc[df.signal=='signal',cols].assign(signal='temp'). The sorted rows will then be grouped-labeled by using the "signal" and cumsum(). see below code:
str="""No Open High Low Close signal
1 75 95 65 50 signal
2 78 94 74 77 none
3 83 91 81 84 none
4 91 101 88 93 signal
5 104 121 95 103 none
6 101 111 99 105 none
7 97 108 95 101 signal
8 103 113 102 106 none
9 108 128 105 114 signal
10 104 114 99 102 none
11 110 130 105 115 signal
12 112 122 110 115 none
13 118 145 112 123 none
14 123 143 71 133 signal
15 130 150 120 140 none"""
df = pd.read_csv(pd.io.common.StringIO(str), sep='\s+')
# cols which are used in this task
cols = ['No', 'Low', 'signal']
# create a new dataframe, cloned all 'signal' rows but rename signal to 'temp', sort the rows
df1 = pd.concat([df[cols], df.loc[df.signal=='signal',cols].assign(signal='temp')]) \
.sort_values(['No', 'signal'],ascending=[1,0])
# set up group-number with cumsum() and get min() value from each group
df1['g'] = (df1.signal == 'signal').cumsum()
# the following field just for reference, no need for calculation
df1['Low_min'] = df1.groupby('g').Low.transform('min')
The new dataframe df1 will looks like the following. Except the first and last group, every group now starts with a 'signal' and ends with a 'temp' (which is also 'signal'):
Based on your description, for Line no 9 (yellow backgroud, the first item in df1.g==4), we can check df1.loc[df1.g==3, "Low_min"] (red bordered) against
df1.loc[df1.g==1, "Low_min"] (green bordered)
if we have the following:
s = df1.groupby('g').Low.min()
the list of buy group should satisfy s.shift(1) > s.shift(3)
buy = s[s.shift(1) > s.shift(3)].index.tolist()
So, let's set up conditions:
# m1: row marked with signal
# skip the first 3 groups which do not have enough signals
m1 = df1.signal.eq('signal') & df1.g.gt(3)
# m2: m1 plus must in buy list
m2 = df1.g.isin(buy) & m1
df1['trade'] = np.select([m2, m1], ['Buy', 'None'], '')
#In [36]: df1
#Out[36]:
# No Low signal g Low_min trade
#0 1 65 temp 0 65
#0 1 65 signal 1 65
#1 2 74 none 1 65
#2 3 81 none 1 65
#3 4 88 temp 1 65
#3 4 88 signal 2 88
#4 5 95 none 2 88
#5 6 99 none 2 88
#6 7 95 temp 2 88
#6 7 95 signal 3 95
#7 8 102 none 3 95
#8 9 105 temp 3 95
#8 9 105 signal 4 99 Buy
#9 10 99 none 4 99
#10 11 105 temp 4 99
#10 11 105 signal 5 71 Buy
#11 12 110 none 5 71
#12 13 112 none 5 71
#13 14 71 temp 5 71
#13 14 71 signal 6 71 None
#14 15 120 none 6 71
After we have df1.trade, we can update the original dataframe:
# set up column `trade` with EMPTY as default and update
# the field based on df1.trade (using the index)
df['trade'] = ''
df.trade.update(df1.loc[df1.signal=='signal',"trade"])

python pandas: Grouping dataframe by ranges

I have a dateframe object with date and calltime columns.
Was trying to build a histogram based on the second column. E.g.
df.groupby('calltime').head(10).plot(kind='hist', y='calltime')
Got the following:
The thing is that I want to get more details for the first bar. E.g. the range itself 0-2500 is huge, and all the data is hidden there... Is there a possibility to split group by smaller range? E.g. by 50, or something like that?
UPD
date calltime
0 1491928756414930 4643
1 1491928756419607 166
2 1491928756419790 120
3 1491928756419927 142
4 1491928756420083 121
5 1491928756420217 109
6 1491928756420409 52
7 1491928756420476 105
8 1491928756420605 35
9 1491928756420654 120
10 1491928756420787 105
11 1491928756420907 93
12 1491928756421013 37
13 1491928756421062 112
14 1491928756421187 41
15 1491928756421240 122
16 1491928756421375 28
17 1491928756421416 158
18 1491928756421587 65
19 1491928756421667 108
20 1491928756421790 55
21 1491928756421858 145
22 1491928756422018 37
23 1491928756422068 63
24 1491928756422145 57
25 1491928756422214 43
26 1491928756422270 73
27 1491928756422357 90
28 1491928756422460 72
29 1491928756422546 77
... ... ...
9845 1491928759997328 670
9846 1491928759998255 372
9848 1491928759999116 659
9849 1491928759999897 369
9850 1491928760000380 746
9851 1491928760001245 823
9852 1491928760002189 634
9853 1491928760002869 335
9856 1491928760003929 4162
9865 1491928760009368 531
use bins
s = pd.Series(np.abs(np.random.randn(100)) ** 3 * 2000)
s.hist(bins=20)
Or you can use pd.cut to produce your own custom bins.
pd.cut(
s, [-np.inf] + [100 * i for i in range(10)] + [np.inf]
).value_counts(sort=False).plot.bar()

Set columns of DataFrame to sum of columns of another in pandas

I have a DataFrame that looks like the below, call this "values":
I would like to create another, call it "sums" that contains the sum of the DataFrame "values" from the column in "sums" to the end. It would look like the below:
I would like to create this without looking through the entire DataFrame, data point by data point. I have been trying with .apply() as seen below, but I keep getting the error: unsupported operand type(s) for +: 'int' and 'datetime.date'
In [26]: values = pandas.DataFrame({0:[96,54,27,28],
1:[55,75,32,37],2:[54,99,36,46],3:[35,77,0,10],4:[62,25,0,25],
5:[0,66,0,89],6:[0,66,0,89],7:[0,0,0,0],8:[0,0,0,0]})
In [28]: sums = values.copy()
In [29]: sums.iloc[:,:] = ''
In [31]: for column in sums:
...: sums[column].apply(sum(values.loc[:,column:]))
...:
Traceback (most recent call last):
File "<ipython-input-31-030442e5005e>", line 2, in <module>
sums[column].apply(sum(values.loc[:,column:]))
File "C:\WinPython64bit\python-3.5.2.amd64\lib\site-packages\pandas\core\series.py", line 2220, in apply
mapped = lib.map_infer(values, f, convert=convert_dtype)
File "pandas\src\inference.pyx", line 1088, in pandas.lib.map_infer (pandas\lib.c:63043)
TypeError: 'numpy.int64' object is not callable
In [32]: for column in sums:
...: sums[column] = sum(values.loc[:,column:])
In [33]: sums
Out[33]:
0 1 2 3 4 5 6 7 8
0 36 36 35 33 30 26 21 15 8
1 36 36 35 33 30 26 21 15 8
2 36 36 35 33 30 26 21 15 8
3 36 36 35 33 30 26 21 15 8
Is there a way to do this without looping each point individually?
Without looping, you can reverse your dataframe, cumsum per line and then re-reverse it:
>>> values.iloc[:,::-1].cumsum(axis=1).iloc[:,::-1]
0 1 2 3 4 5 6 7 8
0 302 206 151 97 62 0 0 0 0
1 462 408 333 234 157 132 66 0 0
2 95 68 36 0 0 0 0 0 0
3 324 296 259 213 203 178 89 0 0
You can use the .cumsum() method to get the cumulative sum. The problem is that is operates from left to right, where you need it from right to left.
So we will reverse you data frame, use cumsum(), then set the axes back into the proper order.
import pandas as pd
values = pd.DataFrame({0:[96,54,27,28],
1:[55,75,32,37],2:[54,99,36,46],3:[35,77,0,10],4:[62,25,0,25],
5:[0,66,0,89],6:[0,66,0,89],7:[0,0,0,0],8:[0,0,0,0]})
values[values.columns[::-1]].cumsum(axis=1).reindex_axis(values.columns, axis=1)
# returns:
0 1 2 3 4 5 6 7 8
0 302 206 151 97 62 0 0 0 0
1 462 408 333 234 157 132 66 0 0
2 95 68 36 0 0 0 0 0 0
3 324 296 259 213 203 178 89 0 0

Categories