Updating Pandas DataFrame column conditionally using other columns - python

With a DataFrame like the one below, how do I set c1len equal to zero when c1pos equals zero? I would then like to do the same for c2len/c2pos. Is there an easy way to do it without creating a bunch of columns to arrive at the desired answer?
distance c1pos c1len c2pos c2len daysago
line_date
2013-06-22 7.00 9 0.0 9 6.4 27
2013-05-18 8.50 6 4.6 7 4.9 62
2012-12-31 8.32 5 4.6 5 2.1 200
2012-12-01 8.00 7 7.1 6 8.6 230
2012-11-03 7.00 7 0.0 7 2.7 258
2012-10-15 7.00 7 0.0 8 5.2 277
2012-09-22 8.32 10 10.1 8 4.1 300
2012-09-15 9.00 10 12.5 9 12.1 307
2012-08-18 7.00 8 0.0 8 9.2 335
2012-08-02 9.00 5 3.5 5 2.2 351
2012-07-14 12.00 3 4.5 3 3.5 370
2012-06-16 8.32 7 3.7 7 5.1 398

I do't think you have anything that actually satifies those conditions, but
this will work
This creates a boolean mask for when the rows of the column in question (e.g. c2pos)
are 0; then it is setting the column c2len to 0 for those that are True
In [15]: df.loc[df.c2pos==0,'c2len'] = 0
In [16]: df.loc[df.c1pos==0,'c1len'] = 0
In [17]: df
Out[17]:
distance c1pos c1len c2pos c2len daysago
2013-06-22 7.00 9 0.0 9 6.4 27
2013-05-18 8.50 6 4.6 7 4.9 62
2012-12-31 8.32 5 4.6 5 2.1 200
2012-12-01 8.00 7 7.1 6 8.6 230
2012-11-03 7.00 7 0.0 7 2.7 258
2012-10-15 7.00 7 0.0 8 5.2 277
2012-09-22 8.32 10 10.1 8 4.1 300
2012-09-15 9.00 10 12.5 9 12.1 307
2012-08-18 7.00 8 0.0 8 9.2 335
2012-08-02 9.00 5 3.5 5 2.2 351
2012-07-14 12.00 3 4.5 3 3.5 370
2012-06-16 8.32 7 3.7 7 5.1 398

Related

Create a new columns in dataframe equaling differenciated series

I want to create a new column diff aqualing the differenciation of a series in a nother column.
The following is my dataframe:
df=pd.DataFrame({
'series_1' : [10.1, 15.3, 16, 12, 14.5, 11.8, 2.3, 7.7,5,10],
'series_2' : [9.6,10.4, 11.2, 3.3, 6, 4, 1.94, 15.44, 6.17, 8.16]
})
It has the following display:
series_1 series_2
0 10.1 9.60
1 15.3 10.40
2 16.0 11.20
3 12.0 3.30
4 14.5 6.00
5 11.8 4.00
6 2.3 1.94
7 7.7 15.44
8 5.0 6.17
9 10.0 8.16
Goal
Is to get the following output:
series_1 series_2 diff_2
0 10.1 9.60 NaN
1 15.3 10.40 0.80
2 16.0 11.20 0.80
3 12.0 3.30 -7.90
4 14.5 6.00 2.70
5 11.8 4.00 -2.00
6 2.3 1.94 -2.06
7 7.7 15.44 13.50
8 5.0 6.17 -9.27
9 10.0 8.16 1.99
My code
To reach the desired output I used the following code and it worked:
diff_2=[np.nan]
l=len(df)
for i in range(1, l):
diff_2.append(df['series_2'][i] - df['series_2'][i-1])
df['diff_2'] = diff_2
Issue with my code
I replicated here a simplified dataframe, the real one I am working on is extremly large and my code took almost 9 minute runtime!
I want an alternative allowing me to get the output in a fast way,
Any suggestion from your side will be highly appreciated, thanks.
here is one way to do it, using diff
# create a new col by taking difference b/w consecutive rows of DF using diff
df['diff_2']=df['series_2'].diff()
df
series_1 series_2 diff_2
0 10.1 9.60 NaN
1 15.3 10.40 0.80
2 16.0 11.20 0.80
3 12.0 3.30 -7.90
4 14.5 6.00 2.70
5 11.8 4.00 -2.00
6 2.3 1.94 -2.06
7 7.7 15.44 13.50
8 5.0 6.17 -9.27
9 10.0 8.16 1.99
You might want to add the following line of code:
df["diff_2"] = df["series_2"].sub(df["series_2"].shift(1))
to achieve your goal output:
series_1 series_2 diff_2
0 10.1 9.60 NaN
1 15.3 10.40 0.80
2 16.0 11.20 0.80
3 12.0 3.30 -7.90
4 14.5 6.00 2.70
5 11.8 4.00 -2.00
6 2.3 1.94 -2.06
7 7.7 15.44 13.50
8 5.0 6.17 -9.27
9 10.0 8.16 1.99
That is a build-in pandas feature, so that should be optimized for good performance.

correlation matrix with group-by and sort

I am trying calculate correlation matrix with groupby and sort. I have 100 companies from 11 industries. I would like to group by industry and sort by their total assets (atq), and then calculate the correlation of data.pr_multi with this order. however, when I do sort and groupby, it reverses back and calculates by alphabetical order.
The code I use:
index
datafqtr
tic
pr_multi
atq
industry
0
2018Q1
A
NaN
8698.0
4
1
2018Q2
A
-0.0856845728151735
8784.0
4
2
2018Q3
A
0.0035103320774146
8349.0
4
3
2018Q4
A
-0.0157732687260246
8541.0
4
4
2018Q1
AAL
NaN
53280.0
5
5
2018Q2
AAL
-0.2694380292532717
52622.0
5
the code I use:
data1=data18.sort_values(['atq'],ascending=False).groupby('industry').head()
df = data1.pivot_table('pr_multi', ['datafqtr'], 'tic')
# calculate correlation matrix using inbuilt pandas function
correlation_matrix = df.corr()
correlation_matrix.head()
IIUC, you want to calculate the correlation between the order based on the groupby and the pr_multi column. use:
data1=data18.groupby('industry')['atq'].apply(lambda x: x.sort_values(ascending=False))
np.corrcoef(data1.reset_index()['level_1'], data18['pr_multi'].astype(float).fillna(0))
Output:
array([[ 1. , -0.44754795],
[-0.44754795, 1. ]])
import pandas as pd
import numpy as np
df = pd.read_csv('data.csv')
df.groupby('name')[['col1','col2']].corr() # you can put as many desired columns here
Out put:
y x
name
a y 1.000000 0.974467
a x 0.974467 1.000000
b y 1.000000 0.975120
b x 0.975120 1.000000
The data is like this:
name col1 col2
0 a 13.7 7.8
1 a -14.7 -9.7
2 a -3.4 -0.6
3 a 7.4 3.3
4 a -5.3 -1.9
5 a -8.3 -2.3
6 a 8.9 3.7
7 a 10.0 7.9
8 a 1.8 -0.4
9 a 6.7 3.1
10 a 17.4 9.9
11 a 8.9 7.7
12 a -3.1 -1.5
13 a -12.2 -7.9
14 a 7.6 4.9
15 a 4.2 2.3
16 a -15.3 -5.6
17 a 9.9 6.7
18 a 11.0 5.2
19 a 5.7 5.1
20 a -0.3 -0.6
21 a -15.0 -8.7
22 a -10.6 -5.7
23 a -16.0 -9.1
24 b 16.7 8.5
25 b 9.2 8.2
26 b 4.7 3.4
27 b -16.7 -8.7
28 b -4.8 -1.5
29 b -2.6 -2.2
30 b 16.3 9.5
31 b 15.8 9.8
32 b -10.8 -7.3
33 b -5.4 -3.4
34 b -6.0 -1.8
35 b 1.9 -0.6
36 b 6.3 6.1
37 b -14.7 -8.0
38 b -16.1 -9.7
39 b -10.5 -8.0
40 b 4.9 1.0
41 b 11.1 4.5
42 b -14.8 -8.5
43 b -0.2 -2.8
44 b 6.3 1.7
45 b -14.1 -8.7
46 b 13.8 8.9
47 b -6.2 -3.0
​

Pyhton code for rolling window regression by groups

I would like to perform a rolling window regression for panel data over a period of 12 months and get the monthly intercept fund wise as output. My data has Funds (ID) with monthly returns.
enter image description here
Request you to please help me with the python code for the same.
In statsmodels there is rolling OLS. You can use that with groupby
Sample code:
import pandas as pd
import numpy as np
from statsmodels.regression.rolling import RollingOLS
# Read data & adding "intercept" column
df = pd.read_csv('sample_rolling_regression_OLS.csv')
df['intercept'] = 1
# Groupby then apply RollingOLS
df.groupby('name')[['y', 'intercept', 'x']].apply(lambda g: RollingOLS(g['y'], g[['intercept', 'x']], window=6).fit().params)
Sample data: or you can download at: https://www.dropbox.com/s/zhklsg5cmfksufm/sample_rolling_regression_OLS.csv?dl=0
name y x intercept
0 a 13.7 7.8 1
1 a -14.7 -9.7 1
2 a -3.4 -0.6 1
3 a 7.4 3.3 1
4 a -5.3 -1.9 1
5 a -8.3 -2.3 1
6 a 8.9 3.7 1
7 a 10.0 7.9 1
8 a 1.8 -0.4 1
9 a 6.7 3.1 1
10 a 17.4 9.9 1
11 a 8.9 7.7 1
12 a -3.1 -1.5 1
13 a -12.2 -7.9 1
14 a 7.6 4.9 1
15 a 4.2 2.3 1
16 a -15.3 -5.6 1
17 a 9.9 6.7 1
18 a 11.0 5.2 1
19 a 5.7 5.1 1
20 a -0.3 -0.6 1
21 a -15.0 -8.7 1
22 a -10.6 -5.7 1
23 a -16.0 -9.1 1
24 b 16.7 8.5 1
25 b 9.2 8.2 1
26 b 4.7 3.4 1
27 b -16.7 -8.7 1
28 b -4.8 -1.5 1
29 b -2.6 -2.2 1
30 b 16.3 9.5 1
31 b 15.8 9.8 1
32 b -10.8 -7.3 1
33 b -5.4 -3.4 1
34 b -6.0 -1.8 1
35 b 1.9 -0.6 1
36 b 6.3 6.1 1
37 b -14.7 -8.0 1
38 b -16.1 -9.7 1
39 b -10.5 -8.0 1
40 b 4.9 1.0 1
41 b 11.1 4.5 1
42 b -14.8 -8.5 1
43 b -0.2 -2.8 1
44 b 6.3 1.7 1
45 b -14.1 -8.7 1
46 b 13.8 8.9 1
47 b -6.2 -3.0 1

delete consecutive rows conditionally pandas

I have a df with columns (A, B, C, D, F). I want to:
1) Compare consecutive rows
2) if the absolute difference between consecutive E <=1 AND absolute difference between consecutive C>7, then delete the row with the lowest C value.
Sample Data:
A B C D E
0 94.5 4.3 26.0 79.0 NaN
1 34.0 8.8 23.0 58.0 54.5
2 54.2 5.4 25.5 9.91 50.2
3 42.2 3.5 26.0 4.91 5.1
4 98.0 8.2 13.0 193.7 5.5
5 20.5 9.6 17.0 157.3 5.3
6 32.9 5.4 24.5 45.9 79.8
Desired result:
A B C D E
0 94.5 4.3 26.0 79.0 NaN
1 34.0 8.8 23.0 58.0 54.5
2 54.2 5.4 25.5 9.91 50.2
3 42.2 3.5 26.0 4.91 5.01
4 32.9 5.4 24.5 45.9 79.8
Row 4 was deleted when compared with row 3. Row 5 is now row 4 and it was deleted when compared to row 3.
This code returns the results as boolean (not df with values) and does not satisfy all the conditions.
df = (abs(df.E.diff(-1)) <=1 & (abs(df.C.diff(-1)) >7.)
The result of the code:
0 False
1 False
2 False
3 True
4 False
5 False
6 False
dtype: bool
Any help appreciated.
Using shift() to compare the rows, and a while loop to iterate until no new change happens:
while(True):
rows = len(df)
df = df[~((abs(df.E - df.E.shift(1)) <= 1)&(abs(df.C - df.C.shift(1)) > 7))]
df.reset_index(inplace = True, drop = True)
if (rows == len(df)):
break
It produces the desired output:
A B C D E
0 94.5 4.3 26.0 79.00 NaN
1 34.0 8.8 23.0 58.00 54.5
2 54.2 5.4 25.5 9.91 50.2
3 42.2 3.5 26.0 4.91 5.1
4 32.9 5.4 24.5 45.90 79.8

python exponential moving average

I would like to calculate the exponential moving average of my data, as usual, there are a few different way to implement it in python. And before I use any of them, I would like to understand (verify) it, and the result is very surprising, none of them are the same!
Below I use the TA-Lib EMA, as well as the pandas ewm function. I have also included one from excel, using formula [data now-EMA (previous)] x multiplier + EMA (previous), with multiplier = 0.1818.
Can someone explain how they are calculated? why they all have different result? which one is correct?
df = pd.DataFrame({"Number": [x for x in range(1,7)]*5})
data = df["Number"]
df["TA_MA"] = MA(data, timeperiod = 5)
df["PD_MA"] = data.rolling(5).mean()
df["TA_EMA"] = EMA(data, timeperiod = 5)
df["PD_EMA_1"] = data.ewm(span=5, adjust=False).mean()
df["PD_EMA_2"] = data.ewm(span=5, adjust=True).mean()
Number TA_MA PD_MA TA_EMA PD_EMA_1 PD_EMA_2 Excel_EMA
0 1 NaN NaN NaN 1.000000 1.000000 NaN
1 2 NaN NaN NaN 1.333333 1.600000 NaN
2 3 NaN NaN NaN 1.888889 2.263158 NaN
3 4 NaN NaN NaN 2.592593 2.984615 NaN
4 5 3.0 3.0 3.000000 3.395062 3.758294 3.00
5 6 4.0 4.0 4.000000 4.263374 4.577444 3.55
6 1 3.8 3.8 3.000000 3.175583 3.310831 3.08
7 2 3.6 3.6 2.666667 2.783722 2.856146 2.89
8 3 3.4 3.4 2.777778 2.855815 2.905378 2.91
9 4 3.2 3.2 3.185185 3.237210 3.276691 3.11
10 5 3.0 3.0 3.790123 3.824807 3.857846 3.45
11 6 4.0 4.0 4.526749 4.549871 4.577444 3.91
12 1 3.8 3.8 3.351166 3.366581 3.378804 3.38
13 2 3.6 3.6 2.900777 2.911054 2.917623 3.13
14 3 3.4 3.4 2.933852 2.940703 2.945145 3.11
15 4 3.2 3.2 3.289234 3.293802 3.297299 3.27
16 5 3.0 3.0 3.859490 3.862534 3.865443 3.58
17 6 4.0 4.0 4.572993 4.575023 4.577444 4.02
18 1 3.8 3.8 3.381995 3.383349 3.384424 3.47
19 2 3.6 3.6 2.921330 2.922232 2.922811 3.21
20 3 3.4 3.4 2.947553 2.948155 2.948546 3.17
21 4 3.2 3.2 3.298369 3.298770 3.299077 3.32
22 5 3.0 3.0 3.865579 3.865847 3.866102 3.63
23 6 4.0 4.0 4.577053 4.577231 4.577444 4.06
24 1 3.8 3.8 3.384702 3.384821 3.384915 3.50
25 2 3.6 3.6 2.923135 2.923214 2.923265 3.23
26 3 3.4 3.4 2.948756 2.948809 2.948844 3.19
27 4 3.2 3.2 3.299171 3.299206 3.299233 3.33
28 5 3.0 3.0 3.866114 3.866137 3.866160 3.64
29 6 4.0 4.0 4.577409 4.577425 4.577444 4.07

Categories