Merging different length dataframe in Python/pandas - python

I have 2 dataframe:
df1
aa gg pm
1 3.3 0.5
1 0.0 4.7
1 9.3 0.2
2 0.3 0.6
2 14.0 91.0
3 13.0 31.0
4 13.1 64.0
5 1.3 0.5
6 3.3 0.5
7 11.1 3.0
7 11.3 24.0
8 3.2 0.0
8 5.3 0.3
8 3.3 0.3
and df2:
aa gg st
1 3.3 in
2 0.3 in
5 1.3 in
7 11.1 in
8 5.3 in
I would like to merge these two dataframe on col aa and gg to get results like:
aa gg pm st
1 3.3 0.5 in
1 0.0 4.7
1 9.3 0.2
2 0.3 0.6 in
2 14.0 91.0
3 13.0 31.0
4 13.1 64.0
5 1.3 0.5 in
6 3.3 0.5
7 11.1 3.0 in
7 11.3 24.0
8 3.2 0.0
8 5.3 0.3 in
8 3.3 0.3
I want to map the col st details to based on col aa and gg.
please let me know how to do this.

You can multiple float columns by 1000 or 10000 and convert to integers and then use these new columns for join:
df1['gg_int'] = df1['gg'].mul(1000).astype(int)
df2['gg_int'] = df2['gg'].mul(1000).astype(int)
df = df1.merge(df2.drop('gg', axis=1), on=['aa','gg_int'], how='left')
df = df.drop('gg_int', axis=1)
print (df)
aa gg pm st
0 1 3.3 0.5 in
1 1 0.0 4.7 NaN
2 1 9.3 0.2 NaN
3 2 0.3 0.6 in
4 2 14.0 91.0 NaN
5 3 13.0 31.0 NaN
6 4 13.1 64.0 NaN
7 5 1.3 0.5 in
8 6 3.3 0.5 NaN
9 7 11.1 3.0 in
10 7 11.3 24.0 NaN
11 8 3.2 0.0 NaN
12 8 5.3 0.3 in
13 8 3.3 0.3 NaN

Related

Pyhton code for rolling window regression by groups

I would like to perform a rolling window regression for panel data over a period of 12 months and get the monthly intercept fund wise as output. My data has Funds (ID) with monthly returns.
enter image description here
Request you to please help me with the python code for the same.
In statsmodels there is rolling OLS. You can use that with groupby
Sample code:
import pandas as pd
import numpy as np
from statsmodels.regression.rolling import RollingOLS
# Read data & adding "intercept" column
df = pd.read_csv('sample_rolling_regression_OLS.csv')
df['intercept'] = 1
# Groupby then apply RollingOLS
df.groupby('name')[['y', 'intercept', 'x']].apply(lambda g: RollingOLS(g['y'], g[['intercept', 'x']], window=6).fit().params)
Sample data: or you can download at: https://www.dropbox.com/s/zhklsg5cmfksufm/sample_rolling_regression_OLS.csv?dl=0
name y x intercept
0 a 13.7 7.8 1
1 a -14.7 -9.7 1
2 a -3.4 -0.6 1
3 a 7.4 3.3 1
4 a -5.3 -1.9 1
5 a -8.3 -2.3 1
6 a 8.9 3.7 1
7 a 10.0 7.9 1
8 a 1.8 -0.4 1
9 a 6.7 3.1 1
10 a 17.4 9.9 1
11 a 8.9 7.7 1
12 a -3.1 -1.5 1
13 a -12.2 -7.9 1
14 a 7.6 4.9 1
15 a 4.2 2.3 1
16 a -15.3 -5.6 1
17 a 9.9 6.7 1
18 a 11.0 5.2 1
19 a 5.7 5.1 1
20 a -0.3 -0.6 1
21 a -15.0 -8.7 1
22 a -10.6 -5.7 1
23 a -16.0 -9.1 1
24 b 16.7 8.5 1
25 b 9.2 8.2 1
26 b 4.7 3.4 1
27 b -16.7 -8.7 1
28 b -4.8 -1.5 1
29 b -2.6 -2.2 1
30 b 16.3 9.5 1
31 b 15.8 9.8 1
32 b -10.8 -7.3 1
33 b -5.4 -3.4 1
34 b -6.0 -1.8 1
35 b 1.9 -0.6 1
36 b 6.3 6.1 1
37 b -14.7 -8.0 1
38 b -16.1 -9.7 1
39 b -10.5 -8.0 1
40 b 4.9 1.0 1
41 b 11.1 4.5 1
42 b -14.8 -8.5 1
43 b -0.2 -2.8 1
44 b 6.3 1.7 1
45 b -14.1 -8.7 1
46 b 13.8 8.9 1
47 b -6.2 -3.0 1

Number Of Rows Since Positive/Negative in Pandas

I have a DataFrame similar to this:
MACD
0 -2.3
1 -0.3
2 0.8
3 0.1
4 0.6
5 -0.7
6 1.1
7 2.4
How can I add an extra column showing the number of rows since MACD was on the opposite side of the origin (positive/negative)?
Desired Outcome:
MACD RowsSince
0 -2.3 NaN
1 -0.3 NaN
2 0.8 1
3 0.1 2
4 0.6 3
5 -0.7 1
6 1.1 1
7 2.4 2
We can try with use np.sign with diff create the subgroup , then with groupby + cumcount
s = np.sign(df['MACD']).diff().ne(0).cumsum()
df['new'] = (df.groupby(s).cumcount()+1).mask(s.eq(1))
df
Out[80]:
MACD new
0 -2.3 NaN
1 -0.3 NaN
2 0.8 1.0
3 0.1 2.0
4 0.6 3.0
5 -0.7 1.0
6 1.1 1.0
7 2.4 2.0

python exponential moving average

I would like to calculate the exponential moving average of my data, as usual, there are a few different way to implement it in python. And before I use any of them, I would like to understand (verify) it, and the result is very surprising, none of them are the same!
Below I use the TA-Lib EMA, as well as the pandas ewm function. I have also included one from excel, using formula [data now-EMA (previous)] x multiplier + EMA (previous), with multiplier = 0.1818.
Can someone explain how they are calculated? why they all have different result? which one is correct?
df = pd.DataFrame({"Number": [x for x in range(1,7)]*5})
data = df["Number"]
df["TA_MA"] = MA(data, timeperiod = 5)
df["PD_MA"] = data.rolling(5).mean()
df["TA_EMA"] = EMA(data, timeperiod = 5)
df["PD_EMA_1"] = data.ewm(span=5, adjust=False).mean()
df["PD_EMA_2"] = data.ewm(span=5, adjust=True).mean()
Number TA_MA PD_MA TA_EMA PD_EMA_1 PD_EMA_2 Excel_EMA
0 1 NaN NaN NaN 1.000000 1.000000 NaN
1 2 NaN NaN NaN 1.333333 1.600000 NaN
2 3 NaN NaN NaN 1.888889 2.263158 NaN
3 4 NaN NaN NaN 2.592593 2.984615 NaN
4 5 3.0 3.0 3.000000 3.395062 3.758294 3.00
5 6 4.0 4.0 4.000000 4.263374 4.577444 3.55
6 1 3.8 3.8 3.000000 3.175583 3.310831 3.08
7 2 3.6 3.6 2.666667 2.783722 2.856146 2.89
8 3 3.4 3.4 2.777778 2.855815 2.905378 2.91
9 4 3.2 3.2 3.185185 3.237210 3.276691 3.11
10 5 3.0 3.0 3.790123 3.824807 3.857846 3.45
11 6 4.0 4.0 4.526749 4.549871 4.577444 3.91
12 1 3.8 3.8 3.351166 3.366581 3.378804 3.38
13 2 3.6 3.6 2.900777 2.911054 2.917623 3.13
14 3 3.4 3.4 2.933852 2.940703 2.945145 3.11
15 4 3.2 3.2 3.289234 3.293802 3.297299 3.27
16 5 3.0 3.0 3.859490 3.862534 3.865443 3.58
17 6 4.0 4.0 4.572993 4.575023 4.577444 4.02
18 1 3.8 3.8 3.381995 3.383349 3.384424 3.47
19 2 3.6 3.6 2.921330 2.922232 2.922811 3.21
20 3 3.4 3.4 2.947553 2.948155 2.948546 3.17
21 4 3.2 3.2 3.298369 3.298770 3.299077 3.32
22 5 3.0 3.0 3.865579 3.865847 3.866102 3.63
23 6 4.0 4.0 4.577053 4.577231 4.577444 4.06
24 1 3.8 3.8 3.384702 3.384821 3.384915 3.50
25 2 3.6 3.6 2.923135 2.923214 2.923265 3.23
26 3 3.4 3.4 2.948756 2.948809 2.948844 3.19
27 4 3.2 3.2 3.299171 3.299206 3.299233 3.33
28 5 3.0 3.0 3.866114 3.866137 3.866160 3.64
29 6 4.0 4.0 4.577409 4.577425 4.577444 4.07

Numpy Separating CSV into columns

I'm trying to use a CSV imported from bballreference.com. But as you can see, the separated values are all in one row rather than separated by columns. On NumPy Pandas, what would be the easiest way to fix this? I've googled to no avail.
csv on jupyter
I don't know how to post CSV file in a clean way but here it is:
",,,Totals,Totals,Totals,Totals,Totals,Totals,Totals,Totals,Totals,Totals,Totals,Totals,Totals,Totals,Totals,Totals,Totals,Totals,Shooting,Shooting,Shooting,Per Game,Per Game,Per Game,Per Game,Per Game,Per Game"
"Rk,Player,Age,G,GS,MP,FG,FGA,3P,3PA,FT,FTA,ORB,DRB,TRB,AST,STL,BLK,TOV,PF,PTS,FG%,3P%,FT%,MP,PTS,TRB,AST,STL,BLK"
"1,Kevin Durant\duranke01,29,5,5,182,54,107,9,28,22,27,3,34,37,24,7,6,10,7,139,.505,.321,.815,36.5,27.8,7.4,4.8,1.4,1.2"
"2,Klay Thompson\thompkl01,27,5,5,183,38,99,12,43,11,11,3,29,32,9,1,2,6,11,99,.384,.279,1.000,36.7,19.8,6.4,1.8,0.2,0.4"
"3,Stephen Curry\curryst01,29,4,3,125,32,67,15,34,19,19,2,19,21,14,8,2,15,6,98,.478,.441,1.000,31.2,24.5,5.3,3.5,2.0,0.5"
"4,Draymond Green\greendr01,27,5,5,186,27,55,8,20,12,15,12,47,59,50,12,8,18,16,74,.491,.400,.800,37.1,14.8,11.8,10.0,2.4,1.6"
"5,Andre Iguodala\iguodan01,34,5,4,140,14,29,4,12,7,12,4,21,25,17,10,2,3,7,39,.483,.333,.583,27.9,7.8,5.0,3.4,2.0,0.4"
"6,Quinn Cook\cookqu01,24,4,0,58,12,27,0,10,6,8,1,8,9,4,1,0,2,4,30,.444,.000,.750,14.4,7.5,2.3,1.0,0.3,0.0"
"7,Kevon Looney\looneke01,21,5,0,113,12,17,0,0,4,8,10,19,29,5,4,1,2,17,28,.706,,.500,22.6,5.6,5.8,1.0,0.8,0.2"
"8,Shaun Livingston\livinsh01,32,5,0,79,11,27,0,0,4,4,0,6,6,12,0,1,3,9,26,.407,,1.000,15.9,5.2,1.2,2.4,0.0,0.2"
"9,David West\westda01,37,5,0,40,8,14,0,0,0,0,2,5,7,13,2,4,3,4,16,.571,,,7.9,3.2,1.4,2.6,0.4,0.8"
"10,Nick Young\youngni01,32,4,2,41,3,11,3,10,2,3,0,4,4,1,1,0,1,3,11,.273,.300,.667,10.2,2.8,1.0,0.3,0.3,0.0"
"11,JaVale McGee\mcgeeja01,30,3,1,19,3,8,0,1,0,0,4,2,6,0,0,1,0,2,6,.375,.000,,6.2,2.0,2.0,0.0,0.0,0.3"
"12,Zaza Pachulia\pachuza01,33,2,0,8,1,2,0,0,2,4,4,2,6,0,2,0,1,1,4,.500,,.500,4.2,2.0,3.0,0.0,1.0,0.0"
"13,Jordan Bell\belljo01,23,4,0,23,1,4,0,0,1,2,1,5,6,5,2,2,0,2,3,.250,,.500,5.8,0.8,1.5,1.3,0.5,0.5"
"14,Damian Jones\jonesda03,22,1,0,3,0,1,0,0,2,2,0,0,0,0,0,0,0,0,2,.000,,1.000,3.2,2.0,0.0,0.0,0.0,0.0"
",Team Totals,26.5,5,,1200,216,468,51,158,92,115,46,201,247,154,50,29,64,89,575,.462,.323,.800,240.0,115.0,49.4,30.8,10.0,5.8"
It seems that the first two rows of your CSV file are headers, but the default behavior of pd.read_csv thinks that only the first row is header.
Also, the beginning and trailing quotes make pd.read_csv think the text in between is a single field/column.
You could try the following:
Remove the beginning and trailing quotes, and
bbal = pd.read_csv('some_file.csv', header=[0, 1], delimiter=',')
Following is how you could use Python to remove the beginning and trailing quotes:
# open 'quotes.csv' in read mode with variable in_file as handle
# open 'no_quotes.csv' in write mode with variable out_file as handle
with open('quotes.csv') as in_file, open('no_quotes.csv', 'w') as out_file:
# read in_file line by line
# the variable line stores each line as string
for line in in_file:
# line[1:-1] slices the string to omit the first and last character
# append a newline character '\n' to the sliced line
# write the string with newline to out_file
out_file.write(line[1:-1] + '\n')
# read_csv on 'no_quotes.csv'
bbal = pd.read_csv('no_quotes.csv', header=[0, 1], delimiter=',')
bbal.head()
Consider reading in csv as a text file to be stripped of the beginning/end quotes per line on a text file read which tell the parser all data between is a singular value. And use built-in StringIO to read text string into dataframe instead of saving to disk for import.
Additionally, skip the first row of repeated Totals and Per Game and even the last row that aggregates since you can do that with pandas.
from io import StringIO
import pandas as pd
with open('BasketballCSVQuotes.csv') as f:
csvdata = f.read().replace('"', '')
df = pd.read_csv(StringIO(csvdata), skiprows=1, skipfooter=1, engine='python')
print(df)
Output
Rk Player Age G GS MP FG FGA 3P 3PA ... PTS FG% 3P% FT% MP.1 PTS.1 TRB.1 AST.1 STL.1 BLK.1
0 1.0 Kevin Durant\duranke01 29.0 5 5.0 182 54 107 9 28 ... 139 0.505 0.321 0.815 36.5 27.8 7.4 4.8 1.4 1.2
1 2.0 Klay Thompson\thompkl01 27.0 5 5.0 183 38 99 12 43 ... 99 0.384 0.279 1.000 36.7 19.8 6.4 1.8 0.2 0.4
2 3.0 Stephen Curry\curryst01 29.0 4 3.0 125 32 67 15 34 ... 98 0.478 0.441 1.000 31.2 24.5 5.3 3.5 2.0 0.5
3 4.0 Draymond Green\greendr01 27.0 5 5.0 186 27 55 8 20 ... 74 0.491 0.400 0.800 37.1 14.8 11.8 10.0 2.4 1.6
4 5.0 Andre Iguodala\iguodan01 34.0 5 4.0 140 14 29 4 12 ... 39 0.483 0.333 0.583 27.9 7.8 5.0 3.4 2.0 0.4
5 6.0 Quinn Cook\cookqu01 24.0 4 0.0 58 12 27 0 10 ... 30 0.444 0.000 0.750 14.4 7.5 2.3 1.0 0.3 0.0
6 7.0 Kevon Looney\looneke01 21.0 5 0.0 113 12 17 0 0 ... 28 0.706 NaN 0.500 22.6 5.6 5.8 1.0 0.8 0.2
7 8.0 Shaun Livingston\livinsh01 32.0 5 0.0 79 11 27 0 0 ... 26 0.407 NaN 1.000 15.9 5.2 1.2 2.4 0.0 0.2
8 9.0 David West\westda01 37.0 5 0.0 40 8 14 0 0 ... 16 0.571 NaN NaN 7.9 3.2 1.4 2.6 0.4 0.8
9 10.0 Nick Young\youngni01 32.0 4 2.0 41 3 11 3 10 ... 11 0.273 0.300 0.667 10.2 2.8 1.0 0.3 0.3 0.0
10 11.0 JaVale McGee\mcgeeja01 30.0 3 1.0 19 3 8 0 1 ... 6 0.375 0.000 NaN 6.2 2.0 2.0 0.0 0.0 0.3
11 12.0 Zaza Pachulia\pachuza01 33.0 2 0.0 8 1 2 0 0 ... 4 0.500 NaN 0.500 4.2 2.0 3.0 0.0 1.0 0.0
12 13.0 Jordan Belelljo01 23.0 4 0.0 23 1 4 0 0 ... 3 0.250 NaN 0.500 5.8 0.8 1.5 1.3 0.5 0.5
13 14.0 Damian Jones\jonesda03 22.0 1 0.0 3 0 1 0 0 ... 2 0.000 NaN 1.000 3.2 2.0 0.0 0.0 0.0 0.0
[14 rows x 30 columns]

Calculating the accumulated summation of clustered data in data frame in pandas

Given the following data frame:
index value
1 0.8
2 0.9
3 1.0
4 0.9
5 nan
6 nan
7 nan
8 0.4
9 0.9
10 nan
11 0.8
12 2.0
13 1.4
14 1.9
15 nan
16 nan
17 nan
18 8.4
19 9.9
20 10.0
…
in which the data 'value' is separated into a number of clusters by value NAN. is there any way I can calculate some values such as accumulate summation, or mean of the clustered data, for example, I want calculate the accumulated sum and generate the following data frame:
index value cumsum
1 0.8 0.8
2 0.9 1.7
3 1.0 2.7
4 0.9 3.6
5 nan 0
6 nan 0
7 nan 0
8 0.4 0.4
9 0.9 1.3
10 nan 0
11 0.8 0.8
12 2.0 2.8
13 1.4 4.2
14 1.9 6.1
15 nan 0
16 nan 0
17 nan 0
18 8.4 8.4
19 9.9 18.3
20 10.0 28.3
…
Any suggestions?
Also as a simple extension of the problem, if two clusters of data are close enough, such as there are only 1 NAN separate them we consider the as one cluster of data, such that we can have the following data frame:
index value cumsum
1 0.8 0.8
2 0.9 1.7
3 1.0 2.7
4 0.9 3.6
5 nan 0
6 nan 0
7 nan 0
8 0.4 0.4
9 0.9 1.3
10 nan 1.3
11 0.8 2.1
12 2.0 4.1
13 1.4 5.5
14 1.9 7.4
15 nan 0
16 nan 0
17 nan 0
18 8.4 8.4
19 9.9 18.3
20 10.0 28.3
Thank you for the help!
You can do the first part using the compare-cumsum-groupby pattern. Your "simple extension" isn't quite so simple, but we can still pull it off, by finding out the parts of value that we want to treat as zero:
n = df["value"].isnull()
clusters = (n != n.shift()).cumsum()
df["cumsum"] = df["value"].groupby(clusters).cumsum().fillna(0)
to_zero = n & (df["value"].groupby(clusters).transform('size') == 1)
tmp_value = df["value"].where(~to_zero, 0)
n2 = tmp_value.isnull()
new_clusters = (n2 != n2.shift()).cumsum()
df["cumsum_skip1"] = tmp_value.groupby(new_clusters).cumsum().fillna(0)
produces
>>> df
index value cumsum cumsum_skip1
0 1 0.8 0.8 0.8
1 2 0.9 1.7 1.7
2 3 1.0 2.7 2.7
3 4 0.9 3.6 3.6
4 5 NaN 0.0 0.0
5 6 NaN 0.0 0.0
6 7 NaN 0.0 0.0
7 8 0.4 0.4 0.4
8 9 0.9 1.3 1.3
9 10 NaN 0.0 1.3
10 11 0.8 0.8 2.1
11 12 2.0 2.8 4.1
12 13 1.4 4.2 5.5
13 14 1.9 6.1 7.4
14 15 NaN 0.0 0.0
15 16 NaN 0.0 0.0
16 17 NaN 0.0 0.0
17 18 8.4 8.4 8.4
18 19 9.9 18.3 18.3
19 20 10.0 28.3 28.3

Categories