Python pivot table iterate through each value - python

I have an example of data below
Temperature Voltage Data
25 3.3 2
25 3.3 2.5
25 3.3 3.7
25 3.3 3.5
25 3.3 2.7
25 3.45 1.9
25 3.45 1.7
25 3.45 1.5
25 3.45 2
25 3.45 2.9
105 3.3 3
105 3.3 3.5
105 3.3 4.7
105 3.3 4.5
105 3.3 3.7
105 3.45 2.5
105 3.45 2.3
105 3.45 2.1
105 3.45 3.3
105 3.45 4
I would like to iterate through each row to calculate the difference between 2 consecutive data point then count how many times that difference is equal or greater than 1.
Then, print out the number of times that happens per Temperature per Voltage.
Thank you,
Victor

Edit: Added np.abs to make sure to take the absolute value of the difference.
You can use pandas diff for that, and then np.where for the condition:
import numpy as np
import pandas as pd
data = {
'Temperature': [25,25,25,25,25,25,25,25,25,25,105,105,105,105,105,105,105,105,105,105],
'Voltage': [3.3,3.3,3.3,3.3,3.3,3.45,3.45,3.45,3.45,3.45,3.3,3.3,3.3,3.3,3.3,3.45,3.45,3.45,3.45,3.45],
'Data': [2,2.5,3.7,3.5,2.7,1.9,1.7,1.5,2,2.9,3,3.5,4.7,4.5,3.7,2.5,2.3,2.1,3.3,4]
}
df = pd.DataFrame(data)
df['difference'] = df['Data'].diff(1)
df['flag'] = np.where(np.abs(df['difference']) >= 1,'More than 1','Less than one')
print(df)
Output:
Temperature Voltage Data difference flag
0 25 3.30 2.0 NaN Less than one
1 25 3.30 2.5 0.5 Less than one
2 25 3.30 3.7 1.2 More than 1
3 25 3.30 3.5 -0.2 Less than one
4 25 3.30 2.7 -0.8 Less than one
5 25 3.45 1.9 -0.8 Less than one
6 25 3.45 1.7 -0.2 Less than one
7 25 3.45 1.5 -0.2 Less than one
8 25 3.45 2.0 0.5 Less than one
9 25 3.45 2.9 0.9 Less than one
10 105 3.30 3.0 0.1 Less than one
11 105 3.30 3.5 0.5 Less than one
12 105 3.30 4.7 1.2 More than 1
13 105 3.30 4.5 -0.2 Less than one
14 105 3.30 3.7 -0.8 Less than one
15 105 3.45 2.5 -1.2 More than 1
16 105 3.45 2.3 -0.2 Less than one
17 105 3.45 2.1 -0.2 Less than one
18 105 3.45 3.3 1.2 More than 1
19 105 3.45 4.0 0.7 Less than one

Related

correlation matrix with group-by and sort

I am trying calculate correlation matrix with groupby and sort. I have 100 companies from 11 industries. I would like to group by industry and sort by their total assets (atq), and then calculate the correlation of data.pr_multi with this order. however, when I do sort and groupby, it reverses back and calculates by alphabetical order.
The code I use:
index
datafqtr
tic
pr_multi
atq
industry
0
2018Q1
A
NaN
8698.0
4
1
2018Q2
A
-0.0856845728151735
8784.0
4
2
2018Q3
A
0.0035103320774146
8349.0
4
3
2018Q4
A
-0.0157732687260246
8541.0
4
4
2018Q1
AAL
NaN
53280.0
5
5
2018Q2
AAL
-0.2694380292532717
52622.0
5
the code I use:
data1=data18.sort_values(['atq'],ascending=False).groupby('industry').head()
df = data1.pivot_table('pr_multi', ['datafqtr'], 'tic')
# calculate correlation matrix using inbuilt pandas function
correlation_matrix = df.corr()
correlation_matrix.head()
IIUC, you want to calculate the correlation between the order based on the groupby and the pr_multi column. use:
data1=data18.groupby('industry')['atq'].apply(lambda x: x.sort_values(ascending=False))
np.corrcoef(data1.reset_index()['level_1'], data18['pr_multi'].astype(float).fillna(0))
Output:
array([[ 1. , -0.44754795],
[-0.44754795, 1. ]])
import pandas as pd
import numpy as np
df = pd.read_csv('data.csv')
df.groupby('name')[['col1','col2']].corr() # you can put as many desired columns here
Out put:
y x
name
a y 1.000000 0.974467
a x 0.974467 1.000000
b y 1.000000 0.975120
b x 0.975120 1.000000
The data is like this:
name col1 col2
0 a 13.7 7.8
1 a -14.7 -9.7
2 a -3.4 -0.6
3 a 7.4 3.3
4 a -5.3 -1.9
5 a -8.3 -2.3
6 a 8.9 3.7
7 a 10.0 7.9
8 a 1.8 -0.4
9 a 6.7 3.1
10 a 17.4 9.9
11 a 8.9 7.7
12 a -3.1 -1.5
13 a -12.2 -7.9
14 a 7.6 4.9
15 a 4.2 2.3
16 a -15.3 -5.6
17 a 9.9 6.7
18 a 11.0 5.2
19 a 5.7 5.1
20 a -0.3 -0.6
21 a -15.0 -8.7
22 a -10.6 -5.7
23 a -16.0 -9.1
24 b 16.7 8.5
25 b 9.2 8.2
26 b 4.7 3.4
27 b -16.7 -8.7
28 b -4.8 -1.5
29 b -2.6 -2.2
30 b 16.3 9.5
31 b 15.8 9.8
32 b -10.8 -7.3
33 b -5.4 -3.4
34 b -6.0 -1.8
35 b 1.9 -0.6
36 b 6.3 6.1
37 b -14.7 -8.0
38 b -16.1 -9.7
39 b -10.5 -8.0
40 b 4.9 1.0
41 b 11.1 4.5
42 b -14.8 -8.5
43 b -0.2 -2.8
44 b 6.3 1.7
45 b -14.1 -8.7
46 b 13.8 8.9
47 b -6.2 -3.0
​

Pyhton code for rolling window regression by groups

I would like to perform a rolling window regression for panel data over a period of 12 months and get the monthly intercept fund wise as output. My data has Funds (ID) with monthly returns.
enter image description here
Request you to please help me with the python code for the same.
In statsmodels there is rolling OLS. You can use that with groupby
Sample code:
import pandas as pd
import numpy as np
from statsmodels.regression.rolling import RollingOLS
# Read data & adding "intercept" column
df = pd.read_csv('sample_rolling_regression_OLS.csv')
df['intercept'] = 1
# Groupby then apply RollingOLS
df.groupby('name')[['y', 'intercept', 'x']].apply(lambda g: RollingOLS(g['y'], g[['intercept', 'x']], window=6).fit().params)
Sample data: or you can download at: https://www.dropbox.com/s/zhklsg5cmfksufm/sample_rolling_regression_OLS.csv?dl=0
name y x intercept
0 a 13.7 7.8 1
1 a -14.7 -9.7 1
2 a -3.4 -0.6 1
3 a 7.4 3.3 1
4 a -5.3 -1.9 1
5 a -8.3 -2.3 1
6 a 8.9 3.7 1
7 a 10.0 7.9 1
8 a 1.8 -0.4 1
9 a 6.7 3.1 1
10 a 17.4 9.9 1
11 a 8.9 7.7 1
12 a -3.1 -1.5 1
13 a -12.2 -7.9 1
14 a 7.6 4.9 1
15 a 4.2 2.3 1
16 a -15.3 -5.6 1
17 a 9.9 6.7 1
18 a 11.0 5.2 1
19 a 5.7 5.1 1
20 a -0.3 -0.6 1
21 a -15.0 -8.7 1
22 a -10.6 -5.7 1
23 a -16.0 -9.1 1
24 b 16.7 8.5 1
25 b 9.2 8.2 1
26 b 4.7 3.4 1
27 b -16.7 -8.7 1
28 b -4.8 -1.5 1
29 b -2.6 -2.2 1
30 b 16.3 9.5 1
31 b 15.8 9.8 1
32 b -10.8 -7.3 1
33 b -5.4 -3.4 1
34 b -6.0 -1.8 1
35 b 1.9 -0.6 1
36 b 6.3 6.1 1
37 b -14.7 -8.0 1
38 b -16.1 -9.7 1
39 b -10.5 -8.0 1
40 b 4.9 1.0 1
41 b 11.1 4.5 1
42 b -14.8 -8.5 1
43 b -0.2 -2.8 1
44 b 6.3 1.7 1
45 b -14.1 -8.7 1
46 b 13.8 8.9 1
47 b -6.2 -3.0 1

Python, pd dataframe extract values based on condition raises error

I have the following dataframe of nba player stats:
print(self.df)
Name PTS REB AST \
(updated to: , 2020-02-24 19:39:00)
0 James Harden 35.2 6.4 7.4
1 Giannis Antetokounmpo 30.0 13.6 5.8
2 Trae Young 30.0 4.4 9.2
3 Bradley Beal 29.6 4.4 6.0
4 Damian Lillard 29.5 4.4 7.9
... ... ... ... ...
261 Jerome Robinson 3.1 1.7 1.1
262 Goga Bitadze 3.1 2.0 0.5
263 Javonte Green 3.0 1.7 0.5
264 Semi Ojeleye 2.9 1.9 0.5
265 Matthew Dellavedova 2.5 1.1 2.6
STL BLK FGM FGA FG% 3PM 3PA \
(updated to: , 2020-02-24 19:39:00)
0 1.7 1.0 10.1 23.1 43.9 4.6 12.8
1 1.1 1.1 11.1 20.1 55.2 1.5 4.8
2 1.2 0.1 9.3 20.8 44.5 3.5 9.5
3 1.1 0.4 10.1 22.2 45.3 2.6 8.0
4 1.0 0.3 9.4 20.4 46.0 3.9 10.0
... ... ... ... ... ... ... ...
261 0.3 0.2 1.2 3.5 34.1 0.5 1.7
262 0.1 0.7 1.3 2.6 48.2 0.1 0.6
263 0.5 0.1 1.2 2.3 51.1 0.1 0.6
264 0.3 0.1 1.0 2.4 39.5 0.5 1.5
265 0.3 0.0 0.9 2.7 32.3 0.2 1.4
3P% FTM FTA FT%
(updated to: , 2020-02-24 19:39:00)
0 35.9 10.4 12.0 86.8
1 31.1 6.4 10.4 61.5
2 37.4 7.9 9.3 85.5
3 32.0 6.9 8.1 84.4
4 39.3 6.8 7.7 88.9
... ... ... ... ...
261 29.5 0.3 0.4 57.1
262 15.4 0.5 0.7 69.0
263 26.1 0.6 0.9 63.9
264 35.0 0.5 0.5 88.9
265 15.9 0.5 0.6 89.3
[266 rows x 15 columns]
I'm trying to analyze some stats by narrowing down the df and get all stats above two column's mean, and when trying to extract some values based on condition, I get the following error.
def get_stat(self):
pts_fgm_df = self.df.head(n=120)
rslt_df = pts_fgm_df.loc[pts_fgm_df['PTS'] > pts_fgm_df['PTS'].mean() & pts_fgm_df['FG%'] > pts_fgm_df.mean()]
print(rslt_df)
TypeError: ufunc 'bitwise_and' not supported for the input types, and the inputs could not be safely coerced to any supported types according to the casting rule ''safe''
My solution eventuality:
top_df = self.df.head(n=120)
mean_pts = top_df['PTS'].mean()
mean_fgp = top_df['FG%'].mean()
rslt_df = top_df[
(top_df['PTS'] >= mean_pts) &
(top_df['FG%'] >= mean_fgp)
]
return rslt_df
My problem was when I wrote the logic it wasn't clear to watch.
# So the solution is to first give every statement a variable name.
mean_pts = top_df['PTS'].mean()
mean_fgp = top_df['FG%'].mean()
pts = top_df['PTS']
fgp = top_df['FG%']
then filter based on them:
# Which makes this a lot clearer to see missing brackets and such.
rslt_df = top_df[
(pts >= mean_pts) &
(fgp >= mean_fgp)
]
return rslt_df

How can I split columns with regex to move trailing CAPS into a separate column?

I'm trying to split a column using regex, but can't seem to get the split correctly. I'm trying to take all the trailing CAPS and move them into a separate column. So I'm getting all the CAPS that are either 2-4 CAPS in a row. However, it's only leaving the 'Name' column while the 'Team' column is blank.
Here's my code:
import pandas as pd
url = "https://www.espn.com/nba/stats/player/_/table/offensive/sort/avgAssists/dir/desc"
df = pd.read_html(url)[0].join(pd.read_html(url)[1])
df[['Name','Team']] = df['Name'].str.split('[A-Z]{2,4}', expand=True)
I want this:
print(df.head(5).to_string())
RK Name POS GP MIN PTS FGM FGA FG% 3PM 3PA 3P% FTM FTA FT% REB AST STL BLK TO DD2 TD3 PER
0 1 LeBron JamesLA SF 35 35.1 24.9 9.6 19.7 48.6 2.0 6.0 33.8 3.7 5.5 67.7 7.9 11.0 1.3 0.5 3.7 28 9 26.10
1 2 Ricky RubioPHX PG 30 32.0 13.6 4.9 11.9 41.3 1.2 3.7 31.8 2.6 3.1 83.7 4.6 9.3 1.3 0.2 2.5 12 1 16.40
2 3 Luka DoncicDAL SF 32 32.8 29.7 9.6 20.2 47.5 3.1 9.4 33.1 7.3 9.1 80.5 9.7 8.9 1.2 0.2 4.2 22 11 31.74
3 4 Ben SimmonsPHIL PG 36 35.4 14.9 6.1 10.8 56.3 0.1 0.1 40.0 2.7 4.6 59.0 7.5 8.6 2.2 0.7 3.6 19 3 19.49
4 5 Trae YoungATL PG 34 35.1 28.9 9.3 20.8 44.8 3.5 9.4 37.5 6.7 7.9 85.0 4.3 8.4 1.2 0.1 4.8 11 1 23.47
to become this:
print(df.head(5).to_string())
RK Name Team POS GP MIN PTS FGM FGA FG% 3PM 3PA 3P% FTM FTA FT% REB AST STL BLK TO DD2 TD3 PER
0 1 LeBron James LA SF 35 35.1 24.9 9.6 19.7 48.6 2.0 6.0 33.8 3.7 5.5 67.7 7.9 11.0 1.3 0.5 3.7 28 9 26.10
1 2 Ricky Rubio PHX PG 30 32.0 13.6 4.9 11.9 41.3 1.2 3.7 31.8 2.6 3.1 83.7 4.6 9.3 1.3 0.2 2.5 12 1 16.40
2 3 Luka Doncic DAL SF 32 32.8 29.7 9.6 20.2 47.5 3.1 9.4 33.1 7.3 9.1 80.5 9.7 8.9 1.2 0.2 4.2 22 11 31.74
3 4 Ben Simmons PHIL PG 36 35.4 14.9 6.1 10.8 56.3 0.1 0.1 40.0 2.7 4.6 59.0 7.5 8.6 2.2 0.7 3.6 19 3 19.49
4 5 Trae Young ATL PG 34 35.1 28.9 9.3 20.8 44.8 3.5 9.4 37.5 6.7 7.9 85.0 4.3 8.4 1.2 0.1 4.8 11 1 23.47
You may extract the data into two columns by using a regex like ^(.*?)([A-Z]+)$ or ^(.*[^A-Z])([A-Z]+)$:
df[['Name','Team']] = df['Name'].str.extract('^(.*?)([A-Z]+)$', expand=True)
This will keep all up to the last char that is not an uppercase letter in Group "Name" and the last uppercase letters in Group "Team".
See regex demo #1 and regex demo #2
Details
^ - start of a string
(.*?) - Capturing group 1: any zero or more chars other than line break chars, as few as possible
or
(.*[^A-Z]) - any zero or more chars other than line break chars, as many as possible, up to the last char that is not an ASCII uppercase letter (granted the subsequent patterns match) (note that this pattern implies there is at least 1 char before the last uppercase letters)
([A-Z]+) - Capturing group 2: one or more ASCII uppercase letters
$ - end of string.
I have made a few alterations in the functions, You might need to add re package.
Its a bit manual, But I hope this will suffice. Have a great day!
df_obj_skel = dict()
df_obj_skel['Name'] = list()
df_obj_skel['Team'] = list()
for index,row in df.iterrows():
Name = row['Name']
Findings = re.search('[A-Z]{2,4}$', Name)
Refined_Team = Findings[0]
Refined_Name = re.sub(Refined_Team + "$", "", Name)
df_obj_skel['Team'].append(Refined_Team)
df_obj_skel['Name'].append(Refined_Name)
df_final = pd.DataFrame(df_obj_skel)
print(df_final)

binning two dimensional data by its index in python

How would I bin some data based on the index of the data, in python 3
Let's say I have the following data
1 0.5
3 0.6
5 0.7
6 0.8
8 0.9
10 1
11 1.1
12 1.2
14 1.3
15 1.4
17 1.5
18 1.6
19 1.7
20 1.8
22 1.9
24 2
25 2.1
28 2.2
31 2.3
35 2.4
how would I take this data and bin both columns such that each bin has n number of values in it, and average the numbers in each bin and output them.
for example, if I wanted to bin the values by 4
I would take the first four data points:
1 0.5
3 0.6
5 0.7
6 0.8
and the averages of these would be: 3.75 0.65
I would continue down the columns by taking the next set of four, and so on
until I averaged all of the sets of four to get this:
3.75 0.65
10.25 1.05
16 1.45
21.25 1.85
29.75 2.25
How can I do this using python
Base on numpy reshape
pd.DataFrame([np.mean(x.reshape(len(df)//4,-1),axis=1) for x in df.values.T]).T
0 1
0 3.75 0.65
1 10.25 1.05
2 16.00 1.45
3 21.25 1.85
4 29.75 2.25
You can "bin" the index into groups of 4 and call groupby in the index.
df.groupby(df.index // 4).mean()
0 1
0 3.75 0.65
1 10.25 1.05
2 16.00 1.45
3 21.25 1.85
4 29.75 2.25

Categories