I am trying to take a data frame that has time series values on the row axis and get % change.
For example here is the data:
77 70 105
50 25 50
15 20 10
This is the required result:
-0.1 0.5
-0.5 1
0.33 -0.5
You can use df.pct_change over axis 1 and df.dropna.
df
0 1 2
0 77 70 105
1 50 25 50
2 15 20 10
df.pct_change(1).dropna(1)
1 2
0 -0.090909 0.5
1 -0.500000 1.0
2 0.333333 -0.5
Related
I'm looking to grab the stats table from this example link:
https://www.fangraphs.com/leaders.aspx?pos=all&stats=bat&lg=all&qual=0&type=8&season=2022&month=1000&season1=2022&ind=0&team=0,ts&rost=0&age=0&filter=&players=0&startdate=2022-04-07&enddate=2022-04-07&page=1_50
...however when I grab it, there's an extra column header that I can't seem to get rid of, and the bottom row has some useless "Page size" string that I could also get rid of.
I've provided an example code below for testing, along with some attempts to fix the issue, but to no avail.
from pandas import read_html, set_option
#set_option('display.max_rows', 20)
#set_option('display.max_columns', None)
# Extract the table from the provided link
url = "https://www.fangraphs.com/leaders.aspx?pos=all&stats=bat&lg=all&qual=0&type=8&season=2022&month=1000&season1=2022&ind=0&team=0,ts&rost=0&age=0&filter=&players=0&startdate=2022-04-07&enddate=2022-04-07&page=1_50"
table_of_interest = read_html(url)[-2]
print(table_of_interest)
# Attempt 1 - https://stackoverflow.com/questions/68385659/in-pandas-python-how-do-i-get-rid-of-the-extra-column-header-with-index-numbers
df = table_of_interest.iloc[1:,:-1]
print(df)
# Attempt 2 - https://stackoverflow.com/questions/71379513/remove-extra-column-level-from-dataframe
df = table_of_interest.rename_axis(columns=None)
print(df)
results in output
I want to get rid of that top 1 Page size: select 14 items in 1 pages column header. How?
You could try as follows:
from pandas import read_html
url = "https://www.fangraphs.com/leaders.aspx?pos=all&stats=bat&lg=all&qual=0&type=8&season=2022&month=1000&season1=2022&ind=0&team=0,ts&rost=0&age=0&filter=&players=0&startdate=2022-04-07&enddate=2022-04-07&page=1_50"
table_of_interest = read_html(url)[-2]
# keep only level 1 from the original MultiIndex cols
table_of_interest.columns = [col[1] for col in table_of_interest.columns]
# get rid of last `row`
table_of_interest = table_of_interest.iloc[:-1]
print(table_of_interest)
# Team G PA HR R RBI SB ... SLG wOBA xwOBA wRC+ BsR Off Def WAR
0 1 STL 13 41 3 9 9 1 ... .613 .413 NaN 173 0.1 3.6 0.4 0.6
1 2 NYM 16 42 0 5 5 0 ... .400 .388 NaN 157 -0.6 2.2 -0.3 0.3
2 3 MIL 15 40 0 4 4 1 ... .424 .346 NaN 122 0.2 1.2 -0.2 0.2
3 4 CHC 16 35 1 5 5 0 ... .483 .368 NaN 140 -0.5 1.0 -0.2 0.2
4 5 HOU 13 38 2 3 3 1 ... .514 .334 NaN 119 0.1 0.7 -0.1 0.2
5 6 CIN 14 38 1 6 6 0 ... .371 .301 NaN 85 0.0 -0.7 -0.1 0.1
6 7 CLE 14 37 0 1 1 1 ... .242 .252 NaN 66 0.2 -1.4 0.3 0.0
7 8 ARI 18 34 1 4 3 0 ... .231 .276 NaN 77 0.0 -1.1 -0.2 0.0
8 9 SDP 15 36 0 2 2 1 ... .172 .243 NaN 57 0.2 -1.6 -0.1 -0.1
9 10 WSN 15 35 1 1 1 0 ... .313 .243 NaN 51 -0.1 -2.3 -0.3 -0.1
10 11 ATL 14 36 1 3 2 0 ... .226 .227 NaN 44 0.0 -2.6 -0.1 -0.2
11 12 KCR 13 31 0 3 3 0 ... .214 .189 NaN 30 -0.4 -3.1 0.0 -0.2
12 13 LAA 15 31 0 1 1 0 ... .207 .183 NaN 21 0.0 -3.1 0.0 -0.2
13 14 PIT 18 32 0 0 0 0 ... .200 .209 NaN 34 -0.4 -3.0 -0.6 -0.3
Here is data frame df1 and taken A column series.
df1
A B
0 10 SLC
1 20 MNS
2 60 LLK
3 40 GNT
4 22 VJZ
5 06 NLR
I have differentiated the series with the below code.
df1['difference'] = df1['A'].diff().fillna(0)
df1
A B difference
0 10 SLC 0 <<---- place 10-20 = -10 value here
1 20 MNS -10 <<---- place 20-60 = -40 value here
2 60 LLK -40 <<---- place 60-40 = 20 value here
3 40 GNT 20 ..............
4 22 VJZ 18 ..............
5 06 NLR 16 ..............
How to place the difference between 10 and 20 in the '0'th index of
'difference' column and so on?
Change default 1 to -1 for difference with following row:
df1['difference'] = df1['A'].diff(-1).fillna(df1['A'])
print (df1)
A B difference
0 10 SLC -10.0
1 20 MNS -40.0
2 60 LLK 20.0
3 40 GNT 18.0
4 22 VJZ 16.0
5 6 NLR 6.0
I want to plot a density graph and need to extract two values from the data to use as coordinates and a third value which will be the density value. I have a text file that I want to read and when one row/column value matches a condition I want to save two other values of that row. I have something that works for one of the conditions but it saves the row values within a nested list, I was wondering if there was a more pythonic way to do this as I think it may be easier to plot later.
Data:
Accum EdgeThr NumberOfBlobs durationMin Vol Perom X Y
50 0 0 0.03 0 0 0 0
50 2 0 0.03 0 0 0 0
50 4 0 0.03 0 0 0 0
50 6 0 0.03 0 0 0 0
50 8 2 0.03 27.833133599054975 0.0 1032.0 928.0
50 10 2 0.03 27.833133599054975 0.0 1032.0 928.0
46 30 2 0.17 27.833133599054975 0.0 968.0 962.0
46 32 2 0.17 27.833133599054975 0.0 1028.0 1020.0
46 34 2 0.17 27.833133599054975 0.0 978.0 1122.0
46 36 2 0.17 27.833133599054975 0.0 1000.0 1080.0
46 38 2 0.18 27.833133599054975 0.0 1010.0 1062.0
Code:
import pandas as pd
# load data as a pandas dataframe
df = pd.read_csv('dummy.txt', sep='\t', lineterminator='\r')
# to find the rows matching one condition ==2
blob2 = []
for index, row in df.iterrows():
temp = [row['Accum'], row['EdgeThr']]
if row['NumberOfBlobs']==2:
blob2.append(temp)
print(index, row['Accum'], row['EdgeThr'], row['NumberOfBlobs'])
print(blob2)
df[df['NumberOfBlobs'] == 2] will select all rows that fulfill the condition.
df[df['NumberOfBlobs'] == 2][['Accum', 'EdgeThr']] will select those two columns.
Here is an example:
import pandas as pd
import numpy as np
N = 10
df = pd.DataFrame({'Accum': np.random.randint(40, 50, N), 'EdgeThr': np.random.randint(0, 50, N),
'NumberOfBlobs': np.random.randint(0, 2, N) * 2})
blobs2 = df[df['NumberOfBlobs'] == 2][['Accum', 'EdgeThr']]
Example of df:
Accum EdgeThr NumberOfBlobs
0 42 44 2
1 47 32 0
2 45 9 2
3 48 15 2
4 44 6 0
5 42 24 0
6 46 20 0
7 46 9 0
8 40 36 0
9 41 3 0
blobs2:
Accum EdgeThr
0 42 44
2 45 9
3 48 15
Suppose I have a df with values as:
user_id sub_id score
39 16 1
39 4 1
40 1 3
40 2 3
40 3 3
and
user_id score
39 2
40 30
So I want to divide columns based on key user_id, such that my result should be as:
user_id sub_id score
39 16 0.5
39 4 0.5
40 1 0.1
40 2 0.1
40 3 0.1
I have tried the div operation but it is not working as per my need, It is only dividing the first appearance and giving me NAN for else.
Is there any direct pandas operation or do I need to group both df's and then do the division?
Thanks
I think need divide by div by Series created by map:
df1['score'] = df1['score'].div(df1['user_id'].map(df2.set_index('user_id')['score']))
print (df1)
user_id sub_id score
0 39 16 0.5
1 39 4 0.5
2 40 1 0.1
3 40 2 0.1
4 40 3 0.1
Detail:
print (df1['user_id'].map(df2.set_index('user_id')['score']))
0 2
1 2
2 30
3 30
4 30
Name: user_id, dtype: int64
With reference to the test data below and the function I use to identify values within variable thresh of each other.
Can anyone please help me modify this to show the desired output I have shown?
Test data
import pandas as pd
import numpy as np
from itertools import combinations
df2 = pd.DataFrame(
{'AAA' : [4,5,6,7,9,10],
'BBB' : [10,20,30,40,11,10],
'CCC' : [100,50,25,10,10,11],
'DDD' : [98,50,25,10,10,11],
'EEE' : [103,50,25,10,10,11]});
Function:
thresh = 5
def closeCols2(df):
max_value = None
for k1,k2 in combinations(df.keys(),2):
if abs(df[k1] - df[k2]) < thresh:
if max_value is None:
max_value = max(df[k1],df[k2])
else:
max_value = max(max_value, max(df[k1],df[k2]))
return max_value
Data Before function applied:
AAA BBB CCC DDD EEE
0 4 10 100 98 103
1 5 20 50 50 50
2 6 30 25 25 25
3 7 40 10 10 10
4 9 11 10 10 10
5 10 10 11 11 11
Current series output after applied:
df2.apply(closeCols2, axis=1)
0 103
1 50
2 25
3 10
4 11
5 11
dtype: int64
Desired output is a dataframe showing all values within thresh and a nan for any not within thresh
AAA BBB CCC DDD EEE
0 nan nan 100 98 103
1 nan nan 50 50 50
2 nan 30 25 25 25
3 7 nan 10 10 10
4 9 11 10 10 10
5 10 10 11 11 11
use mask and sub with axis=1
df2.mask(df2.sub(df2.apply(closeCols2, 1), 0).abs() > thresh)
AAA BBB CCC DDD EEE
0 NaN NaN 100 98 103
1 NaN NaN 50 50 50
2 NaN 30.0 25 25 25
3 7.0 NaN 10 10 10
4 9.0 11.0 10 10 10
5 10.0 10.0 11 11 11
note:
I'd redefine closeCols to include thresh as a parameter. Then you could pass it in the apply call.
def closeCols2(df, thresh):
max_value = None
for k1,k2 in combinations(df.keys(),2):
if abs(df[k1] - df[k2]) < thresh:
if max_value is None:
max_value = max(df[k1],df[k2])
else:
max_value = max(max_value, max(df[k1],df[k2]))
return max_value
df2.apply(closeCols2, 1, thresh=5)
extra credit
I vectorized and embedded your closeCols for some mind numbing fun.
Notice there is no apply
numpy broadcasting to get all combinations of columns subtracted from each other.
np.abs
<= 5
sum(-1) I arranged the broadcasting such that the difference of say row 0, column AAA with all of row 0 will be laid out across the last dimension. -1 in the sum(-1) says to sum across last dimension.
<= 1 all values are less than 5 away from themselves. So I want the sum of these to be greater than 1. Thus, we mask all less than or equal to one.
v = df2.values
df2.mask((np.abs(v[:, :, None] - v[:, None]) <= 5).sum(-1) <= 1)
AAA BBB CCC DDD EEE
0 NaN NaN 100 98 103
1 NaN NaN 50 50 50
2 NaN 30.0 25 25 25
3 7.0 NaN 10 10 10
4 9.0 11.0 10 10 10
5 10.0 10.0 11 11 11