Assistance with "read_html" result to a proper DataFrame - python

I'm looking to grab the stats table from this example link:
https://www.fangraphs.com/leaders.aspx?pos=all&stats=bat&lg=all&qual=0&type=8&season=2022&month=1000&season1=2022&ind=0&team=0,ts&rost=0&age=0&filter=&players=0&startdate=2022-04-07&enddate=2022-04-07&page=1_50
...however when I grab it, there's an extra column header that I can't seem to get rid of, and the bottom row has some useless "Page size" string that I could also get rid of.
I've provided an example code below for testing, along with some attempts to fix the issue, but to no avail.
from pandas import read_html, set_option
#set_option('display.max_rows', 20)
#set_option('display.max_columns', None)
# Extract the table from the provided link
url = "https://www.fangraphs.com/leaders.aspx?pos=all&stats=bat&lg=all&qual=0&type=8&season=2022&month=1000&season1=2022&ind=0&team=0,ts&rost=0&age=0&filter=&players=0&startdate=2022-04-07&enddate=2022-04-07&page=1_50"
table_of_interest = read_html(url)[-2]
print(table_of_interest)
# Attempt 1 - https://stackoverflow.com/questions/68385659/in-pandas-python-how-do-i-get-rid-of-the-extra-column-header-with-index-numbers
df = table_of_interest.iloc[1:,:-1]
print(df)
# Attempt 2 - https://stackoverflow.com/questions/71379513/remove-extra-column-level-from-dataframe
df = table_of_interest.rename_axis(columns=None)
print(df)
results in output
I want to get rid of that top 1 Page size: select 14 items in 1 pages column header. How?

You could try as follows:
from pandas import read_html
url = "https://www.fangraphs.com/leaders.aspx?pos=all&stats=bat&lg=all&qual=0&type=8&season=2022&month=1000&season1=2022&ind=0&team=0,ts&rost=0&age=0&filter=&players=0&startdate=2022-04-07&enddate=2022-04-07&page=1_50"
table_of_interest = read_html(url)[-2]
# keep only level 1 from the original MultiIndex cols
table_of_interest.columns = [col[1] for col in table_of_interest.columns]
# get rid of last `row`
table_of_interest = table_of_interest.iloc[:-1]
print(table_of_interest)
# Team G PA HR R RBI SB ... SLG wOBA xwOBA wRC+ BsR Off Def WAR
0 1 STL 13 41 3 9 9 1 ... .613 .413 NaN 173 0.1 3.6 0.4 0.6
1 2 NYM 16 42 0 5 5 0 ... .400 .388 NaN 157 -0.6 2.2 -0.3 0.3
2 3 MIL 15 40 0 4 4 1 ... .424 .346 NaN 122 0.2 1.2 -0.2 0.2
3 4 CHC 16 35 1 5 5 0 ... .483 .368 NaN 140 -0.5 1.0 -0.2 0.2
4 5 HOU 13 38 2 3 3 1 ... .514 .334 NaN 119 0.1 0.7 -0.1 0.2
5 6 CIN 14 38 1 6 6 0 ... .371 .301 NaN 85 0.0 -0.7 -0.1 0.1
6 7 CLE 14 37 0 1 1 1 ... .242 .252 NaN 66 0.2 -1.4 0.3 0.0
7 8 ARI 18 34 1 4 3 0 ... .231 .276 NaN 77 0.0 -1.1 -0.2 0.0
8 9 SDP 15 36 0 2 2 1 ... .172 .243 NaN 57 0.2 -1.6 -0.1 -0.1
9 10 WSN 15 35 1 1 1 0 ... .313 .243 NaN 51 -0.1 -2.3 -0.3 -0.1
10 11 ATL 14 36 1 3 2 0 ... .226 .227 NaN 44 0.0 -2.6 -0.1 -0.2
11 12 KCR 13 31 0 3 3 0 ... .214 .189 NaN 30 -0.4 -3.1 0.0 -0.2
12 13 LAA 15 31 0 1 1 0 ... .207 .183 NaN 21 0.0 -3.1 0.0 -0.2
13 14 PIT 18 32 0 0 0 0 ... .200 .209 NaN 34 -0.4 -3.0 -0.6 -0.3

Related

Grouping & aggregating on level 1 index & assigning different aggregation functions using pandas

I have a dataframe df:
2019 2020 2021 2022
A 1 10 15 15 31
2 5 4 7 9
3 0.3 0.4 0.4 0.7
4 500 600 70 90
B 1 10 15 15 31
2 5 4 7 9
3 0.3 0.4 0.4 0.7
4 500 600 70 90
C 1 10 15 15 31
2 5 4 7 9
3 0.3 0.4 0.4 0.7
4 500 600 70 90
D 1 10 15 15 31
2 5 4 7 9
3 0.3 0.4 0.4 0.7
4 500 600 70 90
I am trying to group by the level 1 index, 1, 2, 3, 4 and assign different aggregation functions for those 1, 2, 3, 4 indexes so that 1 is aggregated by sum, 2 by mean, and so on. So that the end result would look like this:
2019 2020 2021 2022
1 40 ... ... # sum
2 5 ... ... # mean
3 0.3 ... ... # mean
4 2000 ... ... # sum
I tried:
df.groupby(level = 1).agg({'1':'sum', '2':'mean', '3':'sum', '4':'mean'})
But I get that none of 1, 2, 3, 4 are in columns which they are not, so I am not sure how should I proceed with this problem.
You could use apply with a custom function as follows:
import numpy as np
aggs = {1: np.sum, 2: np.mean, 3: np.mean, 4: np.sum}
def f(x):
func = aggs.get(x.name, np.sum)
return func(x)
df.groupby(level=1).apply(f)
The above code uses sum by default so 1 and 4 could be removed from aggs without any different results. In this way, only groups that should be handled differently from the rest need to be specified.
Result:
2019 2020 2021 2022
1 40.0 60.0 60.0 124.0
2 5.0 4.0 7.0 9.0
3 0.3 0.4 0.4 0.7
4 2000.0 2400.0 280.0 360.0
Just in case you were after avoiding for loops. Slice and group by index and agg conditionally.
df1 = (
df.groupby([df.index.get_level_values(level=1)]).agg(
lambda x: x.sum() if x.index.get_level_values(level=1).isin([1,4]).any() else x.mean())
)
df1
2019 2020 2021 2022
1 40.0 60.0 60.0 124.0
2 5.0 4.0 7.0 9.0
3 0.3 0.4 0.4 0.7
4 2000.0 2400.0 280.0 360.0

Pandas drop and update rows and columns based on column value

Here is sample csv file of cricket score:
>>> df
venue ball run extra wide noball
0 a 0.1 0 1 NaN NaN
1 a 0.2 4 0 NaN NaN
2 a 0.3 1 5 5.0 NaN
3 a 0.4 1 0 NaN NaN
4 a 0.5 1 1 NaN 1.0
5 a 0.6 2 1 NaN NaN
6 a 0.7 6 2 1.0 1.0
7 a 0.8 0 0 NaN NaN
8 a 0.9 1 1 NaN NaN
9 a 1.1 2 2 NaN NaN
10 a 1.2 1 0 NaN NaN
11 a 1.3 6 1 NaN NaN
12 a 1.4 0 2 NaN 2.0
13 a 1.5 1 0 NaN NaN
14 a 1.6 2 0 NaN NaN
15 a 1.7 0 1 NaN NaN
16 a 0.1 0 5 NaN NaN
17 a 0.2 4 0 NaN NaN
18 a 0.3 1 1 NaN NaN
19 a 0.4 3 0 NaN NaN
20 a 0.5 0 0 NaN NaN
21 a 0.6 0 2 2.0 NaN
22 a 0.7 6 1 NaN NaN
23 a 1.1 4 0 NaN NaN
From this dataframe I want to update ball value, generate 2 new columns and drop 4 entire columns. Condition is
when "wide" or "noball" is null, crun = crun + run + extra until ball
= 0.1 (recursively)
when "wide" or "noball" is not null, concurrent ball value won't be incremented and will be dropped after crun calculation. crun = crun + run + extra. And it will continue until ball = 0.1 (recursively) eg. Let me breakdown: from row index 0 to 8: | 0.1 "wide" or "noball" is null and crun = 1 | 0.2 "wide" or "noball" is null and crun = 1+4 =5| 0.3 "wide" or "noball" is not null (removed) | 0.4 "wide" or "noball" is null (becomes 0.3) and crun = 5+1+5+1 = 12| 0.5 "wide" or "noball" is not null (removed) | 0.6 "wide" or "noball" is null (becomes 0.4) and crun = 12+1+1+2+1 =17| 0.7 "wide" or "noball" is not null (removed) | 0.8 "wide" or "noball" is null (becomes 0.5) and crun = 17+6+2 = 25| 0.9 "wide" or "noball" is null (becomes 0.6) and crun = 25+1+1 =27|
Finally "total" column will be created which returns the max of crun until ball = 0.1 (recursively). Then "run", "extra", "wide", "noball" column should be dropped.
The output I want:
venue ball crun total
0 a 0.1 1 45
1 a 0.2 5 45
2 a 0.3 12 45
3 a 0.4 17 45
4 a 0.5 25 45
5 a 0.6 27 45
6 a 1.1 31 45
7 a 1.2 32 45
8 a 1.3 39 45
9 a 1.4 42 45
10 a 1.5 44 45
11 a 1.6 45 45
12 a 0.1 5 27
13 a 0.2 9 27
14 a 0.3 11 27
15 a 0.4 14 27
16 a 0.5 14 27
17 a 0.6 23 27
18 a 1.1 27 27
I find it too complex, please help. Code I tried:
df = pd.read_csv("data.csv")
gr = df.groupby(df.ball.eq(0.1).cumsum())
df["crun"] = gr.runs.cumsum()
df["total"] = gr.current_run.transform("max")
df = df.drop(['run', 'extra', 'wide', 'noball'], axis=1)
Alrighty. This was a fun one.
(I tried to add comments for clarity.)
Note: "ball," "run," "extra," "wide," and "noball" are all numeric fields.
Note Note: This all assumes your initial DataFrame is under a variable named df.
# Create target groupings by ball value.
df["target_groups"] = df.loc[df["ball"] == 0.1].groupby(level=-1).ngroup()
df["target_groups"].fillna(method="ffill", inplace=True)
# --- Create subgroups --- #
df["target_subgroups"] = df["ball"].astype(int)
# Add field fro sum of run and extra
df["run_extra"] = df[["run", "extra"]].sum(axis=1)
# Apply groupby() and cumsum() as follows to get the cumulative sum
# of each ball group for run and extra.
df["crun"] = df.groupby(["target_groups"])["run_extra"].cumsum()
# Create dataframe for max crun value of each group
group_max_df = df.groupby(["target_groups"])["crun"].max().to_frame().reset_index()
# Merge both of the DataFrames with the given suffixes. The first one
# Just prevents crun from having a suffix added, which is an additional
# step to remove.
# You could probably use .join() in a similar manner.
df = pd.merge(df, group_max_df,
on=["target_groups"],
suffixes=("", "_total"),
sort=False
)
# Rename your new total field.
df.rename(columns={"crun_total": "total"}, inplace = True)
# Apply your wide and noball condition here.
df = df[(df["wide"].isna()) & (df["noball"].isna())].copy()
# -- Reset `ball` column -- #
# Add temp column with static value
df["tmp_ball"] = 0.1
# Generate cumulative sum by subgroup.
# Set `ball` to modulo 0.6
df.loc[:, "ball"] = df.groupby(["target_subgroups"])["tmp_ball"].cumsum() % 0.6
# Find rows where ball == 0.0 and set those to 0.6
df.loc[df["ball"] == 0.0, "ball"] = 0.6
# Add ball and target_subgroups columns to get final ball value.
df["ball"] = df["ball"] + df["target_subgroups"]
# Reset your main index, if desired
df.reset_index(drop=True, inplace=True)
# Select only desired field for output.
df = df.loc[:, ["venue","ball","crun","total"]].copy()
Output of df:
venue ball crun total
0 a 0.1 1 45
1 a 0.2 5 45
2 a 0.4 12 45
3 a 0.6 17 45
4 a 0.8 25 45
5 a 0.9 27 45
6 a 1.1 31 45
7 a 1.2 32 45
8 a 1.3 39 45
9 a 1.5 42 45
10 a 1.6 44 45
11 a 1.7 45 45
12 a 0.1 5 27
13 a 0.2 9 27
14 a 0.3 11 27
15 a 0.4 14 27
16 a 0.5 14 27
17 a 0.7 23 27
18 a 1.1 27 27
EDIT:

Find prc change of every two columns value in pandas

I am trying to take a data frame that has time series values on the row axis and get % change.
For example here is the data:
77 70 105
50 25 50
15 20 10
This is the required result:
-0.1 0.5
-0.5 1
0.33 -0.5
You can use df.pct_change over axis 1 and df.dropna.
df
0 1 2
0 77 70 105
1 50 25 50
2 15 20 10
df.pct_change(1).dropna(1)
1 2
0 -0.090909 0.5
1 -0.500000 1.0
2 0.333333 -0.5

Flatten DataFrame by group with columns creation in Pandas

I have the following pandas DataFrame
Id_household Age_Father Age_child
0 1 30 2
1 1 30 4
2 1 30 4
3 1 30 1
4 2 27 4
5 3 40 14
6 3 40 18
and I want to achieve the following result
Age_Father Age_child_1 Age_child_2 Age_child_3 Age_child_4
Id_household
1 30 1 2.0 4.0 4.0
2 27 4 NaN NaN NaN
3 40 14 18.0 NaN NaN
I tried stacking with multi-index renaming, but I am not very happy with it and I am not able to make everything work properly.
Use this:
df_out = df.set_index([df.groupby('Id_household').cumcount()+1,
'Id_household',
'Age_Father']).unstack(0)
df_out.columns = [f'{i}_{j}' for i, j in df_out.columns]
df_out.reset_index()
Output:
Id_household Age_Father Age_child_1 Age_child_2 Age_child_3 Age_child_4
0 1 30 2.0 4.0 4.0 1.0
1 2 27 4.0 NaN NaN NaN
2 3 40 14.0 18.0 NaN NaN

How to move every element in a column by n range in a dataframe using python?

I have a dataframe df that looks like below:
No A B value
1 23 36 1
2 45 23 1
3 34 12 2
4 22 76 NaN
...
I would like to shift each of the value in "value" column by 2. And the first row "value" should not be shifted.
I have already tried the normal shift, which directly shifts everthing by 2.
df['value']=df['value'].shift(2)
i expect the below result:
No A B value
1 23 36 1
2 45 23 Nan
3 34 12 Nan
4 22 76 1
5 10 12 Nan
6 34 2 Nan
7 21 11 2
...
In your case
df['Newvalue']=pd.Series(df.value.values,index=np.arange(len(df))*3)
df
Out[41]:
No A B value Newvalue
0 1 23 36 1.0 1.0
1 2 45 23 1.0 NaN
2 3 34 12 2.0 NaN
3 4 22 76 NaN 1.0

Categories