Pandas drop and update rows and columns based on column value - python

Here is sample csv file of cricket score:
>>> df
venue ball run extra wide noball
0 a 0.1 0 1 NaN NaN
1 a 0.2 4 0 NaN NaN
2 a 0.3 1 5 5.0 NaN
3 a 0.4 1 0 NaN NaN
4 a 0.5 1 1 NaN 1.0
5 a 0.6 2 1 NaN NaN
6 a 0.7 6 2 1.0 1.0
7 a 0.8 0 0 NaN NaN
8 a 0.9 1 1 NaN NaN
9 a 1.1 2 2 NaN NaN
10 a 1.2 1 0 NaN NaN
11 a 1.3 6 1 NaN NaN
12 a 1.4 0 2 NaN 2.0
13 a 1.5 1 0 NaN NaN
14 a 1.6 2 0 NaN NaN
15 a 1.7 0 1 NaN NaN
16 a 0.1 0 5 NaN NaN
17 a 0.2 4 0 NaN NaN
18 a 0.3 1 1 NaN NaN
19 a 0.4 3 0 NaN NaN
20 a 0.5 0 0 NaN NaN
21 a 0.6 0 2 2.0 NaN
22 a 0.7 6 1 NaN NaN
23 a 1.1 4 0 NaN NaN
From this dataframe I want to update ball value, generate 2 new columns and drop 4 entire columns. Condition is
when "wide" or "noball" is null, crun = crun + run + extra until ball
= 0.1 (recursively)
when "wide" or "noball" is not null, concurrent ball value won't be incremented and will be dropped after crun calculation. crun = crun + run + extra. And it will continue until ball = 0.1 (recursively) eg. Let me breakdown: from row index 0 to 8: | 0.1 "wide" or "noball" is null and crun = 1 | 0.2 "wide" or "noball" is null and crun = 1+4 =5| 0.3 "wide" or "noball" is not null (removed) | 0.4 "wide" or "noball" is null (becomes 0.3) and crun = 5+1+5+1 = 12| 0.5 "wide" or "noball" is not null (removed) | 0.6 "wide" or "noball" is null (becomes 0.4) and crun = 12+1+1+2+1 =17| 0.7 "wide" or "noball" is not null (removed) | 0.8 "wide" or "noball" is null (becomes 0.5) and crun = 17+6+2 = 25| 0.9 "wide" or "noball" is null (becomes 0.6) and crun = 25+1+1 =27|
Finally "total" column will be created which returns the max of crun until ball = 0.1 (recursively). Then "run", "extra", "wide", "noball" column should be dropped.
The output I want:
venue ball crun total
0 a 0.1 1 45
1 a 0.2 5 45
2 a 0.3 12 45
3 a 0.4 17 45
4 a 0.5 25 45
5 a 0.6 27 45
6 a 1.1 31 45
7 a 1.2 32 45
8 a 1.3 39 45
9 a 1.4 42 45
10 a 1.5 44 45
11 a 1.6 45 45
12 a 0.1 5 27
13 a 0.2 9 27
14 a 0.3 11 27
15 a 0.4 14 27
16 a 0.5 14 27
17 a 0.6 23 27
18 a 1.1 27 27
I find it too complex, please help. Code I tried:
df = pd.read_csv("data.csv")
gr = df.groupby(df.ball.eq(0.1).cumsum())
df["crun"] = gr.runs.cumsum()
df["total"] = gr.current_run.transform("max")
df = df.drop(['run', 'extra', 'wide', 'noball'], axis=1)

Alrighty. This was a fun one.
(I tried to add comments for clarity.)
Note: "ball," "run," "extra," "wide," and "noball" are all numeric fields.
Note Note: This all assumes your initial DataFrame is under a variable named df.
# Create target groupings by ball value.
df["target_groups"] = df.loc[df["ball"] == 0.1].groupby(level=-1).ngroup()
df["target_groups"].fillna(method="ffill", inplace=True)
# --- Create subgroups --- #
df["target_subgroups"] = df["ball"].astype(int)
# Add field fro sum of run and extra
df["run_extra"] = df[["run", "extra"]].sum(axis=1)
# Apply groupby() and cumsum() as follows to get the cumulative sum
# of each ball group for run and extra.
df["crun"] = df.groupby(["target_groups"])["run_extra"].cumsum()
# Create dataframe for max crun value of each group
group_max_df = df.groupby(["target_groups"])["crun"].max().to_frame().reset_index()
# Merge both of the DataFrames with the given suffixes. The first one
# Just prevents crun from having a suffix added, which is an additional
# step to remove.
# You could probably use .join() in a similar manner.
df = pd.merge(df, group_max_df,
on=["target_groups"],
suffixes=("", "_total"),
sort=False
)
# Rename your new total field.
df.rename(columns={"crun_total": "total"}, inplace = True)
# Apply your wide and noball condition here.
df = df[(df["wide"].isna()) & (df["noball"].isna())].copy()
# -- Reset `ball` column -- #
# Add temp column with static value
df["tmp_ball"] = 0.1
# Generate cumulative sum by subgroup.
# Set `ball` to modulo 0.6
df.loc[:, "ball"] = df.groupby(["target_subgroups"])["tmp_ball"].cumsum() % 0.6
# Find rows where ball == 0.0 and set those to 0.6
df.loc[df["ball"] == 0.0, "ball"] = 0.6
# Add ball and target_subgroups columns to get final ball value.
df["ball"] = df["ball"] + df["target_subgroups"]
# Reset your main index, if desired
df.reset_index(drop=True, inplace=True)
# Select only desired field for output.
df = df.loc[:, ["venue","ball","crun","total"]].copy()
Output of df:
venue ball crun total
0 a 0.1 1 45
1 a 0.2 5 45
2 a 0.4 12 45
3 a 0.6 17 45
4 a 0.8 25 45
5 a 0.9 27 45
6 a 1.1 31 45
7 a 1.2 32 45
8 a 1.3 39 45
9 a 1.5 42 45
10 a 1.6 44 45
11 a 1.7 45 45
12 a 0.1 5 27
13 a 0.2 9 27
14 a 0.3 11 27
15 a 0.4 14 27
16 a 0.5 14 27
17 a 0.7 23 27
18 a 1.1 27 27
EDIT:

Related

Assistance with "read_html" result to a proper DataFrame

I'm looking to grab the stats table from this example link:
https://www.fangraphs.com/leaders.aspx?pos=all&stats=bat&lg=all&qual=0&type=8&season=2022&month=1000&season1=2022&ind=0&team=0,ts&rost=0&age=0&filter=&players=0&startdate=2022-04-07&enddate=2022-04-07&page=1_50
...however when I grab it, there's an extra column header that I can't seem to get rid of, and the bottom row has some useless "Page size" string that I could also get rid of.
I've provided an example code below for testing, along with some attempts to fix the issue, but to no avail.
from pandas import read_html, set_option
#set_option('display.max_rows', 20)
#set_option('display.max_columns', None)
# Extract the table from the provided link
url = "https://www.fangraphs.com/leaders.aspx?pos=all&stats=bat&lg=all&qual=0&type=8&season=2022&month=1000&season1=2022&ind=0&team=0,ts&rost=0&age=0&filter=&players=0&startdate=2022-04-07&enddate=2022-04-07&page=1_50"
table_of_interest = read_html(url)[-2]
print(table_of_interest)
# Attempt 1 - https://stackoverflow.com/questions/68385659/in-pandas-python-how-do-i-get-rid-of-the-extra-column-header-with-index-numbers
df = table_of_interest.iloc[1:,:-1]
print(df)
# Attempt 2 - https://stackoverflow.com/questions/71379513/remove-extra-column-level-from-dataframe
df = table_of_interest.rename_axis(columns=None)
print(df)
results in output
I want to get rid of that top 1 Page size: select 14 items in 1 pages column header. How?
You could try as follows:
from pandas import read_html
url = "https://www.fangraphs.com/leaders.aspx?pos=all&stats=bat&lg=all&qual=0&type=8&season=2022&month=1000&season1=2022&ind=0&team=0,ts&rost=0&age=0&filter=&players=0&startdate=2022-04-07&enddate=2022-04-07&page=1_50"
table_of_interest = read_html(url)[-2]
# keep only level 1 from the original MultiIndex cols
table_of_interest.columns = [col[1] for col in table_of_interest.columns]
# get rid of last `row`
table_of_interest = table_of_interest.iloc[:-1]
print(table_of_interest)
# Team G PA HR R RBI SB ... SLG wOBA xwOBA wRC+ BsR Off Def WAR
0 1 STL 13 41 3 9 9 1 ... .613 .413 NaN 173 0.1 3.6 0.4 0.6
1 2 NYM 16 42 0 5 5 0 ... .400 .388 NaN 157 -0.6 2.2 -0.3 0.3
2 3 MIL 15 40 0 4 4 1 ... .424 .346 NaN 122 0.2 1.2 -0.2 0.2
3 4 CHC 16 35 1 5 5 0 ... .483 .368 NaN 140 -0.5 1.0 -0.2 0.2
4 5 HOU 13 38 2 3 3 1 ... .514 .334 NaN 119 0.1 0.7 -0.1 0.2
5 6 CIN 14 38 1 6 6 0 ... .371 .301 NaN 85 0.0 -0.7 -0.1 0.1
6 7 CLE 14 37 0 1 1 1 ... .242 .252 NaN 66 0.2 -1.4 0.3 0.0
7 8 ARI 18 34 1 4 3 0 ... .231 .276 NaN 77 0.0 -1.1 -0.2 0.0
8 9 SDP 15 36 0 2 2 1 ... .172 .243 NaN 57 0.2 -1.6 -0.1 -0.1
9 10 WSN 15 35 1 1 1 0 ... .313 .243 NaN 51 -0.1 -2.3 -0.3 -0.1
10 11 ATL 14 36 1 3 2 0 ... .226 .227 NaN 44 0.0 -2.6 -0.1 -0.2
11 12 KCR 13 31 0 3 3 0 ... .214 .189 NaN 30 -0.4 -3.1 0.0 -0.2
12 13 LAA 15 31 0 1 1 0 ... .207 .183 NaN 21 0.0 -3.1 0.0 -0.2
13 14 PIT 18 32 0 0 0 0 ... .200 .209 NaN 34 -0.4 -3.0 -0.6 -0.3

Grouping & aggregating on level 1 index & assigning different aggregation functions using pandas

I have a dataframe df:
2019 2020 2021 2022
A 1 10 15 15 31
2 5 4 7 9
3 0.3 0.4 0.4 0.7
4 500 600 70 90
B 1 10 15 15 31
2 5 4 7 9
3 0.3 0.4 0.4 0.7
4 500 600 70 90
C 1 10 15 15 31
2 5 4 7 9
3 0.3 0.4 0.4 0.7
4 500 600 70 90
D 1 10 15 15 31
2 5 4 7 9
3 0.3 0.4 0.4 0.7
4 500 600 70 90
I am trying to group by the level 1 index, 1, 2, 3, 4 and assign different aggregation functions for those 1, 2, 3, 4 indexes so that 1 is aggregated by sum, 2 by mean, and so on. So that the end result would look like this:
2019 2020 2021 2022
1 40 ... ... # sum
2 5 ... ... # mean
3 0.3 ... ... # mean
4 2000 ... ... # sum
I tried:
df.groupby(level = 1).agg({'1':'sum', '2':'mean', '3':'sum', '4':'mean'})
But I get that none of 1, 2, 3, 4 are in columns which they are not, so I am not sure how should I proceed with this problem.
You could use apply with a custom function as follows:
import numpy as np
aggs = {1: np.sum, 2: np.mean, 3: np.mean, 4: np.sum}
def f(x):
func = aggs.get(x.name, np.sum)
return func(x)
df.groupby(level=1).apply(f)
The above code uses sum by default so 1 and 4 could be removed from aggs without any different results. In this way, only groups that should be handled differently from the rest need to be specified.
Result:
2019 2020 2021 2022
1 40.0 60.0 60.0 124.0
2 5.0 4.0 7.0 9.0
3 0.3 0.4 0.4 0.7
4 2000.0 2400.0 280.0 360.0
Just in case you were after avoiding for loops. Slice and group by index and agg conditionally.
df1 = (
df.groupby([df.index.get_level_values(level=1)]).agg(
lambda x: x.sum() if x.index.get_level_values(level=1).isin([1,4]).any() else x.mean())
)
df1
2019 2020 2021 2022
1 40.0 60.0 60.0 124.0
2 5.0 4.0 7.0 9.0
3 0.3 0.4 0.4 0.7
4 2000.0 2400.0 280.0 360.0

Select value from based on interval Time in pandas [duplicate]

df1= pd.DataFrame({'Chr':['1', '1', '2', '2', '3','3','4'],
'position':[50, 500, 1030, 2005 , 3575,50, 250]})
df2 = pd.DataFrame({'Chr':['1', '1', '1', '1',
'1','2','2','2','2','2','3','3','3','3','3'],
'start':
[0,100,1000,2000,3000,0,100,1000,2000,3000,0,100,1000,2000,3000],
'end':
[100,1000,2000,3000,4000,100,1000,2000,3000,4000,100,1000,2000,3000,4000],
'logr':[3, 4, 5, 6, 7,8,9,10,11,12,13,15,16,17,18],
'seg':[0.2,0.5,0.2,0.1,0.5,0.5,0.2,0.2,0.1,0.2,0.1,0.5,0.5,0.9,0.3]})
I wanted to conditionally loop through 'Chr' and 'position' in df1 to 'Chr' and intervals ( where the position in df1 falls between 'start' and 'end') in df2, then add 'logr' and 'seg'column in df1
my desired output is :
df3= pd.DataFrame({'Chr':['1', '1', '2', '2', '3','3','4'],
'position':[50, 500, 1030, 2005 , 3575,50, 250],
'logr':[3, 4, 10,11, 18,13, "NA"],
'seg':[0.2,0.5,0.2,0.1,0.3,0.1,"NA"]})
Thank you in advance.
Use DataFrame.merge with outer join for all combinations, then filter by Series.between and boolean indexing with DataFrame.pop for extract columns and last left join for add missing rows:
df3 = df1.merge(df2, on='Chr', how='outer')
#between is by default inclusive (>=, <=) orwith parameter inclusive=False (>, <)
df3 = df3[df3['position'].between(df3.pop('start'), df3.pop('end'))]
#if need one inclusive and another interval not (e.g. >, <=)
#df3 = df3[(df3['position'] > df3.pop('start')) & (df3['position'] <= df3.pop('end'))]
df3 = df1.merge(df3, how='left')
print (df3)
Chr position logr seg
0 1 50 3.0 0.2
1 1 500 4.0 0.5
2 2 1030 10.0 0.2
3 2 2005 11.0 0.1
4 3 3575 18.0 0.3
5 3 50 13.0 0.1
6 4 250 NaN NaN
Another solution:
df3 = df1.merge(df2, on='Chr', how='outer')
s = df3.pop('start')
e = df3.pop('end')
df3 = df3[df3['position'].between(s, e) | s.isna() | e.isna()]
#if different closed intervals
#df3 = df3[(df3['position'] > s) & (df3['position'] <= e) | s.isna() | e.isna()]
print (df3)
Chr position logr seg
0 1 50 3.0 0.2
6 1 500 4.0 0.5
12 2 1030 10.0 0.2
18 2 2005 11.0 0.1
24 3 3575 18.0 0.3
25 3 50 13.0 0.1
30 4 250 NaN NaN
try using pd.merge() and
np.where()
import pandas pd
import numpy as np
res_df = pd.merge(df1,df2,on=['Chr'],how='outer')
res_df['check_between'] = np.where((res_df['position']>=res_df['start'])&(res_df['position']<=res_df['end']),True,False)
df3 = res_df[(res_df['check_between']==True) |
(res_df['start'].isnull())|
(res_df['end'].isnull()) ]
df3.drop(['check_between','start','end'],axis=1,inplace=True)
Chr position logr seg
0 1 50 3.0 0.2
6 1 500 4.0 0.5
12 2 1030 10.0 0.2
18 2 2005 11.0 0.1
24 3 3575 18.0 0.3
25 3 50 13.0 0.1
30 4 250 NaN NaN
Doing left-merge with indicator=True. Next, query checks position between start, end or _merge value is left_only. Finally, drop unwanted columns
df1.merge(df2, 'left', indicator=True).query('(start<=position<=end) | _merge.eq("left_only")') \
.drop(['start', 'end', '_merge'],1)
Out[364]:
Chr position logr seg
0 1 50 3.0 0.2
6 1 500 4.0 0.5
12 2 1030 10.0 0.2
18 2 2005 11.0 0.1
24 3 3575 18.0 0.3
25 3 50 13.0 0.1
30 4 250 NaN NaN

Find prc change of every two columns value in pandas

I am trying to take a data frame that has time series values on the row axis and get % change.
For example here is the data:
77 70 105
50 25 50
15 20 10
This is the required result:
-0.1 0.5
-0.5 1
0.33 -0.5
You can use df.pct_change over axis 1 and df.dropna.
df
0 1 2
0 77 70 105
1 50 25 50
2 15 20 10
df.pct_change(1).dropna(1)
1 2
0 -0.090909 0.5
1 -0.500000 1.0
2 0.333333 -0.5

python pandas pivot_table column level one wrong name

I have the following table:
ID Metric Level Level(% Change) Level(Diff)
Index
0 2016 A 10 NaN NaN
1 2017 A 15 0.5 5
2 2018 A 20 0.3 5
3 2016 B 40 NaN NaN
4 2017 B 45 0.2 5
5 2018 B 50 0.1 5
I'd like to get the following:
A_Level B_Level A_Level(% Change) B_Level(% Change) A_Level(Diff) B_Level(Diff)
Index
2016 10 40 NaN NaN NaN NaN
2017 15 45 0.5 0.2 5 5
2018 20 50 0.3 0.1 5 5
I tried:
df = pd.pivot_table(df, index = 'ID', values = ['Level','Level(% Change)','Level(Diff)'], columns = ['Metric'])
df.columns = df.columns.map('_'.join)
However I only get the following table:
Level_A Level_B Level_A Level_B Level_A Level_B
Index
2016 10 40 NaN NaN NaN NaN
2017 15 45 0.5 0.2 5 5
2018 20 50 0.3 0.1 5 5
Basically data in the pivot is correct but the label in the first level column is wrong. There is only 'Level' while 'Level(% Change)', 'Level(Diff)' are missing. I would also get 'A_Level' instead of 'Level_A'.
Thank you in advance
Use list comprehension with swap a,b and f-strings:
df = pd.pivot_table(df,
index = 'ID',
values = ['Level','Level(% Change)','Level(Diff)'],
columns = ['Metric'])
df.columns = [f'{b}_{a}' for a, ab in df.columns]
Or add DataFrame.swaplevel:
df.columns = df.swaplevel(0,1, axis=1).columns.map('_'.join)
print (df)
A_Level B_Level A_Level(% Change) B_Level(% Change) A_Level(Diff) \
ID
2016 10 40 NaN NaN NaN
2017 15 45 0.5 0.2 5.0
2018 20 50 0.3 0.1 5.0
B_Level(Diff)
ID
2016 NaN
2017 5.0
2018 5.0
Alternatively, you could melt the data to do the concatenation,
(df.melt(id_vars=['ID','Metric'])
.assign(header = lambda x:x.Metric + '_' + x.variable)
.pivot_table(index='ID', columns='header', values='value'))

Categories