Comparing Pandas DataFrame rows against two threshold values

Comparing Pandas DataFrame rows against two threshold values - python

I have two DataFrames shown below. The DataFrames in reality are larger than the sample below.
df1
route_no cost_h1 cost_h2 cost_h3 cost_h4 cost_h5 max min location
0 0010 20 22 21 23 26 26 20 NY
1 0011 30 25 23 31 33 33 23 CA
2 0012 67 68 68 69 65 69 67 GA
3 0013 34 33 31 30 35 35 31 MO
4 0014 44 42 40 39 50 50 39 WA
df2
route_no cost_h1 cost_h2 cost_h3 cost_h4 cost_h5 location
0 0020 19 27 21 24 20 NY
1 0021 31 22 23 30 33 CA
2 0023 66 67 68 70 65 GA
3 0022 34 33 31 30 35 MO
4 0025 41 42 40 39 50 WA
5 0030 19 26 20 24 20 NY
6 0032 37 31 31 20 35 MO
7 0034 40 41 39 39 50 WA
The idea is to compare each row of df2 against the appropriate max and min value specified in df1. The threshold value to be compared depends on the match in the location column. If any of the row values are outside the range defined by min and max value, they will be put in a separate dataframe. Please note the number of cost segments are vary.

Solution
# Merge the dataframes on location to append the min/max columns to df2
df3 = df2.merge(df1[['location', 'max', 'min']], on='location', how='left')
# select the cost like columns
cost = df3.filter(like='cost')
# Check whether the cost values satisfy the interval condition
mask = cost.ge(df3['min'], axis=0) & cost.le(df3['max'], axis=0)
# filter the rows where one or more values in row do not satisfy the condition
df4 = df2[~mask.all(axis=1)]
Result
print(df4)
route_no cost_h1 cost_h2 cost_h3 cost_h4 cost_h5 location
0 0020 19 27 21 24 20 NY
1 0021 31 22 23 30 33 CA
2 0023 66 67 68 70 65 GA
3 0022 34 33 31 30 35 MO
5 0030 19 26 20 24 20 NY
6 0032 37 31 31 20 35 MO

Related

Using Python Update the maximum value in each row dataframe with the sum of [column with maximum value] and [column name threshold]

Day US INDIA JAPAN GERMANY AUSTRALIA Threshold
11 40 30 20 100 110 5
21 60 70 80 55 57 8
32 12 43 57 87 98 9
41 99 23 45 65 78 12
This is the demo data frame,
Here i wanted to choose maximum for each row from 3 countries(INDIA,GERMANY,US) and then add the threshold value to that maximum record and then add that into the max value and update it in the dataframe.
lets take an example :
max[US,INDIA,GERMANY] = max[US,INDIA,GERMANY] + threshold
After performing this dataframe will get updated as below :
Day US INDIA JAPAN GERMANY AUSTRALIA Threshold
11 40 30 20 105 110 5
21 60 78 80 55 57 8
32 12 43 57 96 98 9
41 111 23 45 65 78 12
I tried to achieve this using for loop but it is taking too long to execute :
df_max = df_final[['US','INDIA','GERMANY']].idxmax(axis=1)
for ind in df_final.index:
column = df_max[ind]
df_final[column][ind] = df_final[column][ind] + df_final['Threshold'][ind]
Please help me with this. Looking forward for a good solution,Thanks in advance...!!!

First solution compare maximal value per row with all values of filtered columns, then multiple mask by Threshold and add to original column:
cols = ['US','INDIA','GERMANY']
df_final[cols] += (df_final[cols].eq(df_final[cols].max(axis=1), axis=0)
.mul(df_final['Threshold'], axis=0))
print (df_final)
Day US INDIA JAPAN GERMANY AUSTRALIA Threshold
0 11 40 30 20 105 110 5
1 21 60 78 80 55 57 8
2 32 12 43 57 96 98 9
3 41 111 23 45 65 78 12
Or use numpy - get columns names by idxmax, compare by array from list cols, multiple and add to original columns:
cols = ['US','INDIA','GERMANY']
df_final[cols] += ((np.array(cols) == df_final[cols].idxmax(axis=1).to_numpy()[:, None]) *
df_final['Threshold'].to_numpy()[:, None])
print (df_final)
Day US INDIA JAPAN GERMANY AUSTRALIA Threshold
0 11 40 30 20 105 110 5
1 21 60 78 80 55 57 8
2 32 12 43 57 96 98 9
3 41 111 23 45 65 78 12
There is difference of solutions if multiple maximum values per rows.
First solution add threshold to all maximum, second solution to first maximum.
print (df_final)
Day US INDIA JAPAN GERMANY AUSTRALIA Threshold
0 11 40 100 20 100 110 5 <-changed data double 100
1 21 60 70 80 55 57 8
2 32 12 43 57 87 98 9
3 41 99 23 45 65 78 12
cols = ['US','INDIA','GERMANY']
df_final[cols] += (df_final[cols].eq(df_final[cols].max(axis=1), axis=0)
.mul(df_final['Threshold'], axis=0))
print (df_final)
Day US INDIA JAPAN GERMANY AUSTRALIA Threshold
0 11 40 105 20 105 110 5
1 21 60 78 80 55 57 8
2 32 12 43 57 96 98 9
3 41 111 23 45 65 78 12
cols = ['US','INDIA','GERMANY']
df_final[cols] += ((np.array(cols) == df_final[cols].idxmax(axis=1).to_numpy()[:, None]) *
df_final['Threshold'].to_numpy()[:, None])
print (df_final)
Day US INDIA JAPAN GERMANY AUSTRALIA Threshold
0 11 40 105 20 100 110 5
1 21 60 78 80 55 57 8
2 32 12 43 57 96 98 9
3 41 111 23 45 65 78 12

Not able to view CSV from Python Webscrape

I am new to python and am doing a webscraping tutorial. I am having trouble getting my CSV file in the appropriate folder. Basically, I am not able to view the resulting CSV. Does anyone have a solution regarding this problem?
import pandas as pd
import re
from bs4 import BeautifulSoup
import requests
#Pulling in website source code#
url = 'https://www.espn.com/mlb/history/leaders/_/breakdown/season/year/2022'
page = requests.get(url)
soup = BeautifulSoup(page.text, 'html.parser')
#Pulling in player rows
##Identify Player Rows
players = soup.find_all('tr', attrs= {'class':re.compile('row-player-10-')})
for players in players:
##Pulling stats for each players
stats = [stat.get_text() for stat in players.findall('td')]
##Create a data frame for the single player stats
temp.df = pd.DataFrame(stats).transpose()
temp.df = columns
##Join single players stats with the overall dataset
final_dataframe = pd.concat([final_df,temp_df], ignore_index=True)
print(final_dataframe)
final_dataframe.to_csv(r'C\Users\19794\OneDrive\Desktop\Coding Projects', index = False, sep =',', encoding='utf-8')

I've checked your code.
I've found one issue.
This one.
for players in players:
##Pulling stats for each players
stats = [stat.get_text() for stat in players.findall('td')]
##Create a data frame for the single player stats
temp.df = pd.DataFrame(stats).transpose()
temp.df = columns
##Join single players stats with the overall dataset
final_dataframe = pd.concat([final_df,temp_df], ignore_index=True)
print(final_dataframe)
final_dataframe.to_csv(r'C\Users\19794\OneDrive\Desktop\Coding Projects', index = False, sep =',', encoding='utf-8')
You have to use this. (players to player, filename with csv)
for player in players:
##Pulling stats for each players
stats = [stat.get_text() for stat in player.findall('td')]
##Create a data frame for the single player stats
temp.df = pd.DataFrame(stats).transpose()
temp.df = columns
##Join single players stats with the overall dataset
final_dataframe = pd.concat([final_df,temp_df], ignore_index=True)
print(final_dataframe)
final_dataframe.to_csv(r'C\Users\19794\OneDrive\Desktop\Coding Projects\result.csv', index = False, sep =',', encoding='utf-8')

Few issues.
As stated in the previous solution, your for loop you need to change to for player in players: You cant use the same variable as the variable you are looping through
You shouldn't use . in your variables as you have temp.df. That indicates the use of a method. Use underscore instead _
You never define final_df, then try to call it in your pd.concat()
You never define columns and then try to use that (and it would then overwrite your temp_df as well). What you are wanting to do is change instead is temp_df.columns = columns. But note you need to define columns.
Your find_all() for the players is incorrect in that you're searching for a class that contains row-player-10-. There is no class with that. It is row player-10. Very subtle difference, but it's the difference of returning None elements, and 50 elements.
stats = [stat.get_text() for stat in player.findall('td')] - again needs to be referencing player from the for loop as mentioned in 1). And in fact, there's a few syntax things in there that we need to change to actually pull out the text. So that should be [stat.text for stat in player.find_all('td')]
You use pd.concat the temp_df to a final_df within your loop. You can do that (provided you create an initial final_dataframe or final_df (you use 2 different variable names...not sure which you really wanted), but that will lead to repeating the headers/column names in it and require an extra step. What I would rather do, is store each temp_df into a list. Then after it loops through all the players, THEN concat the list of dataframes into a final one.
So here is the full code:
import pandas as pd
import re
from bs4 import BeautifulSoup
import requests
#Pulling in website source code#
url = 'https://www.espn.com/mlb/history/leaders/_/breakdown/season/year/2022'
page = requests.get(url)
soup = BeautifulSoup(page.text, 'html.parser')
#Pulling in player rows
##Identify Player Rows
players = soup.find_all('tr', attrs= {'class':re.compile('.*row player-10-.*')})
columns = soup.find('tr', {'class':'colhead'})
columns = [x.text for x in columns.find_all('td')]
#Initialize a list of dataframes
final_df_list = []
# Loop through the players
for player in players:
##Pulling stats for each players
stats = [stat.text for stat in player.find_all('td')]
##Create a data frame for the single player stats
temp_df = pd.DataFrame(stats).transpose()
temp_df.columns = columns
#Put temp_df in a list of dataframes
final_df_list.append(temp_df)
##Join your list of single players stats
final_dataframe = pd.concat(final_df_list, ignore_index=True)
print(final_dataframe)
final_dataframe.to_csv(r'C\Users\19794\OneDrive\Desktop\Coding Projects', index = False, sep =',', encoding='utf-8')
Output:
print(final_dataframe)
PLAYER YRS G AB R H ... HR RBI BB SO SB CS BA
0 1 J.D. Martinez 11 54 211 38 74 ... 8 28 24 55 0 0 .351
1 2 Paul Goldschmidt 11 62 236 47 82 ... 16 56 35 50 3 0 .347
2 3 Xander Bogaerts 9 62 232 39 77 ... 6 31 23 50 3 0 .332
3 4 Rafael Devers 5 63 258 53 85 ... 16 40 18 49 1 0 .329
4 5 Manny Machado 10 63 244 46 80 ... 11 43 29 46 7 1 .328
5 6 Jeff McNeil 4 61 216 30 70 ... 4 32 16 27 2 0 .324
6 7 Ty France 3 63 249 29 79 ... 10 41 18 40 0 0 .317
7 8 Bryce Harper 10 58 225 46 71 ... 15 46 24 48 7 2 .316
8 9 Yordan Alvarez 3 57 205 39 64 ... 17 45 31 38 0 1 .312
9 10 Aaron Judge 6 61 232 53 72 ... 25 49 31 66 4 0 .310
10 11 Jose Ramirez 9 59 222 40 68 ... 16 62 34 19 11 3 .306
11 12 Andrew Benintendi 6 61 226 23 68 ... 2 22 24 37 0 0 .301
12 13 Michael Brantley 13 55 207 23 62 ... 4 21 28 24 1 1 .300
13 14 Trea Turner 7 62 242 32 72 ... 8 47 21 48 13 2 .298
14 15 J.P. Crawford 5 59 216 28 64 ... 5 16 28 37 3 1 .296
15 16 Dansby Swanson 6 64 234 39 69 ... 9 37 23 70 9 2 .295
16 17 Mike Trout 11 57 201 44 59 ... 18 38 30 64 0 0 .294
17 Josh Bell 6 65 235 33 69 ... 8 39 28 37 0 1 .294
18 19 Santiago Espinal 2 63 219 25 64 ... 5 31 18 40 3 2 .292
19 20 Trey Mancini 5 58 217 25 63 ... 6 25 24 47 0 0 .290
20 21 Austin Hays 4 60 228 33 66 ... 9 37 18 41 1 3 .289
21 22 Eric Hosmer 11 59 222 23 64 ... 4 29 22 38 0 0 .288
22 23 Freddie Freeman 12 62 241 40 69 ... 5 34 32 43 6 0 .286
23 24 C.J. Cron 8 64 249 36 71 ... 14 44 16 74 0 0 .285
24 Tommy Edman 3 63 246 52 70 ... 7 26 26 45 15 2 .285
25 26 Starling Marte 10 54 222 40 63 ... 7 34 10 45 8 5 .284
26 27 Ian Happ 5 61 209 30 59 ... 7 31 34 50 5 1 .282
27 28 Pete Alonso 3 64 239 41 67 ... 18 59 26 56 2 1 .280
28 29 Lourdes Gurriel Jr. 4 58 206 21 57 ... 3 25 15 41 2 1 .277
29 30 Nathaniel Lowe 3 58 217 25 60 ... 8 24 15 57 1 1 .276
30 31 Mookie Betts 8 60 245 53 67 ... 17 40 27 47 6 1 .273
31 32 Jose Abreu 8 59 224 34 61 ... 9 30 33 42 0 0 .272
32 Amed Rosario 5 53 217 31 59 ... 1 16 10 31 7 1 .272
33 Ke'Bryan Hayes 2 57 213 26 58 ... 2 22 26 53 7 3 .272
34 35 Nolan Arenado 9 61 229 28 62 ... 11 41 25 31 0 2 .271
35 George Springer 8 58 218 39 59 ... 12 33 20 51 4 1 .271
36 37 Ryan Mountcastle 2 53 211 28 57 ... 12 35 11 57 2 0 .270
37 Vladimir Guerrero Jr. 3 62 233 34 63 ... 16 39 27 45 0 1 .270
38 39 Cesar Hernandez 9 65 271 37 73 ... 0 16 17 55 2 2 .269
39 Ketel Marte 7 61 223 33 60 ... 4 22 22 45 4 0 .269
40 Connor Joe 2 60 238 32 64 ... 5 16 32 52 3 2 .269
41 42 Brandon Nimmo 6 57 209 36 56 ... 4 21 27 44 0 1 .268
42 Thairo Estrada 3 59 205 34 55 ... 4 26 14 31 9 1 .268
43 44 Shohei Ohtani 4 63 243 42 64 ... 13 37 24 67 7 5 .263
44 45 Randy Arozarena 3 61 233 30 61 ... 7 31 14 58 12 5 .262
45 46 Nelson Cruz 17 60 222 29 58 ... 7 36 25 50 2 0 .261
46 Hunter Dozier 5 55 203 25 53 ... 6 21 15 50 1 2 .261
47 48 Kyle Tucker 4 58 204 24 53 ... 12 39 31 41 11 1 .260
48 Bo Bichette 3 63 265 35 69 ... 10 33 17 65 4 3 .260
49 50 Charlie Blackmon 11 57 232 29 60 ... 10 33 17 41 2 1 .259
[50 rows x 16 columns]
Lastly, tables are a great way to learn how to use BeautifulSoup because of the structure. But do want to throw out there that pandas can parse <table> tags for you with less code:
import pandas as pd
url = 'https://www.espn.com/mlb/history/leaders/_/breakdown/season/year/2022'
final_dataframe = pd.read_html(url, header=1)[0]
final_dataframe = final_dataframe[final_dataframe['PLAYER'].ne('PLAYER')]

Find the total % of each value in its respective index level [duplicate]

This question already has answers here:
Pandas percentage of total with groupby
(16 answers)
Closed 10 months ago.
I'm trying to find the % total of the value within its respective index level, however, the current result is producing Nan values.
pd.DataFrame({"one": np.arange(0, 20), "two": np.arange(20, 40)}, index=[np.array([np.zeros(10), np.ones(10).flatten()], np.arange(80, 100)])
DataFrame:
one two
0.0 80 0 20
81 1 21
82 2 22
83 3 23
84 4 24
85 5 25
86 6 26
87 7 27
88 8 28
89 9 29
1.0 90 10 30
91 11 31
92 12 32
93 13 33
94 14 34
95 15 35
96 16 36
97 17 37
98 18 38
99 19 39
Aim:
To see the % total of a column 'one' within its respective level.
Excel example:
Current attempted code:
for loc in df.index.get_level_values(0):
df.loc[loc, 'total'] = df.loc[loc, :] / df.loc[loc, :].sum()

IIUC, use:
df['total'] = df['one'].div(df.groupby(level=0)['one'].transform('sum'))
output:
one two total
0 80 0 20 0.000000
81 1 21 0.022222
82 2 22 0.044444
83 3 23 0.066667
84 4 24 0.088889
85 5 25 0.111111
86 6 26 0.133333
87 7 27 0.155556
88 8 28 0.177778
89 9 29 0.200000
1 90 10 30 0.068966
91 11 31 0.075862
92 12 32 0.082759
93 13 33 0.089655
94 14 34 0.096552
95 15 35 0.103448
96 16 36 0.110345
97 17 37 0.117241
98 18 38 0.124138
99 19 39 0.131034

Venn Diagram for each row in DataFrame

I have a set of data that looks like this:
Exp # ID Q1 Q2 All IDs Q1 unique Q2 unique Overlap Unnamed: 8
0 1 58 32 58 58 14 40 18 18
1 2 55 38 44 55 28 34 10 10
2 4 95 69 83 95 37 51 32 32
3 5 92 68 84 92 31 47 37 37
4 6 0 0 0 0 0 0 0 0
5 7 71 52 65 71 27 40 25 25
6 8 84 69 69 84 39 39 30 30
7 10 65 35 63 65 17 45 18 18
8 11 90 72 72 90 39 39 33 33
9 14 88 84 80 88 52 48 32 32
10 17 89 56 75 89 30 49 26 26
11 19 83 56 70 83 32 46 24 24
12 20 94 72 83 93 35 46 37 37
13 21 73 57 56 73 38 37 19 19
For each exp #, I want to make a Venn diagram with the values Q1 Unique, Q2 Unique, and Overlap.
I have tried a couple of things, the below code has gotten me the closest:
from matplotlib import pyplot as plt
import numpy as np
from matplotlib_venn import venn2, venn2_circles
import csv
import pandas as pd
import numpy as np
val_path = r"C:\Users\lawashburn\Documents\DIA\DSD First Pass\20220202_Acquisition\Overlap_Values.csv"
val_tab = pd.read_csv(val_path)
exp_num = val_tab['Exp #']
cols = ['Q1 unique','Q2 unique', 'Overlap']
df = pd.DataFrame()
df ['Exp #'] = exp_num
df['combined'] = val_tab[cols].apply(lambda row: ','.join(row.values.astype(str)), axis=1)
print(df)
exp_no = df['Exp #'].tolist()
combined = df['combined'].tolist()
#combined = [int(i) for i in combined]
print(combined)
for a in exp_no:
plt.figure(figsize=(4,4))
plt.title(a)
for b in combined:
v = venn2(subsets=(b), set_labels = ('Q1', 'Q2'), set_colors=('purple','skyblue'), alpha=0.7)
v.get_label_by_id('A').set_text('Q1')
c = venn2_circles(subsets=(b))
plt.show()
plt.savefig(a + 'output.png')
This generates a DataFrame:
Exp # combined
0 1 14,40,18
1 2 28,34,10
2 4 37,51,32
3 5 31,47,37
4 6 0,0,0
5 7 27,40,25
6 8 39,39,30
7 10 17,45,18
8 11 39,39,33
9 14 52,48,32
10 17 30,49,26
11 19 32,46,24
12 20 35,46,37
13 21 38,37,19
However, I think I run into the issue when I export the combined column into a list:
['14,40,18', '28,34,10', '37,51,32', '31,47,37', '0,0,0', '27,40,25', '39,39,30', '17,45,18', '39,39,33', '52,48,32', '30,49,26', '32,46,24', '35,46,37', '38,37,19']
As after this I get the error:
numpy.core._exceptions.UFuncTypeError: ufunc 'absolute' did not contain a loop with signature matching types dtype('<U8') -> dtype('<U8')
How should I proceed from here? I would like 13 separate Venn Diagrams, and to export each of them into a separate .png file.

Pandas - merging start/end time ranges with short gaps

Say I have a series of start and end times for a given event:
np.random.seed(1)
df = pd.DataFrame(np.random.randint(1,5,30).cumsum().reshape(-1, 2), columns = ["start", "end"])
start end
0 2 6
1 7 8
2 12 14
3 18 20
4 24 25
5 26 28
6 29 33
7 35 36
8 39 41
9 44 45
10 48 50
11 53 54
12 58 59
13 62 63
14 65 68
I'd like to merge time ranges with a gap less than or equal to n, so for n = 1 the result would be:
fn(df, n = 1)
start end
0 2 8
2 12 14
3 18 20
4 24 33
7 35 36
8 39 41
9 44 45
10 48 50
11 53 54
12 58 59
13 62 63
14 65 68
I can't seem to find a way to do this with pandas without iterating and building up the result line-by-line. Is there some simpler way to do this?

You can subtract shifted values, compare by N for mask, create groups by cumulative sum and pass to groupby for aggregate max and min:
N = 1
g = df['start'].sub(df['end'].shift())
df = df.groupby(g.gt(N).cumsum()).agg({'start':'min', 'end':'max'})
print (df)
start end
1 2 8
2 12 14
3 18 20
4 24 33
5 35 36
6 39 41
7 44 45
8 48 50
9 53 54
10 58 59
11 62 63
12 65 68

We Keep Coding

Python is a programming language that lets you work quickly and integrate systems more effectively.

Comparing Pandas DataFrame rows against two threshold values - python

Related

Using Python Update the maximum value in each row dataframe with the sum of [column with maximum value] and [column name threshold]

Not able to view CSV from Python Webscrape

Find the total % of each value in its respective index level [duplicate]

Venn Diagram for each row in DataFrame

Pandas - merging start/end time ranges with short gaps

Categories

Resources