Delete rows in pandas which match your header - python
I'm kind of new with pandas and now i have a question.
I read a table from a html site and set my header according to the table on the website.
df = pd.read_html('http://www.espn.com/nhl/statistics/player/_/stat/points/sort/points/year/2015/seasontype/2', header = 1)
Now I have the my dataframe with a matching header BUT I have some rows that are the same as the header, like the example below.
RK PLAYER TEAM GP G A PTS +/- PIM PTS/G SOG
1 Jamie Benn, LW DAL 82 35 52 87 1 64 1.06 253
2 John Tavares, C NYI 82 38 48 86 5 46 1.05 278
...
10 Vladimir Tarasenko, RW STL 77 37 36 73 27 31 0.95 264
RK PLAYER TEAM GP G A PTS +/- PIM PTS/G SOG
14 Steven Stamkos, C TB 82 43 29 72 2 49 0.88 268
I know that it's possible to delete duplicate rows with panda but is it possible to delete rows that are duplicates of the header or a specific row?
Hope you can help me out !
You can use boolean indexing:
df = df[df.PLAYER != 'PLAYER']
If need also remove rows with PP in column PLAYER use isin:
Notice: I add [0] to the end of read_html, because it return list of dataframes an you need select first item of list:
df = pd.read_html('http://www.espn.com/nhl/statistics/player/_/stat/points/sort/points/year/2015/seasontype/2', header = 1)[0]
print (df)
RK PLAYER TEAM GP G A PTS +/- PIM PTS/G \
0 1 Jamie Benn, LW DAL 82 35 52 87 1 64 1.06
1 2 John Tavares, C NYI 82 38 48 86 5 46 1.05
2 3 Sidney Crosby, C PIT 77 28 56 84 5 47 1.09
3 4 Alex Ovechkin, LW WSH 81 53 28 81 10 58 1.00
4 NaN Jakub Voracek, RW PHI 82 22 59 81 1 78 0.99
5 6 Nicklas Backstrom, C WSH 82 18 60 78 5 40 0.95
6 7 Tyler Seguin, C DAL 71 37 40 77 -1 20 1.08
7 8 Jiri Hudler, LW CGY 78 31 45 76 17 14 0.97
8 NaN Daniel Sedin, LW VAN 82 20 56 76 5 18 0.93
9 10 Vladimir Tarasenko, RW STL 77 37 36 73 27 31 0.95
10 NaN PP SH NaN NaN NaN NaN NaN NaN NaN
11 RK PLAYER TEAM GP G A PTS +/- PIM PTS/G
12 NaN Nick Foligno, LW CBJ 79 31 42 73 16 50 0.92
13 NaN Claude Giroux, C PHI 81 25 48 73 -3 36 0.90
14 NaN Henrik Sedin, C VAN 82 18 55 73 11 22 0.89
15 14 Steven Stamkos, C TB 82 43 29 72 2 49 0.88
...
...
mask = df['PLAYER'].isin(['PLAYER', 'PP'])
print (df[~mask])
RK PLAYER TEAM GP G A PTS +/- PIM PTS/G SOG \
0 1 Jamie Benn, LW DAL 82 35 52 87 1 64 1.06 253
1 2 John Tavares, C NYI 82 38 48 86 5 46 1.05 278
2 3 Sidney Crosby, C PIT 77 28 56 84 5 47 1.09 237
3 4 Alex Ovechkin, LW WSH 81 53 28 81 10 58 1.00 395
4 NaN Jakub Voracek, RW PHI 82 22 59 81 1 78 0.99 221
5 6 Nicklas Backstrom, C WSH 82 18 60 78 5 40 0.95 153
6 7 Tyler Seguin, C DAL 71 37 40 77 -1 20 1.08 280
7 8 Jiri Hudler, LW CGY 78 31 45 76 17 14 0.97 158
8 NaN Daniel Sedin, LW VAN 82 20 56 76 5 18 0.93 226
9 10 Vladimir Tarasenko, RW STL 77 37 36 73 27 31 0.95 264
12 NaN Nick Foligno, LW CBJ 79 31 42 73 16 50 0.92 182
13 NaN Claude Giroux, C PHI 81 25 48 73 -3 36 0.90 279
14 NaN Henrik Sedin, C VAN 82 18 55 73 11 22 0.89 101
15 14 Steven Stamkos, C TB 82 43 29 72 2 49 0.88 268
16 NaN Tyler Johnson, C TB 77 29 43 72 33 24 0.94 203
17 16 Ryan Johansen, C CBJ 82 26 45 71 -6 40 0.87 202
18 17 Joe Pavelski, C SJ 82 37 33 70 12 29 0.85 261
19 NaN Evgeni Malkin, C PIT 69 28 42 70 -2 60 1.01 212
20 NaN Ryan Getzlaf, C ANA 77 25 45 70 15 62 0.91 191
21 20 Rick Nash, LW NYR 79 42 27 69 29 36 0.87 304
...
...
Related
How to perform operations with columns from different datasets with different indexation?
The goal A bit of background, to get familiar with variables and understand what the problem is: floor, square, matc and volume are tables or dataframes, all share same column "id" (which simply goes from 1 to 100), so every row is unique; floor and square also share column "room_name"; volume is generally equivalent to floor, except all rows with rooms ("room_name") that have no values in "square" column of square dataframe were dropped; This implies that some values of "id" are missing That done, I needed to create a new column in volume dataframe, which would consist of multiplication of one of its own columns with two other columns from matc and square dataframes. The problem This seemingly simple interaction turned out to be quite difficult, because, well, the columns I am working with are of different length (except for square and matc, they are the same) and I need to align them by "id". To make matters worse, when called directly as volume['coefLoosening'] (please note that coefLoosening does not originate from floor and is added after the table is created), it returns a series with its own index and no way to relate it to "id". What I tried Whilst trying to solve the issue, I came up with this abomination: volume = volume.merge(pd.DataFrame({"id": matc.loc[matc["id"].isin(volume["id"])]["id"], "tempCoef": volume['coefLoosening'] * matc.loc[matc["id"].isin(volume["id"])]['width'] * square.loc[square["id"].isin(volume["id"])]['square']}), how = "left", on = ["id"]) This, however, misaligns "id" column completely, somehow creating more rows. For instance, this what `` returns: index id tempCoef 0 1.0 960.430612244898 1 2.0 4665.499999999999 2 NaN NaN 3 4.0 2425.44652173913 4 5.0 5764.964210526316 5 6.0 55201.68727272727 6 NaN NaN 7 NaN NaN 8 NaN NaN 9 10.0 1780.7208791208789 10 11.0 6075.385074626865 11 12.0 10400.94 12 13.0 31.378285714285713 13 NaN NaN 14 NaN NaN 15 NaN NaN 16 17.0 10505.431451612903 17 18.0 1208.994845360825 18 NaN NaN 19 NaN NaN 20 21.0 568.8900000000001 21 22.0 4275.416470588235 22 NaN NaN 23 NaN NaN 24 25.0 547.04 25 26.0 2090.666111111111 26 27.0 2096.88406779661 27 NaN NaN 28 29.0 8324.566547619048 29 NaN NaN 30 NaN NaN 31 NaN NaN 32 33.0 2459.8314736842103 33 34.0 2177.778461538461 34 35.0 166.1257142857143 35 36.0 1866.8492307692304 36 37.0 3598.1470588235293 37 38.0 21821.709411764703 38 NaN NaN 39 40.0 2999.248 40 41.0 980.3136 41 42.0 2641.3503947368426 42 NaN NaN 43 44.0 25829.878148148146 44 45.0 649.3632 45 46.0 10895.386666666667 46 NaN NaN 47 NaN NaN 48 49.0 825.9879310344828 49 50.0 15951.941666666671 50 51.0 2614.9343434343436 51 52.0 2462.30625 52 NaN NaN 53 NaN NaN 54 55.0 1366.8287671232877 55 56.0 307.38 56 57.0 11601.975 57 58.0 1002.5415730337081 58 59.0 2493.4532432432434 59 60.0 981.7482608695652 61 62.0 NaN 63 64.0 NaN 65 66.0 NaN 67 68.0 NaN 73 74.0 NaN 75 76.0 NaN 76 77.0 NaN 77 78.0 NaN 78 79.0 NaN 80 81.0 NaN 82 83.0 NaN 84 85.0 NaN 88 89.0 NaN 89 90.0 NaN 90 91.0 NaN 92 93.0 NaN 94 95.0 NaN 95 96.0 NaN 97 98.0 NaN 98 99.0 NaN 99 100.0 NaN For clarity, no values in any of columns in the operation have NaNs in them. This is what 'volume["coefLoosening"]` returns: 0 1.020408 1 1.515152 2 2.000000 3 4.347826 4 5.263158 5 9.090909 6 1.162791 7 1.149425 8 1.851852 9 1.098901 10 1.492537 11 2.083333 12 1.428571 13 1.010101 14 1.562500 15 3.448276 16 1.612903 17 1.030928 18 33.333333 19 1.000000 20 1.123596 21 1.960784 22 2.127660 23 2.857143 24 1.369863 25 1.111111 26 1.694915 27 1.492537 28 1.190476 29 1.818182 30 1.612903 31 12.500000 32 1.052632 33 3.846154 34 2.040816 35 1.098901 36 2.941176 37 2.941176 38 2.857143 39 1.111111 40 1.333333 41 1.315789 42 3.703704 43 3.703704 44 2.000000 45 33.333333 46 12.500000 47 1.149425 48 1.724138 49 4.166667 50 1.010101 51 1.041667 52 1.162791 53 3.225806 54 1.369863 55 1.666667 56 4.545455 57 1.123596 58 1.351351 59 2.173913 and finally, this is what volume["id"] returns (to compare to the result of «abomination»): 0 1 1 2 2 4 3 5 4 6 5 10 6 11 7 12 8 13 9 17 10 18 11 21 12 22 13 25 14 26 15 27 16 29 17 33 18 34 19 35 20 36 21 37 22 38 23 40 24 41 25 42 26 44 27 45 28 46 29 49 30 50 31 51 32 52 33 55 34 56 35 57 36 58 37 59 38 60 39 62 40 64 41 66 42 68 43 74 44 76 45 77 46 78 47 79 48 81 49 83 50 85 51 89 52 90 53 91 54 93 55 95 56 96 57 98 58 99 59 100 Some thoughts I believe, part of the problem is how pandas returns columns (as series with default indexation) and I don't know how to work around that. Another source of the problem might be the way how .loc() method returns its result. In the case of matc.loc[matc["id"].isin(volume["id"])]['width'] it is: 0 15.98 1 36.12 3 32.19 4 18.54 5 98.96 9 64.56 10 58.20 11 55.08 12 3.84 16 77.31 17 15.25 20 63.21 21 76.32 24 10.52 25 54.65 26 95.46 28 79.67 32 57.01 33 27.54 34 7.36 35 36.44 36 23.64 37 78.98 39 92.19 40 31.26 41 61.71 43 70.07 44 10.91 45 4.24 48 7.35 49 46.70 50 97.69 51 32.03 54 13.50 55 42.30 56 94.71 57 37.49 58 57.86 59 50.29 61 18.18 63 88.26 65 4.28 67 28.89 73 4.05 75 22.37 76 52.20 77 98.29 78 72.98 80 6.07 82 35.80 84 64.16 88 23.60 89 45.05 90 21.14 92 31.21 94 46.04 95 7.15 97 27.70 98 31.93 99 79.62 which is shifted by -1 and I don't see a way to change this manually. So, any ideas? Maybe there is answered analogue of this question (because I tried to search it before asking, but found nothing)? Data Minimal columns of tables required to replicate this (because stack overflow does not allow files to be uploaded) volume: index,id,room_name,coefLoosening 0,1,6,1.0204081632653061 1,2,7,1.5151515151515151 2,4,3,2.0 3,5,7,4.3478260869565215 4,6,4,5.2631578947368425 5,10,7,9.090909090909092 6,11,5,1.1627906976744187 7,12,4,1.1494252873563218 8,13,1,1.8518518518518516 9,17,3,1.0989010989010988 10,18,3,1.4925373134328357 11,21,3,2.0833333333333335 12,22,7,1.4285714285714286 13,25,3,1.0101010101010102 14,26,6,1.5625 15,27,6,3.4482758620689657 16,29,4,1.6129032258064517 17,33,2,1.0309278350515465 18,34,2,33.333333333333336 19,35,5,1.0 20,36,4,1.1235955056179776 21,37,2,1.9607843137254901 22,38,6,2.127659574468085 23,40,5,2.857142857142857 24,41,6,1.36986301369863 25,42,3,1.1111111111111112 26,44,2,1.6949152542372883 27,45,4,1.4925373134328357 28,46,2,1.1904761904761905 29,49,5,1.8181818181818181 30,50,4,1.6129032258064517 31,51,2,12.5 32,52,3,1.0526315789473684 33,55,6,3.846153846153846 34,56,5,2.0408163265306123 35,57,5,1.0989010989010988 36,58,4,2.941176470588235 37,59,5,2.941176470588235 38,60,5,2.857142857142857 39,62,7,1.1111111111111112 40,64,7,1.3333333333333333 41,66,7,1.3157894736842106 42,68,3,3.7037037037037033 43,74,5,3.7037037037037033 44,76,4,2.0 45,77,3,33.333333333333336 46,78,4,12.5 47,79,5,1.1494252873563218 48,81,5,1.7241379310344829 49,83,4,4.166666666666667 50,85,2,1.0101010101010102 51,89,4,1.0416666666666667 52,90,1,1.1627906976744187 53,91,2,3.2258064516129035 54,93,2,1.36986301369863 55,95,1,1.6666666666666667 56,96,4,4.545454545454546 57,98,7,1.1235955056179776 58,99,7,1.3513513513513513 59,100,5,2.1739130434782608 matc: index,id,width 0,1,15.98 1,2,36.12 2,3,63.41 3,4,32.19 4,5,18.54 5,6,98.96 6,7,5.65 7,8,97.42 8,9,50.88 9,10,64.56 10,11,58.2 11,12,55.08 12,13,3.84 13,14,75.87 14,15,96.51 15,16,42.08 16,17,77.31 17,18,15.25 18,19,81.43 19,20,98.71 20,21,63.21 21,22,76.32 22,23,22.59 23,24,30.79 24,25,10.52 25,26,54.65 26,27,95.46 27,28,49.93 28,29,79.67 29,30,45.0 30,31,59.14 31,32,62.25 32,33,57.01 33,34,27.54 34,35,7.36 35,36,36.44 36,37,23.64 37,38,78.98 38,39,47.8 39,40,92.19 40,41,31.26 41,42,61.71 42,43,93.11 43,44,70.07 44,45,10.91 45,46,4.24 46,47,35.39 47,48,99.1 48,49,7.35 49,50,46.7 50,51,97.69 51,52,32.03 52,53,48.61 53,54,33.44 54,55,13.5 55,56,42.3 56,57,94.71 57,58,37.49 58,59,57.86 59,60,50.29 60,61,77.98 61,62,18.18 62,63,3.42 63,64,88.26 64,65,48.66 65,66,4.28 66,67,20.78 67,68,28.89 68,69,27.17 69,70,57.48 70,71,59.07 71,72,12.63 72,73,22.06 73,74,4.05 74,75,22.3 75,76,22.37 76,77,52.2 77,78,98.29 78,79,72.98 79,80,49.37 80,81,6.07 81,82,28.85 82,83,35.8 83,84,66.74 84,85,64.16 85,86,33.64 86,87,66.36 87,88,34.51 88,89,23.6 89,90,45.05 90,91,21.14 91,92,97.27 92,93,31.21 93,94,13.04 94,95,46.04 95,96,7.15 96,97,47.87 97,98,27.7 98,99,31.93 99,100,79.62 square: index,id,room_name,square 0,1,5,58.9 1,2,3,85.25 2,3,5,90.39 3,4,3,17.33 4,5,2,59.08 5,6,4,61.36 6,7,2,29.02 7,8,2,59.63 8,9,6,98.31 9,10,4,25.1 10,11,3,69.94 11,12,7,90.64 12,13,4,5.72 13,14,6,29.96 14,15,4,59.06 15,16,1,41.85 16,17,7,84.25 17,18,4,76.9 18,19,1,17.2 19,20,4,60.9 20,21,1,8.01 21,22,2,28.57 22,23,1,65.07 23,24,1,20.24 24,25,7,37.96 25,26,7,34.43 26,27,3,12.96 27,28,6,80.96 28,29,5,87.77 29,30,2,95.67 30,31,1,10.4 31,32,1,30.96 32,33,6,40.99 33,34,7,20.56 34,35,5,11.06 35,36,4,46.62 36,37,3,51.75 37,38,4,93.94 38,39,5,62.64 39,40,6,29.28 40,41,3,23.52 41,42,6,32.53 42,43,1,33.3 43,44,3,99.53 44,45,5,29.76 45,46,7,77.09 46,47,1,71.31 47,48,2,59.22 48,49,1,65.18 49,50,7,81.98 50,51,7,26.5 51,52,3,73.8 52,53,6,78.52 53,54,6,69.67 54,55,6,73.91 55,56,6,4.36 56,57,5,26.95 57,58,2,23.8 58,59,2,31.89 59,60,1,8.98 60,61,1,88.76 61,62,5,88.75 62,63,4,44.94 63,64,4,81.13 64,65,5,48.39 65,66,3,55.63 66,67,7,46.28 67,68,3,40.85 68,69,7,54.37 69,70,3,14.01 70,71,6,20.13 71,72,2,90.67 72,73,3,4.28 73,74,4,56.18 74,75,3,74.8 75,76,5,10.34 76,77,6,15.94 77,78,2,29.4 78,79,6,60.8 79,80,3,13.05 80,81,3,49.46 81,82,1,75.76 82,83,1,84.27 83,84,5,76.36 84,85,3,75.98 85,86,7,77.81 86,87,2,56.34 87,88,1,43.93 88,89,5,30.64 89,90,5,55.78 90,91,5,88.26 91,92,6,15.11 92,93,1,20.64 93,94,2,5.08 94,95,1,82.31 95,96,4,76.92 96,97,1,53.47 97,98,2,2.7 98,99,7,77.12 99,100,4,29.43 floor: index,id,room_name 0,1,6 1,2,7 2,3,12 3,4,3 4,5,7 5,6,4 6,7,8 7,8,11 8,9,10 9,10,7 10,11,5 11,12,4 12,13,1 13,14,11 14,15,12 15,16,9 16,17,3 17,18,3 18,19,9 19,20,12 20,21,3 21,22,7 22,23,8 23,24,12 24,25,3 25,26,6 26,27,6 27,28,10 28,29,4 29,30,10 30,31,9 31,32,11 32,33,2 33,34,2 34,35,5 35,36,4 36,37,2 37,38,6 38,39,11 39,40,5 40,41,6 41,42,3 42,43,11 43,44,2 44,45,4 45,46,2 46,47,9 47,48,12 48,49,5 49,50,4 50,51,2 51,52,3 52,53,9 53,54,10 54,55,6 55,56,5 56,57,5 57,58,4 58,59,5 59,60,5 60,61,12 61,62,7 62,63,12 63,64,7 64,65,11 65,66,7 66,67,12 67,68,3 68,69,8 69,70,11 70,71,12 71,72,8 72,73,12 73,74,5 74,75,11 75,76,4 76,77,3 77,78,4 78,79,5 79,80,12 80,81,5 81,82,12 82,83,4 83,84,8 84,85,2 85,86,8 86,87,8 87,88,9 88,89,4 89,90,1 90,91,2 91,92,9 92,93,2 93,94,12 94,95,1 95,96,4 96,97,8 97,98,7 98,99,7 99,100,5
IIUC you overcomplicated things. The whole thing about merging on id is that you don't need to filter the other df's beforehand on id with loc and isin like you tried to do, merge will do that for you. You could multiply square and width at the square_df (matc_df would also work since they have same length and id). Then merge this new column to the volume_df (which filters the multiplied result only to the id's which are found in the volume_df) and multiply it again. square_df['square*width'] = square_df['square'] * matc_df['width'] df = volume_df.merge(square_df[['id', 'square*width']], on='id', how='left') df['result'] = df['coefLoosening'] * df['square*width'] Output df: id room_name coefLoosening square*width result 0 1 6 1.020408 941.2220 960.430612 1 2 7 1.515152 3079.2300 4665.500000 2 4 3 2.000000 557.8527 1115.705400 3 5 7 4.347826 1095.3432 4762.361739 4 6 4 5.263158 6072.1856 31958.871579 5 10 7 9.090909 1620.4560 14731.418182 6 11 5 1.162791 4070.5080 4733.148837 7 12 4 1.149425 4992.4512 5738.449655 8 13 1 1.851852 21.9648 40.675556 9 17 3 1.098901 6513.3675 7157.546703 10 18 3 1.492537 1172.7250 1750.335821 11 21 3 2.083333 506.3121 1054.816875 12 22 7 1.428571 2180.4624 3114.946286 13 25 3 1.010101 399.3392 403.372929 14 26 6 1.562500 1881.5995 2939.999219 15 27 6 3.448276 1237.1616 4266.074483 16 29 4 1.612903 6992.6359 11278.445000 17 33 2 1.030928 2336.8399 2409.113299 18 34 2 33.333333 566.2224 18874.080000 19 35 5 1.000000 81.4016 81.401600 20 36 4 1.123596 1698.8328 1908.800899 21 37 2 1.960784 1223.3700 2398.764706 22 38 6 2.127660 7419.3812 15785.917447 23 40 5 2.857143 2699.3232 7712.352000 24 41 6 1.369863 735.2352 1007.171507 25 42 3 1.111111 2007.4263 2230.473667 26 44 2 1.694915 6974.0671 11820.452712 27 45 4 1.492537 324.6816 484.599403 28 46 2 1.190476 326.8616 389.120952 29 49 5 1.818182 479.0730 871.041818 30 50 4 1.612903 3828.4660 6174.945161 31 51 2 12.500000 2588.7850 32359.812500 32 52 3 1.052632 2363.8140 2488.225263 33 55 6 3.846154 997.7850 3837.634615 34 56 5 2.040816 184.4280 376.383673 35 57 5 1.098901 2552.4345 2804.873077 36 58 4 2.941176 892.2620 2624.300000 37 59 5 2.941176 1845.1554 5426.927647 38 60 5 2.857143 451.6042 1290.297714 39 62 7 1.111111 1613.4750 1792.750000 40 64 7 1.333333 7160.5338 9547.378400 41 66 7 1.315789 238.0964 313.284737 42 68 3 3.703704 1180.1565 4370.950000 43 74 5 3.703704 227.5290 842.700000 44 76 4 2.000000 231.3058 462.611600 45 77 3 33.333333 832.0680 27735.600000 46 78 4 12.500000 2889.7260 36121.575000 47 79 5 1.149425 4437.1840 5100.211494 48 81 5 1.724138 300.2222 517.624483 49 83 4 4.166667 3016.8660 12570.275000 50 85 2 1.010101 4874.8768 4924.117980 51 89 4 1.041667 723.1040 753.233333 52 90 1 1.162791 2512.8890 2921.963953 53 91 2 3.225806 1865.8164 6018.762581 54 93 2 1.369863 644.1744 882.430685 55 95 1 1.666667 3789.5524 6315.920667 56 96 4 4.545455 549.9780 2499.900000 57 98 7 1.123596 74.7900 84.033708 58 99 7 1.351351 2462.4416 3327.623784 59 100 5 2.173913 2343.2166 5093.949130
Not able to view CSV from Python Webscrape
I am new to python and am doing a webscraping tutorial. I am having trouble getting my CSV file in the appropriate folder. Basically, I am not able to view the resulting CSV. Does anyone have a solution regarding this problem? import pandas as pd import re from bs4 import BeautifulSoup import requests #Pulling in website source code# url = 'https://www.espn.com/mlb/history/leaders/_/breakdown/season/year/2022' page = requests.get(url) soup = BeautifulSoup(page.text, 'html.parser') #Pulling in player rows ##Identify Player Rows players = soup.find_all('tr', attrs= {'class':re.compile('row-player-10-')}) for players in players: ##Pulling stats for each players stats = [stat.get_text() for stat in players.findall('td')] ##Create a data frame for the single player stats temp.df = pd.DataFrame(stats).transpose() temp.df = columns ##Join single players stats with the overall dataset final_dataframe = pd.concat([final_df,temp_df], ignore_index=True) print(final_dataframe) final_dataframe.to_csv(r'C\Users\19794\OneDrive\Desktop\Coding Projects', index = False, sep =',', encoding='utf-8')
I've checked your code. I've found one issue. This one. for players in players: ##Pulling stats for each players stats = [stat.get_text() for stat in players.findall('td')] ##Create a data frame for the single player stats temp.df = pd.DataFrame(stats).transpose() temp.df = columns ##Join single players stats with the overall dataset final_dataframe = pd.concat([final_df,temp_df], ignore_index=True) print(final_dataframe) final_dataframe.to_csv(r'C\Users\19794\OneDrive\Desktop\Coding Projects', index = False, sep =',', encoding='utf-8') You have to use this. (players to player, filename with csv) for player in players: ##Pulling stats for each players stats = [stat.get_text() for stat in player.findall('td')] ##Create a data frame for the single player stats temp.df = pd.DataFrame(stats).transpose() temp.df = columns ##Join single players stats with the overall dataset final_dataframe = pd.concat([final_df,temp_df], ignore_index=True) print(final_dataframe) final_dataframe.to_csv(r'C\Users\19794\OneDrive\Desktop\Coding Projects\result.csv', index = False, sep =',', encoding='utf-8')
Few issues. As stated in the previous solution, your for loop you need to change to for player in players: You cant use the same variable as the variable you are looping through You shouldn't use . in your variables as you have temp.df. That indicates the use of a method. Use underscore instead _ You never define final_df, then try to call it in your pd.concat() You never define columns and then try to use that (and it would then overwrite your temp_df as well). What you are wanting to do is change instead is temp_df.columns = columns. But note you need to define columns. Your find_all() for the players is incorrect in that you're searching for a class that contains row-player-10-. There is no class with that. It is row player-10. Very subtle difference, but it's the difference of returning None elements, and 50 elements. stats = [stat.get_text() for stat in player.findall('td')] - again needs to be referencing player from the for loop as mentioned in 1). And in fact, there's a few syntax things in there that we need to change to actually pull out the text. So that should be [stat.text for stat in player.find_all('td')] You use pd.concat the temp_df to a final_df within your loop. You can do that (provided you create an initial final_dataframe or final_df (you use 2 different variable names...not sure which you really wanted), but that will lead to repeating the headers/column names in it and require an extra step. What I would rather do, is store each temp_df into a list. Then after it loops through all the players, THEN concat the list of dataframes into a final one. So here is the full code: import pandas as pd import re from bs4 import BeautifulSoup import requests #Pulling in website source code# url = 'https://www.espn.com/mlb/history/leaders/_/breakdown/season/year/2022' page = requests.get(url) soup = BeautifulSoup(page.text, 'html.parser') #Pulling in player rows ##Identify Player Rows players = soup.find_all('tr', attrs= {'class':re.compile('.*row player-10-.*')}) columns = soup.find('tr', {'class':'colhead'}) columns = [x.text for x in columns.find_all('td')] #Initialize a list of dataframes final_df_list = [] # Loop through the players for player in players: ##Pulling stats for each players stats = [stat.text for stat in player.find_all('td')] ##Create a data frame for the single player stats temp_df = pd.DataFrame(stats).transpose() temp_df.columns = columns #Put temp_df in a list of dataframes final_df_list.append(temp_df) ##Join your list of single players stats final_dataframe = pd.concat(final_df_list, ignore_index=True) print(final_dataframe) final_dataframe.to_csv(r'C\Users\19794\OneDrive\Desktop\Coding Projects', index = False, sep =',', encoding='utf-8') Output: print(final_dataframe) PLAYER YRS G AB R H ... HR RBI BB SO SB CS BA 0 1 J.D. Martinez 11 54 211 38 74 ... 8 28 24 55 0 0 .351 1 2 Paul Goldschmidt 11 62 236 47 82 ... 16 56 35 50 3 0 .347 2 3 Xander Bogaerts 9 62 232 39 77 ... 6 31 23 50 3 0 .332 3 4 Rafael Devers 5 63 258 53 85 ... 16 40 18 49 1 0 .329 4 5 Manny Machado 10 63 244 46 80 ... 11 43 29 46 7 1 .328 5 6 Jeff McNeil 4 61 216 30 70 ... 4 32 16 27 2 0 .324 6 7 Ty France 3 63 249 29 79 ... 10 41 18 40 0 0 .317 7 8 Bryce Harper 10 58 225 46 71 ... 15 46 24 48 7 2 .316 8 9 Yordan Alvarez 3 57 205 39 64 ... 17 45 31 38 0 1 .312 9 10 Aaron Judge 6 61 232 53 72 ... 25 49 31 66 4 0 .310 10 11 Jose Ramirez 9 59 222 40 68 ... 16 62 34 19 11 3 .306 11 12 Andrew Benintendi 6 61 226 23 68 ... 2 22 24 37 0 0 .301 12 13 Michael Brantley 13 55 207 23 62 ... 4 21 28 24 1 1 .300 13 14 Trea Turner 7 62 242 32 72 ... 8 47 21 48 13 2 .298 14 15 J.P. Crawford 5 59 216 28 64 ... 5 16 28 37 3 1 .296 15 16 Dansby Swanson 6 64 234 39 69 ... 9 37 23 70 9 2 .295 16 17 Mike Trout 11 57 201 44 59 ... 18 38 30 64 0 0 .294 17 Josh Bell 6 65 235 33 69 ... 8 39 28 37 0 1 .294 18 19 Santiago Espinal 2 63 219 25 64 ... 5 31 18 40 3 2 .292 19 20 Trey Mancini 5 58 217 25 63 ... 6 25 24 47 0 0 .290 20 21 Austin Hays 4 60 228 33 66 ... 9 37 18 41 1 3 .289 21 22 Eric Hosmer 11 59 222 23 64 ... 4 29 22 38 0 0 .288 22 23 Freddie Freeman 12 62 241 40 69 ... 5 34 32 43 6 0 .286 23 24 C.J. Cron 8 64 249 36 71 ... 14 44 16 74 0 0 .285 24 Tommy Edman 3 63 246 52 70 ... 7 26 26 45 15 2 .285 25 26 Starling Marte 10 54 222 40 63 ... 7 34 10 45 8 5 .284 26 27 Ian Happ 5 61 209 30 59 ... 7 31 34 50 5 1 .282 27 28 Pete Alonso 3 64 239 41 67 ... 18 59 26 56 2 1 .280 28 29 Lourdes Gurriel Jr. 4 58 206 21 57 ... 3 25 15 41 2 1 .277 29 30 Nathaniel Lowe 3 58 217 25 60 ... 8 24 15 57 1 1 .276 30 31 Mookie Betts 8 60 245 53 67 ... 17 40 27 47 6 1 .273 31 32 Jose Abreu 8 59 224 34 61 ... 9 30 33 42 0 0 .272 32 Amed Rosario 5 53 217 31 59 ... 1 16 10 31 7 1 .272 33 Ke'Bryan Hayes 2 57 213 26 58 ... 2 22 26 53 7 3 .272 34 35 Nolan Arenado 9 61 229 28 62 ... 11 41 25 31 0 2 .271 35 George Springer 8 58 218 39 59 ... 12 33 20 51 4 1 .271 36 37 Ryan Mountcastle 2 53 211 28 57 ... 12 35 11 57 2 0 .270 37 Vladimir Guerrero Jr. 3 62 233 34 63 ... 16 39 27 45 0 1 .270 38 39 Cesar Hernandez 9 65 271 37 73 ... 0 16 17 55 2 2 .269 39 Ketel Marte 7 61 223 33 60 ... 4 22 22 45 4 0 .269 40 Connor Joe 2 60 238 32 64 ... 5 16 32 52 3 2 .269 41 42 Brandon Nimmo 6 57 209 36 56 ... 4 21 27 44 0 1 .268 42 Thairo Estrada 3 59 205 34 55 ... 4 26 14 31 9 1 .268 43 44 Shohei Ohtani 4 63 243 42 64 ... 13 37 24 67 7 5 .263 44 45 Randy Arozarena 3 61 233 30 61 ... 7 31 14 58 12 5 .262 45 46 Nelson Cruz 17 60 222 29 58 ... 7 36 25 50 2 0 .261 46 Hunter Dozier 5 55 203 25 53 ... 6 21 15 50 1 2 .261 47 48 Kyle Tucker 4 58 204 24 53 ... 12 39 31 41 11 1 .260 48 Bo Bichette 3 63 265 35 69 ... 10 33 17 65 4 3 .260 49 50 Charlie Blackmon 11 57 232 29 60 ... 10 33 17 41 2 1 .259 [50 rows x 16 columns] Lastly, tables are a great way to learn how to use BeautifulSoup because of the structure. But do want to throw out there that pandas can parse <table> tags for you with less code: import pandas as pd url = 'https://www.espn.com/mlb/history/leaders/_/breakdown/season/year/2022' final_dataframe = pd.read_html(url, header=1)[0] final_dataframe = final_dataframe[final_dataframe['PLAYER'].ne('PLAYER')]
Adding columns to dataframe that depends on a existing column and its qcut bin values
I have a dataframe that looks like below. dataframe1 = Ind ID T1 T2 T3 T4 T5 0 Q1 100 121 43 56 78 1 Q2 23 43 56 76 87 2 Q3 345 56 76 78 98 3 Q4 21 32 34 45 56 4 Q5 45 654 567 78 90 5 Q6 123 32 45 56 67 6 Q7 23 24 25 26 27 7 Q8 32 33 34 35 36 8 Q9 123 124 125 126 127 9 Q10 56 56 56 56 56 10 Q11 76 77 78 79 80 11 Q12 87 87 87 87 87 12 Q13 90 90 90 90 90 13 Q14 43 44 45 46 47 14 Q15 23 24 25 26 27 15 Q16 51 52 53 54 55 16 Q17 67 67 67 67 67 17 Q18 87 87 87 87 87 18 Q19 90 91 92 93 94 19 Q20 23 24 25 26 27 Now,I have applied qcut to column 'T1' to get bins by using - pd.qcut(data_data['T1'].rank(method = 'first'),10,labels = list(range(1,11))) that gives me. 0 9 1 1 2 10 3 1 4 4 5 9 6 2 7 3 8 10 9 5 10 6 11 7 12 8 13 4 14 2 15 5 16 6 17 7 18 8 19 3 Now, I want to get the mean of all bin 5 values, so that I can add another column in dataframe1 named 'T1_FOLD' that is simply the ((individual 'T1' values) - (that mean of bin 5 values)). How can I do that??
Filter column T1 with DataFrame.loc and boolean indexing, get means and use for subtracting of column T1: s = pd.qcut(data_data['T1'].rank(method = 'first'),10,labels = list(range(1,11))) data_data['T1_FOLD'] = data_data['T1'] - data_data.loc[s == 5, 'T1'].mean() print (data_data) ID T1 T2 T3 T4 T5 T1_FOLD 0 Q1 100 121 43 56 78 46.5 1 Q2 23 43 56 76 87 -30.5 2 Q3 345 56 76 78 98 291.5 3 Q4 21 32 34 45 56 -32.5 4 Q5 45 654 567 78 90 -8.5 5 Q6 123 32 45 56 67 69.5 6 Q7 23 24 25 26 27 -30.5 7 Q8 32 33 34 35 36 -21.5 8 Q9 123 124 125 126 127 69.5 9 Q10 56 56 56 56 56 2.5 10 Q11 76 77 78 79 80 22.5 11 Q12 87 87 87 87 87 33.5 12 Q13 90 90 90 90 90 36.5 13 Q14 43 44 45 46 47 -10.5 14 Q15 23 24 25 26 27 -30.5 15 Q16 51 52 53 54 55 -2.5 16 Q17 67 67 67 67 67 13.5 17 Q18 87 87 87 87 87 33.5 18 Q19 90 91 92 93 94 36.5 19 Q20 23 24 25 26 27 -30.5
compare two data frames and delete columns based on lookup table
I have two data frames: df1: A B C D E F 0 63 9 56 23 41 0 1 40 35 69 98 47 45 2 51 95 55 36 10 34 3 25 11 67 83 49 89 4 91 10 43 73 96 95 5 2 47 8 30 46 9 6 37 10 33 8 45 20 7 40 88 6 29 46 79 8 75 87 49 76 0 69 9 92 21 86 91 46 41 df2: A B C D E F 0 0 0 0 1 1 0 I want to delete Columns in df1 based on values in df2(lookup table). wherever df2 has 1 I have to delete that column in df1. so my final output should be like. A B C F 0 63 9 56 0 1 40 35 69 45 2 51 95 55 34 3 25 11 67 89 4 91 10 43 95 5 2 47 8 9 6 37 10 33 20 7 40 88 6 79 8 75 87 49 69 9 92 21 86 41
Assuming len(df1.columns) == len(df2.columns): df1.loc[:, ~df2.loc[0].astype(bool).values] A B C F 0 63 9 56 0 1 40 35 69 45 2 51 95 55 34 3 25 11 67 89 4 91 10 43 95 5 2 47 8 9 6 37 10 33 20 7 40 88 6 79 8 75 87 49 69 9 92 21 86 41 If the columns aren't the same, but df2 has a subset of columns in df1, then df1.reindex(df2.columns[~df2.loc[0].astype(bool)], axis=1) Or with drop, similar to #student's method: df1.drop(df2.columns[df2.loc[0].astype(bool)], axis=1) A B C F 0 63 9 56 0 1 40 35 69 45 2 51 95 55 34 3 25 11 67 89 4 91 10 43 95 5 2 47 8 9 6 37 10 33 20 7 40 88 6 79 8 75 87 49 69 9 92 21 86 41
columns can do intersection df1[df1.columns.intersection(df2.columns[~df2.iloc[0].astype(bool)])] Out[354]: A B C F 0 63 9 56 0 1 40 35 69 45 2 51 95 55 34 3 25 11 67 89 4 91 10 43 95 5 2 47 8 9 6 37 10 33 20 7 40 88 6 79 8 75 87 49 69 9 92 21 86 41
You can try with drop to drop the columns: remove_col = df2.columns[(df2 == 1).any()] # get columns with any value 1 df1.drop(remove_col, axis=1, inplace=True) # drop the columns in original dataframe Or, in one line as: df1.drop(df2.columns[(df2 == 1).any()], axis=1, inplace=True)
Following can be an easily understandable solution: df1.loc[:,df2.loc[0]!=1] Output: A B C F 0 63 9 56 0 1 40 35 69 45 2 51 95 55 34 3 25 11 67 89 4 91 10 43 95 5 2 47 8 9 6 37 10 33 20 7 40 88 6 79 8 75 87 49 69 9 92 21 86 41 loc can be used for selecting rows or columns with a boolean or conditional lookup : https://www.shanelynn.ie/select-pandas-dataframe-rows-and-columns-using-iloc-loc-and-ix/
How to apply a function to rows of two pandas DataFrame
There are two pandas DataFrame, say dfx, dfy of the same shape and exactly the same column and row indices. I want to apply a function to the corresponding rows of these two DataFrame. In other words, suppose we have a function as follows def fun( row_x, row_y): ...# a function of the corresponding rows Let index be the common index of dfx, dfy. I want to compute in pandas the following list/Series [fun(dfx[i], dfy[i]) for i in index] (pseudo-code) By the following code, I make a grouped two-level indexed DataFrame. Then I do not know how to apply agg in the proper way. dfxy = pd.concat({'dfx':dfx, 'dfy':dfy}) dfxy = dfxy.swaplevel(0,1,axis=0).sort_index(level=0) grouped=dfxy.groupby(level=0)
In [19]: dfx = pd.DataFrame(data = np.random.randint(0 , 100 , 50).reshape(10 ,-1) , columns=list('abcde')) dfx Out[19]: a b c d e 3 44 8 55 95 26 5 18 34 10 20 20 91 15 8 83 7 50 47 27 97 65 10 94 93 44 6 70 60 4 38 64 8 67 92 44 21 42 6 12 30 98 34 7 79 76 7 14 58 5 In [4]: dfy = pd.DataFrame(data = np.random.randint(0 , 100 , 50).reshape(10 ,-1) , columns=list('fghij')) dfy Out[4]: f g h i j 82 48 29 54 78 7 31 78 38 30 90 91 43 8 40 52 88 13 87 39 41 88 90 51 91 55 4 94 62 98 31 23 4 59 93 87 12 33 77 0 25 99 39 23 1 7 50 46 39 66 In [13]: dfxy = pd.concat({'dfx':dfx, 'dfy':dfy} , axis = 1) dfxy Out[13]: dfx dfy a b c d e f g h i j 20 76 5 98 38 82 48 29 54 78 39 36 9 3 74 7 31 78 38 30 43 12 50 72 14 90 91 43 8 40 89 41 95 91 86 52 88 13 87 39 33 30 55 64 94 41 88 90 51 91 89 84 48 1 60 55 4 94 62 98 68 40 27 10 63 31 23 4 59 93 33 10 86 89 67 87 12 33 77 0 56 89 0 70 67 25 99 39 23 1 48 58 98 18 24 7 50 46 39 66 def f(x , y): return pd.Series(data = [np.mean(x) , np.mean(y)] , index=['x_mean' , 'y_mean']) In [17]: dfxy.apply( lambda x : f(x['dfx'] , x['dfy']) , axis = 1) Out[17]: x_mean y_mean 0 47.4 58.2 1 32.2 36.8 2 38.2 54.4 3 80.4 55.8 4 55.2 72.2 5 56.4 62.6 6 41.6 42.0 7 57.0 41.8 8 56.4 37.4 9 49.2 41.6
Could this be what you are looking for? In [1]: import pandas as pd In [2]: import numpy as np In [3]: dfx = pd.DataFrame(data=np.random.randint(0,100,50).reshape(10,-1), columns=['index', 'a', 'b', 'c', 'd']) In [4]: dfy = pd.DataFrame(data=np.random.randint(0,100,50).reshape(10,-1), columns=['index', 'a', 'b', 'c', 'd']) In [5]: dfy['index'] = dfx['index'] In [6]: print(dfx) index a b c d 0 25 41 46 18 98 1 0 21 9 20 29 2 18 78 63 94 70 3 86 71 71 95 64 4 23 33 19 34 29 5 69 10 91 19 42 6 92 68 60 12 58 7 74 49 22 74 1 8 47 35 56 41 80 9 93 20 44 16 49 In [7]: print(dfy) index a b c d 0 25 28 35 96 89 1 0 44 94 50 43 2 18 18 39 75 45 3 86 18 87 72 88 4 23 2 28 24 4 5 69 53 55 55 40 6 92 0 52 54 91 7 74 8 1 96 59 8 47 74 21 7 7 9 93 42 83 42 60 In [8]: print(dfx.merge(dfy, on='index')) index a_x b_x c_x d_x a_y b_y c_y d_y 0 25 41 46 18 98 28 35 96 89 1 0 21 9 20 29 44 94 50 43 2 18 78 63 94 70 18 39 75 45 3 86 71 71 95 64 18 87 72 88 4 23 33 19 34 29 2 28 24 4 5 69 10 91 19 42 53 55 55 40 6 92 68 60 12 58 0 52 54 91 7 74 49 22 74 1 8 1 96 59 8 47 35 56 41 80 74 21 7 7 9 93 20 44 16 49 42 83 42 60 In [9]: def my_function(x): ...: return sum(x) ...: In [10]: print(dfx.merge(dfy, on='index').drop('index', axis=1).apply(my_function, axis=1)) 0 451 1 310 2 482 3 566 4 173 5 365 6 395 7 310 8 321 9 356 dtype: int64 In [11]: print(pd.DataFrame( { 'my_function': dfx.merge(dfy, on='index').\ drop('index', axis=1).apply(my_function, axis=1), 'index': dfx['index'] })) index my_function 0 25 451 1 0 310 2 18 482 3 86 566 4 23 173 5 69 365 6 92 395 7 74 310 8 47 321 9 93 356