I have some daily data in a df, which goes back as far as 1st January 2020. It looks similar to the below but with many id1s on each day.
| yyyy_mm_dd | id1 | id2 | cost |
|------------|-----|------|-------|
| 2020-01-01 | 23 | 7253 | 5003 |
| 2020-01-01 | 23 | 7743 | 30340 |
| 2020-01-02 | 23 | 7253 | 450 |
| 2020-01-02 | 23 | 7743 | 4500 |
| ... | ... | ... | ... |
| 2021-01-01 | 23 | 7253 | 5675 |
| 2021-01-01 | 23 | 134 | 1030 |
| 2021-01-01 | 23 | 3445 | 564 |
| 2021-01-01 | 23 | 4534 | 345 |
| ... | ... | ... | ... |
I want like to calculate (1) the summed cost grouped by quarter and id1, (2) the growth % compared to the same quarter in the previous year.
I have grouped and calculated the summed cost like so:
grouped_quarterly = (
df
.withColumn('year_quarter', (F.year(sf.col('yyyy_mm_dd')) * 100 + F.quarter(F.col('yyyy_mm_dd'))
.groupby('id1', 'year_quarter')
.agg(
F.sum('cost').alias('cost')
)
)
But I am unsure how to get the growth compared to the previous year. Expected output based on the above sample:
| year_quarter | id1 | cost | cost_growth |
|--------------|-----|------|-------------|
| 202101 | 23 | 7614 | -81 |
It would also be nice to set cost_growth to 0 if the id1 has no rows in the previous years quarter.
Edit: Below is an attempt to make the comparison but I get an error that there is no attribute prev_value:
grouped_quarterly = (
df
.withColumn('year_quarter', (F.year(sf.col('yyyy_mm_dd')) * 100 + F.quarter(F.col('yyyy_mm_dd'))
.groupby('id1', 'year_quarter')
.agg(
F.sum('cost').alias('cost')
)
)
w = Window.partitionBy('id1').orderBy('year_quarter')
growth = (
grouped_quarterly
.withColumn('prev_value', sf.lag(grouped_quarterly.cost).over(w))
.withColumn('diff', sf.when(sf.isnull(grouped_quarterly.cost - grouped_quarterly.prev_value), 0).otherwise(grouped_quarterly.cost - grouped_quarterly.cost))
)
Edit #2: The window function seems to take the previous quarter, regardless of year. This means my prev_value column is the previous quarter rather than the same quarter from the previous year:
grouped_quarterly.where(sf.col('id1') == 222).sort('year_quarter').show(10,False)
| id1 | year_quarter | cost |
|-----|--------------|------|
| 222 | 202001 | 73 |
| 222 | 202002 | 246 |
| 222 | 202003 | 525 |
| 222 | 202004 | -27 |
| 222 | 202101 | 380 |
w = Window.partitionBy('id1').orderBy('year_quarter')
growth = (
grouped_quarterly
.withColumn('prev_value', sf.lag(sf.col('cost')).over(w))
.withColumn('diff', sf.when(sf.isnull(sf.col('cost') - sf.col('prev_value')), 0).otherwise(sf.col('cost') - sf.col('prev_value')))
)
growth.where(sf.col('id1') == 222).sort('year_quarter').show(10,False)
| id1 | year_quarter | cost | prev_value | diff |
|-----|--------------|------|------------|------|
| 222 | 202001 | 73 | null | 0 |
| 222 | 202002 | 246 | 73 | 173 |
| 222 | 202003 | 525 | 246 | 279 |
| 222 | 202004 | -27 | 525 | -522 |
| 222 | 202101 | 380 | -27 | 407 |
Edit #3: Using the quarter in the partitioning results in a null prev_value for all rows:
grouped_quarterly.where(sf.col('id1') == 222).sort('year_quarter').show(10,False)
| id1 | year_quarter | cost |
|-----|--------------|------|
| 222 | 202001 | 73 |
| 222 | 202002 | 246 |
| 222 | 202003 | 525 |
| 222 | 202004 | -27 |
| 222 | 202101 | 380 |
w = Window.partitionBy(sf.col('id1'), sf.expr('substring(string(year_quarter), 2)')).orderBy('year_quarter')
growth = (
grouped_quarterly
.withColumn('prev_value', sf.lag(sf.col('cost')).over(w))
.withColumn('diff', sf.when(sf.isnull(sf.col('cost') - sf.col('prev_value')), 0).otherwise(sf.col('cost') - sf.col('prev_value')))
)
growth.where(sf.col('id1') == 222).sort('year_quarter').show(10,False)
| id1 | year_quarter | cost | prev_value | diff |
|-----|--------------|------|------------|-------|
| 222 | 202001 | 73 | null | 0 |
| 222 | 202002 | 246 | null | 0 |
| 222 | 202003 | 525 | null | 0 |
| 222 | 202004 | -27 | null | 0 |
| 222 | 202101 | 380 | null | 0 |
Try using the quarter in the partitioning as well, so that lag will give you the value in the same quarter last year:
w = Window.partitionBy(sf.col('id1'), sf.expr('substring(string(year_quarter), -2)')).orderBy('year_quarter')
Related
I ran hyperopt for 5000 iterations and got the following results:
2022-01-10 19:38:31,370 - freqtrade.optimize.hyperopt - INFO - Best result:
1101 trades. Avg profit 0.23%. Total profit 25.48064438 BTC (254.5519Σ%). Avg duration 888.1 mins.
with values:
{ 'roi_p1': 0.011364434095803464,
'roi_p2': 0.04123147845715937,
'roi_p3': 0.10554480985209454,
'roi_t1': 105,
'roi_t2': 47,
'roi_t3': 30,
'rsi-enabled': True,
'rsi-value': 9,
'sell-rsi-enabled': True,
'sell-rsi-value': 94,
'sell-trigger': 'sell-bb_middle1',
'stoploss': -0.42267640639979365,
'trigger': 'bb_lower2'}
2022-01-10 19:38:31,371 - freqtrade.optimize.hyperopt - INFO - ROI table:
{ 0: 0.15814072240505736,
30: 0.05259591255296283,
77: 0.011364434095803464,
182: 0}
Result for strategy BBRSI
================================================== BACKTESTING REPORT =================================================
| pair | buy count | avg profit % | cum profit % | total profit BTC | avg duration | profit | loss |
|:----------|------------:|---------------:|---------------:|-------------------:|:----------------|---------:|-------:|
| ETH/BTC | 11 | -1.30 | -14.26 | -1.42732928 | 3 days, 4:55:00 | 0 | 1 |
| LUNA/BTC | 17 | 0.60 | 10.22 | 1.02279906 | 15:46:00 | 9 | 0 |
| SAND/BTC | 37 | 0.30 | 11.24 | 1.12513532 | 6:16:00 | 14 | 1 |
| MATIC/BTC | 24 | 0.47 | 11.35 | 1.13644340 | 12:20:00 | 10 | 0 |
| ADA/BTC | 24 | 0.24 | 5.68 | 0.56822170 | 21:05:00 | 5 | 0 |
| BNB/BTC | 11 | -1.09 | -11.96 | -1.19716109 | 3 days, 0:44:00 | 2 | 1 |
| XRP/BTC | 20 | -0.39 | -7.71 | -0.77191523 | 1 day, 5:48:00 | 1 | 1 |
| DOT/BTC | 9 | 0.50 | 4.54 | 0.45457736 | 4 days, 1:13:00 | 4 | 0 |
| SOL/BTC | 19 | -0.38 | -7.16 | -0.71688463 | 22:47:00 | 3 | 1 |
| MANA/BTC | 29 | 0.38 | 11.16 | 1.11753320 | 10:25:00 | 9 | 1 |
| AVAX/BTC | 27 | 0.30 | 8.15 | 0.81561432 | 16:36:00 | 11 | 1 |
| GALA/BTC | 26 | -0.52 | -13.45 | -1.34594702 | 15:48:00 | 9 | 1 |
| LINK/BTC | 21 | 0.27 | 5.68 | 0.56822170 | 1 day, 0:06:00 | 5 | 0 |
| TOTAL | 275 | 0.05 | 13.48 | 1.34930881 | 23:42:00 | 82 | 8 |
================================================== SELL REASON STATS ==================================================
| Sell Reason | Count |
|:--------------|--------:|
| roi | 267 |
| force_sell | 8 |
=============================================== LEFT OPEN TRADES REPORT ===============================================
| pair | buy count | avg profit % | cum profit % | total profit BTC | avg duration | profit | loss |
|:---------|------------:|---------------:|---------------:|-------------------:|:------------------|---------:|-------:|
| ETH/BTC | 1 | -14.26 | -14.26 | -1.42732928 | 32 days, 4:00:00 | 0 | 1 |
| SAND/BTC | 1 | -4.65 | -4.65 | -0.46588544 | 17:00:00 | 0 | 1 |
| BNB/BTC | 1 | -14.23 | -14.23 | -1.42444977 | 31 days, 13:00:00 | 0 | 1 |
| XRP/BTC | 1 | -8.85 | -8.85 | -0.88555957 | 18 days, 4:00:00 | 0 | 1 |
| SOL/BTC | 1 | -10.57 | -10.57 | -1.05781765 | 5 days, 14:00:00 | 0 | 1 |
| MANA/BTC | 1 | -3.17 | -3.17 | -0.31758065 | 17:00:00 | 0 | 1 |
| AVAX/BTC | 1 | -12.58 | -12.58 | -1.25910300 | 7 days, 9:00:00 | 0 | 1 |
| GALA/BTC | 1 | -23.66 | -23.66 | -2.36874608 | 7 days, 12:00:00 | 0 | 1 |
| TOTAL | 8 | -11.50 | -91.97 | -9.20647144 | 12 days, 23:15:00 | 0 | 8 |
Have accurately followed the tutorial. Don't know what I am doing wrong here.
I'm trying to create a new column in a DataFrame and storing it with values stored in a different dataframe by first comparing the values of columns that both dataframes have. For example:
df1 >>>
| name | team | week | dates | interceptions | pass_yds | rating |
| ---- | ---- | -----| ---------- | ------------- | --------- | -------- |
| maho | KC | 1 | 2020-09-10 | 0 | 300 | 105 |
| went | PHI | 1 | 2020-09-13 | 2 | 225 | 74 |
| lock | DEN | 1 | 2020-09-14 | 0 | 150 | 89 |
| dris | DEN | 2 | 2020-09-20 | 1 | 220 | 95 |
| went | PHI | 2 | 2020-09-20 | 2 | 250 | 64 |
| maho | KC | 2 | 2020-09-21 | 1 | 245 | 101 |
df2 >>>
| name | team | week | catches | rec_yds | rec_tds |
| ---- | ---- | -----| ------- | ------- | ------- |
| ertz | PHI | 1 | 5 | 58 | 1 |
| fant | DEN | 2 | 6 | 79 | 0 |
| kelc | KC | 2 | 8 | 105 | 1 |
| fant | DEN | 1 | 3 | 29 | 0 |
| kelc | KC | 1 | 6 | 71 | 1 |
| ertz | PHI | 2 | 7 | 91 | 2 |
| goed | PHI | 2 | 2 | 15 | 0 |
I want to create a dates column in df2 with the values of the dates stored in the dates column in df1 after matching the teams and the weeks columns. After the matching, df2 in this example should look something like this:
df2 >>>
| name | team | week | catches | rec_yds | rec_tds | dates |
| ---- | ---- | -----| ------- | ------- | ------- | ---------- |
| ertz | PHI | 1 | 5 | 58 | 1 | 2020-09-13 |
| fant | DEN | 2 | 6 | 79 | 0 | 2020-09-20 |
| kelc | KC | 2 | 8 | 105 | 1 | 2020-09-20 |
| fant | DEN | 1 | 3 | 29 | 0 | 2020-09-14 |
| kelc | KC | 1 | 6 | 71 | 1 | 2020-09-10 |
| ertz | PHI | 2 | 7 | 91 | 2 | 2020-09-20 |
| goed | PHI | 2 | 2 | 15 | 0 | 2020-09-20 |
I'm looking for an optimal solution. I've already tried nested for loops and comparing the week and team columns from both dataframes together but that hasn't worked. At this point I'm all out of ideas. Please help!
Disclaimer: The actual DataFrames I'm working with are a lot larger. They have a lot more rows, columns, and values (i.e. a lot more teams in the team columns, a lot more dates in the dates columns, and a lot more weeks in the week columns)
I have a Pandas Series produced by df.column.value_counts().sort_index().
| N Months | Count |
|------|------|
| 0 | 15 |
| 1 | 9 |
| 2 | 78 |
| 3 | 151 |
| 4 | 412 |
| 5 | 181 |
| 6 | 543 |
| 7 | 175 |
| 8 | 409 |
| 9 | 594 |
| 10 | 137 |
| 11 | 202 |
| 12 | 170 |
| 13 | 446 |
| 14 | 29 |
| 15 | 39 |
| 16 | 44 |
| 17 | 253 |
| 18 | 17 |
| 19 | 34 |
| 20 | 18 |
| 21 | 37 |
| 22 | 147 |
| 23 | 12 |
| 24 | 31 |
| 25 | 15 |
| 26 | 117 |
| 27 | 8 |
| 28 | 38 |
| 29 | 23 |
| 30 | 198 |
| 31 | 29 |
| 32 | 122 |
| 33 | 50 |
| 34 | 60 |
| 35 | 357 |
| 36 | 329 |
| 37 | 457 |
| 38 | 609 |
| 39 | 4744 |
| 40 | 1120 |
| 41 | 591 |
| 42 | 328 |
| 43 | 148 |
| 44 | 46 |
| 45 | 10 |
| 46 | 1 |
| 47 | 1 |
| 48 | 7 |
| 50 | 2 |
my desired output is
| bin | Total |
|-------|--------|
| 0-13 | 3522 |
| 14-26 | 793 |
| 27-50 | 9278 |
I tried df.column.value_counts(bins=3).sort_index() but got
| bin | Total |
|---------------------------------|-------|
| (-0.051000000000000004, 16.667] | 3634 |
| (16.667, 33.333] | 1149 |
| (33.333, 50.0] | 8810 |
I can get the correct result with
a = df.column.value_counts().sort_index()[:14].sum()
b = df.column.value_counts().sort_index()[14:27].sum()
c = df.column.value_counts().sort_index()[28:].sum()
print(a, b, c)
Output: 3522 793 9270
But I am wondering if there is a pandas method that can do what I want. Any advice is very welcome. :-)
You can use pd.cut:
pd.cut(df['N Months'], [0,13, 26, 50], include_lowest=True).value_counts()
Update you should be able to pass custom bin to value_counts:
df['N Months'].value_counts(bins = [0,13, 26, 50])
Output:
N Months
(-0.001, 13.0] 3522
(13.0, 26.0] 793
(26.0, 50.0] 9278
Name: Count, dtype: int64
I'm a total newbie at Python and have what I think is a pretty complex problem. I'd like to parse two tables from a website for about 80 URLs, example of one of the pages: https://www.sports-reference.com/cfb/players/sam-darnold-1.html
I'd need the first table "Passing" and the second table "Rushing and Receiving" from each of the 80 URLs (I know how to get the first and second table). But the problem is I need it for all 80 URLs in one csv.
This is my code so far and how the data looks:
import requests
import pandas as pd
COLUMNS = ['School', 'Conf', 'Class', 'Pos', 'G', 'Cmp', 'Att', 'Pct', 'Yds','Y/A', 'AY/A', 'TD', 'Int', 'Rate']
urls = ['https://www.sports-reference.com/cfb/players/russell-wilson-1.html',
'https://www.sports-reference.com/cfb/players/cam-newton-1.html',
'https://www.sports-reference.com/cfb/players/peyton-manning-1.html']
#scrape elements
dataframes = []
try:
for url in urls:
response = requests.get(url)
soup = BeautifulSoup(response.content, "html.parser")
#print(soup)
table = soup.find_all('table')[0] # Find the first "table" tag in the page
rows = table.find_all("tr")
cy_data = []
for row in rows:
cells = row.find_all("td")
cells = cells[0:14]
cy_data.append([cell.text for cell in cells]) # For each "td" tag, get the text inside it
dataframes.append(pd.DataFrame(cy_data, columns=COLUMNS).drop(0, axis=0))
except:
pass
data = pd.concat(dataframes)
data.to_csv('testcsv3.csv', sep=',') ```
+---+--+----------------------+---------+-------+-----+----+-----+------+------+-------+------+------+-----+-----+-------+
| | | School | Conf | Class | Pos | G | Cmp | Att | Pct | Yds | Y/A | AY/A | TD | Int | Rate |
+---+--+----------------------+---------+-------+-----+----+-----+------+------+-------+------+------+-----+-----+-------+
| 1 | | | | | | | | | | | | | | | |
| 2 | | North Carolina State | ACC | FR | QB | 11 | 150 | 275 | 54.5 | 1955 | 7.1 | 8.2 | 17 | 1 | 133.9 |
| 3 | | North Carolina State | ACC | SO | QB | 12 | 224 | 378 | 59.3 | 3027 | 8 | 8.3 | 31 | 11 | 147.8 |
| 4 | | North Carolina State | ACC | JR | QB | 13 | 308 | 527 | 58.4 | 3563 | 6.8 | 6.6 | 28 | 14 | 127.5 |
| 5 | | Wisconsin | Big Ten | SR | QB | 14 | 225 | 309 | 72.8 | 3175 | 10.3 | 11.8 | 33 | 4 | 191.8 |
| 6 | | Overall | | | | | 907 | 1489 | 60.9 | 11720 | 7.9 | 8.4 | 109 | 30 | 147.2 |
| 7 | | North Carolina State | | | | | 682 | 1180 | 57.8 | 8545 | 7.2 | 7.5 | 76 | 26 | 135.5 |
| 8 | | Wisconsin | | | | | 225 | 309 | 72.8 | 3175 | 10.3 | 11.8 | 33 | 4 | 191.8 |
| 1 | | | | | | | | | | | | | | | |
| 2 | | Florida | SEC | FR | QB | 5 | 5 | 10 | 50 | 40 | 4 | 4 | 0 | 0 | 83.6 |
| 3 | | Florida | SEC | SO | QB | 1 | 1 | 2 | 50 | 14 | 7 | 7 | 0 | 0 | 108.8 |
| 4 | | Auburn | SEC | JR | QB | 14 | 185 | 280 | 66.1 | 2854 | 10.2 | 11.2 | 30 | 7 | 182 |
| 5 | | Overall | | | | | 191 | 292 | 65.4 | 2908 | 10 | 10.9 | 30 | 7 | 178.2 |
| 6 | | Florida | | | | | 6 | 12 | 50 | 54 | 4.5 | 4.5 | 0 | 0 | 87.8 |
| 7 | | Auburn | | | | | 185 | 280 | 66.1 | 2854 | 10.2 | 11.2 | 30 | 7 | 182 |
+---+--+----------------------+---------+-------+-----+----+-----+------+------+-------+------+------+-----+-----+-------+
And this is how I'd like the data to look, note the player name is missing from each grouping which ideally can be added from the sample website/url and I've added the second table which I need help appending:
+---+----------------+----------------------+---------+-------+-----+----+-----+------+------+-------+------+------+-----+-----+-------+----------------------+---------+-------+-----+----+-----+-----+-----+----+
| | | School | Conf | Class | Pos | G | Cmp | Att | Pct | Yds | Y/A | AY/A | TD | Int | Rate | School | Conf | Class | Pos | G | Att | Yds | Avg | TD |
+---+----------------+----------------------+---------+-------+-----+----+-----+------+------+-------+------+------+-----+-----+-------+----------------------+---------+-------+-----+----+-----+-----+-----+----+
| 1 | | | | | | | | | | | | | | | | | | | | | | | | |
| 2 | Russell Wilson | North Carolina State | ACC | FR | QB | 11 | 150 | 275 | 54.5 | 1955 | 7.1 | 8.2 | 17 | 1 | 133.9 | North Carolina State | ACC | FR | QB | 11 | 150 | 467 | 6.7 | 3 |
| 3 | Russell Wilson | North Carolina State | ACC | SO | QB | 12 | 224 | 378 | 59.3 | 3027 | 8 | 8.3 | 31 | 11 | 147.8 | North Carolina State | ACC | SO | QB | 12 | 129 | 300 | 6.8 | 2 |
| 4 | Russell Wilson | North Carolina State | ACC | JR | QB | 13 | 308 | 527 | 58.4 | 3563 | 6.8 | 6.6 | 28 | 14 | 127.5 | North Carolina State | ACC | JR | QB | 13 | 190 | 560 | 7.1 | 5 |
| 5 | Russell Wilson | Wisconsin | Big Ten | SR | QB | 14 | 225 | 309 | 72.8 | 3175 | 10.3 | 11.8 | 33 | 4 | 191.8 | Wisconsin | Big Ten | SR | QB | 14 | 210 | 671 | 7.3 | 7 |
| 6 | Russell Wilson | Overall | | | | | 907 | 1489 | 60.9 | 11720 | 7.9 | 8.4 | 109 | 30 | 147.2 | Overall | | | | | | | | |
| 7 | Russell Wilson | North Carolina State | | | | | 682 | 1180 | 57.8 | 8545 | 7.2 | 7.5 | 76 | 26 | 135.5 | North Carolina State | | | | | | | | |
| 8 | Russell Wilson | Wisconsin | | | | | 225 | 309 | 72.8 | 3175 | 10.3 | 11.8 | 33 | 4 | 191.8 | Wisconsin | | | | | | | | |
| 1 | | | | | | | | | | | | | | | | | | | | | | | | |
| 2 | Cam Newton | Florida | SEC | FR | QB | 5 | 5 | 10 | 50 | 40 | 4 | 4 | 0 | 0 | 83.6 | Florida | SEC | FR | QB | 5 | 210 | 456 | 7.1 | 2 |
| 3 | Cam Newton | Florida | SEC | SO | QB | 1 | 1 | 2 | 50 | 14 | 7 | 7 | 0 | 0 | 108.8 | Florida | SEC | SO | QB | 1 | 212 | 478 | 4.5 | 5 |
| 4 | Cam Newton | Auburn | SEC | JR | QB | 14 | 185 | 280 | 66.1 | 2854 | 10.2 | 11.2 | 30 | 7 | 182 | Auburn | SEC | JR | QB | 14 | 219 | 481 | 6.7 | 6 |
| 5 | Cam Newton | Overall | | | | | 191 | 292 | 65.4 | 2908 | 10 | 10.9 | 30 | 7 | 178.2 | Overall | | | | | | | 3.4 | 7 |
| 6 | Cam Newton | Florida | | | | | 6 | 12 | 50 | 54 | 4.5 | 4.5 | 0 | 0 | 87.8 | Florida | | | | | | | | |
| 7 | Cam Newton | Auburn | | | | | 185 | 280 | 66.1 | 2854 | 10.2 | 11.2 | 30 | 7 | 182 | Auburn | | | | | | | | |
+---+----------------+----------------------+---------+-------+-----+----+-----+------+------+-------+------+------+-----+-----+-------+----------------------+---------+-------+-----+----+-----+-----+-----+----+
So basically I'd wanna append the second table (Only the columns mentioned) to the end of the first table and add the player name (read from the URL) to each row
import requests
import pandas as pd
from bs4 import BeautifulSoup
COLUMNS = ['School', 'Conf', 'Class', 'Pos', 'G', 'Cmp', 'Att', 'Pct', 'Yds','Y/A', 'AY/A', 'TD', 'Int', 'Rate']
COLUMNS2 = ['School', 'Conf', 'Class', 'Pos', 'G', 'Att', 'Yds','Avg', 'TD', 'Rec', 'Yds', 'Avg', 'TD', 'Plays', 'Yds', 'Avg', 'TD']
urls = ['https://www.sports-reference.com/cfb/players/russell-wilson-1.html',
'https://www.sports-reference.com/cfb/players/cam-newton-1.html',
'https://www.sports-reference.com/cfb/players/peyton-manning-1.html']
#scrape elements
dataframes = []
dataframes2 = []
for url in urls:
a = url
print(a)
response = requests.get(url)
soup = BeautifulSoup(response.content, "html.parser")
#print(soup)
table = soup.find_all('table')[0] # Find the first "table" tag in the page
rows = table.find_all("tr")
cy_data = []
for row in rows:
cells = row.find_all("td")
cells = cells[0:14]
cy_data.append([cell.text for cell in cells]) # For each "td" tag, get the text inside it
cy_data = pd.DataFrame(cy_data, columns=COLUMNS)
#Create player column in first column and derive the player from the URL
cy_data.insert(0, 'Player', url)
cy_data['Player'] = cy_data['Player'].str.split('/').str[5].str.split('-').str[0].str.title() + ' ' + cy_data['Player'].str.split('/').str[5].str.split('-').str[1].str.title()
dataframes.append(cy_data)
table2 = soup.find_all('table')[1] # Find the second "table" tag in the page
rows2 = table2.find_all("tr")
cy_data2 = []
for row2 in rows2:
cells2 = row2.find_all("td")
cells2 = cells2[0:14]
cy_data2.append([cell.text for cell in cells2]) # For each "td" tag, get the text inside it
cy_data2 = pd.DataFrame(cy_data2, columns=COLUMNS2)
cy_data2.insert(0, 'Player', url)
cy_data2['Player'] = cy_data2['Player'].str.split('/').str[5].str.split('-').str[0].str.title() + ' ' + cy_data2['Player'].str.split('/').str[5].str.split('-').str[1].str.title()
dataframes2.append(cy_data2)
data = pd.concat(dataframes).reset_index()
data2 = pd.concat(dataframes).reset_index()
data3 = data.merge(data2, on=['index', 'Player'], suffixes=('',' '))
#Filter on None rows
data3 = data3.loc[data3['School'].notnull()].drop('index', axis=1)
display(data, data2, data3)
I have a dataframe like below
+-----------+------------+---------------+------+-----+-------+
| InvoiceNo | CategoryNo | Invoice Value | Item | Qty | Price |
+-----------+------------+---------------+------+-----+-------+
| 1 | 1 | 77 | 128 | 1 | 10 |
| 1 | 1 | 77 | 101 | 1 | 11 |
| 1 | 2 | 77 | 105 | 3 | 12 |
| 1 | 3 | 77 | 129 | 2 | 10 |
| 2 | 1 | 21 | 145 | 1 | 9 |
| 2 | 2 | 21 | 130 | 1 | 12 |
+-----------+------------+---------------+------+-----+-------+
I want to filter the entire group, if any of the items in the list item_list = [128,129,130] is present in that group, after grouping by 'InvoiceNo' &'CategoryNo'.
My desired out put is as below
+-----------+------------+---------------+------+-----+-------+
| InvoiceNo | CategoryNo | Invoice Value | Item | Qty | Price |
+-----------+------------+---------------+------+-----+-------+
| 1 | 1 | 77 | 128 | 1 | 10 |
| 1 | 1 | 77 | 101 | 1 | 11 |
| 1 | 3 | 77 | 129 | 2 | 10 |
| 2 | 2 | 21 | 130 | 1 | 12 |
+-----------+------------+---------------+------+-----+-------+
I know how to filter a dataframe using isin(). But, not sure how to do it with groupby()
so far i have tried below
import pandas as pd
df = pd.read_csv('data.csv')
item_list = [128,129,130]
df.groupby(['InvoiceNo','CategoryNo'])['Item'].isin(item_list)
but nothing happens. please guide me how to solve this issue.
You can do something like this:
s = (df['Item'].isin(item_list)
.groupby([df['InvoiceNo'], df['CategoryNo']])
.transform('any')
)
df[s]