sportsreference API wrong data issue - python

I'm using the sportsreference API to get some data but I'm not sure if I am doing something wrong or there is an issue with the API. When I pull the data I need with the API it always says the away team won, even on games this is not true for.
Code snippet:
from sportsreference.nba.boxscore import Boxscores
from sportsreference.nba.boxscore import Boxscore
# Select range of dates to get boxscores from (year, month, day)
games = Boxscores(datetime(2017, 10, 17), datetime(2017, 10, 20))
# Get boxscore abbreviations to get more detailed game boxscores
boxscore_abvs = []
for key in games.games.keys():
for i in range(len(games.games[key])):
boxscore_abvs.append(games.games[key][i]['boxscore'])
# Get more detailed boxscores
df = pd.DataFrame()
for abv in boxscore_abvs:
game_data = Boxscore(abv)
temp_df = game_data.dataframe
df = df.append(temp_df)
Sample of wrong output from df (Cavs won this game, API reports Celtics):
away_assist_percentage away_assists away_block_percentage away_blocks away_defensive_rating ... losing_name pace winner winning_abbr winning_name
201710170CLE 66.7 24 6.6 4 102.7 ... Cleveland Cavaliers 99.3 Away BOS Boston Celtics

It's a known issue caused by the site changing its HTML layout. Seems like it should be fixed in the 0.6.0 release: https://github.com/roclark/sportsreference/pull/506.
In the meantime, you can install from git to get the fixed version:
pip install --force-reinstall git+https://github.com/roclark/sportsreference.git#master
With that, I get the correct result:
Boxscore('201710170CLE').dataframe[['away_points', 'home_points', 'winning_name']]
# away_points home_points winning_name
# 201710170CLE 99 102 Cleveland Cavaliers

Related

any way to download the data with custom queries from url in python?

I want to download the data from USDA site with custom queries. So instead of manually selecting queries in the website, I am thinking about how should I do this handier in python. To do so, I used request, http to access the url and read the content, it is not intuitive for me how should I pass the queries then make a selection and download the data as csv. Does anyone knows of doing this easily in python? Is there any workaround we could download the data from url with specific queries? Any idea?
this is my current attempt
here is the url that I am going to select data with custom queries.
import io
import requests
import pandas as pd
url="https://www.marketnews.usda.gov/mnp/ls-report-retail?&repType=summary&portal=ls&category=Retail&species=BEEF&startIndex=1"
s=requests.get(url).content
df=pd.read_csv(io.StringIO(s.decode('utf-8')))
so before reading the requested json in pandas, I need to pass following queries for correct data selection:
Category = "Retail"
Report Type = "Item"
Species = "Beef"
Region(s) = "National"
Start Dates = "2020-01-01"
End Date = "2021-02-08"
it is not intuitive for me how should I pass the queries with requested json then download the filtered data as csv. Is there any efficient way of doing this in python? Any thoughts? Thanks
A few details
simplest format is text rather that HTML. Got URL from HTML page for text download
requests(params=) is a dict. Built it up and passed, no need to deal with building complete URL string
clearly text is space delimited, found minimum of double space
import io
import requests
import pandas as pd
url="https://www.marketnews.usda.gov/mnp/ls-report-retail"
p = {"repType":"summary","species":"BEEF","portal":"ls","category":"Retail","format":"text"}
r = requests.get(url, params=p)
df = pd.read_csv(io.StringIO(r.text), sep="\s\s+", engine="python")
Date
Region
Feature Rate
Outlets
Special Rate
Activity Index
0
02/05/2021
NATIONAL
69.40%
29,200
20.10%
81,650
1
02/05/2021
NORTHEAST
75.00%
5,500
3.80%
17,520
2
02/05/2021
SOUTHEAST
70.10%
7,400
28.00%
23,980
3
02/05/2021
MIDWEST
75.10%
6,100
19.90%
17,430
4
02/05/2021
SOUTH CENTRAL
57.90%
4,900
26.40%
9,720
5
02/05/2021
NORTHWEST
77.50%
1,300
2.50%
3,150
6
02/05/2021
SOUTHWEST
63.20%
3,800
27.50%
9,360
7
02/05/2021
ALASKA
87.00%
200
.00%
290
8
02/05/2021
HAWAII
46.70%
100
.00%
230
Just format the query data in the url - it's actually a REST API:
To add more query data, as #mullinscr said, you can change the values on the left and press submit, then see the query name in the URL (for example, start date is called repDate).
If you hover on the Download as XML link, you will also discover you can specify the download format using format=<format_name>. Parsing the tabular data in XML using pandas might be easier, so I would append format=xml at the end as well.
category = "Retail"
report_type = "Item"
species = "BEEF"
regions = "NATIONAL"
start_date = "01-01-2019"
end_date = "01-01-2021"
# the website changes "-" to "%2F"
start_date = start_date.replace("-", "%2F")
end_date = end_date.replace("-", "%2F")
url = f"https://www.marketnews.usda.gov/mnp/ls-report-retail?runReport=true&portal=ls&startIndex=1&category={category}&repType={report_type}&species={species}&region={regions}&repDate={start_date}&endDate={end_date}&compareLy=No&format=xml"
# parse with pandas, etc...

Best Python looping system for merging pandas DataFrame rows for export

I am a self-teaching data science student, currently doing my first big Python portfolio project in several steps, the first of which is using pandas to work with IMDb [Internet Movie Database]'s rather oddly structured .tsv files in an effort to create a fully searchable big data repository of all IMDb data (the officially supported searches and even APIs like OMDB (Open Movie Database) don't allow for the kinds of detailed queries I need to do for the larger project).
The structure of IMDb's public files is that they include all data on movies, TV shows, episodes, actors, directors, crew, the whole business, scattered rather haphazardly across seven massive tsv files. I've confirmed that pandas can, in fact, read in all of this data and that my computer's memory can handle it, but what I want to do is merge the seven tsv files into a single DataFrame object which can then be exported to (preferably) a SQL database or even a huge spreadsheet/another TSV file but larger.
Each thing in the database (movie, actor, individual TV episode) has a tconst row, which, in one file is identified as "titleId", a string. In every other file, this is identified as "tconst", also a string. I'm going to need to change titleId when I read that file into tconst; this is one of several challenges I haven't got to yet.
#set pandas formatting parameters
pd.set_option('display.max_columns', None)
pd.set_option('display.max_rows', 25)
#read in the data tables provided
showbiz_core = pd.read_table("name.basics.tsv",sep='\t')
#temporary hack - print the entire dataframe as test
print(showbiz_core)
This works, but I'm not sure exactly how to proceed next. I want to import each of the other tsv files to attempt to locally reconstruct the imdb database. This means that I don't want to have duplicate tconst strings, but rather to end up with new information about a tconst entry (like a film) appended to it as new columns.
Should I be looking to do a "for i in [new file]" type loop somehow? How would you go about this?
The IMdB files are actually highly structured. Looping is always a bad structure for merging data
structure data sourcing - I used wget rather than manually sourcing
the files are large so work with a subset for modelling purpose. I just have used popular movies and actors as driver
the CSV columns in the tsv files are actually sub-tables. Treat them as such. I build a reference entity dmi to do this
there are other associative relationships there as well primaryProfession, genres
finally join (merge) everything together from OMDB and IMdB. Taken first rows where many items associate to a title
I have left data currently as tsv clearly it would be very simple to put into a database using to_sql() method. Main point is sourcing and transformation. aka ETL which has become an unfashionable term. This can be further supplemented with web scraping. I looked at Box Office Mojo however this would require selenium to scrape it as it's dynamic HTML
IMdB sourcing
import requests, json, re, urllib.parse
from bs4 import BeautifulSoup
import pandas as pd
import wget,gzip
from pathlib import Path
import numpy as np
# find what IMdB has to give ...
resp = requests.get("https://datasets.imdbws.com")
soup = BeautifulSoup(resp.content.decode(), "html.parser")
files = {}
for f in soup.find_all("a", href=True):
if f["href"].endswith('gz'):
u = urllib.parse.urlparse(f["href"])
fn = Path().cwd().joinpath(u.path.strip("/"))
files[Path(fn.stem).stem] = fn.name
if not fn.is_file():
wget.download(f["href"])
IMdB transform
set alldata=True on first run to prep data. Second run false and you have a manageable subset
alldata = False
subsetdata = True
dfs={}
# work with a subset of data to speed up modelling and iterations. Take a few major actors and titles
# as criteria to build a manageable representative set of data
l = ["Tom Hanks","Will Smith","Clint Eastwood","Leonardo DiCaprio","Johnny Depp","Meryl Streep","Bruce Willis"]
tm = {'tconst': ['tt0111161','tt0468569','tt1375666','tt0137523','tt0110912','tt0109830','tt0944947','tt0133093','tt0120737','tt0167260','tt0068646'],
'averageRating': [9.3, 9.0, 8.8, 8.8, 8.9, 8.8, 9.3, 8.7, 8.8, 8.9, 9.2],
'numVotes': [2275837,2237966,1997918,1805137,1777920,1752954,1699318,1630083,1618100,1602417,1570167]}
# work with subset for modelling purpose
k = "name.basics"
if alldata:
dfs[k] = pd.read_csv(gzip.open(files[k]), sep="\t").replace({"\\N":np.nan})
if subsetdata:
# manage down size of nmi
dfs[k] = dfs[k].loc[(dfs[k]["primaryName"].isin(l)
| dfs[k]["knownForTitles"].str.contains(tm["tconst"][0])
| dfs[k]["knownForTitles"].str.contains(tm["tconst"][1])
| dfs[k]["knownForTitles"].str.contains(tm["tconst"][2])
| dfs[k]["knownForTitles"].str.contains(tm["tconst"][3])
| dfs[k]["knownForTitles"].str.contains(tm["tconst"][4])
| dfs[k]["knownForTitles"].str.contains(tm["tconst"][5])
| dfs[k]["knownForTitles"].str.contains(tm["tconst"][6])
| dfs[k]["knownForTitles"].str.contains(tm["tconst"][7])
| dfs[k]["knownForTitles"].str.contains(tm["tconst"][8])
| dfs[k]["knownForTitles"].str.contains(tm["tconst"][9])
)
&dfs[k]["knownForTitles"].str.contains("tt")]
dfs[k].to_csv(f"{files[k]}_subset.tsv", sep="\t", index=False)
else:
dfs[k] = pd.read_csv(f"{files[k]}_subset.tsv", sep="\t")
dfs[k] = dfs[k].astype({c:"Int64" for c in dfs[k].columns}, errors="ignore")
# birth year is a lot but getting data issues...
# dfs[k] = dfs[k].dropna(subset=["primaryProfession","birthYear"])
# comma separated - not good for joins and merges. rename for consistency
dfs["nmi"] = (dfs["name.basics"].loc[:,["nconst","knownForTitles"]]
.assign(knownForTitles=lambda x: x["knownForTitles"].str.split(","))
.explode("knownForTitles")
).rename(columns={"knownForTitles":"tconst"}).drop_duplicates()
# already extracted known titles so can drop and de-dup - e.g. Tom Hanks
dfs[k] = dfs[k].drop(columns=["knownForTitles"]).drop_duplicates()
for k in [k for k in files.keys() if k not in ["name.basics","omdb.titles"]]:
if alldata:
dfs[k] = pd.read_csv(gzip.open(files[k]), sep="\t").replace({"\\N":np.nan})
if k=="title.akas": dfs[k]=dfs[k].rename(columns={"titleId":"tconst"})
# subset titles to those we have names
if subsetdata:
c = "tconst" if k!= "title.episode" else "parentTconst"
try:
(dfs[k].loc[dfs[k][c].isin(dfs["nmi"]["tconst"])]
.to_csv(f"{files[k]}_subset.tsv", sep="\t", index=False))
except KeyError as e:
print(k, dfs[k].columns, e)
else:
dfs[k] = pd.read_csv(f"{files[k]}_subset.tsv", sep="\t")
dfs[k] = dfs[k].astype({c:"Int64" for c in dfs[k].columns}, errors="ignore")
dfs["name.and.titles"] = dfs["nmi"].merge(dfs["name.basics"], on="nconst").merge(dfs["title.basics"], on="tconst")
OMDB sourcing
omdbcols = ['Title', 'Year', 'Rated', 'Released', 'Runtime', 'Genre', 'Director', 'Writer', 'Actors', 'Plot', 'Language', 'Country', 'Awards', 'Poster', 'Ratings', 'Metascore', 'imdbRating', 'imdbVotes', 'imdbID', 'Type', 'DVD', 'BoxOffice', 'Production', 'Website', 'Response']
omdbk = "omdb.titles"
files[omdbk] = f"{omdbk}.tsz"
if not Path().cwd().joinpath(files[omdbk]).is_file():
dfs[omdbk] = pd.DataFrame(columns=omdbcols)
else:
dfs[omdbk] = pd.read_csv(files[omdbk], sep="\t", thousands=",")
dfs[omdbk] = dfs[omdbk].astype({c:"Int64" for c in dfs[omdbk].columns}, errors="ignore")
k = "title.basics"
# limited to 1000 API calls a day, so only fetch if have not done already
for tconst in dfs[k].loc[~(dfs[k]["tconst"].isin(dfs[omdbk]["imdbID"]))]["tconst"].values:
# tt0109830 movie Forrest Gump
# http://www.omdbapi.com/?i=tt3896198&apikey=xxx
params={"apikey":apikey,"i":tconst,"plot":"full"}
res = requests.get("http://www.omdbapi.com/", params=params)
if res.status_code!=200:
print("breached API limit")
break
else:
dfs[omdbk] = pd.concat([dfs[omdbk], pd.json_normalize(res.json())])
dfs[omdbk].to_csv(files[omdbk], index=False, sep="\t")
sample analysis
# The Dark Knight tt0468569
# Game of Throne tt0944947
# for demo purpose - just pick first association when there are many
mask = dfs[omdbk]["imdbID"].isin(["tt0468569","tt0944947"])
demo = (dfs[omdbk].loc[mask]
.rename(columns={c:f"OMDB{c}" for c in dfs[omdbk].columns})
.rename(columns={"OMDBimdbID":"tconst"})
.merge(dfs["title.basics"], on="tconst")
.merge(dfs["title.ratings"], on="tconst")
.merge(dfs["title.akas"].groupby("tconst", as_index=False).first(), on="tconst")
.merge(dfs["title.crew"].groupby("tconst", as_index=False).first(), on="tconst")
.merge(dfs["title.principals"].groupby("tconst", as_index=False).first(), on="tconst")
.merge(dfs["title.episode"].groupby("parentTconst", as_index=False).first(),
left_on="tconst", right_on="parentTconst", how="left", suffixes=("","_ep"))
.merge(dfs["nmi"]
.merge(dfs["name.basics"], on="nconst")
.groupby(["tconst"], as_index=False).first(), on="tconst", suffixes=("","_name"))
).T
output
0 1
OMDBTitle The Dark Knight Game of Thrones
OMDBYear 2008 2011–2019
OMDBRated PG-13 TV-MA
OMDBReleased 18 Jul 2008 17 Apr 2011
OMDBRuntime 152 min 57 min
OMDBGenre Action, Crime, Drama, Thriller Action, Adventure, Drama, Fantasy, Romance
OMDBDirector Christopher Nolan NaN
OMDBWriter Jonathan Nolan (screenplay), Christopher Nolan (screenplay), Christopher Nolan (story), David S. Goyer (story), Bob Kane (characters) David Benioff, D.B. Weiss
OMDBActors Christian Bale, Heath Ledger, Aaron Eckhart, Michael Caine Peter Dinklage, Lena Headey, Emilia Clarke, Kit Harington
OMDBLanguage English, Mandarin English
OMDBCountry USA, UK USA, UK
OMDBAwards Won 2 Oscars. Another 153 wins & 159 nominations. Won 1 Golden Globe. Another 374 wins & 602 nominations.
OMDBPoster https://m.media-amazon.com/images/M/MV5BMTMxNTMwODM0NF5BMl5BanBnXkFtZTcwODAyMTk2Mw##._V1_SX300.jpg https://m.media-amazon.com/images/M/MV5BYTRiNDQwYzAtMzVlZS00NTI5LWJjYjUtMzkwNTUzMWMxZTllXkEyXkFqcGdeQXVyNDIzMzcwNjc#._V1_SX300.jpg
OMDBRatings [{'Source': 'Internet Movie Database', 'Value': '9.0/10'}, {'Source': 'Rotten Tomatoes', 'Value': '94%'}, {'Source': 'Metacritic', 'Value': '84/100'}] [{'Source': 'Internet Movie Database', 'Value': '9.3/10'}]
OMDBMetascore 84 <NA>
OMDBimdbRating 9 9.3
OMDBimdbVotes 2234169 1679892
tconst tt0468569 tt0944947
OMDBType movie series
OMDBDVD 09 Dec 2008 NaN
OMDBBoxOffice $533,316,061 NaN
OMDBProduction Warner Bros. Pictures/Legendary NaN
OMDBWebsite <NA> <NA>
OMDBResponse 1 1
OMDBtotalSeasons <NA> 8
titleType movie tvSeries
primaryTitle The Dark Knight Game of Thrones
originalTitle The Dark Knight Game of Thrones
isAdult 0 0
startYear 2008 2011
endYear <NA> 2019
runtimeMinutes 152 57
genres Action,Crime,Drama Action,Adventure,Drama
averageRating 9 9.3
numVotes 2237966 1699318
ordering_x 10 10
title The Dark Knight Taht Oyunları
region GB TR
language en tr
types imdbDisplay imdbDisplay
attributes fake working title literal title
isOriginalTitle 0 0
directors nm0634240 nm0851930,nm0551076,nm0533713,nm0336241,nm1888967,nm1047532,nm0764601,nm0007008,nm0617042,nm0787687,nm0687964,nm0070474,nm1125275,nm0638354,nm0002399,nm0806252,nm0755261,nm0887700,nm0590889
writers nm0634300,nm0634240,nm0333060,nm0004170 nm1125275,nm0552333,nm1888967,nm4984276,nm2643685,nm7260047,nm2977599,nm0961827,nm0260870
ordering_y 10 10
nconst nm0746273 nm0322513
category producer actor
job producer creator
characters ["Bruce Wayne"] ["Jorah Mormont"]
parentTconst NaN tt0944947
tconst_ep NaN tt1480055
seasonNumber <NA> 1
episodeNumber <NA> 1
nconst_name nm0000198 nm0000293
primaryName Gary Oldman Sean Bean
birthYear 1958 1959
deathYear 1998 2020
primaryProfession actor,soundtrack,producer actor,producer,animation_department

Nested string in a list - need to split nested string to help turn it into a dataframe

I'm working on a web scraping project with nba stats. When I am scraping, I can get all of the information. However, all of the stats are returning as one string, which, turned into a dataframe, puts all the stats in one column. I'm attempting to split this string. and replace it in it's own nested area.
Hopefully the image explains this better.
I am webscraping from https://stats.nba.com/players/traditional/?sort=PTS&dir=-1 using selenium because I am planning on clicking though all of the pages
code I've done so far
here is the function I'm working on:
In the last line I would like to replace z[2] with the split version I've created. When I try z[2] = z[2].split(' ') I get the error AttributeError: 'list' object has no attribute 'split'
new_split = []
for i in player:
player_stats.append(i.text.split('\n'))
for z in player_stats:
new_split.append(z[2].split(' '))```
You didn't mention where you're getting your data from. (I've updated the url in my code. It's still the same API, which returns information for all 457 players, so there is no need to use selenium to navigate to the other pages). The official nba website seems to be offering their data in JSON format, which is always desirable when web scraping:
import requests
import json
# url = "https://stats.nba.com/stats/leagueLeaders?LeagueID=00&PerMode=PerGame&Scope=S&Season=2019-20&SeasonType=Regular+Season&StatCategory=PTS"
url = "https://stats.nba.com/stats/leaguedashplayerstats?College=&Conference=&Country=&DateFrom=&DateTo=&Division=&DraftPick=&DraftYear=&GameScope=&GameSegment=&Height=&LastNGames=0&LeagueID=00&Location=&MeasureType=Base&Month=0&OpponentTeamID=0&Outcome=&PORound=0&PaceAdjust=N&PerMode=PerGame&Period=0&PlayerExperience=&PlayerPosition=&PlusMinus=N&Rank=N&Season=2019-20&SeasonSegment=&SeasonType=Regular+Season&ShotClockRange=&StarterBench=&TeamID=0&TwoWay=0&VsConference=&VsDivision=&Weight="
response = requests.get(url)
response.raise_for_status()
data = json.loads(response.text)
players = []
for player_data in data["resultSet"]["rowSet"]:
player = dict(zip(data["resultSet"]["headers"], player_data))
players.append(player)
for player in players[:10]:
print(f"{player['PLAYER']} ({player['TEAM_ABBREVIATION']}) is rank {player['RANK']} with a GP of {player['GP']}")
Output:
James Harden (HOU) is rank 1 with a GP of 18
Giannis Antetokounmpo (MIL) is rank 2 with a GP of 19
Luka Doncic (DAL) is rank 3 with a GP of 18
Bradley Beal (WAS) is rank 4 with a GP of 17
Trae Young (ATL) is rank 5 with a GP of 18
Damian Lillard (POR) is rank 6 with a GP of 18
Karl-Anthony Towns (MIN) is rank 7 with a GP of 16
Anthony Davis (LAL) is rank 8 with a GP of 18
Brandon Ingram (NOP) is rank 9 with a GP of 15
LeBron James (LAL) is rank 10 with a GP of 19
Note: I have no idea what a "GP" is - I just picked that for demonstration. Here's a screenshot of Chrome's network logger, showing a small part of the expanded JSON resource (EDIT The json response from the new url looks exactly the same, except some of the headers are different, like "TEAM" -> "TEAM_ABBREVIATION"):
You can see the values - which you're struggling to extract out of one giant string - nicely separated into separate elements. The code I posted above creates key-value pairs using the headers ("PLAYER_ID", "RANK", etc. found in data["resultSet"]["headers"]) and these values.
If the second column is a string, you could try to split this string into
a list, turn each element of this list into a series, and then concat
this new data frame with the first two columns of the original data frame.
df_stats = df["2"].apply(lambda x: x[0].split(" ")).apply(pd.Series)
df_end = pd.concat([df[["0","1"]].reset_index(drop=True), df_stats], axis=1)
Example:
df = pd.DataFrame({"0": [1, 2],
"1": ["Name1", "Name2"],
"2":[["HOU 30 80"], ["LA 30 50"]]})
df_stats = df["2"].apply(lambda x: x[0].split(" ")).apply(pd.Series)
df_end = pd.concat([df[["0","1"]].reset_index(drop=True), df_stats], axis=1)
0 1 0 1 2
0 1 Name1 HOU 30 80
1 2 Name2 LA 30 50

How to get scrape all the td and tr data from NFL schedule

I am scraping data from the espn.com for the upcoming NFL schedule. However, I am only able to get the first line of table and the not the rest of the tables. I believe it is because of structure of the html and the each date has a different 'td'. I can get Thursday's game data but, not the rest
****Thursday, September 5****
MATCHUP TIME (ET) NAT TV TICKETS LOCATION
Green Bay
Chicago
8:20 PM NBC Tickets as low as $290 Soldier Field, Chicago
Sunday, September 8
MATCHUP TIME (ET) NAT TV TICKETS LOCATION
Tennessee
Cleveland
1:00 PM CBS Tickets as low as $121 FirstEnergy Stadium, Cleveland
Cincinnati
Seattle
4:05 PM CBS Tickets as low as $147 CenturyLink Field, Seattle
New York
Dallas
4:25 PM FOX Tickets as low as $50 AT&T Stadium, Arlington
Foxboro
Monday, September 9
MATCHUP TIME (ET) NAT TV TICKETS LOCATION
Houston
New Orleans
7:10 PM ESPN Tickets as low as $112 Mercedes-Benz Superdome, New Orleans
Denver
Oakland
10:20 PM ESPN Tickets as low as $72 Oakland Coliseum, Oakland
I have use beautifulsoup and was easily about to get the data, but parsing the data has been a challenged.
I have tried to just continuing using a for loop, but I can a stopiteration traceback. After reading the previous article about the traceback I realize that I need to try a different solution to the problem.
import requests
from bs4 import BeautifulSoup
from urllib.parse import urlparse
import pandas as pd
main_url = ['http://www.espn.com/nfl/schedule'][1]
response = requests.get(main_url)
soup = BeautifulSoup(response.text, 'lxml')
table = soup.find('table')
rows = table.find_all('tr')
rows = iter(rows)
df = [td.text for td in next(rows).find_all('td') if td.text]
df2 = [td.text for td in next(rows).find_all('td') if td.text]
[1]: https://www.espn.com/nfl/schedule
I believe that the problem lies in this line :
table = soup.find('table')
The thing is, the above mentioned page consists of 3 table elements that have the class = "schedule" attribute. However in your code, you used the find() function only, instead of find_all(). That's the major reason that you ended up with only the contents of the first table. So, I believe that if just handle that part correctly then you'll be good to go. Now, I'm not much familiar with the set notation used to fill up the lists, hence the code contains the good old for loop style.
#List to store the rows
df = []
#Collect all the tables
tables = soup.find_all('table', class_ = "schedule")
for table in tables:
rows = soup.find_all('tr')
#rows = iter(rows)
row_item = []
for row in rows:
#Collect all 'td' elements from the 'row' & append them to a list 'row_item'
data_items = row.find_all('td')
for data_item in data_items:
row_item.append(data_item.text)
#Append the list to the 'df'
df.append(row_item)
row_item = []
print(df)
If you're trying to pull <table> tags, you can use Pandas .read_html() to do that. It'll return a list of dataframes. In this case, you can append them all together into 1 table:
import pandas as pd
url = 'http://www.espn.com/nfl/schedule'
tables = pd.read_html(url)
df = pd.DataFrame()
for table in tables:
df = df.append(table)

filtering pandas over a list and sending email

I am having a pandas data frame like below:-
Tweets
0 RT #cizzorz: THE CHILLER TRAP *TEMPLE RUN* OBS...
1 Disco Domination receives a change in order to...
2 It's time for the Week 3 #FallSkirmish Trials!...
3 Dance your way to victory in the new Disco Dom...
4 Patch v6.02 is available now with a return fro...
5 Downtime for patch v6.02 has begun. Find out a...
6 💀⛏️... soon
7 Launch into patch v6.02 Wednesday, October 10!...
8 Righteous Fury.\n\nThe Wukong and Dark Vanguar...
9 RT #wbgames: WB Games is happy to bring #Fortn...
I also have a list suppose like below :-
my_list = ['Launch', 'Dance', 'Issue']
Now I want to filter the rows if there is a matching word from the my_list and get the whole row and send it as an email or to slack.
Like I should get output as row no is because its having Dance word in it.
3 Dance your way to victory in the new Disco Dom..
I tried below code to filter but every time its giving me an empty values
data[data['Tweets'].str.contains('my_list')]
Also I only wants to send the email the same row as a body if I am having matching words from list else I dont want.
This will get it done:
import pandas as pd
import numpy as np
from io import StringIO
s = '''
"RT #cizzorz: THE CHILLER TRAP *TEMPLE RUN* OBS..."
"Disco Domination receives a change in order to..."
"It's time for the Week 3 #FallSkirmish Trials!..."
"Dance your way to victory in the new Disco Dom..."
"Patch v6.02 is available now with a return fro..."
"Downtime for patch v6.02 has begun. Find out a..."
"💀⛏️... soon"
"Launch into patch v6.02 Wednesday, October 10!..."
"Righteous Fury.\n\nThe Wukong and Dark Vanguar..."
"RT #wbgames: WB Games is happy to bring #Fortn... plane 5 [20 , 12, 30]"
'''
ss = StringIO(s)
df = pd.read_csv(ss, sep=r'\s+', names=['Data'])
my_list = ['Launch', 'Dance', 'Issue']
cond = df.Data.str.contains(my_list[0])
for x in my_list[1:]:
cond = cond | df.Data.str.contains(x)
df[cond]
Use regex=True
Ex:
data[data['Tweets'].str.contains("|".join(my_list), regex=True)]

Categories