How can I scrape tables that seem to be hidden by jquery? - python

I'm trying to scrape these words with their meanings on this website, I scraped the first table, but even after revealing word list 2 by clicking on it, bs4 can't find that table (or any other of the hidden tables). Is there anything different I'm meant to do for toggled/hidden elements like this?
Here's what I used to access the first table:
root = "https://www.graduateshotline.com/gre-word-list.html#x2"
content = requests.get(root).text
soup = BeautifulSoup(content,'html.parser')
table = soup.find_all('table',attrs={'class':'tablex border1'})[0]
print(table)

import pandas as pd
df = pd.read_html('https://www.graduateshotline.com/gre/load.php?file=list2.html',
attrs={'class': 'tablex border1'})[0]
print(df)
Output:
0 1
0 multifarious varied; motley; greatly diversified
1 substantiation giving facts to support (statement)
2 feud bitter quarrel over a long period of time
3 indefatigability not easily exhaustible; tirelessness
4 convoluted complicated;coiled; twisted
.. ... ...
257 insensible unconscious; unresponsive; unaffected
258 gourmand a person who is devoted to eating and drinking...
259 plead address a court of law as an advocate
260 morbid diseased; unhealthy (e.g.. about ideas)
261 enmity hatred being an enemy
[262 rows x 2 columns]

Related

Struggling to grab data from baseball reference

I'm trying to grab the tables for all pitchers batting against found on this page.
I believe the problems lies with the data being behind a comment.
For the sake of the example, I'd like to find, say, Sandy Alcantara's home runs allowed.
import requests
from bs4 import BeautifulSoup as bs
url="https://www.baseball-reference.com/leagues/majors/2022-batting-pitching.shtml"
page=requests.get(url)
soup=bs(page.content,"html.parser")
for tr in soup.find_all('tr'):
td=tr.find_all('td')
print(td)
This prints a lot of team data, but doesn't print the pitcher data.
How can I cleverly get it to print the pitcher data? Ideally, I'd have it in a list or something.
object[0]=Rk
object[1]=Name
object[4]=IP
object[13]=HR
The problem related to the extraction of the table content is that the table itself is stored inside a comment string.
After you have fetched your web page, and loaded it into BeautifulSoup, you can solve this web page scraping issue following these steps:
gather the div tagged id = 'all_players_batting_pitching', which contains your table
extract the table from the comments using the decode_contents function, then reloading your text into a soup
extract each record of your table by looking for the tr tag, then each value by looking for the td tag, if that value is included in your indices [1, 4, 13]
load your values into a Pandas.DataFrame, ready to be used
import requests
from bs4 import BeautifulSoup as bs
import pandas as pd
# fetching web page
url = "https://www.baseball-reference.com/leagues/majors/2022-batting-pitching.shtml"
page = requests.get(url)
# extracting table from html
soup = bs(page.content,"html.parser")
table = soup.find(id = 'all_players_batting_pitching')
tab_text = table.decode_contents().split('--')[1].strip()
tab_soup = bs(tab_text,"html.parser")
# extracting records from table
records = []
for i, row in enumerate(tab_soup.find_all('tr')):
record = [ele.text.strip() for j, ele in enumerate(row.find_all('td')) if j in [0, 3, 12]]
if record != []:
records.append([row.a['href']] + [i] + record)
Output:
href Rk Name IP HR
0 /players/a/abbotco01.shtml 1 Cory Abbott 48.0 12
1 /players/a/abreual01.shtml 2 Albert Abreu 38.2 5
2 /players/a/abreual01.shtml 3 Albert Abreu 8.2 2
3 /players/a/abreual01.shtml 4 Albert Abreu 4.1 1
4 /players/a/abreual01.shtml 5 Albert Abreu 25.2 2
... ... ... ... ... ...
1063 /players/z/zastrro01.shtml 1106 Rob Zastryzny* 1.0 0
1064 /players/z/zastrro01.shtml 1107 Rob Zastryzny* 3.0 0
1065 /players/z/zerpaan01.shtml 1108 Angel Zerpa* 11.0 2
1066 /players/z/zeuchtj01.shtml 1109 T.J. Zeuch 10.2 5
1067 /players/z/zimmebr02.shtml 1110 Bruce Zimmermann* 73.2 21

Python categorize data in excel based on key words from another excel sheet

I have two excel sheets, one has four different types of categories with keywords listed. I am using Python to find the keywords in the review data and match them to a category. I have tried using pandas and data frames to compare but I get errors like "DataFrame objects are mutable, thus they cannot be hashed". I'm not sure if there is a better way but I am new to Pandas.
Here is an example:
Category sheet
Service
Experience
fast
bad
slow
easy
Data Sheet
Review #
Location
Review
1
New York
"The service was fast!
2
Texas
"Overall it was a bad experience for me"
For the examples above I would expect the following as a result.
I would expect review 1 to match the category Service because of the word "fast" and I would expect review 2 to match category Experience because of the word "bad". I do not expect the review to match every word in the category sheet, and it is fine if one review belongs to more than one category.
Here is my code, note I am using a simple example. In the example below I am trying to find the review data that would match the Customer Service list of keywords.
import pandas as pd
# List of Categories
cat = pd.read_excel("Categories_List.xlsx")
# Data being used
data = pd.read_excel("Data.xlsx")
# Data Frame for review column
reviews = pd.DataFrame(data["reviews"])
# Data Frame for Categories
cs = pd.DataFrame(cat["Customer Service"])
be = pd.DataFrame(cat["Billing Experience"])
net = pd.DataFrame(cat["Network"])
out = pd.DataFrame(cat["Outcome"])
for i in reviews:
if cs in reviews:
print("True")
One approach would be to build a regular expression from the cat frame:
exp = '|'.join([rf'(?P<{col}>{"|".join(cat[col].dropna())})' for col in cat])
(?P<Service>fast|slow)|(?P<Experience>bad|easy)
Alternatively replace cat with a list of columns to test:
cols = ['Service']
exp = '|'.join([rf'(?P<{col}>{"|".join(cat[col].dropna())})' for col in cols])
(?P<Service>fast|slow|quick)
Then to get matches use str.extractall and aggregate into summary + join to add back to the reviews frame:
Aggregated into List:
reviews = reviews.join(
reviews['Review'].str.extractall(exp).groupby(level=0).agg(
lambda g: list(g.dropna()))
)
Review # Location Review Service Experience
0 1 New York The service was fast and easy! [fast] [easy]
1 2 Texas Overall it was a bad experience for me [] [bad]
Aggregated into String:
reviews = reviews.join(
reviews['Review'].str.extractall(exp).groupby(level=0).agg(
lambda g: ', '.join(g.dropna()))
)
Review # Location Review Service Experience
0 1 New York The service was fast and easy! fast easy
1 2 Texas Overall it was a bad experience for me bad
Alternatively for an existence test use any on level=0:
reviews = reviews.join(
reviews['Review'].str.extractall(exp).any(level=0)
)
Review # Location Review Service Experience
0 1 New York The service was fast and easy! True True
1 2 Texas Overall it was a bad experience for me False True
Or iteratively over the columns and with str.contains:
cols = cat.columns
for col in cols:
reviews[col] = reviews['Review'].str.contains('|'.join(cat[col].dropna()))
Review # Location Review Service Experience
0 1 New York The service was fast and easy! True True
1 2 Texas Overall it was a bad experience for me False True

Why can't I webscrape the table that I want?

I am new to BeautifulSoup and I wanted to try out some web scraping. For my little project, I wanted to get the Golden State Warrior win rate from Wikipedia. I was planning to get the table that had that information and make it into a panda so I could graph it over the years. However, my code selects the Table Key table instead of the Seasons table. I know this is because they are the same type of table (wikitable), but I don't know how to solve this problem. I am sure that there is an easy explanation that I am missing. Can someone please explain how to fix my code and explain how I could choose which tables to web scrape in the future? Thanks!
c_data = "https://en.wikipedia.org/wiki/List_of_Golden_State_Warriors_seasons" #wikipedia page
c_page = urllib.request.urlopen(c_data)
c_soup = BeautifulSoup(c_page, "lxml")
c_table=c_soup.find('table', class_='wikitable') #this is the problem
c_year = []
c_rate = []
for row in c_table.findAll('tr'): #setup for dataframe
cells=row.findAll('td')
if len(cells)==13:
c_year = c_year.append(cells[0].find(text=True))
c_rate = c_rate.append(cells[9].find(text=True))
print(c_year, c_rate)
Use pd.read_html to get all the tables
This function returns a list of dataframes
tables[0] through tables[17], in this case
import pandas as pd
# read tables
tables = pd.read_html('https://en.wikipedia.org/wiki/List_of_Golden_State_Warriors_seasons')
print(len(tables))
>>> 18
tables[0]
0 1
0 AHC NBA All-Star Game Head Coach
1 AMVP All-Star Game Most Valuable Player
2 COY Coach of the Year
3 DPOY Defensive Player of the Year
4 Finish Final position in division standings
5 GB Games behind first-place team in division[b]
6 Italics Season in progress
7 Losses Number of regular season losses
8 EOY Executive of the Year
9 FMVP Finals Most Valuable Player
10 MVP Most Valuable Player
11 ROY Rookie of the Year
12 SIX Sixth Man of the Year
13 SPOR Sportsmanship Award
14 Wins Number of regular season wins
# display all dataframes in tables
for i, table in enumerate(tables):
print(f'Table {i}')
display(table)
print('\n')
Select specific table
df_i_want = tables[x] # x is the specified table, 0 indexed
# delete tables
del(tables)

How to Automatically select data on webpage and download the resulting xls file using Python

I am new to Python. I am trying to scrape the data on the page:
For example:
Category: grains
Organic: No
Commodity: Coarse
SubCommodity: Corn
Publications: Daily
Location: All
Refine Commodity: All
Dates: 07/31/2018 - 08/01/2019
Is there a way in which Python can select this on the webpage and then click on run and then
Click on Download as Excel and store the excel file?
Is it possible? I am new to coding and need some guidance here.
Currently what I have done is enter the data and then on the resulting page I used Beautiful Soup to scrape the table. However it takes a lot of time since the table is spread on more than 200 pages.
Using the query you defined as an example, I input the query manually and found the following URL for the Excel (really HTML) format:
url = 'https://marketnews.usda.gov/mnp/ls-report?&endDateGrain=08%2F01%2F2019&commDetail=All&repMonth=1&endDateWeekly=&repType=Daily&rtype=&fsize=&_use=1&use=&_fsize=1&byproducts=&run=Run&pricutvalue=&repDateGrain=07%2F31%2F2018&runReport=true&grade=&regionsDesc=&subprimals=&mscore=&endYear=2019&repDateWeekly=&_wrange=1&endDateWeeklyGrain=&repYear=2019&loc=All&_loc=1&wrange=&_grade=1&repDateWeeklyGrain=&_byproducts=1&organic=NO&category=Grain&_mscore=1&subComm=Corn&commodity=Coarse&_commDetail=1&_subprimals=1&cut=&endMonth=1&repDate=07%2F31%2F2018&endDate=08%2F01%2F2019&format=excel'
In the URL are parameters we can set in Python, and we could easily make a loop to change the parameters. For now, let me just get into the example of actually getting this data. I use the pandas.read_html to read this HTML result and populate a DataFrame, which could be thought of as a table with columns and rows.
import pandas as pd
# use URL defined earlier
# url = '...'
df_lst = pd.read_html(url, header=1)
Now df_lst is a list of DataFrame objects containing the desired data. For your particular example, this results in 30674 rows and 11 columns:
>>> df_lst[0].columns
Index([u'Report Date', u'Location', u'Class', u'Variety', u'Grade Description',
u'Units', u'Transmode', u'Low', u'High', u'Pricing Point',
u'Delivery Period'],
dtype='object')
>>> df_lst[0].head()
Report Date Location Class Variety Grade Description Units Transmode Low High Pricing Point Delivery Period
0 07/31/2018 Blytheville, AR YELLOW NaN US NO 2 Bushel Truck 3.84 3.84 Country Elevators Cash
1 07/31/2018 Helena, AR YELLOW NaN US NO 2 Bushel Truck 3.76 3.76 Country Elevators Cash
2 07/31/2018 Little Rock, AR YELLOW NaN US NO 2 Bushel Truck 3.74 3.74 Mills and Processors Cash
3 07/31/2018 Pine Bluff, AR YELLOW NaN US NO 2 Bushel Truck 3.67 3.67 Country Elevators Cash
4 07/31/2018 Denver, CO YELLOW NaN US NO 2 Bushel Truck-Rail 3.72 3.72 Terminal Elevators Cash
>>> df_lst[0].shape
(30674, 11)
Now, back to the point I made about the URL parameters--using Python, we can run through lists and format the URL string to our liking. For instance, iterating through 20 years of the given query can be done by modifying the URL to have numbers corresponding to positional arguments in Python's str.format() method. Here's a full example below:
import datetime
import pandas as pd
url = 'https://marketnews.usda.gov/mnp/ls-report?&endDateGrain={1}&commDetail=All&repMonth=1&endDateWeekly=&repType=Daily&rtype=&fsize=&_use=1&use=&_fsize=1&byproducts=&run=Run&pricutvalue=&repDateGrain={0}&runReport=true&grade=&regionsDesc=&subprimals=&mscore=&endYear=2019&repDateWeekly=&_wrange=1&endDateWeeklyGrain=&repYear=2019&loc=All&_loc=1&wrange=&_grade=1&repDateWeeklyGrain=&_byproducts=1&organic=NO&category=Grain&_mscore=1&subComm=Corn&commodity=Coarse&_commDetail=1&_subprimals=1&cut=&endMonth=1&repDate={0}&endDate={1}&format=excel'
start = [datetime.date(2018-i, 7, 31) for i in range(20)]
end = [datetime.date(2019-i, 8, 1) for i in range(20)]
for s, e in zip(start, end):
url_get = url.format(s.strftime('%m/%d/%Y'), e.strftime('%m/%d/%Y'))
df_lst = pd.read_html(url_get, header=1)
#print(df_lst[0].head()) # uncomment to see first five rows
#print(df_lst[0].shape) # uncomment to see DataFrame shape
Be careful with pd.read_html. I've modified my answer with a header keyword argument to pd.read_html() because the multi-indexing made it a pain to get results. By giving a single row index as the header, it's no longer a multi-index, and data indexing is easy. For instance, I can get corn class using this:
df_lst[0]['Class']
Compiling all the reports into one large file is also easy with Pandas. Since we have a DataFrame, we can use the pandas.to_csv function to export our data as a CSV (or any other file type you want, but I chose CSV for this example). Here's a modified version with the additional capability of outputting a CSV:
import datetime
import pandas as pd
# URL
url = 'https://marketnews.usda.gov/mnp/ls-report?&endDateGrain={1}&commDetail=All&repMonth=1&endDateWeekly=&repType=Daily&rtype=&fsize=&_use=1&use=&_fsize=1&byproducts=&run=Run&pricutvalue=&repDateGrain={0}&runReport=true&grade=&regionsDesc=&subprimals=&mscore=&endYear=2019&repDateWeekly=&_wrange=1&endDateWeeklyGrain=&repYear=2019&loc=All&_loc=1&wrange=&_grade=1&repDateWeeklyGrain=&_byproducts=1&organic=NO&category=Grain&_mscore=1&subComm=Corn&commodity=Coarse&_commDetail=1&_subprimals=1&cut=&endMonth=1&repDate={0}&endDate={1}&format=excel'
# CSV output file and flag
csv_out = 'myreports.csv'
flag = True
# Start and end dates
start = [datetime.date(2018-i, 7, 31) for i in range(20)]
end = [datetime.date(2019-i, 8, 1) for i in range(20)]
# Iterate through dates and get report from URL
for s, e in zip(start, end):
url_get = url.format(s.strftime('%m/%d/%Y'), e.strftime('%m/%d/%Y'))
df_lst = pd.read_html(url_get, header=1)
print(df_lst[0].head()) # uncomment to see first five rows
print(df_lst[0].shape) # uncomment to see DataFrame shape
# Save to big CSV
if flag is True:
# 0th iteration, so write header and overwrite existing file
df_lst[0].to_csv(csv_out, header=True, mode='w') # change mode to 'wb' if Python 2.7
flag = False
else:
# Subsequent iterations should append to file and not add new header
df_lst[0].to_csv(csv_out, header=False, mode='a') # change mode to 'ab' if Python 2.7
Your particular query generates at least 1227 pages of data - so I just trimmed it down to one location - Arizona(from 07/31/2018 - 08/1/2019) - now generating 47 pages of data. xml size was 500KB
You can semi automate like this:
>>> end_day='01'
>>> start_day='31'
>>> start_month='07'
>>> end_month='08'
>>> start_year='2018'
>>> end_year='2019'
>>> link = f"https://marketnews.usda.gov/mnp/ls-report?&endDateGrain={end_month}%2F{end_day}%2F{end_year}&commDetail=All&endDateWeekly={end_month}%2F{end_day}%2F{end_year}&repMonth=1&repType=Daily&rtype=&use=&_use=1&fsize=&_fsize=1&byproducts=&run=Run&pricutvalue=&repDateGrain={start_month}%2F{start_day}%2F{start_year}+&runReport=true&grade=&regionsDesc=All+AR&subprimals=&mscore=&endYear={end_year}&repDateWeekly={start_month}%2F{start_day}%2F{start_year}&_wrange=1&endDateWeeklyGrain=&repYear={end_year}&loc=All+AR&_loc=1&wrange=&_grade=1&repDateWeeklyGrain=&_byproducts=1&category=Grain&organic=NO&commodity=Coarse&subComm=Corn&_mscore=0&_subprimals=1&_commDetail=1&cut=&endMonth=1&repDate={start_month}%2F{start_day}%2F{start_year}&endDate={end_month}%2F{end_day}%2F{end_year}&format=xml"
>>> link
'https://marketnews.usda.gov/mnp/ls-report?&endDateGrain=08%2F01%2F2019&commDetail=All&endDateWeekly=08%2F01%2F2019&repMonth=1&repType=Daily&rtype=&use=&_use=1&fsize=&_fsize=1&byproducts=&run=Run&pricutvalue=&repDateGrain=07%2F31%2F2018+&runReport=true&grade=&regionsDesc=All+AR&subprimals=&mscore=&endYear=2019&repDateWeekly=07%2F31%2F2018&_wrange=1&endDateWeeklyGrain=&repYear=2019&loc=All+AR&_loc=1&wrange=&_grade=1&repDateWeeklyGrain=&_byproducts=1&category=Grain&organic=NO&commodity=Coarse&subComm=Corn&_mscore=0&_subprimals=1&_commDetail=1&cut=&endMonth=1&repDate=07%2F31%2F2018&endDate=08%2F01%2F2019&format=xml'
>>> with urllib.request.urlopen(link) as response:
... html = response.read()
...
loading the html could take a hot minute with large queries
If you for some reason wished to process the entire data set - you could repeat this process - but you may wish to look into techniques that can be specifically optimized for big data - perhaps a solution involving Python's Pandas and numexpr(for GPU acceleration/parallelization)
You can find the data used in this answer here - which you can download as an xml.
First import your xml:
>>> import xml.etree.ElementTree as ET
you can either download the file from the website in python
>>> tree = ET.parse(html)
or manually
>>> tree = ET.parse('report.xml')
>>> report = tree.getroot()
you can then do stuff like this:
>>> report[0][0]
<Element 'reportDate' at 0x7f902adcf368>
>>> report[0][0].text
'07/31/2018'
>>> for el in report[0]:
... print(el)
...
<Element 'reportDate' at 0x7f902adcf368>
<Element 'location' at 0x7f902ac814f8>
<Element 'classStr' at 0x7f902ac81548>
<Element 'variety' at 0x7f902ac81b88>
<Element 'grade' at 0x7f902ac29cc8>
<Element 'units' at 0x7f902ac29d18>
<Element 'transMode' at 0x7f902ac29d68>
<Element 'bidLevel' at 0x7f902ac29db8>
<Element 'deliveryPoint' at 0x7f902ac29ea8>
<Element 'deliveryPeriod' at 0x7f902ac29ef8>
More info on parsing xml is here.
You're going to want to learn some python - but hopefully you can make sense of the following. Luckily - there are many free python tutorials online - here is a quick snippet to get you started.
#lets find the lowest bid on a certain day
>>> report[0][0]
<Element 'reportDate' at 0x7f902adcf368>
>>> report[0][0].text
'07/31/2018'
>>> report[0][7]
<Element 'bidLevel' at 0x7f902ac29db8>
>>> report[0][7][0]
<Element 'lowPrice' at 0x7f902ac29e08>
>>> report[0][7][0].text
'3.84'
#how many low bids are there?
>>> len(report)
1216
#get an average of the lowest bids...
>>> low_bid_list = [float(bid[7][0].text) for bid in report]
[3.84, 3.76, 3.74, 3.67, 3.65, 3.7, 3.5, 3.7, 3.61,...]
>>> sum = 0
>>> for el in low_bid_list:
... sum = sum + el
...
>>> sum
4602.599999999992
>>> sum/len(report)
3.7850328947368355

scraping table from site - finding blank cells, python

Here's the site I'm working with: http://www.fantasypros.com/mlb/probable-pitchers.php
What I want to do it run the code every day, and it return a list of pitchers that are pitching that day, so just the first column. Here's what I have so far.
from bs4 import BeautifulSoup
import urllib.request
url = 'http://www.fantasypros.com/mlb/probable-pitchers.php'
page = requests.get(url)
soup = BeautifulSoup(page.content, 'lxml')
table = soup.find('table',{'class': 'table table-condensed'})
table2 = table.find('tbody') #this find just the rows with pitchers (excludes dates)
daysOnPage = []
for row in table.findAll('th'):
daysOnPage.append(row.text)
daysOnPage.pop(0)
#print(daysOnPage)
pitchers = []
for row in table2.findAll('a', {'class': 'available mpb-available'}):
pitchers.append(row.text)
This returns a list of every pitcher on the page. If every cell on the table was always filled, I could do something like deleting every nth player or something like that, but that seems pretty inelegant, and also doesn't work since you don't ever know which cells will be blank. I've looked through the table2.prettify() code but I can't find anything that indicates to me where a blank cell is coming.
Thanks for the help.
Edit: Tinkering a little bit, I've figured this much out:
for row in table2.find('tr'):
for a in row.findAll('a', {'class': 'available mpb-available'}):
pitchers.append(a.text)
continue
That prints the first row of pitchers, which is also a problem I was going to tackle later. Why does the continue not make it iterate through the rows?
When I hear table, I think pandas. You can have pandas.read_html do the parsing for you then use pandas.Series.dropna return only valid values.
In [1]: import pandas as pd
In [2]: dfs = pd.read_html('http://www.fantasypros.com/mlb/probable-pitchers.php')
In [3]: df = dfs[0].head(10) # get the first dataframe and just use the first 10 teams for this example
In [4]: print(df['Thu Aug 6']) # Selecting just one day by label
0 #WSHJ. Hellickson(7-6)SP 124
1 MIAM. Wisler(5-1)SP 306
2 NaN
3 #NYYE. Rodriguez(6-3)SP 177
4 SFK. Hendricks(4-5)SP 51
5 STLM. Lorenzen(3-6)SP 300
6 NaN
7 NaN
8 NaN
9 KCB. Farmer(0-2)SP 270
Name: Thu Aug 6, dtype: object
In [5]: active = df['Thu Aug 6'].dropna() # now just drop any fields that are NaNs
In [6]: print(active)
0 #WSHJ. Hellickson(7-6)SP 124
1 MIAM. Wisler(5-1)SP 306
3 #NYYE. Rodriguez(6-3)SP 177
4 SFK. Hendricks(4-5)SP 51
5 STLM. Lorenzen(3-6)SP 300
9 KCB. Farmer(0-2)SP 270
Name: Thu Aug 6, dtype: object
I suppose the last thing you'll want to do is parse the strings in the table to get just the pitchers name.
If you want to write the Series to a csv, you can do so directly by:
In [7]: active.to_csv('active.csv')
This gives you a csv that looks something like this:
0,#WSHJ. Hellickson(7-6)SP 126
1,MIAM. Wisler(5-1)SP 306
3,#NYYE. Rodriguez(6-3)SP 179
4,SFK. Hendricks(4-5)SP 51
5,STLM. Lorenzen(3-6)SP 301
9,KCB. Farmer(0-2)SP 267

Categories