Beautifulsoup Python loops

Beautifulsoup Python loops - python

I have this code that returns None for each row, can someone help me?
from bs4 import BeautifulSoup
import requests
import pandas as pd
website = 'https://www.bloodyelbow.com/22198483/comprehensive-list-of-ufc-fighters-who-have-tested-positive-for-covid-19'
response = requests.get(website)
soup = BeautifulSoup(response.content, 'html.parser')
results = soup.find('table',{'class':'p-data-table'}).find('tbody').find_all('tr')
name=[]
reported_date=[]
card=[]
card_date=[]
opponent=[]
resolution=[]
for result in results:
print(name.append(i.find_all('td')[0].get_text()))

You can use pandas directly and get all the columns up to the last two:
import pandas
website = "https://www.bloodyelbow.com/22198483/comprehensive-list-of-ufc-fighters-who-have-tested-positive-for-covid-19"
df = pandas.read_html(website)[0].iloc[:, :-2]
print(df.to_string())
Output (truncated):
Fighter Reported Card Card Date Opponent Resolution
0 Rani Yahya 7/31/2021 UFC Vegas 33 7/31/2021 Kyung Ho Kang Fight scratched
1 Amanda Nunes 7/29/2021 UFC 265 8/7/2021 Julianna Pena Fight scratched
2 Amanda Ribas 5/23/2021 UFC Vegas 28 6/5/2021 Angela Hill Fight scratched
3 Jack Hermansson 5/19/2021 UFC 262 5/17/2021 Edmen Shahbazyan Rescheduled for UFC Vegas 27 - May 22

Related

Web scraping a table through multiple pages with a single link

I am trying to web scrape a table on a webpage as part of an assignment using Python. I want to scrape all 618 records of the table which are scattered across 13 pages in the same URL. However, my program only scrapes the first page of the table and its records. The URL is in my code, which can be found below:
from bs4 import BeautifulSoup as bs
import requests as r
base_URL = 'https://www.nba.com/players'
def scrape_webpage(URL):
player_names = []
page = r.get(URL)
print(f'{page.status_code}')
soup = bs(page.content, 'html.parser')
raw_player_names = soup.find_all('div', class_='flex flex-col lg:flex-row')
for name in raw_player_names:
player_names.append(name.get_text().strip())
print(player_names)
scrape_webpage(base_URL)

The player data is embedded inside <script> element in the page. You can decode it with this example:
import re
import json
import requests
import pandas as pd
url = "https://www.nba.com/players"
data = re.search(r'({"props":.*})', requests.get(url).text).group(0)
data = json.loads(data)
# uncomment to print all data:
# print(json.dumps(data, indent=4))
df = pd.DataFrame(data["props"]["pageProps"]["players"])
print(df.head().to_markdown())
Prints:
PERSON_ID
PLAYER_LAST_NAME
PLAYER_FIRST_NAME
PLAYER_SLUG
TEAM_ID
TEAM_SLUG
IS_DEFUNCT
TEAM_CITY
TEAM_NAME
TEAM_ABBREVIATION
JERSEY_NUMBER
POSITION
HEIGHT
WEIGHT
COLLEGE
COUNTRY
DRAFT_YEAR
DRAFT_ROUND
DRAFT_NUMBER
ROSTER_STATUS
FROM_YEAR
TO_YEAR
PTS
REB
AST
STATS_TIMEFRAME
PLAYER_LAST_INITIAL
HISTORIC
0
1630173
Achiuwa
Precious
precious-achiuwa
1610612761
raptors
0
Toronto
Raptors
TOR
5
F
6-8
225
Memphis
Nigeria
2020
1
20
1
2020
2021
9.1
6.5
1.1
Season
A
False
1
203500
Adams
Steven
steven-adams
1610612763
grizzlies
0
Memphis
Grizzlies
MEM
4
C
6-11
265
Pittsburgh
New Zealand
2013
1
12
1
2013
2021
6.9
10
3.4
Season
A
False
2
1628389
Adebayo
Bam
bam-adebayo
1610612748
heat
0
Miami
Heat
MIA
13
C-F
6-9
255
Kentucky
USA
2017
1
14
1
2017
2021
19.1
10.1
3.4
Season
A
False
3
1630583
Aldama
Santi
santi-aldama
1610612763
grizzlies
0
Memphis
Grizzlies
MEM
7
F-C
6-11
215
Loyola-Maryland
Spain
2021
1
30
1
2021
2021
4.1
2.7
0.7
Season
A
False
4
200746
Aldridge
LaMarcus
lamarcus-aldridge
1610612751
nets
0
Brooklyn
Nets
BKN
21
C-F
6-11
250
Texas-Austin
USA
2006
1
2
1
2006
2021
12.9
5.5
0.9
Season
A
False

Scrapy: unable to locate table or scrape data in table

For a group project, I am trying to scrape Salaries table within https://www.basketball-reference.com/players/a/allenra02.html.
I have tried multiple CSS and Xpath selectors such as
#all_salaries > tbody > tr:nth-child(1)
#all_salaries > tbody
#all_salaries > tbody > tr:nth-child(1) > td.right
#all_salaries
//*[#id="all_salaries"]/tbody/tr[1]/td[3]
//*[#id="all_salaries"]/tbody
//*[#id="all_salaries"]
Code look as follows:
def start_requests(self):
start_urls = ['https://www.basketball-reference.com/players/a/allenra02.html']
for url in start_urls:
yield scrapy.Request(url=url, callback=self.parse_season)
def parse_player(self, response):
response.css('#all_salaries > tbody)
I tried printing it out, but it keeps returning an empty list.
Other tables seem fine, except this one.
EDIT:
My final solution looks something like
regex = re.compile(r'<!--(.*)-->', re.DOTALL)
salaries = response.xpath('//*[#id="all_all_salaries"]/comment()').get()
if salaries:
salaries = response.xpath('//*[#id="all_all_salaries"]/comment()').re(regex)[0]
salaries_sel = scrapy.Selector(text=salaries, type="html")
all_salaries = salaries_sel.css('#all_salaries > tbody > tr').extract()

You can use BeautifulSoup to pull out the comments then parse the table with pandas. I chose to only pull out the salary table, but you can get all the tables in the comments this way.
import requests
from bs4 import BeautifulSoup, Comment
import pandas as pd
url = "https://www.basketball-reference.com/players/a/allenra02.html"
response = requests.get(url)
soup = BeautifulSoup(response.text, 'html.parser')
comments = soup.find_all(string=lambda text: isinstance(text, Comment))
tables = []
for each in comments:
if 'table' in str(each):
try:
tables.append(pd.read_html(str(each), attrs = {'id': 'all_salaries'})[0])
break
except:
continue
print(tables[0].to_string())
Output:
Season Team Lg Salary
0 1996-97 Milwaukee Bucks NBA $1,785,000
1 1997-98 Milwaukee Bucks NBA $2,052,360
2 1998-99 Milwaukee Bucks NBA $2,320,000
3 1999-00 Milwaukee Bucks NBA $9,000,000
4 2000-01 Milwaukee Bucks NBA $10,130,000
5 2001-02 Milwaukee Bucks NBA $11,250,000
6 2002-03 Milwaukee Bucks NBA $12,375,000
7 2003-04 Seattle SuperSonics NBA $13,500,000
8 2004-05 Seattle SuperSonics NBA $14,625,000
9 2005-06 Seattle SuperSonics NBA $13,223,140
10 2006-07 Seattle SuperSonics NBA $14,611,570
11 2007-08 Boston Celtics NBA $16,000,000
12 2008-09 Boston Celtics NBA $18,388,430
13 2009-10 Boston Celtics NBA $18,776,860
14 2010-11 Boston Celtics NBA $10,000,000
15 2011-12 Boston Celtics NBA $10,000,000
16 2012-13 Miami Heat NBA $3,090,000
17 2013-14 Miami Heat NBA $3,229,050
18 Career (may be incomplete) NaN $184,356,410

It's because that table is actually commented out in the original source code and later added via javascript. Have a look here on how to get the comment contents: Scrapy: Extract commented (hidden) content

scraping basketball results and associate related competition to each match

I want to scrape basketball results from this webpage:
http://www.nowgoal.group/nba/Schedule.aspx?f=ft2&date=2020-07-29
I created the code using bs4 and requests:
url = http://www.nowgoal.group/nba/Schedule.aspx?f=ft2&date=2020-07-29
with requests.Session() as session:
session.headers = {'User-Agent': 'Mozilla/5.0 (X11; Ubuntu; Linux x86_64; rv:76.0) Gecko/20100101 Firefox/76.0'}
r = session.get(url, timeout=30)
soup = BeautifulSoup(r.content, 'html.parser')
The issue I face is how to add competition to each row I scrape
I want to create a table and each row is the match results (competition, home team, away team, score...)

Selenium
Try this (selenium):
import pandas as pd
from bs4 import BeautifulSoup
from selenium import webdriver
import time
res =[]
url = 'http://www.nowgoal.group/nba/Schedule.aspx?f=ft2&date=2020-07-29'
driver = webdriver.Firefox(executable_path='c:/program/geckodriver.exe')
driver.get(url)
time.sleep(2)
page = driver.page_source
driver.close()
soup = BeautifulSoup(page, 'html.parser')
span = soup.select_one('span#live')
tables = span.select('table')
for table in tables:
if table.get('class'):
competition = table.select_one('a b font').text
else:
for home, away in zip(table.select('tr.b1')[0::2], table.select('tr.b1')[1::2]):
res.append([f"{competition}",
f"{home.select_one('td a').text}",
f"{away.select_one('td a').text}",
f"{home.select_one('td.red').text}",
f"{away.select_one('td.red').text}",
f"{home.select_one('td.odds1').text}",
f"{away.select_one('td.odds1').text}",
f"{home.select('td font')[0].text}/{home.select('td font')[1].text}",
f"{away.select('td font')[0].text}/{away.select('td font')[1].text}",
f"{home.select('td div a')[-1].get('href')}"])
df = pd.DataFrame(res, columns=['competition',
'home',
'away',
'home score',
'away score',
'home odds',
'away odds',
'home ht',
'away ht',
'odds'
])
print(df.to_string())
df.to_csv('Res.csv')
prints:
competition home away home score away score home odds away odds home ht away ht odds
0 National Basketball Association Portland Trail Blazers Oklahoma City Thunder 120 131 2.72 1.45 50/70 63/68 http://data.nowgoal.group/OddsCompBasket/387520.html
1 National Basketball Association Houston Rockets Boston Celtics 137 112 1.49 2.58 77/60 60/52 http://data.nowgoal.group/OddsCompBasket/387521.html
2 National Basketball Association Philadelphia 76ers Dallas Mavericks 115 118 2.04 1.76 39/64 48/55 http://data.nowgoal.group/OddsCompBasket/387522.html
3 Women’s National Basketball Association Connecticut Sun Washington Mystics 89 94 2.28 1.59 52/37 48/46 http://data.nowgoal.group/OddsCompBasket/385886.html
4 Women’s National Basketball Association Chicago Sky Los Angeles Sparks 96 78 2.72 1.43 40/56 36/42 http://data.nowgoal.group/OddsCompBasket/385618.html
5 Women’s National Basketball Association Seattle Storm Minnesota Lynx 90 66 1.21 4.19 41/49 35/31 http://data.nowgoal.group/OddsCompBasket/385884.html
6 Friendly Competition Labas Pasauli LT Balduasenaras 85 78 52/33 31/47 http://data.nowgoal.group/OddsCompBasket/387769.html
7 Friendly Competition BC Vikings Nemuno Banga KK 66 72 29/37 30/42 http://data.nowgoal.group/OddsCompBasket/387771.html
8 Friendly Competition NRG Kiev Hizhaki 51 76 31/20 28/48 http://data.nowgoal.group/OddsCompBasket/387766.html
9 Friendly Competition Finland Estonia 97 76 2.77 1.40 48/49 29/47 http://data.nowgoal.group/OddsCompBasket/387740.html
10 Friendly Competition Synkarb Sk nemenchine 82 79 37/45 38/41 http://data.nowgoal.group/OddsCompBasket/387770.html
and so on....
And saves a Res.csv that looks like this:
Requests
Try this (requests):
import pandas as pd
from bs4 import BeautifulSoup
import requests
res = []
url = 'http://www.nowgoal.group/GetNbaWithTimeZone.aspx?date=2020-07-29&timezone=2&kind=0&t=1596143185000'
r = requests.get(url)
soup = BeautifulSoup(r.text, 'html.parser')
items = soup.find_all('h')
for item in items:
values = item.text.split('^')
res.append([f'{values[1]}', f'{values[8]}', f'{values[10]}', f'{values[11]}', f'{values[12]}'])
df = pd.DataFrame(res, columns=['competition', 'home', 'away', 'home score', 'away score'])
print(df.to_string())
df.to_csv('Res.csv')
prints:
competition home away home score away score
0 NBA Portland Trail Blazers Oklahoma City Thunder 120 131
1 NBA Houston Rockets Boston Celtics 137 112
2 NBA Philadelphia 76ers Dallas Mavericks 115 118
3 WNBA Connecticut Sun Washington Mystics 89 94
4 WNBA Chicago Sky Los Angeles Sparks 96 78
5 WNBA Seattle Storm Minnesota Lynx 90 66
6 FC Labas Pasauli LT Balduasenaras 85 78
7 FC BC Vikings Nemuno Banga KK 66 72
8 FC NRG Kiev Hizhaki 51 76
And saves a Res.csv that looks like this:
If you do not want the index column you can simply add index=False to df.to_csv('Res.csv') so it looks like this df.to_csv('Res.csv', index=False)
Note selenium: You need selenium and geckodriver and in this code geckodriver is set to be imported from c:/program/geckodriver.exe
The selenium version is slower but has no need to fetch and find the XML file with devtools

This page uses JavaScript to load data but requests/BeautifulSoup can't run JavaScript.
So you have two options.
First: you can use Selenium to control real web browser which can run JavaScript. It can be better when page use complex JavaScript code to generate data - but this slower because it needs to run web browser which has to render page and run JavaScript.
Second: you can try to use DevTools in Firefox/Chrome (tab Network, filter XHR) to find URL used by JavaScript/AJAX(XHR) to get data from server and use this URL with requests. often you can get JSON data which can be converted to Python list/dictionary and then you don't need BeautifulSoupto scrape data. It is faster but sometimes page uses some JavaScript code which hard to replace with Python code.
I choose second method.
I found it reads data from
http://www.nowgoal.group/GetNbaWithTimeZone.aspx?date=2020-07-29&timezone=2&kind=0&t=1596143185000
but it gives XML data so it still needs BeautifulSoup (or lxml) to scrape data.
import requests
from bs4 import BeautifulSoup as BS
url = 'http://www.nowgoal.group/GetNbaWithTimeZone.aspx?date=2020-07-29&timezone=2&kind=0&t=1596143185000'
r = requests.get(url)
soup = BS(r.text, 'html.parser')
all_items = soup.find_all('h')
for item in all_items:
values = item.text.split('^')
#print(values)
print(values[8], values[11])
print(values[10], values[12])
print('---')
Result:
Portland Trail Blazers 120
Oklahoma City Thunder 131
---
Houston Rockets 137
Boston Celtics 112
---
Philadelphia 76ers 115
Dallas Mavericks 118
---
Connecticut Sun 89
Washington Mystics 94
---
Chicago Sky 96
Los Angeles Sparks 78
---
Seattle Storm 90
Minnesota Lynx 66
---
Labas Pasauli LT 85
Balduasenaras 78
---
BC Vikings 66
Nemuno Banga KK 72
---
NRG Kiev 51
Hizhaki 76
---
Finland 97
Estonia 76
---
Synkarb 82
Sk nemenchine 79
---
CS Sfaxien (w) 51
ES Cap Bon (w) 54
---
Police De La Circulation (w) 43
Etoile Sportive Sahel (w) 39
---
CA Bizertin 63
ES Goulette 71
---
JS Manazeh 77
AS Hammamet 53
---
Southern Huskies 84
Canterbury Rams 98
---
Taranaki Mountainairs 99
Franklin Bulls 90
---
Chaophraya Thunder 67
Thai General Equipment 102
---
Airforce Madgoat Basketball Club 60
HiTech Bangkok City 77
---
Bizoni 82
Leningrad 75
---
chameleon 104
Leningrad 80
---
Bizoni 71
Zubuyu 57
---
Drakony 89
chameleon 79
---
Dragoni 71
Zubuyu 87
---

Scraping of the BBB site converting JSON to a DataFrame

I would like to put this information into a dataframe and then export to excel. So far tutorials in python produce table errors. No luck converting the JSON data to a data frame.
Any tips would be very helpful.
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns
%matplotlib inline
from urllib.request import urlopen
import bs4
import requests, re, json
headers = {'User-Agent':'Mozilla/5.0'}
r = requests.get('https://www.bbb.org/search?find_country=USA&find_entity=10126-000&find_id=357_10126-000_alias&find_text=roofing&find_type=Category&page=1&touched=1', headers = headers)
p = re.compile(r'PRELOADED_STATE__ = (.*?);')
data = json.loads(p.findall(r.text)[0])
results = [(item['businessName'], ' '.join([item['address'],item['city'], item['state'], item['postalcode']]), item['phone']) for item in data['searchResult']['results']]
print(results)

import re
import json
import requests
import pandas as pd
from bs4 import BeautifulSoup
headers = {'User-Agent':'Mozilla/5.0'}
r = requests.get('https://www.bbb.org/search?find_country=USA&find_entity=10126-000&find_id=357_10126-000_alias&find_text=roofing&find_type=Category&page=1&touched=1', headers = headers)
p = re.compile(r'PRELOADED_STATE__ = (.*?);')
data = json.loads(p.findall(r.text)[0])
results = [(item['businessName'], ' '.join([item['address'],item['city'], item['state'], item['postalcode']]), item['phone']) for item in data['searchResult']['results']]
df = pd.DataFrame(results, columns=['Business Name', 'Address', 'Phone'])
print(df)
df.to_csv('data.csv')
Prints:
Business Name Address Phone
0 Trinity Roofing, LLC Stilwell KS 66085-8238 [(913) 432-4425, (303) 699-7999]
1 Trinity Roofing, LLC 14241 E 4th Ave Ste 5-300 Aurora CO 80011-8733 [(913) 432-4425, (303) 699-7999]
2 CMR Construction & Roofing of Texas, LLC 12500 E US Highway 40, Ste. B1 Independence MO... [(855) 376-6326, (855) 766-3267]
3 All-Star Home Repairs LLC 1806 Grove Ave Richmond VA 23220-4506 [(804) 405-9337]
4 MadSky Roofing & Restoration, LLC Bank of America Center, 16th Floor 1111 E. Mai... [(855) 623-7597]
5 Robert Owens Roofing Bealeton VA 22712-9706 [(540) 878-3544]
6 Proof Seal of Athens PO Box 80732 Canton OH 447080732 [(330) 685-6363]
7 Proof Seal of Athens Athens OH 45701-1847 [(330) 685-6363]
8 Tenecela General Services Corp 57 Anderson St Lowell MA 01852-5357 None
9 Water Tight Roofing & Siding 57 Whitehall Way Hyannis MA 02601-2149 [(508) 364-8323]
10 Tenecela General Services Corp 745 Broadway St Fl 2 Lowell MA 01854-3137 None
11 Just In Time Roofing & Contracting, LLC ----- Ft Worth TX 76102 [(888) 666-3122, (254) 296-8016, (888) 370-3331]
12 Paramount Construction of Southerntier NY Inc. 323 Fluvanna Ave. Jamestown NY 14701 [(716) 487-0093]
13 Paramount Construction of Southerntier NY Inc. P O Box 488 Falconer NY 14733 [(716) 487-0093]
14 Paramount Construction of Southerntier NY Inc. 1879 Lyndon Boulevard Falconer NY 14733 [(716) 487-0093]
And saves data.csv (screenshot from LibreOffice):

Missing data not being scraped from Hansard

I'm trying to scrape data from Hansard, the official verbatim record of everything spoken in the UK House of Parliament. This is the precise link I'm trying to scrape: in a nutshell, I want to scrape every "mention" container on this page and the following 50 pages after that.
But I find that when my scraper is "finished," it's only collected data on 990 containers and not the full 1010. Data on 20 containers is missing, as if it's skipping a page. When I only set the page range to (0,1), it fails to collect any values. When I set it to (0,2), it collects only the first page's values. Asking it to collect data on 52 pages does not help. I thought that this was perhaps due to the fact that I wasn't giving the URLs enough time to load, so I added some delays in the scraper's crawl. That didn't solve anything.
Can anyone provide me with any insight into what I may be missing? I'd like to make sure that my scraper is collecting all available data.
pages = np.arange(0, 52)
for page in pages:
hansard_url = "https://hansard.parliament.uk/search/Contributions? searchTerm=%22civilian%20casualties%22&startDate=01%2F01%2F1988%2000%3A00%3A00&endDate=07%2F14%2F2020%2000%3A00%3A00"
full_url = hansard_url + "&page=" + str(page) + "&partial=true"
page = get(full_url)
html_soup = BeautifulSoup(page.text, 'html.parser')
mention_containers = html_soup.find_all('div', class_="result contribution")
time.sleep(randint(2,10))
for mention in mention_containers:
topic = mention.div.span.text
topics.append(topic)
house = mention.find("img")["alt"]
if house == "Lords Portcullis":
houses.append("House of Lords")
elif house == "Commons Portcullis":
houses.append("House of Commons")
else:
houses.append("N/A")
name = mention.find('div', class_="secondaryTitle").text
names.append(name)
date = mention.find('div', class_="").text
dates.append(date)
time.sleep(randint(2,10))
hansard_dataset = pd.DataFrame(
{'Date': dates, 'House': houses, 'Speaker': names, 'Topic': topics})
)
print(hansard_dataset.info())
print(hansard_dataset.isnull().sum())
hansard_dataset.to_csv('hansard.csv', index=False, sep="#")
Any help in helping me solve this problem is appreciated.

The server returns on page 48 empty container, so total results are 1000 from pages 1 to 51 (inclusive):
import requests
import pandas as pd
from bs4 import BeautifulSoup
url = 'https://hansard.parliament.uk/search/Contributions'
params = {
'searchTerm':'civilian casualties',
'startDate':'01/01/1988 00:00:00',
'endDate':'07/14/2020 00:00:00',
'partial':'True',
'page':1,
}
all_data = []
for page in range(1, 52):
params['page'] = page
print('Page {}...'.format(page))
soup = BeautifulSoup(requests.get(url, params=params).content, 'html.parser')
mention_containers = soup.find_all('div', class_="result contribution")
if not mention_containers:
print('Empty container!')
for mention in mention_containers:
topic = mention.div.span.text
house = mention.find("img")["alt"]
if house == "Lords Portcullis":
house = "House of Lords"
elif house == "Commons Portcullis":
house = "House of Commons"
else:
house = "N/A"
name = mention.find('div', class_="secondaryTitle").text
date = mention.find('div', class_="").get_text(strip=True)
all_data.append({'Date': date, 'House': house, 'Speaker': name, 'Topic': topic})
df = pd.DataFrame(all_data)
print(df)
Prints:
...
Page 41...
Page 42...
Page 43...
Page 44...
Page 45...
Page 46...
Page 47...
Page 48...
Empty container! # <--- here is the server error
Page 49...
Page 50...
Page 51...
Date House Speaker Topic
0 14 July 2014 House of Lords Baroness Warsi Gaza debate in Lords Chamber
1 3 March 2016 House of Lords Lord Touhig Armed Forces Bill debate in Grand Committee
2 2 December 2015 House of Commons Mr David Cameron ISIL in Syria debate in Commons Chamber
3 3 March 2016 House of Lords Armed Forces Bill debate in Grand Committee
4 27 April 2016 House of Lords Armed Forces Bill debate in Lords Chamber
.. ... ... ... ...
995 18 June 2003 House of Lords Lord Craig of Radley Defence Policy debate in Lords Chamber
996 7 September 2004 House of Lords Lord Rea Iraq debate in Lords Chamber
997 14 February 1994 House of Lords The Parliamentary Under-Secretary of State, Mi... Landmines debate in Lords Chamber
998 12 January 2000 House of Commons The Minister of State, Foreign and Commonwealt... Serbia And Kosovo debate in Westminster Hall
999 26 February 2003 House of Lords Lord Rea Iraq debate in Lords Chamber
[1000 rows x 4 columns]

We Keep Coding

Python is a programming language that lets you work quickly and integrate systems more effectively.

Beautifulsoup Python loops - python

Related

Web scraping a table through multiple pages with a single link

Scrapy: unable to locate table or scrape data in table

scraping basketball results and associate related competition to each match

Scraping of the BBB site converting JSON to a DataFrame

Missing data not being scraped from Hansard

Categories

Resources