How to scrape a website using beautifulsoup

How to scrape a website using beautifulsoup - python

I am trying to scrape a website https://remittanceprices.worldbank.org/en/corridor/Australia/China for Fee field.
url = 'https://remittanceprices.worldbank.org/en/corridor/Australia/China'
r = requests.get(url,verify=False)
soup = BeautifulSoup(r.text,'lxml')
rows = [i.get_text("|").split("|") for i in soup.select('#tab-1 .corridor-row')]
for row in rows:
#a,b,c,d,e = row[2],row[15],row[18],row[21],row[25]
#print(a,b,c,d,e,sep='|')
print('{0[2]}|{0[15]}|{0[18]}|{0[21]}|{0[25]}').format(row)
But I am receiving a response AttributeError with the above code.
Can anyone help me out?

The problem is you are using .format() with print(), not the string. .format() is a method of str type and print() actually returns None, so try:
url = 'https://remittanceprices.worldbank.org/en/corridor/Australia/China'
r = requests.get(url,verify=False)
soup = BeautifulSoup(r.text,'lxml')
rows = [i.get_text("|").split("|") for i in soup.select('#tab-1 .corridor-row')]
for row in rows:
#a,b,c,d,e = row[2],row[15],row[18],row[21],row[25]
#print(a,b,c,d,e,sep='|')
print('{0[2]}|{0[15]}|{0[18]}|{0[21]}|{0[25]}'.format(row))

Related

I am getting the following JSON error when trying to web scrape dates from Trustpilot with BS4 - Python

I am using the following script to scrape user review data from the Trustpilot website to do some analysis on user sentiment using data from https://ca.trustpilot.com/review/www.hellofresh.ca I expect to scrape
Date, Star Rating,Review Content.
but when i run the code, i am getting the following error, can anyone help explain why?
JSONDecodeError: Expecting value: line 1 column 1 (char 0)
stars = []
dates = []
comments = []
results = []
with requests.Session() as s:
for num in range(1,2):
url = "https://ca.trustpilot.com/review/www.hellofresh.ca?page={}".format(num)
r = s.get(url, headers = headers)
soup = BeautifulSoup(r.content, 'lxml')
for star in soup.find_all("section", {"class":"review__content"}):
# Get rating value
rating = star.find("div", {"class":"star-rating star-rating--medium"}).find('img').get('alt')
# Get date value
#date_json = json.loads(star.find('script').text)
#date = date_json['publishedDate']
date_tag = star.select("div.review-content-header__dates > script")
date = json.loads(date_tag[0].text)
dt = datetime.strptime(date['publishedDate'], "%Y-%m-%dT%H:%M:%SZ")
# Get comment
comment = star.find("div", class_="review-content__body").text
stars.append(rating)
dates.append(dt)
comments.append(comment)
data = {"Rating": rating, "Review": comment, "Dates": date}
results.append(data)
time.sleep(2)
print(results)```

To get the JSON data, you can call the .string method.
...
date = json.loads(date_tag[0].string)
>>> print(date)
{'publishedDate': '2021-01-04T21:57:34+00:00', 'updatedDate': None, 'reportedDate': None}
...
...

Beautiful soup how select <a href> and <td> elements with whitespaces

I'm trying to use BeautifulSoup to select the date, url, description, and additional url from table and am having trouble accessing them given the weird white spaces:
So far I've written:
import urllib
import urllib.request
from bs4 import BeautifulSoup
def make_soup(url):
thepage = urllib.request.urlopen(url)
soupdata = BeautifulSoup(thepage, "html.parser")
return soupdata
soup = make_soup('https://www.sec.gov/litigation/litreleases/litrelarchive/litarchive2010.shtml')
test1 = soup.findAll("td", {"nowrap" : "nowrap"})
test2 = [item.text.strip() for item in test1]

With bs4 4.7.1 you can use :has and nth-of-type in combination with next_sibling to get those columns
from bs4 import BeautifulSoup
import requests, re
def make_soup(url):
the_page = requests.get(url)
soup_data = BeautifulSoup(the_page.content, "html.parser")
return soup_data
soup = make_soup('https://www.sec.gov/litigation/litreleases/litrelarchive/litarchive2010.shtml')
releases = []
links = []
dates = []
descs = []
addit_urls = []
for i in soup.select('td:nth-of-type(1):has([href^="/litigation/litreleases/"])'):
sib_sib = i.next_sibling.next_sibling.next_sibling.next_sibling
releases+= [i.a.text]
links+= [i.a['href']]
dates += [i.next_sibling.next_sibling.text.strip()]
descs += [re.sub('\t+|\s+',' ',sib_sib.text.strip())]
addit_urls += ['N/A' if sib_sib.a is None else sib_sib.a['href']]
result = list(zip(releases, links, dates, descs, addit_urls))
print(result)

Unfortunately there is no class or id HTML attribute to quickly identify the table to scrape; after experimentation I found it was the table at index 4.
Next we ignore the header by separating it from the data, which still has table rows that are just separations for quarters. We can skip over these using a try-except block since those only contain one table data tag.
I noticed that the description is separated by tabs, so I split the text on \t.
For the urls, I used .get('href') rather than ['href'] since not every anchor tag has an href attribute from my experience scraping. This avoids errors should that case occur. Finally the second anchor tag does not always appear, so this is wrapped in a try-except block as well.
data = []
table = soup.find_all('table')[4] # target the specific table
header, *rows = table.find_all('tr')
for row in rows:
try:
litigation, date, complaint = row.find_all('td')
except ValueError:
continue # ignore quarter rows
id = litigation.text.strip().split('-')[-1]
date = date.text.strip()
desc = complaint.text.strip().split('\t')[0]
lit_url = litigation.find('a').get('href')
try:
comp_url = complaint.find('a').get('href')
except AttributeError:
comp_ulr = None # complaint url is optional
info = dict(id=id, date=date, desc=desc, lit_url=lit_url, comp_url=comp_url)
data.append(info)

Problems with web scraping (William Hill-UFC Odds)

I'm creating a web scraper that will let me get the odds of upcoming UFC Fights on William Hill. I'm using beautiful soup but have yet been able to successfully scrape the needed data. (https://sports.williamhill.com/betting/en-gb/ufc)
I need the fighters names and their odds.
I've attempted a variety of methods to try get the data, trying to scrape different tags etc., but nothing happens.
def scrape_data():
data = requests.get("https://sports.williamhill.com/betting/en-
gb/ufc")
soup = BeautifulSoup(data.text, 'html.parser')
links = soup.find_all('a',{'class': 'btmarket__name btmarket__name--
featured'}, href=True)
for link in links:
links.append(link.get('href'))
for link in links:
print(f"Now currently scraping link: {link}")
data = requests.get(link)
soup = BeautifulSoup(data.text, 'html.parser')
time.sleep(1)
fighters = soup.find_all('p', {'class': "btmarket__name"})
c = fighters[0].text.strip()
d = fighters[1].text.strip()
f1.append(c)
f2.append(d)
odds = soup.find_all('span', {'class': "betbutton_odds"})
a = odds[0].text.strip()
b = odds[1].text.strip()
f1_odds.append(a)
f2_odds.append(b)
return None
I would expect it to be exported to a CSV file. I'm currently using Morph.io to host and run the scraper, but it returns nothing.
If correct, it would output:
Fighter1Name:
Fighter2Name:
F1Odds:
F2Odds:
For every available fight.
Any help would be greatly appreciated.

The html returned has different attributes and values. You need to inspect the response.
For writing out to csv you will want to append "'" in front of odds to prevent odds being treated as fractions or dates. See commented out alternatives in code below.
import requests
from bs4 import BeautifulSoup as bs
import pandas as pd
r = requests.get('https://sports.williamhill.com/betting/en-gb/ufc')
soup = bs(r.content, 'lxml')
results = []
for item in soup.select('.btmarket:has([data-odds])'):
match_name = item.select_one('.btmarket__name[title]')['title']
odds = [i['data-odds'] for i in item.select('[data-odds]')]
row = {'event-starttime' : item.select_one('[datetime]')['datetime']
,'match_name' : match_name
,'home_name' : match_name.split(' vs ')[0]
#,'home_odds' : "'" + str(odds[0])
,'home_odds' : odds[0]
,'away_name' : match_name.split(' vs ')[1]
,'away_odds' : odds[1]
#,'away_odds' : "'" + str(odds[1])
}
results.append(row)
df = pd.DataFrame(results, columns = ['event-starttime','match_name','home_name','home_odds','away_name','away_odds'])
print(df.head())
#write to csv
df.to_csv(r'C:\Users\User\Desktop\Data.csv', sep=',', encoding='utf-8-sig',index = False )

Webscraping the contents of a tables

Hi I am trying to use Python and Beautiful Soup to scrape a webpage. There are various tables in the webpage with results that I want out of them, but I am struggling to:
1) find the right table
2) find the right two cells
3) write the cells 1 and 2 into a dictionary key and value, respectively.
So far, after making a request, and parsing the HTML, I use:
URL='someurl.com'
def datascrape(url):
page=requests.get(url)
print ("requesting page")
soup = BeautifulSoup(page.content, "html.parser")
return(soup)
soup=datascrape(URL)
results = {}
for row in soup.findAll('tr'):
aux = row.findAll('td')
try:
if "Status" in (aux.stripped_strings):
key=(aux[0].strings)
value=(aux[1].string)
results[key] = value
except:
pass
print (results)
Unfortunately "results" is always empty. I am really not sure where I am going wrong. Could anyone enlighten me please?

I'm not sure why you're using findAll() instead of find_all() as I'm fairly new to web-scraping, but nevertheless I think this gives you the output you're looking for.
URL='http://sitem.herts.ac.uk/aeru/bpdb/Reports/2070.html'
def datascrape(url):
page=requests.get(url)
print ("requesting page")
soup = BeautifulSoup(page.content,
"html.parser")
return(soup)
soup=datascrape(URL)
results = {}
table_rows = soup.find_all('tr')
for tr in table_rows:
td = tr.find_all('td')
row = [i.text for i in td]
try:
for i in row:
if "Status" in i:
key=(row[0].strip())
value=(row[1].strip())
results[key] = value
else:
pass
print(results)
Hope this helps!

If just after the Status and Not Applicable you can use positional nth-of-type css selectors. This does depend on position being the same across pages.
import requests
from bs4 import BeautifulSoup
url ='https://sitem.herts.ac.uk/aeru/bpdb/Reports/2070.htm'
page=requests.get(url)
soup = BeautifulSoup(page.content, "lxml")
tdCells = [item.text.strip() for item in soup.select('table:nth-of-type(2) tr:nth-of-type(1) td')]
results = {tdCells[0] : tdCells[1]}
print(results)

BS4 Not Locating Element in Python

I am somewhat new to Python and can't for the life of me figure out why the following code isn’t pulling the element I am trying to get.
It currently returns:
for player in all_players:
player_first, player_last = player.split()
player_first = player_first.lower()
player_last = player_last.lower()
first_name_letters = player_first[:2]
last_name_letters = player_last[:5]
player_url_code = '/{}/{}{}01'.format(last_name_letters[0], last_name_letters, first_name_letters)
player_url = 'https://www.basketball-reference.com/players' + player_url_code + '.html'
print(player_url) #test
req = urlopen(player_url)
soup = bs.BeautifulSoup(req, 'lxml')
wrapper = soup.find('div', id='all_advanced_pbp')
table = wrapper.find('div', class_='table_outer_container')
for td in table.find_all('td'):
player_pbp_data.append(td.get_text())
Currently returning:
--> for td in table.find_all('td'):
player_pbp_data.append(td.get_text()) #if this works, would like to
AttributeError: 'NoneType' object has no attribute 'find_all'
Note: iterating through children of the wrapper object returns:
< div class="table_outer_container" > as part of the tree.
Thanks!

Make sure that table contains the data you expect.
For example https://www.basketball-reference.com/players/a/abdulka01.html doesn't seem to contain a div with id='all_advanced_pbp'

Try to explicitly pass the html instead:
bs.BeautifulSoup(the_html, 'html.parser')

I trie to extract data from the url you gave but it did not get full DOM. after then i try to access the page with browser with javascrip and without javascrip, i know website need javascrip to load some data. But the page like players it need not. The simple way to get dynamic data is using selenium
This is my test code
import requests
from bs4 import BeautifulSoup
from selenium import webdriver
player_pbp_data = []
def get_list(t="a"):
with requests.Session() as se:
url = "https://www.basketball-reference.com/players/{}/".format(t)
req = se.get(url)
soup = BeautifulSoup(req.text,"lxml")
with open("a.html","wb") as f:
f.write(req.text.encode())
table = soup.find("div",class_="table_wrapper setup_long long")
players = {player.a.text:"https://www.basketball-reference.com"+player.a["href"] for player in table.find_all("th",class_="left ")}
def get_each_player(player_url="https://www.basketball-reference.com/players/a/abdulta01.html"):
with webdriver.Chrome() as ph:
ph.get(player_url)
text = ph.page_source
'''
with requests.Session() as se:
text = se.get(player_url).text
'''
soup = BeautifulSoup(text, 'lxml')
try:
wrapper = soup.find('div', id='all_advanced_pbp')
table = wrapper.find('div', class_='table_outer_container')
for td in table.find_all('td'):
player_pbp_data.append(td.get_text())
except Exception as e:
print("This page dose not contain pbp")
get_each_player()

We Keep Coding

Python is a programming language that lets you work quickly and integrate systems more effectively.

How to scrape a website using beautifulsoup - python

Related

I am getting the following JSON error when trying to web scrape dates from Trustpilot with BS4 - Python

Beautiful soup how select <a href> and <td> elements with whitespaces

Problems with web scraping (William Hill-UFC Odds)

Webscraping the contents of a tables

BS4 Not Locating Element in Python

Categories

Resources