Not all HTML seems to be retrieved using BeautifulSoup - python

i'm quite new using BeautifulSoup, I have tried to retrieve a few elements in the following webpage: https://www.booking.com/hotel/tz/zuri-zanzibar.html
from bs4 import BeautifulSoup
import requests
url = "https://www.booking.com/hotel/tz/zuri-zanzibar.html"
r = requests.get(url)
soup = BeautifulSoup(r.content)
for elem in soup.findAll("div", {"class": "facilitiesChecklistSection"}):
try:
print(elem.attrs['data-section-id'])
except:
continue
I get the following IDs, whereas there should be much more:
13
-2
2
7
11
16
22
21
23
25
26
34
1
Would you know why?
Also, i don't get anything back when i try:
soup.findAll("div", {"class": "hp_location_block__map_container bui-spacer--larger"})
I'd like to retrieve the map location.

Related

Trouble scraping sports table - python

I'm having a lot of trouble scraping results off a sporting table in Python.
I am new to scrapping and have tried everything I can find online.
The website is https://www.nrl.com/stats/teams/?competition=111&season=2022&stat=38
and I'm just trying to get the team name and tries.
Any suggestions would be appreciated.
Make sure you have the good requirements:
if you are using Anaconda go to Anaconda cmd line and type:
> pip install beautifulsoup4
> pip install requests
Now, You can try a scrapping library called beautifulsoup, you can specify the name of the div you want in the html source code of your website, with your link and the library catch all the data, example:
import requests
from bs4 import BeautifulSoup
#create variable page
page = requests.get('https://www.imdb.com/title/tt7286456/criticreviews?ref_=tt_ov_rt')
#create the variable soup and calling BeatifulSoup library for our web page
soup = BeautifulSoup(page.text, 'html.parser')
file = open("data.txt", "w") #don't forget create a data.txt file in the same repository of your file.py
#maping the div with class_="summary" with the function find_all()
for x in soup.find_all('div', class_='summary'):
print(x) #print the data scrapped
file.write(x.text) #store the data in the file.txt
file.write("\n")
file.close()
More details in my Scrapping project https://github.com/mehdimaaref7/Scrapping-Sentiment-Analysis/blob/master/big_data.py
The data.txt file is optional, you can just use the variable x with print(x), to display the data you need to scrap.
No need to iterate. Just get the json then let pandas parse it into a dataframe:
import pandas as pd
import requests
from bs4 import BeautifulSoup
import json
url = 'https://www.nrl.com/stats/teams/?competition=111&season=2022&stat=38'
response = requests.get(url)
soup = BeautifulSoup(response.text, 'html.parser')
data = soup.find('div', {'id':'vue-stats-detail'})['q-data']
jsonData = json.loads(data)
df = pd.DataFrame(jsonData['totalStats']['leaders'])
Output:
print(df)
teamNickName ... played
0 Storm ... 24
1 Roosters ... 24
2 Rabbitohs ... 24
3 Panthers ... 24
4 Eels ... 24
5 Cowboys ... 24
6 Sharks ... 24
7 Broncos ... 24
8 Sea Eagles ... 24
9 Raiders ... 24
10 Titans ... 24
11 Dragons ... 24
12 Knights ... 24
13 Warriors ... 24
14 Bulldogs ... 24
15 Wests Tigers ... 24
[16 rows x 5 columns]

Python 3 Web Scrape & Beautiful Soup Tag Attribute

I am practicing on Beautiful Soup and am after a products price, description and item number. The first 2 are text and are easy to get. The third is an attribute of the tag data-trade-price as seen below:-
<div class="price-group display-metro has-promo-price medium ng-scope" ng-class="{'has-trade-price': ShowTrade}" data-trade-price="221043">
I am after the numbers such as 221043 which is loaded in by the page. IE - all 24 item numbers matching all 24 products
My code is:-
import requests
r = requests.get('http://www.supercheapauto.com.au/store/car-care/wash-wax-polish/1021762?page=1&pageSize=24&sort=-ProductSummaryPurchasesWeighted%2C-ProductSummaryPurchases')
from bs4 import BeautifulSoup
soup = BeautifulSoup(r.text, 'lxml')
results = soup.find_all('div', class_='details')
for result in results:
try:
SKU = result.select_one("data-trade-price")
except AttributeError: SKU = "N/A"
DESC = result.find('div', class_='title').text.strip().upper()
PRICE = result.find('span', class_='currency').text.strip().upper()
print(SKU,'\t', DESC,'\t', PRICE)
What is the syntax to get the item number from the soup?
Sorry - I am after the syntax that can iterate through the page of 24 products and recover the 24 different item numbers. The example given was to show the part of the attribute value that I was after. I ran the given answer and it works. I am unsure of how to integrate into the code given as the variations I use do not. Any suggestions.
You can access the attribute just like a dictionary.
Ex:
from bs4 import BeautifulSoup
s = """<div class="price-group display-metro has-promo-price medium ng-scope" ng-class="{'has-trade-price': ShowTrade}" data-trade-price="221043"<\div>"""
soup = BeautifulSoup(s, "html.parser")
print( soup.find("div", class_="price-group display-metro has-promo-price medium ng-scope").attrs["data-trade-price"] )
or
print( soup.find("div", class_="price-group display-metro has-promo-price medium ng-scope")["data-trade-price"] )
Output:
221043

web-scraping using python ('NoneType' object is not iterable)

I am new to python and web-scraping. I am trying to scrape a website (link is the url). I am getting an error as "'NoneType' object is not iterable", with the last line of below code. Could anyone point what could have gone wrong?
import requests
from bs4 import BeautifulSoup
from urllib.parse import urljoin
url = 'https://labtestsonline.org/tests-index'
soup = BeautifulSoup(requests.get(url).content, 'lxml')
# Function to get hyper-links for all test components
hyperlinks = []
def parseUrl(url):
global hyperlinks
page = requests.get(url).content
soup = BeautifulSoup(page, 'lxml')
for a in soup.findAll('div',{'class':'field-content'}):
a = a.find('a')
href = urlparse.urljoin(Url,a.get('href'))
hyperlinks.append(href)
parseUrl(url)
# function to get header and common questions for each test component
def header(url):
page = requests.get(url).content
soup = BeautifulSoup(page, 'lxml')
h = []
commonquestions = []
for head in soup.find('div',{'class':'field-item'}).find('h1'):
heading = head.get_text()
h.append(heading)
for q in soup.find('div',{'id':'Common_Questions'}):
questions = q.get_text()
commonquestions.append(questions)
for i in range(0, len(hyperlinks)):
header(hyperlinks[i])
Below is the traceback error:
<ipython-input-50-d99e0af6db20> in <module>()
1 for i in range(0, len(hyperlinks)):
2 header(hyperlinks[i])
<ipython-input-49-15ac15f9071e> in header(url)
5 soup = BeautifulSoup(page, 'lxml')
6 h = []
for head in soup.find('div',{'class':'field-item'}).find('h1'):
heading = head.get_text()
h.append(heading)
TypeError: 'NoneType' object is not iterable
soup.find('div',{'class':'field-item'}).find('h1') is returning None. First check whether the function returns anything before looping over it.
Something like:
heads = soup.find('div',{'class':'field-item'}).find('h1')
if heads:
for head in heads:
# remaining code
Try this. It should solve the issues you are having at this moment. I used css selector to get the job done.
import requests
from bs4 import BeautifulSoup
from urllib.parse import urljoin
link = 'https://labtestsonline.org/tests-index'
page = requests.get(link)
soup = BeautifulSoup(page.content, 'lxml')
for a in soup.select('.field-content a'):
new_link = urljoin(link,a.get('href')) ##joining broken urls so as to reuse these
response = requests.get(new_link) ##sending another http requests
sauce = BeautifulSoup(response.text,'lxml')
for item in sauce.select("#Common_Questions .field-item"):
print(item.text)
print("<<<<<<<<<>>>>>>>>>>>")

Soup.find_all is only returning Some of the results in Python 3.5.1

I'm trying to get all of the urls for thumbnails from my webpage that have the class = "thumb", but soup.find_all is only printing the most recent 22 or so.
Here is the Code:
import requests
from bs4 import BeautifulSoup
r = requests.get("http://rayleighev.deviantart.com/gallery/44021661/Reddit")
soup = BeautifulSoup(r.content, "html.parser")
links = soup.find_all("a", {'class' : "thumb"})
for link in links:
print(link.get("href"))
I think you meant to ask about following the pagination and grabbing all the links in a list. Here is the implementation of that idea - use the offset parameter and grab links until there are no more links present incrementing the offset by 24 (number of links per page):
import requests
from bs4 import BeautifulSoup
offset = 0
links = []
with requests.Session() as session:
while True:
r = session.get("http://rayleighev.deviantart.com/gallery/44021661/Reddit?offset=%d" % offset)
soup = BeautifulSoup(r.content, "html.parser")
new_links = [link["href"] for link in soup.find_all("a", {'class': "thumb"})]
# no more links - break the loop
if not new_links:
break
links.extend(new_links)
print(len(links))
offset += 24
print(links)

Parsing error with Beautiful Soup 4 and Python

I need to get the list of the rooms from this website: http://www.studentroom.ch/en/dynasite.cfm?dsmid=106547
I'm using Beautiful Soup 4 in order to parse the page.
This is the code I wrote until now:
from bs4 import BeautifulSoup
import urllib
pageFile = urllib.urlopen("http://studentroom.ch/dynasite.cfm?dsmid=106547")
pageHtml = pageFile.read()
pageFile.close()
soup = BeautifulSoup("".join(pageHtml))
roomsNoFilter = soup.find('div', {"id": "ImmoListe"})
rooms = roomsNoFilter.table.find_all('tr', recursive=False)
for room in rooms:
print room
print "----------------"
print len(rooms)
For now I'm trying to get only the rows of the table.
But I get only 7 rows instead of 78 (or 77).
At first I tough that I was receiving only a partial html, but I printed the whole html and I'm receiving it correctly.
There's no ajax calls that loads new rows after the page loaded...
Someone could please help me finding the error?
This is working for me
soup = BeautifulSoup(pageHtml)
div = soup.select('#ImmoListe')[0]
table = div.select('table > tbody')[0]
k = 0
for room in table.find_all('tr'):
if 'onmouseout' in str(room):
print room
k = k + 1
print "Total ",k
Let me know the status

Categories