How to scrape Amazon IMG Alt Attributes with BeautifulSoup? - python

I'm having a really hard time trying to scrape Amazon, my code works on every average page but when it comes to Amazon it's really frustrating.
I know I can use "FindAll" but I'm using this approach to "keep the flow" and get text and img alt simultaneous:
See
Multiple conditions in BeautifulSoup: Text=True & IMG Alt=True
This is my code:
url = "https://www.amazon.com/Best-Sellers-Health-Personal-Care-Foot-Arch-Supports/zgbs/hpc/3780091"
import requests
headers = {'User-Agent':'Mozilla/5.0 (Macintosh; Intel Mac OS X 10_12_6) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/86.0.4240.75 Safari/537.36'}
r = requests.get(url, headers=headers)
from bs4 import BeautifulSoup, Tag, NavigableString
soup = BeautifulSoup(r.content, 'html.parser')
def get_raw_text(s):
for t in s.contents:
if isinstance(t, Tag):
if t.name == 'img' and 'alt' in t.attrs:
yield t['alt']
yield from get_raw_text(t)
for text in get_raw_text(soup):
print(text)
and I get nothing.

Try changing the HTML parser to html5lib. First do this pip install html5lib and then try again with this:
import requests
from bs4 import BeautifulSoup, Tag
headers = {
'User-Agent': 'Mozilla/5.0 (Macintosh; Intel Mac OS X 10_12_6) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/86.0.4240.75 Safari/537.36'}
url = "https://www.amazon.com/Best-Sellers-Health-Personal-Care-Foot-Arch-Supports/zgbs/hpc/3780091"
r = requests.get(url, headers=headers)
soup = BeautifulSoup(r.content, 'html5lib')
def get_raw_text(s):
for t in s.contents:
if isinstance(t, Tag):
if t.name == 'img' and 'alt' in t.attrs:
yield t['alt']
yield from get_raw_text(t)
for text in get_raw_text(soup):
print(text)
Output:
Dr. Scholl’s Tri-Comfort Insoles // Comfort for Heel, Arch and Ball of Foot with Targeted Cushioning and Arch Support…
Dr. Scholl’s Sport Insoles // Superior Shock Absorption and Arch Support to Reduce Muscle Fatigue and Stress on Lower…
Copper Compression Copper Arch Support - 2 Plantar Fasciitis Braces/Sleeves. Guaranteed Highest Copper Content. Foot…
Dr. Scholl’s Extra Support Insoles // Superior Shock Absorption and Reinforced Arch Support for Big & Tall Men To Reduce…
Arch Support,3 Pairs Compression Fasciitis Cushioned Support Sleeves, Plantar Fasciitis Foot Relief Cushions for Plantar…
LLSOARSS Plantar Fasciitis Feet Sandal with Arch Support - Best Orthotic flip Flops for Flat Feet,Heel Pain- for Women
Pcssole’s 3/4 Orthotics Shoe Insoles High Arch Supports Shoe Insoles for Plantar Fasciitis, Flat Feet, Over-Pronation…
and so on...

Related

How do I grab a title and link from a website?

I'm currently working on a web scraper on this website (https://www.allabolag.se). I would like to grab the title and link to every result on the page, and I'm currently stuck.
<a data-v-4565614c="" href="/5566435201/grenspecialisten-forvaltning-ab">Grenspecialisten Förvaltning AB</a>
This is an example from the website where I would like to grab the href and >Grenspecialisten Förvaltning AB< as It contains the title and link. How would I go about doing that?
The code I have currently looks like this
import requests
from bs4 import BeautifulSoup
headers = {'User-Agent': 'Mozilla/5.0 (Macintosh; Intel Mac OS X 10_15_7) AppleWebKit/605.1.15 (KHTML, like Gecko) Version/14.1.1 Safari/605.1.15'}
url = 'https://www.allabolag.se/bransch/bygg-design-inredningsverksamhet/6/_/xv/BYGG-,%20DESIGN-%20&%20INREDNINGSVERKSAMHET/xv/JURIDIK,%20EKONOMI%20&%20KONSULTTJÄNSTER/xl/12/xb/AB/xe/4/xe/3'
r = requests.get (url, headers=headers)
soup = BeautifulSoup(r.text, 'html.parser')
questions = soup.findAll('div', {'class': 'tw-flex'})
for item in questions:
title = item.find('a', {''}).text
print(title)
Any help would be greatly appreciated!
Best regards :)
The results are embedded in the page in Json form. To decode it, you can use next example:
import json
import requests
from bs4 import BeautifulSoup
headers = {
"User-Agent": "Mozilla/5.0 (Macintosh; Intel Mac OS X 10_15_7) AppleWebKit/605.1.15 (KHTML, like Gecko) Version/14.1.1 Safari/605.1.15"
}
url = "https://www.allabolag.se/bransch/bygg-design-inredningsverksamhet/6/_/xv/BYGG-,%20DESIGN-%20&%20INREDNINGSVERKSAMHET/xv/JURIDIK,%20EKONOMI%20&%20KONSULTTJÄNSTER/xl/12/xb/AB/xe/4/xe/3"
r = requests.get(url, headers=headers)
soup = BeautifulSoup(r.content, "html.parser")
data = json.loads(
soup.find(attrs={":search-result-default": True})[":search-result-default"]
)
# uncomment to print all data:
# print(json.dumps(data, indent=4))
for result in data:
print(
"{:<50} {}".format(
result["jurnamn"], "https://www.allabolag.se/" + result["linkTo"]
)
)
Prints:
Grenspecialisten Förvaltning AB https://www.allabolag.se/5566435201/grenspecialisten-forvaltning-ab
Peab Fastighetsutveckling Syd AB https://www.allabolag.se/5566998430/peab-fastighetsutveckling-syd-ab
BayWa r.e. Nordic AB https://www.allabolag.se/5569701377/baywa-re-nordic-ab
Kronetorp Park Projekt AB https://www.allabolag.se/5567196539/kronetorp-park-projekt-ab
SVENSKA HUSCOMPAGNIET AB https://www.allabolag.se/5568155583/svenska-huscompagniet-ab
Byggnadsaktiebolaget Gösta Bengtsson https://www.allabolag.se/5561081869/byggnadsaktiebolaget-gosta-bengtsson
Tectum Byggnader AB https://www.allabolag.se/5562903582/tectum-byggnader-ab
Winthrop Engineering and Contracting AB https://www.allabolag.se/5592128176/winthrop-engineering-and-contracting-ab
SPI Global Play AB https://www.allabolag.se/5565082897/spi-global-play-ab
Trelleborg Offshore & Construction AB https://www.allabolag.se/5560557711/trelleborg-offshore-construction-ab
M.J. Eriksson Entreprenad AB https://www.allabolag.se/5567043814/mj-eriksson-entreprenad-ab
Solix Group AB https://www.allabolag.se/5569669574/solix-group-ab
Gripen Betongelement AB https://www.allabolag.se/5566646427/gripen-betongelement-ab
BLS Construction AB https://www.allabolag.se/5569814345/bls-construction-ab
We Construction AB https://www.allabolag.se/5590705116/we-construction-ab
Helsingborgs Fasad & Kakel AB https://www.allabolag.se/5567814248/helsingborgs-fasad-kakel-ab
Gat & Kantsten Sverige AB https://www.allabolag.se/5566564919/gat-kantsten-sverige-ab
Bjärno Byggsystem AB https://www.allabolag.se/5566743190/bjarno-byggsystem-ab
Bosse Sandells Bygg Aktiebolag https://www.allabolag.se/5564391158/bosse-sandells-bygg-aktiebolag
Econet Vatten & Miljöteknik AB https://www.allabolag.se/5567388953/econet-vatten-miljoteknik-ab

Web-scraping: Accessing text information within a large list

Example: https://www.realtor.com/realestateandhomes-detail/20013-Hazeltine-Pl_Ashburn_VA_20147_M65748-31771
I am trying to access the number of garage spaces for several real estate listings. The only problem is that the location of the number of garage spaces isn't always in the 9th location of the list. On some pages it is earlier, and on other pages it is later.
garage = info[9].strip().replace('\n','')[15]
where
info = soup.find_all('ul', {'class': "list-default"})
info = [t.text for t in info]
and
header = {"user agent": "Mozilla/5.0 (Macintosh; Intel Mac OS X 10_14_6) AppleWebKit/605.1.15 (KHTML, like Gecko) Version/12.1.2 Safari/605.1.15"}
page = requests.get(url, headers = header)
page.reason
requests.utils.default_user_agent()
soup = bs4.BeautifulSoup(page.text, 'html5lib')
What is the best way for me to obtain how many garage spaces a house listing has?
You can use CSS selector li:contains("Garage Spaces:") that will find <li> tag with the text "Garage Spaces:".
For example:
import requests
from bs4 import BeautifulSoup
url = 'https://www.realtor.com/realestateandhomes-detail/20013-Hazeltine-Pl_Ashburn_VA_20147_M65748-31771'
headers = {"User-Agent": "Mozilla/5.0 (Macintosh; Intel Mac OS X 10_14_6) AppleWebKit/605.1.15 (KHTML, like Gecko) Version/12.1.2 Safari/605.1.15"}
soup = BeautifulSoup(requests.get(url, headers=headers).content, 'html.parser')
garage_spaces = soup.select_one('li:contains("Garage Spaces:")')
if garage_spaces:
garage_spaces = garage_spaces.text.split()[-1]
print('Found Garage spaces! num =', garage_spaces)
Prints:
Found Garage spaces! num = 2

Python web scrape numerical weather data

I am attempting to print the int value of current outside air temperature. (55)
Any chance for a tip on what I am doing wrong? (sorry not a lot of wisdom here!)
import requests
from bs4 import BeautifulSoup
import pandas as pd
import numpy as np
import datetime as dt
#this is used at the end with plotting results to current hour
h = dt.datetime.now().hour
r = requests.get(
'https://www.google.com/search?q=weather+duluth')
soup = BeautifulSoup(r.text, 'html.parser')
stuff = []
for item in soup.select('vk_bk sol-tmp'):
item = int(item.contents[1].get_text(strip=True)[:-1])
#print(item)#this is weather data
stuff.append(item)
This is the web URL for weather and the current outdoor temperature is tied to the div class highlighted below.
If I attempt to print stuff I just get an empty list returned.
Adding User-Agent header should give expected result
headers = {
'User-Agent': 'Mozilla/5.0 (Macintosh; Intel Mac OS X 10_10_1) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/39.0.2171.95 Safari/537.36'}
r = requests.get('https://www.google.com/search?q=weather%20duluth', headers=headers)
soup = BeautifulSoup(r.text, 'html.parser')
soup.find("span", {"class": "wob_t"}).text

How do I parse two elements that are stuck together?

I want to get rating and numVotes from zomato.com but unfortunately it seems like the elements are stuck together. Hard to explain but I made a quick video show casing what I mean.
https://streamable.com/sdh0w
entire code: https://pastebin.com/JFKNuK2a
import requests
from bs4 import BeautifulSoup
headers = {'User-Agent': 'Mozilla/5.0 (Macintosh; Intel Mac OS X 10_11_6) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/61.0.3163.100 Safari/537.36'}
response = requests.get("https://www.zomato.com/san-francisco/restaurants?q=restaurants&page=1",headers=headers)
content = response.content
bs = BeautifulSoup(content,"html.parser")
zomato_containers = bs.find_all("div", {"class": "search-snippet-card"})
for zomato_container in zomato_containers:
rating = zomato_container.find('div', {'class': 'search_result_rating'})
# numVotes = zomato_container.find("div", {"class": "rating-votes-div"})
print("rating: ", rating.get_text().strip())
# print("numVotes: ", numVotes.text())
You can use re module to parse the voting count:
import re
import requests
from bs4 import BeautifulSoup
headers = {'User-Agent': 'Mozilla/5.0 (Macintosh; Intel Mac OS X 10_11_6) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/61.0.3163.100 Safari/537.36'}
response = requests.get("https://www.zomato.com/san-francisco/restaurants?q=restaurants&page=1",headers=headers)
content = response.content
bs = BeautifulSoup(content,"html.parser")
zomato_containers = bs.find_all("div", {"class": "search-snippet-card"})
for zomato_container in zomato_containers:
print('name:', zomato_container.select_one('.result-title').get_text(strip=True))
print('rating:', zomato_container.select_one('.rating-popup').get_text(strip=True))
votes = ''.join( re.findall(r'\d', zomato_container.select_one('[class^="rating-votes"]').text) )
print('votes:', votes)
print('*' * 80)
Prints:
name: The Original Ghirardelli Ice Cream and Chocolate...
rating: 4.9
votes: 344
********************************************************************************
name: Tadich Grill
rating: 4.6
votes: 430
********************************************************************************
name: Delfina
rating: 4.8
votes: 718
********************************************************************************
...and so on.
OR:
If you don't want to use re, you can use str.split():
votes = zomato_container.select_one('[class^="rating-votes"]').get_text(strip=True).split()[0]
According to requirements in your clip you should alter you selectors to be more specific so as to target the appropriate child elements (rather than parent). At present, by targeting parent you are getting the unwanted extra child. To get the appropriate ratings element you can use a css attribute = value with starts with operator.
This
[class^=rating-votes-div]
says match on elements with class attribute whose values starts with rating-votes-div
Visual:
import requests
from bs4 import BeautifulSoup
headers = {'User-Agent': 'Mozilla/5.0 (Macintosh; Intel Mac OS X 10_11_6) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/61.0.3163.100 Safari/537.36'}
response = requests.get("https://www.zomato.com/san-francisco/restaurants?q=restaurants&page=1",headers=headers)
content = response.content
bs = BeautifulSoup(content,"html.parser")
zomato_containers = bs.find_all("div", {"class": "search-snippet-card"})
for zomato_container in zomato_containers:
name = zomato_container.select_one('.result-title').text.strip()
rating = zomato_container.select_one('.rating-popup').text.strip()
numVotes = zomato_container.select_one('[class^=rating-votes-div]').text
print('name: ', name)
print('rating: ' , rating)
print('votes: ', numVotes)

Unsure how to web-scrape a specific value that could be in several different places

So I've been working on a web-scraping program and have been having some difficulties with one of the last bits.
There is this website that shows records of in-game fights like so:
Example 1: https://zkillboard.com/kill/44998359/
Example 2: https://zkillboard.com/kill/44917133/
I am trying to always scrape the full information of the player who scored the killing blow. That means their name, their corporation name, and their alliance name.
The information for the above examples are:
Example 1: Name = Happosait, Corp. = Arctic Light Inc., Alliance = Arctic Light
Example 2: Name = Lord Veninal, Corp. = Sniggerdly, Alliance = Pandemic Legion
While the "Final Blow" is always listed in the top right with the name, the name does not have the corporation and alliance with it as well. The full information is always listed below in the right-hand column, "## Involved", but their location in that column depends on how much damage they did in the fight, so it is not always on top, or anywhere specific for that matter.
So while I can get their names with:
kbPilotName = soup.find_all('td', style="text-align: center;")[0].find_all('a', href=re.compile('/character/'))[0].img.get('alt')
How can I get the rest of their information?
There is a textarea element containing all the data you are looking for. It's all in one text, but it's structured. You can choose a different way to parse it, but here is an example using regex:
import re
from bs4 import BeautifulSoup
import requests
url = 'https://zkillboard.com/kill/44998359/'
pattern = re.compile(r"(?s)Name: (.*?)Security: (.*?)Corp: (.*?)Alliance: (.*?)")
with requests.Session() as session:
session.headers = {'User-Agent': 'Mozilla/5.0 (Macintosh; Intel Mac OS X 10_10_2) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/40.0.2214.115 Safari/537.36'}
response = session.get(url)
soup = BeautifulSoup(response.content)
data = soup.select('form.form textarea#eft')[0].text
for name, security, corp, alliance in pattern.findall(data):
print name.strip()
Prints:
Happosait (laid the final blow)
Baneken
Perkel
Tibor Vherok
Kheo Dons
Kayakka
Lina Ectelion
Jay Burner
Zalamus
Draacan Ferox
Luwanii
Jousen Momaki
Varcuntis Morannear
Grimm K-Man
Wob'Niar
Godfrey Silvarna
Quintus Corvus
Shadow Altair
Sieren
Isha Vir
Argyrosdraco
Jack None
Strixi
Alternative solution (parsing "involved" page):
from bs4 import BeautifulSoup
import requests
url = 'https://zkillboard.com/kill/44998359/'
involved_url = 'https://zkillboard.com/kill/44998359/involved/'
with requests.Session() as session:
session.headers = {'User-Agent': 'Mozilla/5.0 (Macintosh; Intel Mac OS X 10_10_2) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/40.0.2214.115 Safari/537.36'}
session.get(url)
response = session.get(involved_url)
soup = BeautifulSoup(response.content)
for row in soup.select('table.table tr.attacker'):
name, corp, alliance = row.select('td.pilot > a')
print name.text, corp.text, alliance.text

Categories