Unable to parse two fields from a webpage using requests module - python

I'm trying to scrape two fields product_title and item_code from this webpage using requests module. When I execute the script below, I always get AttributeError in place of the result as the data I'm after are not in page source.
However, I've come across several solutions in here which are able to fetch data from javascript encrypted sites even when the data are not in page source, so I suppose there should be any way to grab the two fields from the webpage using requests.
import requests
from bs4 import BeautifulSoup
link = 'https://www.sainsburys.co.uk/gol-ui/Product/persil-small---mighty-non-bio-laundry-liquid-21l-60-washes'
with requests.Session() as s:
s.headers['User-Agent'] = 'Mozilla/5.0 (Windows NT 6.1) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/88.0.4324.150 Safari/537.36'
res = s.get(link)
soup = BeautifulSoup(res.text,"lxml")
product_title = soup.select_one("h1[data-test-id='pd-product-title']").get_text(strip=True)
item_code = soup.select_one("span#productSKU").get_text(strip=True)
print(product_title,item_code)
Expected output:
Persil Non-Bio Laundry Liquid 1.43L
Item code: 7637944
How can I fetch the two fields from that site using requests?

Actually the wesite calling apis, so you can use that directly to get the data
r = requests.get('https://www.sainsburys.co.uk/groceries-api/gol-services/product/v1/product?filter[product_seo_url]=gb%2Fgroceries%2Fpersil-small---mighty-non-bio-laundry-liquid-21l-60-washes&include[ASSOCIATIONS]=true&include[PRODUCT_AD]=citrus')
products = r.json()['products']
for each_product in products:
print(f"Item code: {each_product['product_uid']}")
print(each_product['name'])
# Item code: 7637944
# Persil Non-Bio Laundry Liquid 1.43L

Related

scraper returning empty when trying to scrape in beautiful soup

Hi so i want to scrape domain names and their prices but its returning null idk why
from bs4 import BeautifulSoup
url = 'https://www.brandbucket.com/styles/6-letter-domain-names?page=1'
response = requests.get(url)
soup = BeautifulSoup(response.text,'html.parser')
names = soup.findAll('div', {'class': "domainCardDetail"})
print(names)
Try the following approach to get domain names and their price from that site. The script currently parses content from the first page only. If you wish to get content from other pages, make sure to use the desired page number here page=1 which is located within link.
import requests
from bs4 import BeautifulSoup
link = 'https://www.brandbucket.com/styles/6-letter-domain-names?page=1'
url = 'https://www.brandbucket.com/amp/ga'
payload = {
'__amp_source_origin': 'https://www.brandbucket.com'
}
with requests.Session() as s:
s.headers['User-Agent'] = 'Mozilla/5.0 (Windows NT 6.1) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/88.0.4324.150 Safari/537.36'
res = s.get(link)
soup = BeautifulSoup(res.text,'html.parser')
payload['dp'] = soup.select_one("amp-iframe")['src'].split("list=")[1].split("&")[0]
resp = s.get(url,params=payload)
for key,val in resp.json()['triggers'].items():
if not key.startswith('domain'):continue
container = val['extraUrlParams']
print(container['pr1nm'],container['pr1pr'])
Output are like (truncated):
estita.com 2035
rendro.com 1675
rocamo.com 3115
wzrdry.com 4315
prutti.com 2395
bymodo.com 3495
ethlax.com 2035
intezi.com 2035
podoxa.com 2430
rorror.com 3190
zemoxa.com 2195
Check the status code of the response. When I tested there was 403 from the Web Server and because of that there is no such element like "domainCardDetail" div in response.
The reason for this is that website is protected by Cloudflare.
There are some advanced ways to bypass this.
The following solution is very simple if you do not need a mass amount of scraping. Otherwise, you may want to use "clouscraper" "Selenium" or another method to enable JavaScript on the website.
Open the developer console
Go to "Network". Make sure ticks are clicked as below picture.
https://i.stack.imgur.com/v0KTv.png
Refresh the page.
Copy JSON result and parse it in Python
https://i.stack.imgur.com/odX5S.png

webscraping python not showing all tags

I'm new to webscraping. I was trying to make a script that gets data from a balance sheet (here the site: https://www.sec.gov/ix?doc=/Archives/edgar/data/320193/000032019320000010/a10-qq1202012282019.htm). The problem is getting the data: when I watch at the source code in my browser, I'm able to find the tag and the correct value. Once I write down a script with bs4, I don't get anything.
I'm trying to get informations form the balance sheet: Products, Services, Cost of sales... and the data contained in the table 1. (I'm sorry, but I can't post the image. Anyway is the first table you see scrolling down).
Here's my code.
from bs4 import BeautifulSoup
from urllib.request import Request, urlopen
url = "https://www.sec.gov/ix?doc=/Archives/edgar/data/320193/000032019320000010/a10-qq1202012282019.htm"
req = Request(url, headers={'User-Agent': 'Mozilla/5.0'})
read_data = urlopen(req).read()
soup_data = BeautifulSoup(read_data,"lxml")
names = soup_data.find_all("td")
for name in names:
print(name)
Thanks for your time.
Try this URL:
Also include the headers to get the data.
import requests
from bs4 import BeautifulSoup
url = "https://www.sec.gov/Archives/edgar/data/320193/000032019320000010/a10-qq1202012282019.htm"
headers = {"User-agent":"Mozilla/5.0 (X11; Linux x86_64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/77.0.3865.120 Safari/537.36"}
req = requests.get(url, headers=headers)
soup_data = BeautifulSoup(req.text,"lxml")
You will be able to find the data you need.

Can't use image ids in order to make them qualified image links

I'm trying to scrape all the image links from this webpage using requests module. When I use this link I can only scrape the image links up until the rest of the content which show up while scrolling downward. However, If I use this link, I can get all the image ids by incrementing the last number attached to the very link. The problem is I can't reuse those ids to make them full-fledged image links.
I've tried with:
import requests
from bs4 import BeautifulSoup
url = 'https://stocksnap.io/api/search-photos/phone/relevance/desc/1'
with requests.Session() as s:
s.headers['User-Agent'] = 'Mozilla/5.0 (Windows NT 6.1) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/85.0.4183.83 Safari/537.36'
r = s.get(url)
for item in r.json()['results']:
print(item['img_id'])
How can I grab all the image links from the landing page of that website?
PS the first few sponsored image links should be ignored as they are not included in the api either.
Inspecting the page, the image URLs are constructed from the ID and first two tags obtained from the API:
import requests
url = 'https://stocksnap.io/api/search-photos/phone/relevance/desc/{}'
page = 1
while True:
data = requests.get(url.format(page)).json()
if not data['results']:
break
for r in data['results']:
print('https://stocksnap.io/photo/{}-{}-{}'.format(r['keywords'][0], r['keywords'][1], r['img_id']))
page += 1
Prints:
...
https://stocksnap.io/photo/iphone-cellphone-LNXYMM77SS
https://stocksnap.io/photo/business-technology-OGLUHZAPGF
https://stocksnap.io/photo/samsung-android-7ZALGLUAAW
https://stocksnap.io/photo/apple-macbook-55A6840521
https://stocksnap.io/photo/woman-talking-54C3E9FE9D
https://stocksnap.io/photo/samsung-galaxy-BB3307280A
https://stocksnap.io/photo/parc-bench-3D99A31C0C
https://stocksnap.io/photo/iphone-cellphone-E2C541A7DC
https://stocksnap.io/photo/iphone-mockup-167A645BDC
https://stocksnap.io/photo/mac-keyboard-BA9AFFE0BF
https://stocksnap.io/photo/sony-android-EB939B3311
https://stocksnap.io/photo/iphone-cellphone-B962ABCAC7
https://stocksnap.io/photo/building-man-D49A8BB4AE
https://stocksnap.io/photo/technology-computer-C9B37875B9
https://stocksnap.io/photo/iphone-cellphone-381F0FD1EE
https://stocksnap.io/photo/work-bag-96E1A8F1CB
https://stocksnap.io/photo/iphone-phone-70FE8C00C9
https://stocksnap.io/photo/iphone-mockup-9FCDF4E1F5
https://stocksnap.io/photo/young-girl-BE8BA006E6
https://stocksnap.io/photo/young-girl-7174B21D56
https://stocksnap.io/photo/man-woman-6XELVX8KAN
https://stocksnap.io/photo/nexus-smartphones-UAXILBRNUL
EDIT: To get .jpg links, the same method applies:
import requests
url = 'https://stocksnap.io/api/search-photos/phone/relevance/desc/{}'
page = 1
while True:
data = requests.get(url.format(page)).json()
if not data['results']:
break
for r in data['results']:
print('https://cdn.stocksnap.io/img-thumbs/280h/{}-{}_{}.jpg'.format(r['keywords'][0], r['keywords'][1], r['img_id']))
page += 1
Prints:
...
https://cdn.stocksnap.io/img-thumbs/280h/iphone-cellphone_B962ABCAC7.jpg
https://cdn.stocksnap.io/img-thumbs/280h/building-man_D49A8BB4AE.jpg
https://cdn.stocksnap.io/img-thumbs/280h/technology-computer_C9B37875B9.jpg
https://cdn.stocksnap.io/img-thumbs/280h/iphone-cellphone_381F0FD1EE.jpg
https://cdn.stocksnap.io/img-thumbs/280h/work-bag_96E1A8F1CB.jpg
https://cdn.stocksnap.io/img-thumbs/280h/iphone-phone_70FE8C00C9.jpg
https://cdn.stocksnap.io/img-thumbs/280h/iphone-mockup_9FCDF4E1F5.jpg
https://cdn.stocksnap.io/img-thumbs/280h/young-girl_BE8BA006E6.jpg
https://cdn.stocksnap.io/img-thumbs/280h/young-girl_7174B21D56.jpg
https://cdn.stocksnap.io/img-thumbs/280h/man-woman_6XELVX8KAN.jpg
https://cdn.stocksnap.io/img-thumbs/280h/nexus-smartphones_UAXILBRNUL.jpg

site data not populated as browser, despite rendering with html-requests

I am experimenting with html-requests on various sites,
and I am having trouble extracting the price of a stock on this particular site:
https://www.morningstar.com/stocks/xnys/BABA/quote
I am using html-requests, and using html.render to render javascript.
Despite this, the data doesn't seem to be populated as it is within the browser.
from requests_html import HTMLSession
import requests_html
from bs4 import BeautifulSoup as bs
user_agent = 'Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/84.0.4147.105 Safari/537.36'
requests_html.DEFAULT_USER_AGENT = user_agent
def get_request(ticker):
session = HTMLSession()
print(url)
res = session.get(url)
try:
res.raise_for_status()
except ValueError as e:
raise('Dead link')
return res
def mstar():
url = 'https://www.morningstar.com/stocks/xnys/BABA/quote'
res = get_requesturl)
res.html.render()
price = res.html.find('div#message-box-price.message-partial.fill.up')[0].text
print(price)
price = res.html.find('div.message-partial.fill.up')[0].text
print(price)
change = res.html.find('div#message-box-percentage')[0].text
print(change)
The Expected outcome is this data:
262.20
4.26 | 1.65%
However,
either I am just getting back symbols:
- or % but no actual prices.
Any suggestions?
Thank you.
The data is generated by the JSON API and then dynamically inserted into the website via JavaScript, hence python requests cannot see it. You can verify it yourself by doing a curl https://www.morningstar.com/stocks/xnys/baba/quote and trying to find the 1.65% on it -- it is not there, simply because it is not in the HTML source.
I would suggest to use selenium instead, and parse the data as follows:
elements = driver.find_element(By.ID, "div")
for element in elements:
print element.text
print element.get_attribute('message-box-price.message-partial.fill.up')

How do I send an embed message that contains multiple links parsed from a website to a webhook?

I want my embed message to look like this, but mine only returns one link.
Here's my code:
import requests
from bs4 import BeautifulSoup
from discord_webhook import DiscordWebhook, DiscordEmbed
url = 'https://www.solebox.com/Footwear/Basketball/Lebron-X-JE-Icon-QS-variant.html'
headers = {'user-agent': 'Mozilla/5.0 (Macintosh; Intel Mac OS X 10_15_0) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/77.0.3865.120 Safari/537.36'}
r = requests.get(url, headers=headers)
soup = BeautifulSoup(r.content, "lxml")
for tag in soup.find_all('a', class_="selectSize"):
#There's multiple 'id' resulting in more than one link
aid = tag.get('id')
#There's also multiple sizes
size = tag.get('data-size-us')
#These are the links that need to be shown in the embed message
product_links = "https://www.solebox.com/{0}".format(aid)
webhook = DiscordWebhook(url='WebhookURL')
embed = DiscordEmbed(title='Title')
embed.set_author(name='Brand')
embed.set_thumbnail(url="Image")
embed.set_footer(text='Footer')
embed.set_timestamp()
embed.add_embed_field(name='Sizes', value='US{0}'.format(size))
embed.add_embed_field(name='Links', value='[Links]({0})'.format(product_links))
webhook.add_embed(embed)
webhook.execute()
This will most likely get you the results you want. type(product_links) is a string, meaning that every iteration in your for loop is just re-writing over the product_links variable with a new string. If you declare a List before the loop and append product_links to that list, it will most likely result in what you wanted.
Note: I had to use a different URL from that site. The one specified in the question was no longer available. I also had to use a different header as the one the asker put up continuously fed me a 403 error.
Additional Note: The URLS that are returned via your code logic return links that lead to no where. I feel that you'll need to work that one through since I don't know what you're exactly trying to do, however I feel that this answers the question of why you where only getting one link.
import requests
from bs4 import BeautifulSoup
url = 'https://www.solebox.com/Footwear/Basketball/Air-Force-1-07-PRM-variant-2.html'
headers = {"User-Agent": "Mozilla/5.0 (Windows NT 6.1) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/41.0.2228.0 Safari/537.3"}
r = requests.get(url=url, headers=headers)
soup = BeautifulSoup(r.content, "lxml")
product_links = [] # Create our product
for tag in soup.find_all('a', class_="selectSize"):
#There's multiple 'id' resulting in more than one link
aid = tag.get('id')
#There's also multiple sizes
size = tag.get('data-size-us')
#These are the links that need to be shown in the embed message
product_links.append("https://www.solebox.com/{0}".format(aid))

Categories