Python Web Scraping Product Price - python

I'm trying to web scrape the product price on this site: https://www.webhallen.com/se/product/232445-Logitech-C920-HD-Pro-Webcam
I tried using
price = str(soup.find('div', {"class": "add-product-to-cart"}))
and
price = soup.find(id="add-product-to-cart").get_text()
But unfortunately, I had no luck. The item returns no price. The price/text is stored in a span class.

The entire website is behind JavaScript so you won't fetch anything with bs4. However, there's an API endpoint with all the data you need.
Here's how to get it:
import requests
with requests.Session() as session:
response = session.get("https://www.webhallen.com/api/product/232445").json()
print(response["product"]["price"]["price"])
Output:
1190.00

Related

API - Web Scrape

how to get access to this API:
import requests
url = 'https://b2c-api-premiumlabel-production.azurewebsites.net/api/b2c/page/menu?id_loja=2691'
print(requests.get(url))
I'm trying to retrieve data from this site via API, I found the url above and I can see its data , however I can't seem to get it right because I'm running into code 403.
This is the website url:
https://www.nagumo.com.br/osasco-lj46-osasco-ayrosa-rua-avestruz/departamentos
I'm trying to retrieve items category, they are visible for me, but I'm unable to take them.
Later I'll use these categories to iterate over products API.
API Category
Obs: please be gentle it's my first post here =]
To get the data as you shown in your image the following headers and endpoint are needed:
import requests
headers = {
'sm-token': '{"IdLoja":2691,"IdRede":884}',
'User-Agent': 'Mozilla/5.0',
'Referer': 'https://www.nagumo.com.br/osasco-lj46-osasco-ayrosa-rua-avestruz/departamentos',
}
params = {
'id_loja': '2691',
}
r = requests.get('https://www.nagumo.com.br/api/b2c/page/menu', params=params, headers=headers)
r.json()
Not sure exactly what your issue is here.
Bu if you want to see the content of the response and not just the 200/400 reponses. You need to add '.content' to your print.
Eg.
#Create Session
s = requests.Session()
#Example Connection Variables, probably not required for your use case.
setCookieUrl = 'https://www...'
HeadersJson = {'Accept-Language':'en-us'}
bodyJson = {"__type":"xxx","applicationName":"xxx","userID":"User01","password":"password2021"}
#Get Request
p = s.get(otherUrl, json=otherBodyJson, headers=otherHeadersJson)
print(p) #Print response (200 etc)
#print(p.headers)
#print(p.content) #Print the content of the response.
#print(s.cookies)
I'm also new here haha, but besides this requests library, you'll also need another one like beautiful soup for what you're trying to do.
bs4 installation: https:https://www.crummy.com/software/BeautifulSoup/bs4/doc/#installing-beautiful-soup
Once you install it and import it, it's just continuing what you were doing to actively get your data.
response = requests.get(url)
soup = BeautifulSoup(response.text, "html.parser")
this gets the entire HTML content of the page, and so, you can get your data from this page based on their css selectors like this:
site_data = soup.select('selector')
site_data is an array of things with that 'selector', so a simple for loop and an array to add your items in would suffice (as an example, getting links for each book on a bookstore site)
For example, if i was trying to get links from a site:
import requests
from bs4 import BeautifulSoup
sites = []
URL = 'https://b2c-api-premiumlabel-production.azurewebsites.net/api/b2c/page/menu?id_loja=2691'
response = requests.get(url)
soup = BeautifulSoup(response.text, "html.parser")
links = soup.select("a") # list of all items with this selector
for link in links:
sites.append(link)
Also, a helpful tip is when you inspect the page (right click and at the bottom press 'inspect'), you can see the code for the page. Go to the HTML and find the data you want and right click it and select copy -> copy selector. This will make it really easy for you to get the data you want on that site.
helpful sites:
https://oxylabs.io/blog/python-web-scraping
https://realpython.com/beautiful-soup-web-scraper-python/

Web Scraping Stock Ticker Price from Yahoo Finance using BeautifulSoup

I'm trying to scrape Gold stock ticker from Yahoo! Finance.
from bs4 import BeautifulSoup
import requests, lxml
response = requests.get('https://finance.yahoo.com/quote/GC=F?p=GC=F')
soup = BeautifulSoup(response.text, 'lxml')
gold_price = soup.findAll("div", class_='My(6px) Pos(r) smartphone_Mt(6px)')[2].find_all('p').text
Whenever I run this it returns: list index out of range.
When I do print(len(ssoup)) it returns 4.
Any ideas?
Thank you.
You can make a direct request to the yahoo server. To locate the query URL you need to open Network tab via Dev tools (F12) -> Fetch/XHR -> find name: spark?symbols= (refresh page if you don't see any), find the needed symbol, and see the response (preview tab) on the newly opened tab on the right.
You can make direct requests to all of these links if the request method is GET since POST methods are much more complicated.
You need json and requests library, no need for bs4. Note that making a lot of such requests might block your IP (or set an IP rate limit) or you won't get any response because their system might detect that it's a bot since the regular user won't make such requests to the server, repeatedly. So you need to figure out how to bypass it.
Update:
There's possibly a hard limit on how many requests can be made in an X period of time.
Code and example in the online IDE (contains full JSON response):
import requests, json
response = requests.get('https://query1.finance.yahoo.com/v7/finance/spark?symbols=GC%3DF&range=1d&interval=5m&indicators=close&includeTimestamps=false&includePrePost=false&corsDomain=finance.yahoo.com&.tsrc=finance').text
data_1 = json.loads(response)
gold_price = data_1['spark']['result'][0]['response'][0]['meta']['previousClose']
print(gold_price)
# 1830.8
P.S. There's a blog about scraping Yahoo! Finance Home Page of mine, which is kind of relevant.

get number of followers of twitter account by scraping twitter

I tried to get the number of followers of a given Twitter account by scraping twitter. I tried scraping with BeautifulSoup and XPath. But none of the code is working.
This is some of my sample testing code for it,
from bs4 import BeautifulSoup
url = "https://twitter.com/BarackObama"
resposnse = re.get(url)
soup = BeautifulSoup(resposnse.content)
div_tag = soup.find_all('main',{"class":"css-1dbjc4n r-1habvwh r-16xksha r-1wbh5a2"})
when i try to see what is the content i scraped by using below code,
import requests
t=requests.get('https://twitter.com/BarackObama')
print(t.content)
It's not included any of the data like count of followers or anything.
Please help me to do this.
whenever your code started to parse information from the twitter URL. it will parse all data suddenly but it won't get all the data. because the URL page is loaded but not the values or other important data etc... (same for the followers). so there is TWITTER PYTHON API where you can able to get followers with api.GetFollowers()
The relevant API endpoint is followers/ids. Using TwitterAPI you can do the following:
from TwitterAPI import TwitterAPI, TwitterPager
api = TwitterAPI(YOUR_CONSUMER_KEY,
YOUR_CONSUMER_SECRET,
YOUR_ACCESS_TOKEN_KEY,
YOUR_ACCESS_TOKEN_SECRET)
count = 0
r = TwitterPager(api, 'followers/ids')
for item in r.get_iterator():
count = count + 1
print(count)

how to pull the shipping price from banggood.com using beautifulsoup

i'm trying to get the shipping price from this link:
https://www.banggood.com/Xiaomi-Mi-Air-Laptop-2019-13_3-inch-Intel-Core-i7-8550U-8GB-RAM-512GB-PCle-SSD-Win-10-NVIDIA-GeForce-MX250-Fingerprint-Sensor-Notebook-p-1535887.html?rmmds=search&cur_warehouse=CN
but it seems that the "strong" is empty.
i've tried few solutions but all of them gave me an empty "strong"
i'm using beautifulsoup in python 3.
for example this code led me to an empty "strong":
client = requests.get(url, headers={"User-Agent": "Mozilla/5.0"})
soup = BeautifulSoup(client.content, 'lxml')
for child in soup.find("span", class_="free_ship").children:
print(child)
The issue is the 'Free Shipping' is generated by JavaScript after the page loads, rather than being sent in the webpage.
It might obtain the Shipping price by performing a HTTP request after the page has loaded or it may be hidden within the page
You might be able to try to find the XHR Request to pulls the Shipping price using DevTools in Firefox or chrome using the 'networking' tab and using that to get the price.
Using the XHR, you can find that data:
import requests
from bs4 import BeautifulSoup
import json
url = 'https://m.banggood.com/ajax/product/dynamicPro/index.html'
payload = {
'c': 'api',
'sq': 'IY38TmCNgDhATYCmIDGxYisATHA7ANn2HwX2RNwEYrcAGAVgDNxawIQFhLpFhkOCuZFFxA'}
response = requests.get(url, params=payload).json()
data = response['result']
shipping = data['shipment']
for each in shipping.items():
print (each)
print (shipping['shipCost'])
Output:
print (shipping['shipCost'])
<b>Free Shipping</b>

Python Beautifulsoup - Scrape elements from "inspect"

I am trying to get scrape some data from stockrow.com using BeautifulSoup.
However there seems to be some diffrences between inspect and view sourcecode (im using chrome, but i do not see that being a problem for Pyton).
This is resulting in some trouble as the sourcecode itself does not show any html-tags such as h1. They are however showing up when i use the inspect tool.
The part i am trying to scrape (among other things) - this is show using the inspect tool:
<h1>Teva Pharmaceutical Industries Ltd<small>(TEVA)</small></h1>
My current code, printing an empty list:
import bs4 as bs
import urllib.request
class Stock:
stockrow_url = "https://stockrow.com"
url_suffix = "/financials/{}/annual"
def __init__(self, ticker : str, stock_url=stockrow_url, url_suffix = url_suffix):
# Stock ticker
self.ticker = ticker.upper()
# URLs for financial statements related to the ticker
self.stock_url = stock_url + "/{}".format(self.ticker)
sauce = urllib.request.urlopen(self.stock_url).read()
soup = bs.BeautifulSoup(sauce, 'html.parser').h1
print(soup)
self.income_url = self.stock_url + url_suffix.format("income")
self.balance_sheet_url = self.stock_url + url_suffix.format("balance")
self.cash_flow_url = self.stock_url + url_suffix.format("cashflow")
teva = Stock("teva")
print(teva.get_income_statement())
The page is dynamically generated using jscript and cannot be handled by beautifulsoup. You can capture the information using either selenium and the like, or by looking for API calls.
In this case, you can get for TEVA, background information using
import json
import requests
hdr = {'User-Agent':'Mozilla/5.0'}
url = "https://stockrow.com/api/companies/TEVA.json?ticker=TEVA"
response = requests.get(url, headers=hdr)
info = json.loads(response.text)
info
Similarly, the income statement is hiding here:
url = 'https://stockrow.com/api/companies/TEVA/financials.json?ticker=TEVA&dimension=MRY&section=Income+Statement'
Using the same code as above but with this other url, will get you your income statement, in json format.
And you can take it from there. Search around - there is a lot of information available on this topic. Good luck.

Categories