Using web scraping to check if an item is in stock - python

I am creating a Python program that uses web scraping to check if an item is in stock. The code is a Python 3.9 script, using Beautiful Soup 4 and requests to scrape for the item's availability. I would eventually like to make the program search multiple websites and multiple links within each site so I don't have to have a bunch of scripts running at once. The expected result of the program is this:
200
0
In Stock
But I am getting:
200
[]
Out Of Stock
The '200' represents if the code can access the server, 200 is the expected result. The '0' is a boolean to see if the item is in stock, the expected response is either '0' for In Stock. I have given it both in-stock items and out of stock items and they both give the same response of 200 [] Out Of Stock. I have a feeling there is something wrong with the out_of_stock_divs within the def check_item_in_stock because that's where I am getting the [] result of it finding the availability of the item
I had the code working correctly earlier yesterday, and I kept adding features (like it scraping multiple links and different websites) and that broke it, and I can't get it back to a working condition
Here's the program code. (I did base this code off of Mr. Arya Boudaie's code on his website, https://aryaboudaie.com/ I got rid of his text notifications though because I plan on just having this running on a spare computer next to me and have it play a loud sound, that will later be implemented.)
from bs4 import BeautifulSoup
import requests
def get_page_html(url):
headers = {"User-Agent":"Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/83.0.4103.116 Safari/537.36"}
page = requests.get(url, headers=headers)
print(page.status_code)
return page.content
def check_item_in_stock(page_html):
soup = BeautifulSoup(page_html, 'html.parser')
out_of_stock_divs = soup.findAll("text", {"class": "product-inventory"})
print(out_of_stock_divs)
return len(out_of_stock_divs) != 0
def check_inventory():
url = "https://www.newegg.com/hp-prodesk-400-g5-nettop-computer/p/N82E16883997492?Item=9SIA7ABC996974"
page_html = get_page_html(url)
if check_item_in_stock(page_html):
print("In stock")
else:
print("Out of stock")
while True:
check_inventory()
time.sleep(60)```

The product inventory status is located inside a <div> tag, not a <text> tag:
import requests
from bs4 import BeautifulSoup
def get_page_html(url):
headers = {"User-Agent":"Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/83.0.4103.116 Safari/537.36"}
page = requests.get(url, headers=headers)
print(page.status_code)
return page.content
def check_item_in_stock(page_html):
soup = BeautifulSoup(page_html, 'html.parser')
out_of_stock_divs = soup.findAll("div", {"class": "product-inventory"}) # <--- change "text" to div
print(out_of_stock_divs)
return len(out_of_stock_divs) != 0
def check_inventory():
url = "https://www.newegg.com/hp-prodesk-400-g5-nettop-computer/p/N82E16883997492?Item=9SIA7ABC996974"
page_html = get_page_html(url)
if check_item_in_stock(page_html):
print("In stock")
else:
print("Out of stock")
check_inventory()
Prints:
200
[<div class="product-inventory"><strong>In stock.</strong></div>]
In stock
Note: The HTML markup of that site probably changed in the past, I'd modify the check_item_in_stock function:
def check_item_in_stock(page_html):
soup = BeautifulSoup(page_html, 'html.parser')
out_of_stock_div = soup.find("div", {"class": "product-inventory"})
return out_of_stock_div.text == "In stock."

You can probably achieve the legwork in a very readable and slightly more elegant manner with the lxml library:
import config
import requests
from lxml import html
def in_stock(url: str = config.upstream_url) -> tuple:
""" Check the website for stock status """
page = requests.get(url, headers={'User-agent': config.user_agent})
proc_html = html.fromstring(page.text)
checkout_button = proc_html.get_element_by_id('addToCart')
return (page.status, not ('disabled' in checkout_button.attrib['class']))
I would suggest using xpath to identify the element on the page that you want to examine. This makes it Easy to Change in the event of a upstream website update (outside of your control) as you only need to adjust the xpath string to reflect the upstream change:
# change me, if upstream web content changes
xpath_selector = r'''///button[#id='addToCart']'''
checkout_button = proc_html.xpath(xpath_selector)[0]
On a side note, stylistically, some purists would suggest avoiding side effects when writing functions (i.e. use of print() within a function). You can return a tuple with the status code and the result. This is a really nice feature in Python.

Maybe you know this already but Git is your friend. Whenever you make a change push it to github or wherever you choose to save it. Others can clone it and they will have the code you wrote so it is retrievable in multiple places if it is cloned more than once.

Related

Python, Scraping BS4

There are a lot of post about this subject but I still don't manage to achieve what I want so here is my problem:
I am trying to extract stock price from this site:
https://bors.e24.no/#!/instrument/NHY.OSE
and I would like extract the price: 57,12 from the "inspection" text:
<div class="number LAST" data-reactid=".g.1.2.0">
57,12</div>
Here is the code I tried which generate "AttributeError" and 'NoneType' object has no attribute 'text'.
I also tried to remove .text, in the PRICE line, and the result is 'Price is: None'
from bs4 import BeautifulSoup
import requests
url = 'https://bors.e24.no/#!/instrument/NHY.OSE'
page = requests.get(url)
soup = BeautifulSoup(page.text, 'lxml')
PRICE= soup.find('div', class_= "number LAST").text
print('Price is:',(PRICE))
Try this:
import requests
headers = {
'user_agent': 'Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/90.0.4430.93 Safari/537.36',
}
api_url = "https://bors.e24.no/server/components?columns=ITEM, LAST, BID, ASK, CHANGE, CHANGE_PCT, TURNOVER, LONG_NAME&itemSector=NHY.OSE&type=table"
data = requests.get(api_url, headers=headers).json()
print(data["rows"][0]["values"]["LAST"])
Output:
56.92
This happens because your
requests.get(url)
Will not get all information in the page, including the price you are looking for, because the said webpage will load some parts of it and only then fetch more data. Because of that, trying to select the div with className="number LAST"
PRICE= soup.find('div', class_= "number LAST").text
Will throw an error because this doesn't exist, yet.
There are some ways to fix this problem:
You can try to use libraries like Selenium, which is often recommended for scraping more dynamic pages that rely on some Javascript and API calls to load content.
You can open your developer tools and inspect the Network tab where you might find the request that fetches the price you are trying to scrap.
I believe that in your case, after taking a look at the Network tab myself, the right URL to request could be 'https://bors.e24.no/server/components?columns=TIME,+PRICE,+VOLUME,+BUYER,+SELLER,+ID&filter=ITEM%3D%3DsNHY&limit=5&source=feed.ose.trades.EQUITIES%2BPCC&type=history', which seems to return a dictionary with the price you are looking for.
import requests
url = 'https://bors.e24.no/server/components?columns=TIME,+PRICE,+VOLUME,+BUYER,+SELLER,+ID&filter=ITEM%3D%3DsNHY&limit=5&source=feed.ose.trades.EQUITIES%2BPCC&type=history'
page = requests.get(url)
print(page.json()["rows"][0]["values"]["PRICE"])
If you are looking to scrap various links, you will need to find a way to dynamically change the previous link to one that matches others that you are trying to crawl. Which I guess would mean to change "NHY" and "ose" to something that would match other stock that you are looking for.

scraper returning empty when trying to scrape in beautiful soup

Hi so i want to scrape domain names and their prices but its returning null idk why
from bs4 import BeautifulSoup
url = 'https://www.brandbucket.com/styles/6-letter-domain-names?page=1'
response = requests.get(url)
soup = BeautifulSoup(response.text,'html.parser')
names = soup.findAll('div', {'class': "domainCardDetail"})
print(names)
Try the following approach to get domain names and their price from that site. The script currently parses content from the first page only. If you wish to get content from other pages, make sure to use the desired page number here page=1 which is located within link.
import requests
from bs4 import BeautifulSoup
link = 'https://www.brandbucket.com/styles/6-letter-domain-names?page=1'
url = 'https://www.brandbucket.com/amp/ga'
payload = {
'__amp_source_origin': 'https://www.brandbucket.com'
}
with requests.Session() as s:
s.headers['User-Agent'] = 'Mozilla/5.0 (Windows NT 6.1) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/88.0.4324.150 Safari/537.36'
res = s.get(link)
soup = BeautifulSoup(res.text,'html.parser')
payload['dp'] = soup.select_one("amp-iframe")['src'].split("list=")[1].split("&")[0]
resp = s.get(url,params=payload)
for key,val in resp.json()['triggers'].items():
if not key.startswith('domain'):continue
container = val['extraUrlParams']
print(container['pr1nm'],container['pr1pr'])
Output are like (truncated):
estita.com 2035
rendro.com 1675
rocamo.com 3115
wzrdry.com 4315
prutti.com 2395
bymodo.com 3495
ethlax.com 2035
intezi.com 2035
podoxa.com 2430
rorror.com 3190
zemoxa.com 2195
Check the status code of the response. When I tested there was 403 from the Web Server and because of that there is no such element like "domainCardDetail" div in response.
The reason for this is that website is protected by Cloudflare.
There are some advanced ways to bypass this.
The following solution is very simple if you do not need a mass amount of scraping. Otherwise, you may want to use "clouscraper" "Selenium" or another method to enable JavaScript on the website.
Open the developer console
Go to "Network". Make sure ticks are clicked as below picture.
https://i.stack.imgur.com/v0KTv.png
Refresh the page.
Copy JSON result and parse it in Python
https://i.stack.imgur.com/odX5S.png

How do I send an embed message that contains multiple links parsed from a website to a webhook?

I want my embed message to look like this, but mine only returns one link.
Here's my code:
import requests
from bs4 import BeautifulSoup
from discord_webhook import DiscordWebhook, DiscordEmbed
url = 'https://www.solebox.com/Footwear/Basketball/Lebron-X-JE-Icon-QS-variant.html'
headers = {'user-agent': 'Mozilla/5.0 (Macintosh; Intel Mac OS X 10_15_0) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/77.0.3865.120 Safari/537.36'}
r = requests.get(url, headers=headers)
soup = BeautifulSoup(r.content, "lxml")
for tag in soup.find_all('a', class_="selectSize"):
#There's multiple 'id' resulting in more than one link
aid = tag.get('id')
#There's also multiple sizes
size = tag.get('data-size-us')
#These are the links that need to be shown in the embed message
product_links = "https://www.solebox.com/{0}".format(aid)
webhook = DiscordWebhook(url='WebhookURL')
embed = DiscordEmbed(title='Title')
embed.set_author(name='Brand')
embed.set_thumbnail(url="Image")
embed.set_footer(text='Footer')
embed.set_timestamp()
embed.add_embed_field(name='Sizes', value='US{0}'.format(size))
embed.add_embed_field(name='Links', value='[Links]({0})'.format(product_links))
webhook.add_embed(embed)
webhook.execute()
This will most likely get you the results you want. type(product_links) is a string, meaning that every iteration in your for loop is just re-writing over the product_links variable with a new string. If you declare a List before the loop and append product_links to that list, it will most likely result in what you wanted.
Note: I had to use a different URL from that site. The one specified in the question was no longer available. I also had to use a different header as the one the asker put up continuously fed me a 403 error.
Additional Note: The URLS that are returned via your code logic return links that lead to no where. I feel that you'll need to work that one through since I don't know what you're exactly trying to do, however I feel that this answers the question of why you where only getting one link.
import requests
from bs4 import BeautifulSoup
url = 'https://www.solebox.com/Footwear/Basketball/Air-Force-1-07-PRM-variant-2.html'
headers = {"User-Agent": "Mozilla/5.0 (Windows NT 6.1) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/41.0.2228.0 Safari/537.3"}
r = requests.get(url=url, headers=headers)
soup = BeautifulSoup(r.content, "lxml")
product_links = [] # Create our product
for tag in soup.find_all('a', class_="selectSize"):
#There's multiple 'id' resulting in more than one link
aid = tag.get('id')
#There's also multiple sizes
size = tag.get('data-size-us')
#These are the links that need to be shown in the embed message
product_links.append("https://www.solebox.com/{0}".format(aid))

This Code should return Product Title, But instead of title, I am getting "None" in return

I am trying to make a price tracker for Amazon by viewing a youtube tutorial, I am new to python and web scraping, Somehow I wrote this code and It should return Product name, But Instead its giving me "None" as an output, Can you please help me with this?
I tried with different URL's still its not working.
import requests
from bs4 import BeautifulSoup
URL = 'https://www.amazon.com/Nike-Rival-Track-Field-Shoes/dp/B07HYNB7VV/'
headers = {"User-Agent":'Mozilla/5.0 (Windows NT 10.0; Win64; x64)
AppleWebKit/57.36 (HTML, like Gecko) Chrome/75.0.30.100 Safari/537.4'}
page =requests.get(URL,headers)
soup = BeautifulSoup(page.content,'html.parser')
title = soup.find(id="productTitle")
print(title)import requests
I was inspecting the returned HTML, and realized that Amazon sends a (somewhat malformed?) HTML that trips the default html.parser, but using lxml I was able to scrape title just fine.
import requests
from bs4 import BeautifulSoup
def make_soup(url: str) -> BeautifulSoup:
res = requests.get(url, headers={
'User-Agent': 'Mozilla/5.0 (Windows NT 10.0; Win64; x64; rv:69.0) Gecko/20100101 Firefox/69.0'
})
res.raise_for_status()
return BeautifulSoup(res.text, 'lxml')
def parse_product_page(soup: BeautifulSoup) -> dict:
title = soup.select_one('#productTitle').text.strip()
return {
'title': title
}
if __name__ == "__main__":
url = 'https://www.amazon.com/Nike-Rival-Track-Field-Shoes/dp/B07HYNB7VV/'
soup = make_soup(url)
info = parse_product_page(soup)
print(info)
output:
{'title': "Nike Men's Zoom Rival M 9 Track and Field Shoes"}
You can make your locator more specific using .select(). You need to change the parser as well.
Try this instead:
import requests
from bs4 import BeautifulSoup
URL = 'https://www.amazon.com/Nike-Rival-Track-Field-Shoes/dp/B07HYNB7VV/'
page = requests.get(URL,headers={"User-Agent":'Mozilla/5.0'})
soup = BeautifulSoup(page.text,'lxml') #make sure you use "lxml' or "html5lib" parser instead of "html.parser"
title = soup.select_one("h1 > #productTitle").get_text(strip=True)
print(title)
Output:
Nike Men's Zoom Rival M 9 Track and Field Shoes
Bot detection is pretty pervasive these days. No major site with any data worth mining, especially retail, is going to let you use requests on their site.
You're going to have to at the very least use Selenium / ChromeDriver to get a response from any reputable site. Even then if they use something like Distil for bot detection they will stop even Selenium.
Try a less popular site with Selenium, and you will get data back.

soup.select('.r a') in f'https://google.com/search?q={query}' brings back empty list in Python BeautifulSoup. **NOT A DUPLICATE**

The Situation:
The "I'm Feeling Lucky!" project in the "Automate the boring stuff with Python" ebook no longer works with the code he provided.
Specifically:
linkElems = soup.select('.r a')
What I have done:
I've already tried using the solution provided within this stackoverflow question
I'm also currently using the same search format.
Code:
import webbrowser, requests, bs4
def im_feeling_lucky():
# Make search query look like Google's
search = '+'.join(input('Search Google: ').split(" "))
# Pull html from Google
print('Googling...') # display text while downloading the Google page
res = requests.get(f'https://google.com/search?q={search}&oq={search}')
res.raise_for_status()
# Retrieve top search result link
soup = bs4.BeautifulSoup(res.text, features='lxml')
# Open a browser tab for each result.
linkElems = soup.select('.r') # Returns empty list
numOpen = min(5, len(linkElems))
print('Before for loop')
for i in range(numOpen):
webbrowser.open(f'http://google.com{linkElems[i].get("href")}')
The Problem:
The linkElems variable returns an empty list [] and the program doesn't do anything past that.
The Question:
Could sombody please guide me to he correct way of handling this and perhaps explain why it isn't working?
I too had had the same problem while reading that book and found a solution for that problem.
replacing
soup.select('.r a')
with
soup.select('div#main > div > div > div > a')
will solve that issue
following is the code that will work
import webbrowser, requests, bs4 , sys
print('Googling...')
res = requests.get('https://google.com/search?q=' + ' '.join(sys.argv[1:]))
res.raise_for_status()
soup = bs4.BeautifulSoup(res.text)
linkElems = soup.select('div#main > div > div > div > a')
numOpen = min(5, len(linkElems))
for i in range(numOpen):
webbrowser.open('http://google.com' + linkElems[i].get("href"))
the above code takes input from commandline arguments
I took a different route. I saved the HTML from the request and opened that page, then I inspected the elements. It turns out that the page is different if I open it natively in the Chrome browser compared to what my python request is served. I identified the div with the class that appears to denote a result and supplemented that for the .r - in my case it was .kCrYT
#! python3
# lucky.py - Opens several Google Search results.
import requests, sys, webbrowser, bs4
print('Googling...') # display text while the google page is downloading
url= 'http://www.google.com.au/search?q=' + ' '.join(sys.argv[1:])
url = url.replace(' ','+')
res = requests.get(url)
res.raise_for_status()
# Retrieve top search result links.
soup=bs4.BeautifulSoup(res.text, 'html.parser')
# get all of the 'a' tags afer an element with the class 'kCrYT' (which are the results)
linkElems = soup.select('.kCrYT > a')
# Open a browser tab for each result.
numOpen = min(5, len(linkElems))
for i in range(numOpen):
webbrowser.open_new_tab('http://google.com.au' + linkElems[i].get('href'))
Different websites (for instance Google) generate different HTML codes to different User-Agents (this is how the web browser is identified by the website). Another solution to your problem is to use a browser User-Agent to ensure that the HTML code you obtain from the website is the same you would get by using "view page source" in your browser. The following code just prints the list of google search result urls, not the same as the book you've referenced but it's still useful to show the point.
#! python3
# lucky.py - Opens several Google search results.
import requests, sys, webbrowser, bs4
print('Please enter your search term:')
searchTerm = input()
print('Googling...') # display thext while downloading the Google page
url = 'http://google.com/search?q=' + ' '.join(searchTerm)
headers = {'User-Agent': 'Mozilla/5.0 (Macintosh; Intel Mac OS X 10_10_1) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/39.0.2171.95 Safari/537.36'}
res = requests.get(url, headers=headers)
res.raise_for_status()
# Retrieve top search results links.
soup = bs4.BeautifulSoup(res.content)
# Open a browser tab for each result.
linkElems = soup.select('.r > a') # Used '.r > a' instead of '.r a' because
numOpen = min(5, len(linkElems)) # there are many href after div class="r"
for i in range(numOpen):
# webbrowser.open('http://google.com' + linkElems[i].get('href'))
print(linkElems[i].get('href'))
There's actually no need to save the HTML file, and one of the reasons why response output is different from the one you see in the browser is that there are no headers being sent with the request, in this case, user-agent which will act as a "real" user visit (already written by Cucurucho).
When no user-agent is specified (when using requests library) it defaults to python-requests thus Google understands it, blocks a request and you receive a different HTML with different CSS selectors. Check what's your user-agent.
Pass user-agent:
headers = {
'User-agent':
'Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/70.0.3538.102 Safari/537.36 Edge/18.19582'
}
requests.get('URL', headers=headers)
To easier grab CSS selectors, have a look at the SelectorGadget extension to get CSS selectors by clicking on the desired element in your browser.
Code and example in the online IDE:
from bs4 import BeautifulSoup
import requests, lxml
headers = {
'User-agent':
'Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/70.0.3538.102 Safari/537.36 Edge/18.19582'
}
params = {
'q': 'how to create minecraft server',
'gl': 'us',
'hl': 'en',
}
html = requests.get('https://www.google.com/search', headers=headers, params=params).text
soup = BeautifulSoup(html, 'lxml')
# [:5] - first 5 results
# container with needed data: title, link, snippet, etc.
for result in soup.select('.tF2Cxc')[:5]:
link = result.select_one('.yuRUbf a')['href']
print(link, sep='\n')
----------
'''
https://help.minecraft.net/hc/en-us/articles/360058525452-How-to-Setup-a-Minecraft-Java-Edition-Server
https://www.minecraft.net/en-us/download/server
https://www.idtech.com/blog/creating-minecraft-server
https://minecraft.fandom.com/wiki/Tutorials/Setting_up_a_server
https://codewizardshq.com/how-to-make-a-minecraft-server/
'''
Alternatively, you can achieve the same thing by using Google Organic Results API from SerpApi. It's a paid API with a free plan.
The difference in your case is that you don't have to spend time thinking about how to bypass blocks from Google or what is the right CSS selector to parse the data, instead, you need to pass parameters (params) you want, and iterate over structured JSON and get the data you want.
Code to integrate:
import os
from serpapi import GoogleSearch
params = {
"engine": "google",
"q": "how to create minecraft server",
"hl": "en",
"gl": "us",
"api_key": os.getenv("API_KEY"),
}
search = GoogleSearch(params)
results = search.get_dict()
for result in results["organic_results"][:5]:
print(result["link"], sep="\n")
----------
'''
https://help.minecraft.net/hc/en-us/articles/360058525452-How-to-Setup-a-Minecraft-Java-Edition-Server
https://www.minecraft.net/en-us/download/server
https://www.idtech.com/blog/creating-minecraft-server
https://minecraft.fandom.com/wiki/Tutorials/Setting_up_a_server
https://codewizardshq.com/how-to-make-a-minecraft-server/
'''
Disclaimer, I work for SerpApi.

Categories