Python, Scraping BS4

Python, Scraping BS4 - python

There are a lot of post about this subject but I still don't manage to achieve what I want so here is my problem:
I am trying to extract stock price from this site:
https://bors.e24.no/#!/instrument/NHY.OSE
and I would like extract the price: 57,12 from the "inspection" text:
<div class="number LAST" data-reactid=".g.1.2.0">
57,12</div>
Here is the code I tried which generate "AttributeError" and 'NoneType' object has no attribute 'text'.
I also tried to remove .text, in the PRICE line, and the result is 'Price is: None'
from bs4 import BeautifulSoup
import requests
url = 'https://bors.e24.no/#!/instrument/NHY.OSE'
page = requests.get(url)
soup = BeautifulSoup(page.text, 'lxml')
PRICE= soup.find('div', class_= "number LAST").text
print('Price is:',(PRICE))

Try this:
import requests
headers = {
'user_agent': 'Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/90.0.4430.93 Safari/537.36',
}
api_url = "https://bors.e24.no/server/components?columns=ITEM, LAST, BID, ASK, CHANGE, CHANGE_PCT, TURNOVER, LONG_NAME&itemSector=NHY.OSE&type=table"
data = requests.get(api_url, headers=headers).json()
print(data["rows"][0]["values"]["LAST"])
Output:
56.92

This happens because your
requests.get(url)
Will not get all information in the page, including the price you are looking for, because the said webpage will load some parts of it and only then fetch more data. Because of that, trying to select the div with className="number LAST"
PRICE= soup.find('div', class_= "number LAST").text
Will throw an error because this doesn't exist, yet.
There are some ways to fix this problem:
You can try to use libraries like Selenium, which is often recommended for scraping more dynamic pages that rely on some Javascript and API calls to load content.
You can open your developer tools and inspect the Network tab where you might find the request that fetches the price you are trying to scrap.
I believe that in your case, after taking a look at the Network tab myself, the right URL to request could be 'https://bors.e24.no/server/components?columns=TIME,+PRICE,+VOLUME,+BUYER,+SELLER,+ID&filter=ITEM%3D%3DsNHY&limit=5&source=feed.ose.trades.EQUITIES%2BPCC&type=history', which seems to return a dictionary with the price you are looking for.
import requests
url = 'https://bors.e24.no/server/components?columns=TIME,+PRICE,+VOLUME,+BUYER,+SELLER,+ID&filter=ITEM%3D%3DsNHY&limit=5&source=feed.ose.trades.EQUITIES%2BPCC&type=history'
page = requests.get(url)
print(page.json()["rows"][0]["values"]["PRICE"])
If you are looking to scrap various links, you will need to find a way to dynamically change the previous link to one that matches others that you are trying to crawl. Which I guess would mean to change "NHY" and "ose" to something that would match other stock that you are looking for.

Related

scraper returning empty when trying to scrape in beautiful soup

Hi so i want to scrape domain names and their prices but its returning null idk why
from bs4 import BeautifulSoup
url = 'https://www.brandbucket.com/styles/6-letter-domain-names?page=1'
response = requests.get(url)
soup = BeautifulSoup(response.text,'html.parser')
names = soup.findAll('div', {'class': "domainCardDetail"})
print(names)

Try the following approach to get domain names and their price from that site. The script currently parses content from the first page only. If you wish to get content from other pages, make sure to use the desired page number here page=1 which is located within link.
import requests
from bs4 import BeautifulSoup
link = 'https://www.brandbucket.com/styles/6-letter-domain-names?page=1'
url = 'https://www.brandbucket.com/amp/ga'
payload = {
'__amp_source_origin': 'https://www.brandbucket.com'
}
with requests.Session() as s:
s.headers['User-Agent'] = 'Mozilla/5.0 (Windows NT 6.1) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/88.0.4324.150 Safari/537.36'
res = s.get(link)
soup = BeautifulSoup(res.text,'html.parser')
payload['dp'] = soup.select_one("amp-iframe")['src'].split("list=")[1].split("&")[0]
resp = s.get(url,params=payload)
for key,val in resp.json()['triggers'].items():
if not key.startswith('domain'):continue
container = val['extraUrlParams']
print(container['pr1nm'],container['pr1pr'])
Output are like (truncated):
estita.com 2035
rendro.com 1675
rocamo.com 3115
wzrdry.com 4315
prutti.com 2395
bymodo.com 3495
ethlax.com 2035
intezi.com 2035
podoxa.com 2430
rorror.com 3190
zemoxa.com 2195

Check the status code of the response. When I tested there was 403 from the Web Server and because of that there is no such element like "domainCardDetail" div in response.

The reason for this is that website is protected by Cloudflare.
There are some advanced ways to bypass this.
The following solution is very simple if you do not need a mass amount of scraping. Otherwise, you may want to use "clouscraper" "Selenium" or another method to enable JavaScript on the website.
Open the developer console
Go to "Network". Make sure ticks are clicked as below picture.
https://i.stack.imgur.com/v0KTv.png
Refresh the page.
Copy JSON result and parse it in Python
https://i.stack.imgur.com/odX5S.png

Get next page on Amazon.com using python requests library

I'm trying to scrape through an amazon offer. The code block below prints the titles of all results of the first page.
import requests
from bs4 import BeautifulSoup as BS
offer_url = 'https://www.amazon.de/gp/promotion/A2M8IJS74E2LMU?ref=deals_deals_deals-grid_slot-15_8454_dt_dcell_img_14_f3724fb9#'
headers = {
'User-Agent': 'Mozilla/5.0 (Macintosh; Intel Mac OS X 10_15_4) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/83.0.4103.97 Safari/537.36'
}
response = requests.get(offer_url, headers=headers)
html = response.text
soup = BS(html)
links = soup.find_all('a',class_='a-size-base a-link-normal fw-line3-break a-text-normal')
for link in links:
title = link.text.strip() # remove surrounding spaces and linebreaks etc.
print(title)
So far, so good. Now, how do I access the second page? Clicking on the Next page button (in german it's "Weiter") at the bottom of the page does not add a page argument like ?page=2 to the URL through which I could access the next page.
The questions I have are: How do I access the content of the next page similarly to how I access the first page? Is there a POST request involved and if so: How do I figure out its params/data? How would I use requests to mimic pressing the Next Page button and get the respective page results?
The offer is scheduled to last until March 21st, 2021. Until then, the link provided in the code should be valid.
Maybe it's just a few lines of code, e.g. a tweak in my request. Thanks in advance! Have a wonderful day!
Edit:
Trying to fetch the second page using the following script only yields the results of the first page.
params = {"page":2}
html = requests.post('https://www.amazon.de/gp/promotion/A2M8IJS74E2LMU', data=params, headers=headers).text

Using web scraping to check if an item is in stock

I am creating a Python program that uses web scraping to check if an item is in stock. The code is a Python 3.9 script, using Beautiful Soup 4 and requests to scrape for the item's availability. I would eventually like to make the program search multiple websites and multiple links within each site so I don't have to have a bunch of scripts running at once. The expected result of the program is this:
200
0
In Stock
But I am getting:
200
[]
Out Of Stock
The '200' represents if the code can access the server, 200 is the expected result. The '0' is a boolean to see if the item is in stock, the expected response is either '0' for In Stock. I have given it both in-stock items and out of stock items and they both give the same response of 200 [] Out Of Stock. I have a feeling there is something wrong with the out_of_stock_divs within the def check_item_in_stock because that's where I am getting the [] result of it finding the availability of the item
I had the code working correctly earlier yesterday, and I kept adding features (like it scraping multiple links and different websites) and that broke it, and I can't get it back to a working condition
Here's the program code. (I did base this code off of Mr. Arya Boudaie's code on his website, https://aryaboudaie.com/ I got rid of his text notifications though because I plan on just having this running on a spare computer next to me and have it play a loud sound, that will later be implemented.)
from bs4 import BeautifulSoup
import requests
def get_page_html(url):
headers = {"User-Agent":"Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/83.0.4103.116 Safari/537.36"}
page = requests.get(url, headers=headers)
print(page.status_code)
return page.content
def check_item_in_stock(page_html):
soup = BeautifulSoup(page_html, 'html.parser')
out_of_stock_divs = soup.findAll("text", {"class": "product-inventory"})
print(out_of_stock_divs)
return len(out_of_stock_divs) != 0
def check_inventory():
url = "https://www.newegg.com/hp-prodesk-400-g5-nettop-computer/p/N82E16883997492?Item=9SIA7ABC996974"
page_html = get_page_html(url)
if check_item_in_stock(page_html):
print("In stock")
else:
print("Out of stock")
while True:
check_inventory()
time.sleep(60)```

The product inventory status is located inside a <div> tag, not a <text> tag:
import requests
from bs4 import BeautifulSoup
def get_page_html(url):
headers = {"User-Agent":"Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/83.0.4103.116 Safari/537.36"}
page = requests.get(url, headers=headers)
print(page.status_code)
return page.content
def check_item_in_stock(page_html):
soup = BeautifulSoup(page_html, 'html.parser')
out_of_stock_divs = soup.findAll("div", {"class": "product-inventory"}) # <--- change "text" to div
print(out_of_stock_divs)
return len(out_of_stock_divs) != 0
def check_inventory():
url = "https://www.newegg.com/hp-prodesk-400-g5-nettop-computer/p/N82E16883997492?Item=9SIA7ABC996974"
page_html = get_page_html(url)
if check_item_in_stock(page_html):
print("In stock")
else:
print("Out of stock")
check_inventory()
Prints:
200
[<div class="product-inventory"><strong>In stock.</strong></div>]
In stock
Note: The HTML markup of that site probably changed in the past, I'd modify the check_item_in_stock function:
def check_item_in_stock(page_html):
soup = BeautifulSoup(page_html, 'html.parser')
out_of_stock_div = soup.find("div", {"class": "product-inventory"})
return out_of_stock_div.text == "In stock."

You can probably achieve the legwork in a very readable and slightly more elegant manner with the lxml library:
import config
import requests
from lxml import html
def in_stock(url: str = config.upstream_url) -> tuple:
""" Check the website for stock status """
page = requests.get(url, headers={'User-agent': config.user_agent})
proc_html = html.fromstring(page.text)
checkout_button = proc_html.get_element_by_id('addToCart')
return (page.status, not ('disabled' in checkout_button.attrib['class']))
I would suggest using xpath to identify the element on the page that you want to examine. This makes it Easy to Change in the event of a upstream website update (outside of your control) as you only need to adjust the xpath string to reflect the upstream change:
# change me, if upstream web content changes
xpath_selector = r'''///button[#id='addToCart']'''
checkout_button = proc_html.xpath(xpath_selector)[0]
On a side note, stylistically, some purists would suggest avoiding side effects when writing functions (i.e. use of print() within a function). You can return a tuple with the status code and the result. This is a really nice feature in Python.

Maybe you know this already but Git is your friend. Whenever you make a change push it to github or wherever you choose to save it. Others can clone it and they will have the code you wrote so it is retrievable in multiple places if it is cloned more than once.

How to scrape data in h4 with beautifulsoup?

I am trying to scrape the results data from this website (https://www.ufc.com/matchup/908/7717/post) and I am completely at a loss for why my proposed solution isn't working.
The outer html that I am trying to scrape is <h4 class="e-t5 winner">Jon Jones</h4>. I don't have a lot of experience with web scraping or HTML but all of the relevant information is contained in the h4 tag.
I have been successful in extracting the data from the h2 tag but I am confused as to why the same approach doesn't work for h4. For example, to extract the relevant data from <h2 class="field--name-name name_given red">Jon Jones <span class="field--field-rank rank"></span></h2> the following code works.
from requests import get
from bs4 import BeautifulSoup
headers = {
'User-Agent': 'Mozilla/5.0 (Windows NT 6.1) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/41.0.2228.0 Safari/537.36',
}
raw_html = get('https://www.ufc.com/matchup/908/7717/post', headers=headers)
html = BeautifulSoup(raw_html.content)
# this works
html.find_all('h2', attrs={'class': 'field--name-name name_given red'})[0].get_text().strip()
# this does not work?
html.find_all('h4', attrs={'class': 'e-t5 winner red'})
# this code gets me to the headers but not the actual listed data inside
html.find('div', attrs={'class': 'l-flex--4col-2to4'})
I am mostly confused as to why the above doesn't work and why the text I can see when inspecting the element in my browser, doesn't appear in the scraped HTML.

It is added dynamically. You can find the source in the network tab. Assuming there is always one winner you can use something like
import requests
r = requests.get('https://dvk92099qvr17.cloudfront.net/V1/908/Fnt.json').json()
winner = [fighter['FullName'] for fighter in r['FMLiveFeed']['Fights'][0]['Fighters'] if fighter['Outcome'] == 'Win'][0]
print(winner)

Web Scraping - No content displayed

I am trying to fetch the stock of a company specified by a user by taking the input. I am using requests to get the source code and BeautifulSoup to scrape. I am fetching the data from google.com. I am trying the fetch only the last stock price (806.93 in the picture). When I run my script, it prints none. None of the data is being fetched. What am I missing ?
# -*- coding: utf-8 -*-
from bs4 import BeautifulSoup
import requests
company = raw_input("Enter the company name:")
URL = "https://www.google.co.in/?gfe_rd=cr&ei=-AKmV6eqC-LH8AfRqb_4Aw#newwindow=1&safe=off&q="+company+"+stock"
request = requests.get(URL)
soup = BeautifulSoup(request.content,"lxml")
code = soup.find('span',{'class':'_Rnb fmob_pr fac-l','data-symbol':'GOOGL'})
print code.contents[0]
The source code of the page looks like this :

Looks like that source is from inspecting the element, not the actual source. A couple of suggestions. Use google finance to get rid of some noise - https://www.google.com/finance?q=googl would be the URL. On that page there is a section that looks like this:
<div class=g-unit>
<div id=market-data-div class="id-market-data-div nwp g-floatfix">
<div id=price-panel class="id-price-panel goog-inline-block">
<div>
<span class="pr">
<span id="ref_694653_l">806.93</span>
</span>
<div class="id-price-change nwp">
<span class="ch bld"><span class="chg" id="ref_694653_c">+9.68</span>
<span class="chg" id="ref_694653_cp">(1.21%)</span>
</span>
</div>
</div>
You should be able to pull the number out of that.

I went to
https://www.google.com/?gfe_rd=cr&ei=-AKmV6eqC-LH8AfRqb_4Aw#newwindow=1&safe=off&q=+google+stock
, did a right click and "View Page Source" but did not see the code that you screenshotted.
Then I typed out a section of your code screenshot and created a BeautifulSoup object with it and then ran your find on it:
test_screenshot = BeautifulSoup('<div class="_F0c" data-tmid="/m/07zln7n"><span class="_Rnb fmob_pr fac-l" data-symbol="GOOGL" data-tmid="/m/07zln7n" data-value="806.93">806.93.</span> = $0<span class ="_hgj">USD</span>')
test_screenshot.find('span',{'class':'_Rnb fmob_pr fac-l','data-symbol':'GOOGL'})`
Which will output what you want:
<span class="_Rnb fmob_pr fac-l" data-symbol="GOOGL" data-tmid="/m/07zln7n" data-value="806.93">806.93.</span>
This means that the code you are getting is not the code you expect to get.
I suggest using the google finance page:
https://www.google.com/finance?q=google (replace 'google' with what you want to search), which will give you wnat you are looking for:
request = requests.get(URL)
soup = BeautifulSoup(request.content,"lxml")
code = soup.find("span",{'class':'pr'})
print code.contents
Will give you
[u'\n', <span id="ref_694653_l">806.93</span>, u'\n'].
In general, scraping Google search results can get really nasty, so try to avoid it if you can.
You might also want to look into Yahoo Finance Python API.

You're looking for this:
# two selectors which will handle two layouts
current_price = soup.select_one('.wT3VGc, .XcVN5d').text
Have a look at the SelectorGadget Chrome extension to grab CSS selectors by clicking on the desired element in your browser. CSS selectors reference.
It might be because there's no user-agent specified in your request headers.
The default requests user-agent is python-requests thus Google blocks a request because it knows that it's a bot and not a "real" user visit and you received a different HTML with different selectors and elements, and some sort of an error. User-agent fakes user visit by adding this information into HTTP request headers.
Pass user-agent into request headers:
headers = {
'User-agent':
'Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/70.0.3538.102 Safari/537.36 Edge/18.19582'
}
response = requests.get('YOUR_URL', headers=headers)
Code and example in the online IDE:
import requests, lxml
from bs4 import BeautifulSoup
headers = {
'User-agent':
'Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/70.0'
params = {
'q': 'alphabet inc class a stock',
'gl': 'us'
}
html = requests.get('https://www.google.com/search', headers=headers, params=params)
soup = BeautifulSoup(html.text, 'lxml')
# two selectors which will handle two layouts
current_price = soup.select_one('.wT3VGc, .XcVN5d').text
print(current_price)
# 2,816.00
Alternatively, you can achieve the same thing by using Google Direct Answer Box API from SerpApi. It's a paid API with a free plan.
The difference in your case is that you only need to iterate over structured JSON and get the data you want fast rather than figuring out why certain things don't work as expected and then to maintain it over time.
Code to integrate:
from serpapi import GoogleSearch
params = {
"api_key": "YOUR_API_KEY",
"engine": "google",
"q": "alphabet inc class a stock",
"gl": "us",
"hl": "en"
}
search = GoogleSearch(params)
results = search.get_dict()
current_price = results['answer_box']['price']
print(current_price)
# 2,816.00
P.S - I wrote an in-depth blog post about how to reduce the chance of being blocked while web scraping search engines.
Disclaimer, I work for SerpApi.

We Keep Coding

Python is a programming language that lets you work quickly and integrate systems more effectively.

Python, Scraping BS4 - python

Related

scraper returning empty when trying to scrape in beautiful soup

Get next page on Amazon.com using python requests library

Using web scraping to check if an item is in stock

How to scrape data in h4 with beautifulsoup?

Web Scraping - No content displayed

Categories

Resources