Is it even possible to webscrape this website [Unpredictive URL]? - python

This is the website in question (I want to extract the SMR Rating):
https://research.investors.com/stock-quotes/nasdaq-apple-inc-aapl.htm
If I have a list of stock names like AAPL, NVDA, TSM etc. and I want to iterate through them, how can I do it when the URL constantly changes in an unpredictable manner?
Take for example the same website with the ticker NVDA:
https://research.investors.com/stock-quotes/nasdaq-nvidia-corp-nvda.htm
It's not possible to append the ticker name to the URL and be done with it. I searched for a hidden API and I got this:
https://research.investors.com/services/ChartService.svc/GetData
This website gives me access to a json file but it doesn't contain the desired SMR Rating. Apart from that, I couldn't find anything else that would lead to the SMR Rating. Is this simply impossible?
Here's what I got so far, I can't get even past the HTML reading stage:
from bs4 import BeautifulSoup as bs
import json
import re
import pandas as pd
import requests
header = {"user-agent": "Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/74.0.3729.169 Safari/537.36" ,
'referer':'https://www.google.com/'}
URL = "https://research.investors.com/stock-quotes/nasdaq-nvidia-corp-nvda.htm"
page = requests.get(URL, headers = header)
soup = bs(page.content, "html.parser")
print(soup)
As you can see, I can't load the full html code with beautiful soup, as the page assumes that some form of robotic activity is taking place (Error 405). Should I have specified a different header or is it indeed the case that webscraping isn't allowed on this webapge?

Related

Web scrapping with Beautifulsoup returns no text eventhough it is in the html

I'm new to web scrapping and using Beautifulsoup. I need help as I don't understand why my code is returning no text when there is text in the inspect view on the website.
Here is my simple code:
from bs4 import BeautifulSoup
import requests
source = requests.get("https://www.nummerplade.net/nummerplade/Dd97487.html")
soup = BeautifulSoup(source.text,"html.parser")
name = soup.find("span",id="debitorer_name1")
print(name)
The output of running my code is:
<span id="debitorer_name1"></span>
When I inspect the HTML on the website I can see the desired name I want to extract, but not when running my script. Can anyone help me solve this issue?
Thanks!
If you reload site the data is reflecting in right side pane it takes same time so where it is uses dynamic data loading and it will not be visible in soup
How to find URL which renders dynamic data:
Go to Network tab and reload site and in left side just type the data that you want to search it will give you URL
Now go to Headers and copy user-agent, referer for headers and it will return data as in json format and you can extract what so data you want
import requests
headers={"user-agent": "Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/102.0.0.0 Safari/537.36", "referer": "https://www.nummerplade.net/"}
res=requests.get("https://data3.nummerplade.net/bilbogen2.php?stelnr=salza2bt3nh162519",headers=headers)
Output:
'Sebastian Carl Schwabe'
Image:

Unable to print the information in this div on a webpage? - Tried multiple methods - Python - BS4

Currently having some trouble attempting to pull the below text from the webpage:
"https://www.johnlewis.com/mulberry-bayswater-small-zipped-leather-handbag-summer-khaki/p5807862"
I am using the below code and I'm trying to print the product name, product price and number available in stock.
I am easily able to print the name and price, but seem to be unable to print the # in stock.
I have tried using both StockInformation_stock__3OYkv & DefaultTemplate_product-stock-information__dFTUx but I am either presented with nothing, or the price again.
What am i doing wrong?
Thanks in advance.
import requests
from bs4 import BeautifulSoup
url = 'https://www.johnlewis.com/mulberry-bayswater-small-zipped-leather-handbag-summer-khaki/p5807862'
response = requests.get(url, headers={
'User-Agent': 'Mozilla/5.0 (Windows NT 10.0; Win64; x64; rv:94.0) Gecko/20100101 Firefox/94.0'
})
soup = BeautifulSoup(response.content, 'html.parser')
numInStock = soup.find(class_="StockInformation_stock__3OYkv").get_text().strip()
productName = soup.find(id="confirmation-anchor-desktop").get_text().strip()
productPrice = soup.find(class_="ProductPrice_price__DcrIr").get_text().strip()
print (productName)
print (productPrice)
print (numInStock)
The webpage you chose has some dynamic elements, meaning rapidly changing elements such as the stock number. In this case, the page you pulled first displays the more static elements such as the product name and price, then does supplementary requests to different API urls for the data on stock (since it changes frequently). After the browser requests the supplemental data it injects it into the original HTML page, which is why the frame of the name and product are there but not the stock. In simple terms, the webpage is "still loading" as you do the request to grab it, and there is hundreds of other requests for images, files, and data that must also be done to get the rest of the data for the full image that your browser and eyes would regularly see.
Fortunately, we only need one more request, which grabs the stock data.
To fix this, we are going to do an additional request to the URL for the stock information. I am unsure how much you know about reverse engineering but I'll touch on it lightly. I did some reverse engineering and found it is to https://www.johnlewis.com/fashion-ui/api/stock/v2 in the form of a post with the json parameters {"skus":["240280782"]} (the skus being a list of products). The SKU is available in the webpage, so the full code to get the stock is as follows:
import requests
from bs4 import BeautifulSoup
url = 'https://www.johnlewis.com/longchamp-le-pliage-original-large-shoulder-bag/p5051141'
response = requests.get(url, headers={
'User-Agent': 'Mozilla/5.0 (Windows NT 10.0; Win64; x64; rv:94.0) Gecko/20100101 Firefox/94.0'
})
soup = BeautifulSoup(response.content, 'html.parser')
numInStock = soup.find(class_="StockInformation_stock__3OYkv").get_text().strip()
productName = soup.find(id="confirmation-anchor-desktop").get_text().strip()
# also find the sku by extracting the numbers out of the following mess found in the webpage: ......"1,150.00"},"productId":"5807862","sku":"240280782","url":"https://www.johnlewis.com/mulberry-ba.....
sku = response.text.split('"sku":"')[1].split('"')[0]
#supplemental request with the newfound sku
response1 = requests.post('https://www.johnlewis.com/fashion-ui/api/stock/v2', headers={
'authority': 'www.johnlewis.com',
'user-agent': 'Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/96.0.4664.45 Safari/537.36',
'content-type': 'application/json',
'accept': '*/*',
'origin': 'https://www.johnlewis.com',
'referer': 'https://www.johnlewis.com/mulberry-bayswater-small-zipped-leather-handbag-summer-khaki/p5807862',
}, json={"skus":[sku]})
# returns the json: {"stocks":[{"skuId":"240280782","stockQuantity":2,"availabilityStatus":"SKU_AVAILABLE","stockMessage":"Only 2 in stock online","lastUpdated":"2021-12-05T22:03:27.613Z"}]}
# index the json
try:
productPrice = response1.json()["stocks"][0]["stockQuantity"]
except:
print("There was an error getting the stock")
productPrice = "NaN"
print (productName)
print (productPrice)
print (numInStock)
I also made sure to test via other products. Since we dynamically simulate what a webpage does by Step 1. getting the page template, then Step 2. using the data from the template to make additional requests to the server, it works for any product URL.
This is EXTREMELY difficult and a pain. You need knowledge of front end, back end, json, and parsing to get it.

How can I get URLs from Oddsportal?

How can I get all the URLs from this particular link: https://www.oddsportal.com/results/#soccer
For every URL on this page, there are multiple pages e.g. the first link of the page:
https://www.oddsportal.com/soccer/africa/
leads to the below page as an example:
https://www.oddsportal.com/soccer/africa/africa-cup-of-nations/results/
-> https://www.oddsportal.com/soccer/africa/africa-cup-of-nations/results/#/page/2/...
https://www.oddsportal.com/soccer/africa/africa-cup-of-nations-2019/results/
-> https://www.oddsportal.com/soccer/africa/africa-cup-of-nations-2019/results/#/page/2/...
I would ideally like to code in python as I am pretty comfortable with it (more than other languages through not at all close to what I can call as comfortable)
and
After clicking on the link:
When I go to inspect element, I can see tha the links can be scraped however I am very new to it.
Please help
I have extracted the URLs from the main page that you mentioned.
import requests
import bs4 as bs
url = 'https://www.oddsportal.com/results/#soccer'
headers = {'User-agent': 'Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/91.0.4472.114 Safari/537.36'}
resp = requests.get(url, headers=headers)
soup = bs.BeautifulSoup(resp.text, 'html.parser')
base_url = 'https://www.oddsportal.com'
a = soup.findAll('a', attrs={'foo': 'f'})
# This set will have all the URLs of the main page
s = set()
for i in a:
s.add(base_url + i['href'])
Since you are new to web-scraping I suggest you to go through these.
Beautiful Soup - Beautiful Soup is a Python library for pulling data out of HTML and XML files.
Docs: https://www.crummy.com/software/BeautifulSoup/bs4/doc/
requests - Requests is an elegant and simple HTTP library for Python.
Docs: https://docs.python-requests.org/en/master/
Selenium - Selenium is an umbrella project for a range of tools and libraries that enable and support the automation of web browsers.
Docs: https://selenium-python.readthedocs.io/

Python Scraping Expedia data by beautifulsoup

I'm trying to scraping the hotel data from expedia. For example, scraping all the hotel link in Cavendish, Canada, from 01/01/2020 to 01/03/2020. But the problem now is I can only scrape 20 of them and it is actually contains 200+ for each place. The sample webpage and its url is like:
https://www.expedia.com/Hotel-Search?adults=2&destination=Cavendish%20Beach%2C%20Cavendish%2C%20Prince%20Edward%20Island%2C%20Canada&endDate=01%2F03%2F2020&latLong=46.504395%2C-63.439669&regionId=6261119&rooms=1&sort=RECOMMENDED&startDate=01%2F01%2F2020
Scraping code:
import lxml
import re
import requests
from bs4 import BeautifulSoup
import xlwt
import pandas as pd
import numpy as np
url = 'https://www.expedia.com/Hotel-Search?adults=2&destination=Cavendish%20Beach%2C%20Cavendish%2C%20Prince%20Edward%20Island%2C%20Canada&endDate=01%2F03%2F2020&latLong=46.504395%2C-63.439669&regionId=6261119&rooms=1&sort=RECOMMENDED&startDate=01%2F01%2F2020'
header={'User-Agent':'Mozilla/5.0 (Windows NT 10.0; WOW64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/62.0.3202.75 Safari/537.36'}
res = requests.get(url,headers=header)
soup = BeautifulSoup(res.content,'lxml')
t1 = soup.select('a.listing__link.uitk-card-link')
So every link is stored in <a class='listing__link.uitk-card-link' href=xxxxxxx> </a> inside <li></li>, there is no differences between the html structure, can anyone explain this?
They are using API call to get next 20 records. There is no way to scrape the next 20 records.
Here is API details they are using when you click on "Show More"
API LINK
They have API authentication to get data using API calls.
Note: Scraping works only when you don't have any ajax call and no authentication method.

Span ID returns empty string when extracting price

I am trying to get the price from the div ID tag to show when I try to print
import requests
from bs4 import BeautifulSoup
URL="https://www.futbin.com/20/player/75/ruud-gullit"
headers={"User-Agent":"Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/78.0.3904.70 Safari/537.36"}
page = requests.get(URL,headers=headers)
soup=BeautifulSoup(page.content,"html.parser")
title=soup.find(id="Player-card").get_text()
price = soup.find(id="ps-lowest-2").get_text()
print(price)
it should show the price of the player but it only returns a "-"
That is because the page is dynamicly loading the price. So the html you are getting with the scraper is different in you browser because your browser loads the javascript and thus the data and the scraper does not.
Edit:
To go above and beyond for you. I would inspect the network of the site and capture what url is called to get the pricing of the player.
I see the url: https://www.futbin.com/20/playerPrices?player=238434&rids=238433,214100&_=1572009060306
This will give you a json blob where you can find the price. Play with the arguments to get what you want.

Categories