I'm trying to scrape through an amazon offer. The code block below prints the titles of all results of the first page.
import requests
from bs4 import BeautifulSoup as BS
offer_url = 'https://www.amazon.de/gp/promotion/A2M8IJS74E2LMU?ref=deals_deals_deals-grid_slot-15_8454_dt_dcell_img_14_f3724fb9#'
headers = {
'User-Agent': 'Mozilla/5.0 (Macintosh; Intel Mac OS X 10_15_4) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/83.0.4103.97 Safari/537.36'
}
response = requests.get(offer_url, headers=headers)
html = response.text
soup = BS(html)
links = soup.find_all('a',class_='a-size-base a-link-normal fw-line3-break a-text-normal')
for link in links:
title = link.text.strip() # remove surrounding spaces and linebreaks etc.
print(title)
So far, so good. Now, how do I access the second page? Clicking on the Next page button (in german it's "Weiter") at the bottom of the page does not add a page argument like ?page=2 to the URL through which I could access the next page.
The questions I have are: How do I access the content of the next page similarly to how I access the first page? Is there a POST request involved and if so: How do I figure out its params/data? How would I use requests to mimic pressing the Next Page button and get the respective page results?
The offer is scheduled to last until March 21st, 2021. Until then, the link provided in the code should be valid.
Maybe it's just a few lines of code, e.g. a tweak in my request. Thanks in advance! Have a wonderful day!
Edit:
Trying to fetch the second page using the following script only yields the results of the first page.
params = {"page":2}
html = requests.post('https://www.amazon.de/gp/promotion/A2M8IJS74E2LMU', data=params, headers=headers).text
Related
There are a lot of post about this subject but I still don't manage to achieve what I want so here is my problem:
I am trying to extract stock price from this site:
https://bors.e24.no/#!/instrument/NHY.OSE
and I would like extract the price: 57,12 from the "inspection" text:
<div class="number LAST" data-reactid=".g.1.2.0">
57,12</div>
Here is the code I tried which generate "AttributeError" and 'NoneType' object has no attribute 'text'.
I also tried to remove .text, in the PRICE line, and the result is 'Price is: None'
from bs4 import BeautifulSoup
import requests
url = 'https://bors.e24.no/#!/instrument/NHY.OSE'
page = requests.get(url)
soup = BeautifulSoup(page.text, 'lxml')
PRICE= soup.find('div', class_= "number LAST").text
print('Price is:',(PRICE))
Try this:
import requests
headers = {
'user_agent': 'Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/90.0.4430.93 Safari/537.36',
}
api_url = "https://bors.e24.no/server/components?columns=ITEM, LAST, BID, ASK, CHANGE, CHANGE_PCT, TURNOVER, LONG_NAME&itemSector=NHY.OSE&type=table"
data = requests.get(api_url, headers=headers).json()
print(data["rows"][0]["values"]["LAST"])
Output:
56.92
This happens because your
requests.get(url)
Will not get all information in the page, including the price you are looking for, because the said webpage will load some parts of it and only then fetch more data. Because of that, trying to select the div with className="number LAST"
PRICE= soup.find('div', class_= "number LAST").text
Will throw an error because this doesn't exist, yet.
There are some ways to fix this problem:
You can try to use libraries like Selenium, which is often recommended for scraping more dynamic pages that rely on some Javascript and API calls to load content.
You can open your developer tools and inspect the Network tab where you might find the request that fetches the price you are trying to scrap.
I believe that in your case, after taking a look at the Network tab myself, the right URL to request could be 'https://bors.e24.no/server/components?columns=TIME,+PRICE,+VOLUME,+BUYER,+SELLER,+ID&filter=ITEM%3D%3DsNHY&limit=5&source=feed.ose.trades.EQUITIES%2BPCC&type=history', which seems to return a dictionary with the price you are looking for.
import requests
url = 'https://bors.e24.no/server/components?columns=TIME,+PRICE,+VOLUME,+BUYER,+SELLER,+ID&filter=ITEM%3D%3DsNHY&limit=5&source=feed.ose.trades.EQUITIES%2BPCC&type=history'
page = requests.get(url)
print(page.json()["rows"][0]["values"]["PRICE"])
If you are looking to scrap various links, you will need to find a way to dynamically change the previous link to one that matches others that you are trying to crawl. Which I guess would mean to change "NHY" and "ose" to something that would match other stock that you are looking for.
Hi so i want to scrape domain names and their prices but its returning null idk why
from bs4 import BeautifulSoup
url = 'https://www.brandbucket.com/styles/6-letter-domain-names?page=1'
response = requests.get(url)
soup = BeautifulSoup(response.text,'html.parser')
names = soup.findAll('div', {'class': "domainCardDetail"})
print(names)
Try the following approach to get domain names and their price from that site. The script currently parses content from the first page only. If you wish to get content from other pages, make sure to use the desired page number here page=1 which is located within link.
import requests
from bs4 import BeautifulSoup
link = 'https://www.brandbucket.com/styles/6-letter-domain-names?page=1'
url = 'https://www.brandbucket.com/amp/ga'
payload = {
'__amp_source_origin': 'https://www.brandbucket.com'
}
with requests.Session() as s:
s.headers['User-Agent'] = 'Mozilla/5.0 (Windows NT 6.1) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/88.0.4324.150 Safari/537.36'
res = s.get(link)
soup = BeautifulSoup(res.text,'html.parser')
payload['dp'] = soup.select_one("amp-iframe")['src'].split("list=")[1].split("&")[0]
resp = s.get(url,params=payload)
for key,val in resp.json()['triggers'].items():
if not key.startswith('domain'):continue
container = val['extraUrlParams']
print(container['pr1nm'],container['pr1pr'])
Output are like (truncated):
estita.com 2035
rendro.com 1675
rocamo.com 3115
wzrdry.com 4315
prutti.com 2395
bymodo.com 3495
ethlax.com 2035
intezi.com 2035
podoxa.com 2430
rorror.com 3190
zemoxa.com 2195
Check the status code of the response. When I tested there was 403 from the Web Server and because of that there is no such element like "domainCardDetail" div in response.
The reason for this is that website is protected by Cloudflare.
There are some advanced ways to bypass this.
The following solution is very simple if you do not need a mass amount of scraping. Otherwise, you may want to use "clouscraper" "Selenium" or another method to enable JavaScript on the website.
Open the developer console
Go to "Network". Make sure ticks are clicked as below picture.
https://i.stack.imgur.com/v0KTv.png
Refresh the page.
Copy JSON result and parse it in Python
https://i.stack.imgur.com/odX5S.png
I'm new to python and html. I am trying to retrieve the number of comments from a page using requests and BeautifulSoup.
In this example I am trying to get the number 226. Here is the code as I can see it when I inspect the page in Chrome:
<a title="Go to the comments page" class="article__comments-counts" href="http://www.theglobeandmail.com/opinion/will-kevin-oleary-be-stopped/article33519766/comments/">
<span class="civil-comment-count" data-site-id="globeandmail" data-id="33519766" data-language="en">
226
</span>
Comments
</a>
When I request the text from the URL, I can find the code but there is no content between the span tags, no 226. Here is my code:
import requests, bs4
url = 'http://www.theglobeandmail.com/opinion/will-kevin-oleary-be-stopped/article33519766/'
r = requests.get()
soup = bs4.BeautifulSoup(r.text, 'html.parser')
span = soup.find('span', class_='civil-comment-count')
It returns this, same as the above but no 226.
<span class="civil-comment-count" data-id="33519766" data-language="en" data-site-id="globeandmail">
</span>
I'm at a loss as to why the value isn't appearing. Thank you in advance for any assistance.
The page, and specifically the number of comments, does involve JavaScript to be loaded and shown. But, you don't have to use Selenium, make a request to the API behind it:
import requests
with requests.Session() as session:
session.headers = {"User-Agent": "Mozilla/5.0 (Macintosh; Intel Mac OS X 10_12_2) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/55.0.2883.95 Safari/537.36"}
# visit main page
base_url = 'http://www.theglobeandmail.com/opinion/will-kevin-oleary-be-stopped/article33519766/'
session.get(base_url)
# get the comments count
url = "https://api-civilcomments.global.ssl.fastly.net/api/v1/topics/multiple_comments_count.json"
params = {"publication_slug": "globeandmail",
"reference_language": "en",
"reference_ids": "33519766"}
r = session.get(url, params=params)
print(r.json())
Prints:
{'comment_counts': {'33519766': 226}}
This page use JavaScript to get the comment number, this is what the page look like when disable the JavaScript:
You can find the real url which contains the number in Chrome's Developer tools:
Than you can mimic the requests using #alecxe code.
I am trying to fetch the stock of a company specified by a user by taking the input. I am using requests to get the source code and BeautifulSoup to scrape. I am fetching the data from google.com. I am trying the fetch only the last stock price (806.93 in the picture). When I run my script, it prints none. None of the data is being fetched. What am I missing ?
# -*- coding: utf-8 -*-
from bs4 import BeautifulSoup
import requests
company = raw_input("Enter the company name:")
URL = "https://www.google.co.in/?gfe_rd=cr&ei=-AKmV6eqC-LH8AfRqb_4Aw#newwindow=1&safe=off&q="+company+"+stock"
request = requests.get(URL)
soup = BeautifulSoup(request.content,"lxml")
code = soup.find('span',{'class':'_Rnb fmob_pr fac-l','data-symbol':'GOOGL'})
print code.contents[0]
The source code of the page looks like this :
Looks like that source is from inspecting the element, not the actual source. A couple of suggestions. Use google finance to get rid of some noise - https://www.google.com/finance?q=googl would be the URL. On that page there is a section that looks like this:
<div class=g-unit>
<div id=market-data-div class="id-market-data-div nwp g-floatfix">
<div id=price-panel class="id-price-panel goog-inline-block">
<div>
<span class="pr">
<span id="ref_694653_l">806.93</span>
</span>
<div class="id-price-change nwp">
<span class="ch bld"><span class="chg" id="ref_694653_c">+9.68</span>
<span class="chg" id="ref_694653_cp">(1.21%)</span>
</span>
</div>
</div>
You should be able to pull the number out of that.
I went to
https://www.google.com/?gfe_rd=cr&ei=-AKmV6eqC-LH8AfRqb_4Aw#newwindow=1&safe=off&q=+google+stock
, did a right click and "View Page Source" but did not see the code that you screenshotted.
Then I typed out a section of your code screenshot and created a BeautifulSoup object with it and then ran your find on it:
test_screenshot = BeautifulSoup('<div class="_F0c" data-tmid="/m/07zln7n"><span class="_Rnb fmob_pr fac-l" data-symbol="GOOGL" data-tmid="/m/07zln7n" data-value="806.93">806.93.</span> = $0<span class ="_hgj">USD</span>')
test_screenshot.find('span',{'class':'_Rnb fmob_pr fac-l','data-symbol':'GOOGL'})`
Which will output what you want:
<span class="_Rnb fmob_pr fac-l" data-symbol="GOOGL" data-tmid="/m/07zln7n" data-value="806.93">806.93.</span>
This means that the code you are getting is not the code you expect to get.
I suggest using the google finance page:
https://www.google.com/finance?q=google (replace 'google' with what you want to search), which will give you wnat you are looking for:
request = requests.get(URL)
soup = BeautifulSoup(request.content,"lxml")
code = soup.find("span",{'class':'pr'})
print code.contents
Will give you
[u'\n', <span id="ref_694653_l">806.93</span>, u'\n'].
In general, scraping Google search results can get really nasty, so try to avoid it if you can.
You might also want to look into Yahoo Finance Python API.
You're looking for this:
# two selectors which will handle two layouts
current_price = soup.select_one('.wT3VGc, .XcVN5d').text
Have a look at the SelectorGadget Chrome extension to grab CSS selectors by clicking on the desired element in your browser. CSS selectors reference.
It might be because there's no user-agent specified in your request headers.
The default requests user-agent is python-requests thus Google blocks a request because it knows that it's a bot and not a "real" user visit and you received a different HTML with different selectors and elements, and some sort of an error. User-agent fakes user visit by adding this information into HTTP request headers.
Pass user-agent into request headers:
headers = {
'User-agent':
'Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/70.0.3538.102 Safari/537.36 Edge/18.19582'
}
response = requests.get('YOUR_URL', headers=headers)
Code and example in the online IDE:
import requests, lxml
from bs4 import BeautifulSoup
headers = {
'User-agent':
'Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/70.0'
params = {
'q': 'alphabet inc class a stock',
'gl': 'us'
}
html = requests.get('https://www.google.com/search', headers=headers, params=params)
soup = BeautifulSoup(html.text, 'lxml')
# two selectors which will handle two layouts
current_price = soup.select_one('.wT3VGc, .XcVN5d').text
print(current_price)
# 2,816.00
Alternatively, you can achieve the same thing by using Google Direct Answer Box API from SerpApi. It's a paid API with a free plan.
The difference in your case is that you only need to iterate over structured JSON and get the data you want fast rather than figuring out why certain things don't work as expected and then to maintain it over time.
Code to integrate:
from serpapi import GoogleSearch
params = {
"api_key": "YOUR_API_KEY",
"engine": "google",
"q": "alphabet inc class a stock",
"gl": "us",
"hl": "en"
}
search = GoogleSearch(params)
results = search.get_dict()
current_price = results['answer_box']['price']
print(current_price)
# 2,816.00
P.S - I wrote an in-depth blog post about how to reduce the chance of being blocked while web scraping search engines.
Disclaimer, I work for SerpApi.
Ok, I have been scratching my head on this for way too long. I am trying to retrieve the url for an embedded video on a web page using Beautiful Soup and requests modules in Python 2.7.6. I inspect the html in chrome and I can see the url to the video but when I get the page using requests and use Beautiful Soup I can't find the "video" node. From looking at the source it looks like the video window is a nested html document. I have searched all over and can't find out why I can't retrieve this. If anyone could point me in the right direction I would greatly appreciate it. Thanks.
here is the url to one of the videos:
http://www.growingagreenerworld.com/episode125/
The problem is that there is an iframe with the video tag inside which is loaded asynchronously in the browser.
Good news is that you can simulate that behavior by making an additional request to the iframe URL passing the current page URL as a Referer.
Implementation:
import re
from bs4 import BeautifulSoup
import requests
url = 'http://www.growingagreenerworld.com/episode125/'
headers = {'User-Agent': 'Mozilla/5.0 (Macintosh; Intel Mac OS X 10_10_2) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/40.0.2214.115 Safari/537.36'}
with requests.Session() as session:
session.headers = headers
response = session.get(url)
soup = BeautifulSoup(response.content)
# follow the iframe url
response = session.get('http:' + soup.iframe['src'], headers={'Referer': url})
soup = BeautifulSoup(response.content)
# extract the video URL from the script tag
print re.search(r'"url":"(.*?)"', soup.script.text).group(1)
Prints:
http://pdl.vimeocdn.com/43109/378/290982236.mp4?token2=1424891659_69f846779e96814be83194ac3fc8fbae&aksessionid=678424d1f375137f