Ok, I have been scratching my head on this for way too long. I am trying to retrieve the url for an embedded video on a web page using Beautiful Soup and requests modules in Python 2.7.6. I inspect the html in chrome and I can see the url to the video but when I get the page using requests and use Beautiful Soup I can't find the "video" node. From looking at the source it looks like the video window is a nested html document. I have searched all over and can't find out why I can't retrieve this. If anyone could point me in the right direction I would greatly appreciate it. Thanks.
here is the url to one of the videos:
http://www.growingagreenerworld.com/episode125/
The problem is that there is an iframe with the video tag inside which is loaded asynchronously in the browser.
Good news is that you can simulate that behavior by making an additional request to the iframe URL passing the current page URL as a Referer.
Implementation:
import re
from bs4 import BeautifulSoup
import requests
url = 'http://www.growingagreenerworld.com/episode125/'
headers = {'User-Agent': 'Mozilla/5.0 (Macintosh; Intel Mac OS X 10_10_2) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/40.0.2214.115 Safari/537.36'}
with requests.Session() as session:
session.headers = headers
response = session.get(url)
soup = BeautifulSoup(response.content)
# follow the iframe url
response = session.get('http:' + soup.iframe['src'], headers={'Referer': url})
soup = BeautifulSoup(response.content)
# extract the video URL from the script tag
print re.search(r'"url":"(.*?)"', soup.script.text).group(1)
Prints:
http://pdl.vimeocdn.com/43109/378/290982236.mp4?token2=1424891659_69f846779e96814be83194ac3fc8fbae&aksessionid=678424d1f375137f
Related
I'm trying to scrape through an amazon offer. The code block below prints the titles of all results of the first page.
import requests
from bs4 import BeautifulSoup as BS
offer_url = 'https://www.amazon.de/gp/promotion/A2M8IJS74E2LMU?ref=deals_deals_deals-grid_slot-15_8454_dt_dcell_img_14_f3724fb9#'
headers = {
'User-Agent': 'Mozilla/5.0 (Macintosh; Intel Mac OS X 10_15_4) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/83.0.4103.97 Safari/537.36'
}
response = requests.get(offer_url, headers=headers)
html = response.text
soup = BS(html)
links = soup.find_all('a',class_='a-size-base a-link-normal fw-line3-break a-text-normal')
for link in links:
title = link.text.strip() # remove surrounding spaces and linebreaks etc.
print(title)
So far, so good. Now, how do I access the second page? Clicking on the Next page button (in german it's "Weiter") at the bottom of the page does not add a page argument like ?page=2 to the URL through which I could access the next page.
The questions I have are: How do I access the content of the next page similarly to how I access the first page? Is there a POST request involved and if so: How do I figure out its params/data? How would I use requests to mimic pressing the Next Page button and get the respective page results?
The offer is scheduled to last until March 21st, 2021. Until then, the link provided in the code should be valid.
Maybe it's just a few lines of code, e.g. a tweak in my request. Thanks in advance! Have a wonderful day!
Edit:
Trying to fetch the second page using the following script only yields the results of the first page.
params = {"page":2}
html = requests.post('https://www.amazon.de/gp/promotion/A2M8IJS74E2LMU', data=params, headers=headers).text
I've created a script using requests and BeautifulSoup library to parse the links of some images from a webpage. The image links are visible when you use this selector [class^='cylindo-viewer-frame'] > img[src*='/frames/'] within the search bar (Ctrl + F) after inspecting element. This how they look like in the dom.
I know I can grab those image links using selenium but I would like to stick with requests module. I've noticed several times that there are always possibilities to grab such dynamic content using requests module. I've tried finding those links within script tag and in dev tools but no luck.
Two of the expected links out of 32 are:
https://content.cylindo.com/api/v2/4616/products/657285/frames/5/657285.JPG?background=FFFFFF&feature=FABRIC:Q1031&size=1268
https://content.cylindo.com/api/v2/4616/products/657285/frames/7/657285.JPG?background=FFFFFF&feature=FABRIC:Q1031&size=1268
This is how I've tried:
import requests
from bs4 import BeautifulSoup
link = 'https://www.ethanallen.com/on/demandware.store/Sites-ethanallen-us-Site/en_US/Product-Variation?pid=emersonQS&dwvar_emersonQS_Fabric=Q1031&dwvar_emersonQS_seatingSize=90sofa&step=2'
with requests.Session() as s:
s.headers['User-Agent'] = 'Mozilla/5.0 (Windows NT 6.1) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/87.0.4280.141 Safari/537.36'
r = s.get(link)
soup = BeautifulSoup(r.text,"lxml")
for item in soup.select(".cylindo-viewer-container li[class^='cylindo-viewer-frame'] > img[src*='/frames/']"):
print(item.get("src"))
How can I grab those image links using requests?
Why you should use selenium?
Website serves content dynamically, what is not to handle with requests, cause the information you try to match is not in the response.
Take a look, it is not that hard ;)
Example
from selenium import webdriver
from bs4 import BeautifulSoup
from time import sleep
driver = webdriver.Chrome(executable_path='C:\Program Files\ChromeDriver\chromedriver.exe')
url = "https://www.ethanallen.com/on/demandware.store/Sites-ethanallen-us-Site/en_US/Product-Variation?pid=emersonQS&dwvar_emersonQS_Fabric=Q1031&dwvar_emersonQS_seatingSize=90sofa&step=2"
driver.get(url)
sleep(2)
soup = BeautifulSoup(driver.page_source, 'lxml')
for item in soup.select(".cylindo-viewer-container li[class^='cylindo-viewer-frame'] > img[src*='/frames/']"):
print(item.get("src"))
driver.close()
Output
https://content.cylindo.com/api/v2/4616/products/657285/frames/3/657285.JPG?background=FFFFFF&feature=FABRIC:Q1031&size=1268
https://content.cylindo.com/api/v2/4616/products/657285/frames/27/657285.JPG?background=FFFFFF&feature=FABRIC:Q1031&size=1268
https://content.cylindo.com/api/v2/4616/products/657285/frames/29/657285.JPG?background=FFFFFF&feature=FABRIC:Q1031&size=1268
https://content.cylindo.com/api/v2/4616/products/657285/frames/11/657285.JPG?background=FFFFFF&feature=FABRIC:Q1031&size=1268
https://content.cylindo.com/api/v2/4616/products/657285/frames/31/657285.JPG?background=FFFFFF&feature=FABRIC:Q1031&size=1268
https://content.cylindo.com/api/v2/4616/products/657285/frames/5/657285.JPG?background=FFFFFF&feature=FABRIC:Q1031&size=1268
...
I have looked through the previous answers but none seemed to be applicable. I am building an open source quizlet scraper to extract all links from a class (e.g. https://quizlet.com/class/3675834/). In this case, the tag is a and class is "UILink". But when I use the following code, the list returned does not contain the element that I am looking for. Is it because of the JavaScript issue described here?
I tried to use the previous method of importing folder as written here but it does not contain the urls.
How can I scrape these urls?
import requests
from bs4 import BeautifulSoup
headers = {
"User-Agent": "Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/85.0.4183.102 Safari/537.36"
}
url = 'https://quizlet.com/class/8536895/'
response = requests.get(url, verify=False, headers=headers)
soup = BeautifulSoup(response.text,'html.parser')
b = soup.find_all("a", class_="UILink")
You wouldn't be able to directly scrape dynamic webpages using just requests. What you see browser is fully rendered page taken care by browser.
Inorder to scrape data from these kind of webpages, you following any of below approaches.
Use requests-html instead of requests
pip install requests-html
scraper.py
from requests_html import HTMLSession
from bs4 import BeautifulSoup
session = HTMLSession()
url = 'https://quizlet.com/class/8536895/'
response = session.get(url)
response.html.render() # render the webpage
# access html page source with html.html
soup = BeautifulSoup(response.html.html, 'html.parser')
b = soup.find_all("a", class_="UILink")
print(len(b))
Note: this uses headless browser(chromium) under the hood to render the page. So it can timeout or be a little slow at times.
Use selenium webdriver
Use driver.get(url) to get the page and pass the page source to beautiful Soup with driver.page_source
Note: run this in headless mode as well and there might be some latency at times.
I'm new to python and html. I am trying to retrieve the number of comments from a page using requests and BeautifulSoup.
In this example I am trying to get the number 226. Here is the code as I can see it when I inspect the page in Chrome:
<a title="Go to the comments page" class="article__comments-counts" href="http://www.theglobeandmail.com/opinion/will-kevin-oleary-be-stopped/article33519766/comments/">
<span class="civil-comment-count" data-site-id="globeandmail" data-id="33519766" data-language="en">
226
</span>
Comments
</a>
When I request the text from the URL, I can find the code but there is no content between the span tags, no 226. Here is my code:
import requests, bs4
url = 'http://www.theglobeandmail.com/opinion/will-kevin-oleary-be-stopped/article33519766/'
r = requests.get()
soup = bs4.BeautifulSoup(r.text, 'html.parser')
span = soup.find('span', class_='civil-comment-count')
It returns this, same as the above but no 226.
<span class="civil-comment-count" data-id="33519766" data-language="en" data-site-id="globeandmail">
</span>
I'm at a loss as to why the value isn't appearing. Thank you in advance for any assistance.
The page, and specifically the number of comments, does involve JavaScript to be loaded and shown. But, you don't have to use Selenium, make a request to the API behind it:
import requests
with requests.Session() as session:
session.headers = {"User-Agent": "Mozilla/5.0 (Macintosh; Intel Mac OS X 10_12_2) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/55.0.2883.95 Safari/537.36"}
# visit main page
base_url = 'http://www.theglobeandmail.com/opinion/will-kevin-oleary-be-stopped/article33519766/'
session.get(base_url)
# get the comments count
url = "https://api-civilcomments.global.ssl.fastly.net/api/v1/topics/multiple_comments_count.json"
params = {"publication_slug": "globeandmail",
"reference_language": "en",
"reference_ids": "33519766"}
r = session.get(url, params=params)
print(r.json())
Prints:
{'comment_counts': {'33519766': 226}}
This page use JavaScript to get the comment number, this is what the page look like when disable the JavaScript:
You can find the real url which contains the number in Chrome's Developer tools:
Than you can mimic the requests using #alecxe code.
Good day!
I am currently making a web scraper for Alibaba website.
My problem is that the returned source code does not show some parts that I am interested in. The data is there when I checked the source code using the browser, but I can't retrieve it when using BeautifulSoup.
Any tips?
from bs4 import BeautifulSoup
def make_soup(url):
try:
html = urlopen(url).read()
except:
return None
return BeautifulSoup(html, "lxml")
url = "http://www.alibaba.com/Agricultural-Growing-Media_pid144"
soup2 = make_soup(url)
I am interested in the highlighted part as shown in the image using the Developer Tools of Chrome. But when I tried writing in a text file, some parts including the highlighted is nowhere to be found. Any tips? TIA!
You need to provide the User-Agent header at least.
Example using requests package instead of urllib2:
import requests
from bs4 import BeautifulSoup
def make_soup(url):
try:
html = requests.get(url, headers={"User-Agent": "Mozilla/5.0 (Macintosh; Intel Mac OS X 10_11_2) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/47.0.2526.80 Safari/537.36"}).content
except:
return None
return BeautifulSoup(html, "lxml")
url = "http://www.alibaba.com/Agricultural-Growing-Media_pid144"
soup = make_soup(url)
print(soup.select_one("a.next").get('href'))
Prints http://www.alibaba.com/catalogs/products/CID144/2.