I am trying to scrape Reuters image captions on certain pictures. I have searched with my parameters and have a search result with 182 pages. The 'PN=X' part at the end of the links are the page numbers. I have built a for loop to loop through the pages and scrape all captions:
pages = ['https://pictures.reuters.com/CS.aspx?VP3=SearchResult&VBID=2C0BXZS52QWLHI&SMLS=1&RW=1920&RH=688#/SearchResult&VBID=2C0BXZS52QWLHI&SMLS=1&RW=1920&RH=688&PN=1',
'https://pictures.reuters.com/CS.aspx?VP3=SearchResult&VBID=2C0BXZS52QWLHI&SMLS=1&RW=1920&RH=688#/SearchResult&VBID=2C0BXZS52QWLHI&SMLS=1&RW=1920&RH=688&PN=2',
'https://pictures.reuters.com/CS.aspx?VP3=SearchResult&VBID=2C0BXZS52QWLHI&SMLS=1&RW=1920&RH=688#/SearchResult&VBID=2C0BXZS52QWLHI&SMLS=1&RW=1920&RH=688&PN=3',
'https://pictures.reuters.com/CS.aspx?VP3=SearchResult&VBID=2C0BXZS52QWLHI&SMLS=1&RW=1920&RH=688#/SearchResult&VBID=2C0BXZS52QWLHI&SMLS=1&RW=1920&RH=688&PN=4', ...]
complete_captions = []
for link in pages:
page = requests.get(link)
soup = BeautifulSoup(page.content, 'html.parser')
for element in soup.find_all(id=re.compile("CaptionLong_Lbl")):
if not element.text.endswith('...'):
complete_captions.append(element.text)
The code runs, but it returns the same captions regardless of the page it is given. It just repeats the same 47 results over and over again. But when I enter the pages into my browser, they are different from each other. So it should give different results. Any idea how to fix?
For this website, to get different results for each page is more complicated than just adding a page number to the URL and using requests.get().
A simpler approach in this case would be to use selenium, for example:
from bs4 import BeautifulSoup
import re
import time
from selenium import webdriver
from selenium.webdriver.firefox.options import Options
options = Options()
options.headless = True
browser = webdriver.Firefox(options=options)
complete_captions = []
for page_number in range(1, 5):
print(f"Page {page_number}")
url = f'https://pictures.reuters.com/CS.aspx?VP3=SearchResult&VBID=2C0BXZS52QWLHI&SMLS=1&RW=1920&RH=688#/SearchResult&VBID=2C0BXZS52QWLHI&SMLS=1&RW=1920&RH=688&PN={page_number}'
browser.get(url)
time.sleep(1)
soup = BeautifulSoup(browser.page_source, 'html.parser')
for element in soup.find_all(id=re.compile("CaptionLong_Lbl")):
if not element.text.endswith('...'):
complete_captions.append(element.text)
#print(element.text)
browser.quit()
Obviously, a different browser can be used.
Related
There is a paginated list of hyperlinks on this webpage: https://www.farmersforum.ie/mart-reports/county-Tipperary-mart/.
The code I have created till now scrapes the relevant links from the first page. I cannot figure out how to extract links from subsequent pages (8 links per page, about 25 pages).
There does not seem to be a way to navigate the pages using the URL.
from bs4 import BeautifulSoup
import urllib.request
# Scrape webpage
parser = 'html.parser' # or 'lxml' (preferred) or 'html5lib', if installed
resp = urllib.request.urlopen("https://www.farmersforum.ie/mart-reports/county-Tipperary-mart/")
soup = BeautifulSoup(resp, parser, from_encoding=resp.info().get_param('charset'))
# Extract links
links = []
for link in soup.find_all('a', href=True):
links.append(link['href'])
# Select relevant links, reformat, and drop duplicates
links = list(dict.fromkeys(["https://www.farmersforum.ie"+link for link in links if "/reports/Thurles" in link]))
Please advise for how I can do this using Python.
I've solved this with Selenium. Thank you.
from selenium import webdriver
from webdriver_manager.chrome import ChromeDriverManager
import time
# Launch Chrome driver
driver = webdriver.Chrome(ChromeDriverManager().install())
# Open webpage
driver.get("https://www.farmersforum.ie/mart-reports/county-Tipperary-mart/")
# Loop through pages
allLnks = []
iStop = False
# Continue until fail to find button
while iStop == False:
for ii in range(2,12):
try:
# Click page
driver.find_element_by_xpath('//*[#id="mainContent"]/div/div[1]/div[2]/ul/li['+str(ii)+']/a').click()
except:
iStop = True
break
# Wait to load
time.sleep(0.1)
# Identify elements with tagname <a>
lnks=driver.find_elements_by_tag_name("a")
# Traverse list of links
iiLnks = []
for lnk in lnks:
# Use get_attribute() to get all href and add links to list
iiLnks.append(lnk.get_attribute("href"))
# Select relevant links, reformat, and drop duplicates
iiLnks = list(dict.fromkeys([iiLnk for iiLnk in iiLnks if "/reports/Thurles" in iiLnk]))
allLnks = allLnks + iiLnks
driver.find_element_by_xpath('//*[#id="mainContent"]/div/div[1]/div[2]/ul/li[12]/a').click()
driver.quit()
I'm trying to get some informations about a product i'm interested in, on Amazon.
I'm using BeatifulSoap library for webscraping :
URL = 'https://www.amazon.it/gp/offer-listing/B08KHL2J5X/ref=dp_olp_unknown_mbc'
page = requests.get(URL,headers=headers)
soup = BeautifulSoup(page.content,'html.parser')
title = soup.find('span',class_='a-size-large a-color-price olpOfferPrice a-text-bold')
print(title)
In the pic, the highlined row it's the one i want to select, but when i run my script i get 'None' everytime. (Printing the entire output after BeatifulSoap call, give me the entire HTML source, so i'm using the right URL)
Any solutions?
You need to use .text() to get the text of an element.
so change:
print(title)
to:
print(title.text)
Output:
EUR 1.153,00
I wouldn't use BS alone in this case. You can easily use add Selenium to scrape the website:
from bs4 import BeautifulSoup
from selenium.webdriver.firefox.options import Options as FirefoxOptions
from selenium import webdriver
url = 'https://www.amazon.it/gp/offer-listing/B08KHL2J5X/ref=dp_olp_unknown_mbc'
driver = webdriver.Safari()
driver.get(url)
html_content = driver.page_source
soup = BeautifulSoup(html_content, "html.parser")
title = soup.find('span',class_='a-size-large a-color-price olpOfferPrice a-text-bold')
print(title)
If you don't can use Safari you have to download the webdriver for Chrome, Firefox etc. but there is plenty of reading material on this topic.
I have a sample website and I want to extract all the "href links" from the website. It has two drop downs and once drop down is selected it displays results with link to manual to download.
It does not navigate to different page instead shows result on the same page. I have extracted the combination of drop down lists, I am trying to extract the manual links and I am unable to find the link.
code is as follows
from selenium import webdriver
from selenium.webdriver.support.ui import Select
import time
from bs4 import BeautifulSoup
import requests
url = "https://www.cars.com/"
driver = webdriver.Chrome('C:/Users/webdrivers/chromedriver.exe')
driver.get(url)
time.sleep(4)
selectYear = Select(driver.find_element_by_id("odl-selected-year"))
data = []
for yearOption in selectYear.options:
yearText = yearOption.text
selectYear.select_by_visible_text(yearText)
time.sleep(1)
selectModel = Select(driver.find_element_by_id("odl-selected-model"))
for modelOption in selectModel.options:
modelText = modelOption.text
selectModel.select_by_visible_text(modelText)
data.append([yearText,modelText])
page = requests.get(url, headers=headers)
soup = BeautifulSoup(page.text, 'html.parser')
content = soup.findAll('div',attrs={"class":"odl-results-container"})
for i in content:
x = i.findAll(['h3','span'])
for y in x:
print(y.get_text())
print does not show any data. How can I get the links for manuals? Thanks in advance
You need to click the button for each car model and year and then retrieve the rendered HTML page source from your Selenium webdriver rather than with requests.
Add this in your inner loop:
button = driver.find_element_by_link_text("Select this vehicle")
button.click()
page = driver.page_source
soup = BeautifulSoup(page, 'html.parser')
content = soup.findAll('a',attrs={"class":"odl-download-link"})
for i in content:
print(i["href"])
This prints out:
http://www.fordservicecontent.com/Ford_Content/vdirsnet/OwnerManual/Home/Index?Variantid=6875&languageCode=EN&countryCode=USA&marketCode=US&bookcode=O91668&VIN=&userMarket=GBR
http://www.fordservicecontent.com/Ford_Content/vdirsnet/OwnerManual/Home/Index?Variantid=7126&languageCode=EN&countryCode=USA&marketCode=US&bookcode=O134871&VIN=&userMarket=GBR
http://www.fordservicecontent.com/Ford_Content/vdirsnet/OwnerManual/Home/Index?Variantid=7708&languageCode=EN&countryCode=USA&marketCode=US&bookcode=O177941&VIN=&userMarket=GBR
...
I am using code provided below to create a list containing titles of videos in a public YouTube playlist. It works well for playlists containing less than 100 videos. For playlists containing more than 100 videos, titles of first 100 videos in the playlist will be added to the list. I think reason behind this behaviour is because when we load the same page in browser, first 100 videos are loaded. Remaining videos are loaded as you scroll down the page. Is there any way to get titles of all videos from a playlist?
from bs4 import BeautifulSoup as bs
import requests
url = "https://www.youtube.com/playlist?list=PLRdD1c6QbAqJn0606RlOR6T3yUqFWKwmX"
r = requests.get(url)
soup = bs(r.text,'html.parser')
res = soup.find_all('tr',{'class':'pl-video yt-uix-tile'})
titles = []
for video in res:
titles.append(video.get('data-title'))
As you have seen correctly only the first 100 Videos are loaded. When the user scrolls down ajax calls are made to load the additional videos.
The easiest, but also most heavywheigt option to reproduce the ajax
calls is to use selenium webdriver. You can find the official
python documentation here.
I created following script with the help of inputs from Abrogans.
Also this gist was helpful.
from bs4 import BeautifulSoup as bs
from selenium import webdriver
from selenium.webdriver.common.keys import Keys
import time
driver = webdriver.Firefox()
url = "https://www.youtube.com/playlist?list=PLRdD1c6QbAqJn0606RlOR6T3yUqFWKwmX"
driver.get(url)
elem = driver.find_element_by_tag_name('html')
elem.send_keys(Keys.END)
time.sleep(3)
elem.send_keys(Keys.END)
innerHTML = driver.execute_script("return document.body.innerHTML")
page_soup = bs(innerHTML, 'html.parser')
res = page_soup.find_all('span',{'class':'style-scope ytd-playlist-video-renderer'})
titles = []
for video in res:
if video.get('title') != None:
titles.append((video.get('title')))
driver.close()
Source Code:
from selenium import webdriver
from selenium.webdriver.common.keys import Keys
import time
from bs4 import BeautifulSoup
path = "C:\\Python27\\chromedriver\\chromedriver"
driver = webdriver.Chrome(executable_path=path)
# Open Chrome
driver.get("http://www.thehindu.com/")
# 10 Second Delay
time.sleep(10)
elem = driver.find_element_by_id("searchString")
# Enter Keyword
elem.send_keys("unilever")
elem.send_keys(Keys.RETURN)
time.sleep(10)
# Problem Here
page = driver.page_source
soup = BeautifulSoup(page, 'lxml')
print soup
Above it the code.
I want to scrap data from "http://www.thehindu.com/", It searches for "unilever" word in search box and redirect to result page
Link for Search Page
Now I have a question for this, How can I get Source code of the searched Page.
Basically I want news related to "Unilever".
You can get text inside <body>:
body = driver.find_element_by_tag_name("body")
bodyText = body.get_attribute("innerText")
Then you can find your keyword in bodyText.