I've asked a similar question about navigating multiple pages with static url from https://ethnicelebs.com/all-celeb and thanks for help! But now I'd like to scrape all ethnicity information of every character listed by clicking at each name. I can navigate all pages right now but my code keeps scraping information from the very first page.
I've tried the following:
url = 'https://ethnicelebs.com/all-celeb'
driver = webdriver.Chrome()
driver.get(url)
while True:
page = requests.post('https://ethnicelebs.com/all-celebs')
soup = BeautifulSoup(page.text, 'html.parser')
for href in soup.find_all('a', href=True)[18:]:
print('Found the URL:{}'.format(href['href']))
request_href = requests.get(href['href'])
soup2 = BeautifulSoup(request_href.content)
for each in soup2.find_all('strong')[:-1]:
print(each.text)
Next_button = (By.XPATH, "//*[#title='Go to next page']")
WebDriverWait(driver, 50).until(EC.element_to_be_clickable(Next_button)).click()
url = driver.current_url
time.sleep(5)
(Thanks to #Sureshmani!)
I expect the code to scrape each page while navigating instead of only the first page. How can I scrape the current page while it keeps navigating? Thanks!
I misunderstood your question due to the nested loop in the previous answer. The following code would work:
url = 'https://ethnicelebs.com/all-celeb'
driver = webdriver.Chrome()
while True:
driver.get(url)
soup = BeautifulSoup(driver.page_source, 'html.parser')
for href in soup.find_all('a', href=True)[18:]:
print('Found the URL:{}'.format(href['href']))
driver.get(href['href'])
soup2 = BeautifulSoup(driver.page_source)
for each in soup2.find_all('strong')[:-1]:
print(each.text)
Next_button = (By.XPATH, "//*[#title='Go to next page']")
WebDriverWait(driver, 50).until(EC.element_to_be_clickable(Next_button)).click()
url = driver.current_url
time.sleep(5)
In your code, you only send a request thru the selenium once at the beginning, and then use requests later. To navigate and scrape a page at the same time, you should only use selenium like the example above.
Related
There is a paginated list of hyperlinks on this webpage: https://www.farmersforum.ie/mart-reports/county-Tipperary-mart/.
The code I have created till now scrapes the relevant links from the first page. I cannot figure out how to extract links from subsequent pages (8 links per page, about 25 pages).
There does not seem to be a way to navigate the pages using the URL.
from bs4 import BeautifulSoup
import urllib.request
# Scrape webpage
parser = 'html.parser' # or 'lxml' (preferred) or 'html5lib', if installed
resp = urllib.request.urlopen("https://www.farmersforum.ie/mart-reports/county-Tipperary-mart/")
soup = BeautifulSoup(resp, parser, from_encoding=resp.info().get_param('charset'))
# Extract links
links = []
for link in soup.find_all('a', href=True):
links.append(link['href'])
# Select relevant links, reformat, and drop duplicates
links = list(dict.fromkeys(["https://www.farmersforum.ie"+link for link in links if "/reports/Thurles" in link]))
Please advise for how I can do this using Python.
I've solved this with Selenium. Thank you.
from selenium import webdriver
from webdriver_manager.chrome import ChromeDriverManager
import time
# Launch Chrome driver
driver = webdriver.Chrome(ChromeDriverManager().install())
# Open webpage
driver.get("https://www.farmersforum.ie/mart-reports/county-Tipperary-mart/")
# Loop through pages
allLnks = []
iStop = False
# Continue until fail to find button
while iStop == False:
for ii in range(2,12):
try:
# Click page
driver.find_element_by_xpath('//*[#id="mainContent"]/div/div[1]/div[2]/ul/li['+str(ii)+']/a').click()
except:
iStop = True
break
# Wait to load
time.sleep(0.1)
# Identify elements with tagname <a>
lnks=driver.find_elements_by_tag_name("a")
# Traverse list of links
iiLnks = []
for lnk in lnks:
# Use get_attribute() to get all href and add links to list
iiLnks.append(lnk.get_attribute("href"))
# Select relevant links, reformat, and drop duplicates
iiLnks = list(dict.fromkeys([iiLnk for iiLnk in iiLnks if "/reports/Thurles" in iiLnk]))
allLnks = allLnks + iiLnks
driver.find_element_by_xpath('//*[#id="mainContent"]/div/div[1]/div[2]/ul/li[12]/a').click()
driver.quit()
I have a sample website and I want to extract all the "href links" from the website. It has two drop downs and once drop down is selected it displays results with link to manual to download.
It does not navigate to different page instead shows result on the same page. I have extracted the combination of drop down lists, I am trying to extract the manual links and I am unable to find the link.
code is as follows
from selenium import webdriver
from selenium.webdriver.support.ui import Select
import time
from bs4 import BeautifulSoup
import requests
url = "https://www.cars.com/"
driver = webdriver.Chrome('C:/Users/webdrivers/chromedriver.exe')
driver.get(url)
time.sleep(4)
selectYear = Select(driver.find_element_by_id("odl-selected-year"))
data = []
for yearOption in selectYear.options:
yearText = yearOption.text
selectYear.select_by_visible_text(yearText)
time.sleep(1)
selectModel = Select(driver.find_element_by_id("odl-selected-model"))
for modelOption in selectModel.options:
modelText = modelOption.text
selectModel.select_by_visible_text(modelText)
data.append([yearText,modelText])
page = requests.get(url, headers=headers)
soup = BeautifulSoup(page.text, 'html.parser')
content = soup.findAll('div',attrs={"class":"odl-results-container"})
for i in content:
x = i.findAll(['h3','span'])
for y in x:
print(y.get_text())
print does not show any data. How can I get the links for manuals? Thanks in advance
You need to click the button for each car model and year and then retrieve the rendered HTML page source from your Selenium webdriver rather than with requests.
Add this in your inner loop:
button = driver.find_element_by_link_text("Select this vehicle")
button.click()
page = driver.page_source
soup = BeautifulSoup(page, 'html.parser')
content = soup.findAll('a',attrs={"class":"odl-download-link"})
for i in content:
print(i["href"])
This prints out:
http://www.fordservicecontent.com/Ford_Content/vdirsnet/OwnerManual/Home/Index?Variantid=6875&languageCode=EN&countryCode=USA&marketCode=US&bookcode=O91668&VIN=&userMarket=GBR
http://www.fordservicecontent.com/Ford_Content/vdirsnet/OwnerManual/Home/Index?Variantid=7126&languageCode=EN&countryCode=USA&marketCode=US&bookcode=O134871&VIN=&userMarket=GBR
http://www.fordservicecontent.com/Ford_Content/vdirsnet/OwnerManual/Home/Index?Variantid=7708&languageCode=EN&countryCode=USA&marketCode=US&bookcode=O177941&VIN=&userMarket=GBR
...
I'm trying to parse this url
First, I've tried to use requests with bs4, but result page was differ from content from browser.
cont = requests.get(path).content
soup = BeautifulSoup(r, "html.parser")
print(soup.prettify())
Next I try to use selenium:
def render_page(path):
driver = webdriver.PhantomJS()
driver.get(path)
time.sleep(3)
r = driver.page_source
return r
r = render_page(path)
soup = BeautifulSoup(r, "html.parser")
print(soup.prettify())
But it returns another content. Сontent of page
After it I've tried add to my code
js_code = "return document.getElementsByTagName('html').innerHTML"
your_elements = sel.execute_script(js_code)
but it didn't help.
So Are any ways to get content of page using requests or selenium or maybe some another parser the same as in the browser?
I need to extract tweets embedded in text articles. The problem with the pages I'm testing is that they load tweets in ~5 out of 10 runs. So I need to use Selenium to wait for the page to load but I cannot make it work. I followed steps from their official website:
url = 'https://www.bbc.co.uk/news/world-us-canada-44648563'
options = webdriver.ChromeOptions()
options.add_argument("headless")
driver = webdriver.Chrome(executable_path='/Users/ME/Downloads/chromedriver', chrome_options=options)
driver.implicitly_wait(15)
driver.get(url)
html = driver.page_source
soup = BeautifulSoup(html, "lxml")
tweets_soup = [s.get_text() for s in soup.find_all('p', {'dir': 'ltr'})]
tweets = '\n'.join(tweets_soup)
print(tweets)
I cannot use the option to wait for a certain element to appear because I'm scanning different pages and not all of them have embedded tweets. So to check if Selenium actually works or not I run the above script together with the script which doesn't use Selenium and compare their results:
url = 'https://www.bbc.co.uk/news/world-us-canada-44648563'
r = requests.get(url)
soup = BeautifulSoup(r.content, "html.parser")
tweets_soup = [s.get_text() for s in soup.find_all('p', {'dir': 'ltr'})]
tweets = '\n'.join(tweets_soup)
print(tweets)
I will really appreciate the help of this wonderful community!
I'm crawling a news website to extracts all links including the archived ones which is typical of a news website. The site here has a a button View More Stories that loads more website articles. Now this code below
def find_urls():
start_url = "e.vnexpress.net/news/business"
r = requests.get("http://" + start_url)
data = r.text
soup = BeautifulSoup(data, "html.parser")
links = soup.findAll('a')
url_list = []
for url in links:
all_link = url.get('href')
if all_link.startswith('http://e.vnexpress.net/news/business'):
url_list.append(all_link)
return set(url_list)
successfully load quite a few url but how do I load more here is a snippet of the button
<a href="javascript:void(0)" id="vnexpress_folder_load_more" data-page="2"
data-cate="1003895">
View more stories
</a>
Can someone help me out. Thanks.
You can use a browser like selenium to click the button till the button disappears or disables. Finally you can scrape the entire page using beautifulsoup in one go.
from selenium import webdriver
from selenium.webdriver.common.keys import Keys
#initializing browser
driver = webdriver.Firefox()
driver.set_window_size(1120, 550)
driver.get("http://e.vnexpress.net/news/news")
# run this till button is present
elem = driver.find_element_by_id('vnexpress_folder_load_more'))
elem.click()