How to get all the webpage elements - python

I am entirely new to webpage scraping and have been looking at a few YouTube videos and online to get me started.
So far, I have been trying to get all the webpage elements from the following website: https://www.letsride.co.uk/routes/search?sort_by=rating
Here is what I have so far:
from requests_html import HTMLSession
from bs4 import BeautifulSoup
s = HTMLSession()
url = 'https://www.letsride.co.uk/routes/search?sort_by=rating'
def getdata(url):
r = s.get(url)
soup = BeautifulSoup(r.text, 'html.parser')
return soup
# for i in range(1, 103):
def getnextpage(soup):
page = soup.find('ul', {'class': 'pagination'})
return page
soup = getdata(url)
print(getnextpage(soup))
This prints:
<ul class="pagination">
<li class="disabled"><span>«</span></li>
<li class="active"><span>1</span></li>
<li>2</li>
<li>3</li>
<li>4</li>
<li>5</li>
<li>6</li>
<li>7</li>
<li>8</li>
<li class="disabled"><span>...</span></li>
<li>101</li>
<li>102</li>
<li>»</li>
</ul>
Which is not exactly what I am looking for, I wanted to return only the html elements from the first page to the last page for example:
https://www.letsride.co.uk/routes/search?sort_by=rating&page=1
https://www.letsride.co.uk/routes/search?sort_by=rating&page=2
...
..
.
https://www.letsride.co.uk/routes/search?sort_by=rating&page=102

You can use selenium with python to simulate a browser and get the site then click on the button as many times as you want or until the button is no longer there. I chose to do it only 10 times because the list seems to be almost infinite.
Then I printed out all the URLs on the site, but you can just as easily store them in a list instead.
from webdriver_manager.chrome import ChromeDriverManager
from selenium.webdriver.chrome.options import Options
from selenium import webdriver
from selenium.webdriver import ActionChains
import time
options = Options()
options.headless = False
driver = webdriver.Chrome(ChromeDriverManager().install(), options=options)
driver.get("https://www.letsride.co.uk/routes/search?sort_by=rating")
load_more = True
#while load_more:
for i in range(10):
time.sleep(0.2)
try:
load_more_btn = driver.find_element_by_xpath('/html/body/div[2]/section/div[2]/div/div[3]/div/a')
load_more_btn.click()
except:
load_more = False
links = driver.find_elements_by_xpath("//a[#href]")
for link in links:
print(link.get_attribute('href'))

you could use a cleaning function to get rid of non-url elements - basically you need to check each element against a variable that has the canonical url form (https:// ....)
i haven't exactly tested the proof on your code, sorry, hope you'll be able to add it accordingly.
tester = "https://www.letsride.co.uk" #modify this var accordingly to your needs
def cleaner(data):
clean_data = []
for items in data:
if items[0:len(tester)] == tester:
clean_data.append(items)
return clean_data

Related

Extract all links from drop down list combination

I have a sample website and I want to extract all the "href links" from the website. It has two drop downs and once drop down is selected it displays results with link to manual to download.
It does not navigate to different page instead shows result on the same page. I have extracted the combination of drop down lists, I am trying to extract the manual links and I am unable to find the link.
code is as follows
from selenium import webdriver
from selenium.webdriver.support.ui import Select
import time
from bs4 import BeautifulSoup
import requests
url = "https://www.cars.com/"
driver = webdriver.Chrome('C:/Users/webdrivers/chromedriver.exe')
driver.get(url)
time.sleep(4)
selectYear = Select(driver.find_element_by_id("odl-selected-year"))
data = []
for yearOption in selectYear.options:
yearText = yearOption.text
selectYear.select_by_visible_text(yearText)
time.sleep(1)
selectModel = Select(driver.find_element_by_id("odl-selected-model"))
for modelOption in selectModel.options:
modelText = modelOption.text
selectModel.select_by_visible_text(modelText)
data.append([yearText,modelText])
page = requests.get(url, headers=headers)
soup = BeautifulSoup(page.text, 'html.parser')
content = soup.findAll('div',attrs={"class":"odl-results-container"})
for i in content:
x = i.findAll(['h3','span'])
for y in x:
print(y.get_text())
print does not show any data. How can I get the links for manuals? Thanks in advance
You need to click the button for each car model and year and then retrieve the rendered HTML page source from your Selenium webdriver rather than with requests.
Add this in your inner loop:
button = driver.find_element_by_link_text("Select this vehicle")
button.click()
page = driver.page_source
soup = BeautifulSoup(page, 'html.parser')
content = soup.findAll('a',attrs={"class":"odl-download-link"})
for i in content:
print(i["href"])
This prints out:
http://www.fordservicecontent.com/Ford_Content/vdirsnet/OwnerManual/Home/Index?Variantid=6875&languageCode=EN&countryCode=USA&marketCode=US&bookcode=O91668&VIN=&userMarket=GBR
http://www.fordservicecontent.com/Ford_Content/vdirsnet/OwnerManual/Home/Index?Variantid=7126&languageCode=EN&countryCode=USA&marketCode=US&bookcode=O134871&VIN=&userMarket=GBR
http://www.fordservicecontent.com/Ford_Content/vdirsnet/OwnerManual/Home/Index?Variantid=7708&languageCode=EN&countryCode=USA&marketCode=US&bookcode=O177941&VIN=&userMarket=GBR
...

beautifulsoup find function returns "-" when retrieving text

I am trying to get the text value inside a span tag having an id attribute using beautifulsoup. But it returs no text, rather only a '-'.
I have tried scraping using the div tag with the class attribute and then navigating to the span tag using findChildren() function too, but it still returns a "-". Here is the html that I am trying to scrape from the website https://etherscan.io/tokens-nft.
<div class="row align-items-center">
<div class="col-md-4 mb-1 mb-md-0">Transfers:</div>
<div class="col-md-8"></div>
<span id="totaltxns">266,765</span><hr class="hr-space">
</div>
And here is my python code:
from urllib2 import Request,urlopen
from bs4 import BeautifulSoup as soup
import array
url = 'https://etherscan.io/tokens-nft'
response = Request(url, headers = {'User-Agent':'Mozilla/5.0'})
page_html = urlopen(response).read()
page_soup = soup (page_html,'html.parser')
count = 0
total_nfts = 2691 #Hard-coded value
supply = []
totalAddr = []
transCount = []
row = []
print('All non-fungible tokens in order of Transfers')
for nfts in page_soup.find_all("a", class_ ='text-primary'):
link = nfts.get('href')
new_url = "https://etherscan.io/"+link
name = nfts.text
print('NFT '+name)
response2 = Request(new_url, headers = {'User-Agent':'Mozilla/5.0'})
phtml = urlopen(response2).read()
psoup = soup (phtml,'html.parser')
#Get tags
tags = []
#print('Tags')
for allTags in psoup.find_all("a",class_ = 'u-label u-label--xs u-label--secondary'):
tags.append(allTags.text.encode("ascii"))
count+=1
if(len(tags)!=0):
print(tags)
#Get total supply
ts = psoup.find("span", class_ = "hash-tag text-truncate")
ts = ts.text
#print(ts)
#Get holders
holders = psoup.find("div", {"id":"ContentPlaceHolder1_tr_tokenHolders"})
holders = holders.findChildren()[1].findChildren()[1].text
#print(holders)
#Get transfers/transactions
print(psoup.find("span", attrs={"id":"totaltxns"}).text)
print('Total number of NFTS '+str(count))
I have also tried:
transfers = psoup.find("span", attrs={"id":"totaltxns"})
but that doesn't work either.
The correct parsing should return 266,765.
To find the element by id you can use soup.find(id='your_id').
Try this:
from bs4 import BeautifulSoup as bs
html = '''
<div class="row align-items-center">
<div class="col-md-4 mb-1 mb-md-0">Transfers:</div>
<div class="col-md-8"></div>
<span id="totaltxns">266,765</span><hr class="hr-space">
</div>
'''
soup = bs(html, 'html.parser')
print(soup.find(id='totaltxns').text)
Outputs:
266,765
If you look at the page source for the link you've mentioned, the value in totaltxns is -. That's why it's returning -.
The value might just be populated with some javascript code on the page.
UPDATE
urlopen().read() simply returns the initial page source received from the server without any further client-side changes.
You can achieve your desired output using Selenium + Chrome WebDriver. The idea is we let the javascript in page run and parse the final page source.
Try this:
from bs4 import BeautifulSoup as bs
from selenium.webdriver import Chrome # pip install selenium
from selenium.webdriver.chrome.options import Options
url='https://etherscan.io/token/0x629cdec6acc980ebeebea9e5003bcd44db9fc5ce'
#Make it headless i.e. run in backgroud without opening chrome window
chrome_options = Options()
chrome_options.add_argument("--headless")
# use Chrome to get page with javascript generated content
with Chrome(executable_path="./chromedriver", options=chrome_options) as browser:
browser.get(url)
page_source = browser.page_source
#Parse the final page source
soup = bs(page_source, 'html.parser')
print(soup.find(id='totaltxns').text)
Outputs:
995,632
More info on setting up webdriver + example is in another StackOverflow question here.

Print elements using dt class name selenium python

I am trying to write a simple scraper for Sales Navigator in Linkedin and this is the link I am trying to scrape . It has search results for specific filter options selected for account results.
The goal I am trying to achieve is to retrieve every company name among the search results. Upon inspecting the link elements carrying the company name (eg : Facile.it, AGT international), I see the following js script, showing the dt class name
<dt class="result-lockup__name">
<a id="ember208" href="/sales/company/2429831?_ntb=zreYu57eQo%2BSZiFskdWJqg%3D%3D" class="ember-view"> Facile.it
</a> </dt>
I basically want to retrieve those names and open the url represented in href.
It can be noted that all the company name links had the same dt class result-lockup__name. The following script is an attempt to collect the list of all company names displayed in the search result along with its elements.
from selenium import webdriver
from selenium.webdriver.common.keys import Keys
from bs4 import BeautifulSoup
import re
import pandas as pd
import os
def scrape_accounts(url):
url = "https://www.linkedin.com/sales/search/companycompanySize=E&geoIncluded=emea%3A0%2Ceurope%3A0&industryIncluded=6&keywords=AI&page=1&searchSessionId=zreYu57eQo%2BSZiFskdWJqg%3D%3D"
driver = webdriver.PhantomJS(executable_path='C:\\phantomjs\\bin\\phantomjs.exe')
#driver = webdriver.Firefox()
#driver.implicitly_wait(30)
driver.get(url)
search_results = []
search_results = driver.find_elements_by_class_name("result-lockup__name")
print(search_results)
if __name__ == "__main__":
scrape_accounts("lol")
however, the result prints an empty list. I am trying to learn scraping different parts of web page and different elements,and thus I am not sure if I got this correct. What would be the right way?
I'm afraid I can't get to the page that you're after, but I notice that you're importing beautiful soup but not using it.
Try:
from selenium import webdriver
from selenium.webdriver.common.keys import Keys
from bs4 import BeautifulSoup
import re
import pandas as pd
import os
url = "https://www.linkedin.com/sales/search/companycompanySize=E&geoIncluded=emea%3A0%2Ceurope%3A0&industryIncluded=6&keywords=AI&page=1&searchSessionId=zreYu57eQo%2BSZiFskdWJqg%3D%3D"
def scrape_accounts(url = url):
driver = webdriver.PhantomJS(executable_path='C:\\phantomjs\\bin\\phantomjs.exe')
#driver = webdriver.Firefox()
#driver.implicitly_wait(30)
driver.get(url)
html = driver.find_element_by_tag_name('html').get_attribute('innerHTML')
soup = BeautifulSoup(html, 'html.parser')
search_results = soup.select('dt.result-lockup__name a')
for link in search_results:
print(link.text.strip(), link['href'])

Extract URL from a website including archived links

I'm crawling a news website to extracts all links including the archived ones which is typical of a news website. The site here has a a button View More Stories that loads more website articles. Now this code below
def find_urls():
start_url = "e.vnexpress.net/news/business"
r = requests.get("http://" + start_url)
data = r.text
soup = BeautifulSoup(data, "html.parser")
links = soup.findAll('a')
url_list = []
for url in links:
all_link = url.get('href')
if all_link.startswith('http://e.vnexpress.net/news/business'):
url_list.append(all_link)
return set(url_list)
successfully load quite a few url but how do I load more here is a snippet of the button
<a href="javascript:void(0)" id="vnexpress_folder_load_more" data-page="2"
data-cate="1003895">
View more stories
</a>
Can someone help me out. Thanks.
You can use a browser like selenium to click the button till the button disappears or disables. Finally you can scrape the entire page using beautifulsoup in one go.
from selenium import webdriver
from selenium.webdriver.common.keys import Keys
#initializing browser
driver = webdriver.Firefox()
driver.set_window_size(1120, 550)
driver.get("http://e.vnexpress.net/news/news")
# run this till button is present
elem = driver.find_element_by_id('vnexpress_folder_load_more'))
elem.click()

Python Selenium accessing HTML source - After Search

Source Code:
from selenium import webdriver
from selenium.webdriver.common.keys import Keys
import time
from bs4 import BeautifulSoup
path = "C:\\Python27\\chromedriver\\chromedriver"
driver = webdriver.Chrome(executable_path=path)
# Open Chrome
driver.get("http://www.thehindu.com/")
# 10 Second Delay
time.sleep(10)
elem = driver.find_element_by_id("searchString")
# Enter Keyword
elem.send_keys("unilever")
elem.send_keys(Keys.RETURN)
time.sleep(10)
# Problem Here
page = driver.page_source
soup = BeautifulSoup(page, 'lxml')
print soup
Above it the code.
I want to scrap data from "http://www.thehindu.com/", It searches for "unilever" word in search box and redirect to result page
Link for Search Page
Now I have a question for this, How can I get Source code of the searched Page.
Basically I want news related to "Unilever".
You can get text inside <body>:
body = driver.find_element_by_tag_name("body")
bodyText = body.get_attribute("innerText")
Then you can find your keyword in bodyText.

Categories