scrape website traffic from semrush using beautiful soup python - python

I'm trying to scrape website traffic from semrush.com.
my current code using BeautifulSoup is:
from bs4 import BeautifulSoup, BeautifulStoneSoup
import urllib
import json
req = urllib.request.Request('https://www.semrush.com/info/burton.com', headers={'User-Agent':'Magic Browser'})
response = urllib.request.urlopen(req)
raw_data = response.read()
response.close()
soup = BeautifulSoup(raw_data)
I've been trying data = soup.findAll("a", {"href":"/info/burton.com+(by+organic)"}) or data = soup.findAll("span", {"class":"sem-report-counter"}) without much luck.
I can see the numbers on the webpage that I would like to get. Is there a way to pull this information off? I'm not seeing it in the html I pull.

I went the extra mile and set up a working example of how you can use selenium to scrape that page. Install selenium and try it out!
from selenium import webdriver
from selenium.webdriver.chrome.options import Options
url = 'https://www.semrush.com/info/burton.com' #your url
options = Options() #set up options
options.add_argument('--headless') #add --headless mode to options
driver = webdriver.Chrome(executable_path='/opt/ChromeDriver/chromedriver',
chrome_options=options)
#note: executable_path will depend on where your chromedriver.exe is located
driver.get(url) #get response
driver.implicitly_wait(1) #wait to load content
elements = driver.find_elements_by_xpath(xpath='//a[#href="/info/burton.com+(by+organic)"]') #grab that stuff you wanted?
for e in elements: print(e.get_attribute('text').strip()) #print text fields
driver.quit() #close the driver when you're done
Output that I see in my terminal:
356K
6.5K
59.3K
$usd305K
Organic keywords
Organic
Top Organic Keywords
View full report
Organic Position Distribution

Related

I'm trying to pull the table values on a website, but an empty list appears

I want to extract data from this site using python, but when I pull the beautifulsoup and request libraries the data in this table, an empty list is created. Can you help me with this?
table in the URL
the website
url2 = "https://www.mackolik.com/mac/trabzonspor-vs-sivasspor/karsilastirma/5x6r419402ucyya2zf0ehbqxg"
r= requests.get(url2)
soup = BeautifulSoup(r.text,"html.parser")
table = soup.find_all('ul',class_ = "Opta-TabbedContent")
table
out: []
The page is dynamically loaded and hence python requests cannot get the data. You need a headless browser such as selenium:
from selenium.webdriver.chrome.options import Options
url = "https://www.mackolik.com/mac/trabzonspor-vs-sivasspor/karsilastirma/5x6r419402ucyya2zf0ehbqxg"
options = Options()
options.add_argument('--headless')

Webscraping Live data

I am currently trying to scrape live stock market data from the yahoo finance page.
I am using bs4. My current issue is that whenever I run my script, it does not update properly to reflect the current price of the stock.
If anybody has any advice on how to change that it would be appreciated.
import requests
from bs4 import BeautifulSoup
while True:
page = requests.get("https://nz.finance.yahoo.com/quote/NZDUSD=X?p=NZDUSD=X")
soup = BeautifulSoup(page.text, "html.parser")
price = soup.find("div", {"class": "My(6px) Pos(r) smartphone_Mt(6px)"}).find("span").text
print(price)
NOT POSSIBLE WITH BS4 ALONE
This website particularly uses JavaScript to update the page and urlib etc. just parses the html content of the page not Java Script or AJAX content.
PhantomJs or Selenium Web Browser provide a more mechanized browser that often can run the JavaScript codes enabling dynamic websites. Try Using this :)
Using Selenium It can be done as:
from selenium import webdriver #its the library
import time
from selenium.webdriver.common.keys import Keys
from bs4 import BeautifulSoup as soup
#it Says that we are going to Use chrome browser
chrome_options = webdriver.ChromeOptions()
#hiding the Chrome Browser
chrome_options.add_argument("--headless")
#Initiating Chrome with all properties we need (in this case we use no specific properties
driver = webdriver.Chrome(chrome_options=chrome_options,executable_path='C:/Users/shary/Downloads/chromedriver.exe')
#URL We need to open
url = 'https://nz.finance.yahoo.com/quote/NZDUSD=X?p=NZDUSD=X'
#Starting Our Browser
driver = webdriver.Chrome()
#Accessing the url .. this will open the page just as you open in Chrome etc.
driver.get(url)
while 1:
#it will get you the html content repeatedly .. So you can get the changing price
html = driver.page_source
page_soup = soup(html,features="lxml")
price = page_soup.find("div", {"class": "D(ib) Mend(20px)"}).text
print(price)
time.sleep(5)
Note the Best Comments But Hope this you will understand it :) Else Watch a youtube tutorial to get proper idea what a Selenium Bot does
Hope This will Help. Its working perfect for me :) Accept This Answer if it helps you

Trying to scrape from mutiple pages with same link

from bs4 import BeautifulSoup
import requests
import time
from selenium import webdriver
driver = webdriver.Chrome(r'C:\chromedriver.exe')
url ='https://www.sambav.com/hyderabad/doctors'
driver.get(url)
soup = BeautifulSoup(driver.page_source,'html.parser')
for links in soup.find_all('div',class_='sambavdoctorname'):
link = links.find('a')
print(link['href'])
driver.close()
I am trying to scrape this page, the link is same in all pages. I am trying to extract the links from all mutiple pages but it's not giving any output nor showing any error just the program gets end.
If you check that website by developer tools in browser ( chrome or mozilla or whatever), before loading website, the website fetch data from few sources. One of this sources is "https://www.sambav.com/api/search/DoctorSearch?searchText=&city=Hyderabad&location=" . Your code could be simplified (and there is no need to use selenium):
import requests
r = requests.get('https://www.sambav.com/api/search/DoctorSearch?searchText=&city=Hyderabad&location=')
BASE_URL_DOCTOR = 'https://www.sambav.com/hyderabad/doctor/'
for item in r.json():
print(BASE_URL_DOCTOR + item['uniqueName'])

How to bypass disclaimer while scraping a website

I was able to scrape the following website before using "driver = webdriver.PhantomJS()" for work reason. What I was scraping were the price and the date.
https://www.cash.ch/fonds/swisscanto-ast-avant-bvg-portfolio-45-p-19225268/swc/chf
This stopped working some days ago due to a disclaimer page which I have to agree at first.
https://www.cash.ch/fonds-investor-disclaimer?redirect=fonds/swisscanto-ast-avant-bvg-portfolio-45-p-19225268/swc/chf
Once agreed I visually saw the real content, however the driver seems not, print out is [], so it must be still with the url of the disclaimer.
Please see code below.
from selenium import webdriver
from bs4 import BeautifulSoup
import csv
import os
driver = webdriver.PhantomJS()
driver.set_window_size(1120, 550)
#Swisscanto
driver.get("https://www.cash.ch/fonds/swisscanto-ast-avant-bvg- portfolio-45-p-19225268/swc/chf")
s_swisscanto = BeautifulSoup(driver.page_source, 'lxml')
nav_sc = s_swisscanto.find_all('span', {"data-field-entry": "value"})
date_sc = s_swisscanto.find_all('span', {"data-field-entry": "datetime"})
print(nav_sc)
print(date_sc)
print("Done Swisscanton")
This should work (I think the button you want to click in zustimmen?)
driver = webdriver.PhantomJS()
driver.get("https://www.cash.ch/fonds/swisscanto-ast-avant-bvg-portfolio-45-p-19225268/swc/chf"
accept_button = driver.find_element_by_link_text('zustimmen')
accept_button.click()
content = driver.page_source
More details here
python selenium click on button

Open a page programmatically in python

Can you extract the VIN number from this webpage?
I tried urllib2.build_opener, requests, and mechanize. I provided user-agent as well, but none of them could see the VIN.
opener = urllib2.build_opener()
opener.addheaders = [('User-agent',('Mozilla/5.0 (Macintosh; Intel Mac OS X 10_6_7) ' 'AppleWebKit/535.1 (KHTML, like Gecko) ' 'Chrome/13.0.782.13 Safari/535.1'))]
page = opener.open(link)
soup = BeautifulSoup(page)
table = soup.find('dd', attrs = {'class': 'tip_vehicleStats'})
vin = table.contents[0]
print vin
That page has much of the information loaded and displayed with Javascript (probably through Ajax calls), most likely as a direct protection against scraping. To scrape this you therefore either need to use a browser that runs Javascript, and control it remotely, or write the scraper itself in javascript, or you need to deconstruct the site and figure out exactly what it loads with Javascript and how, and see if you can duplicate these calls.
You can use browser automation tools for the purpose.
For example this simple selenium script can do your work.
from selenium import webdriver
from bs4 import BeautifulSoup
link = "https://www.iaai.com/Vehicles/VehicleDetails.aspx?auctionID=14712591&itemID=15775059&RowNumber=0"
browser = webdriver.Firefox()
browser.get(link)
page = browser.page_source
soup = BeautifulSoup(page)
table = soup.find('dd', attrs = {'class': 'tip_vehicleStats'})
vin = table.contents.span.contents[0]
print vin
BTW, table.contents[0] prints the entire span, including the span tags.
table.contents.span.contents[0] prints only the VIN no.
You could use selenium, which calls a browser. This works for me :
from selenium import webdriver
from selenium.common.exceptions import NoSuchElementException
from selenium.webdriver.common.keys import Keys
import time
# See: http://stackoverflow.com/questions/20242794/open-a-page-programatically-in-python
browser = webdriver.Firefox() # Get local session of firefox
browser.get("https://www.iaai.com/Vehicles/VehicleDetails.aspx?auctionID=14712591&itemID=15775059&RowNumber=0") # Load page
time.sleep(0.5) # Let the page load
# Search for a tag "span" with an attribute "id" which contains "ctl00_ContentPlaceHolder1_VINc_VINLabel"
e=browser.find_element_by_xpath("//span[contains(#id,'ctl00_ContentPlaceHolder1_VINc_VINLabel')]")
e.text
# Works for me : u'4JGBF7BE9BA648275'
browser.close()
You do not have to use Selenium.
Just make an additional get request:
import requests
stock_number = '123456789' # located at VEHICLE INFORMATION
url = 'https://www.clearvin.com/ads/iaai/check?stockNumber={}&vin='.format(stock_number)
vin = requests.get(url).json()['car']['vin']

Categories