My webscraper using python, BeautifulSoup, and Selenium returns "[]" . Originally I was just using BeautifulSoup and had the same issue, and since what i'm trying to scrape is weather data I tried using selenium.
Here is the code and html snippet, I'm really new to this so thank you in advance:)
from selenium import webdriver
from selenium.webdriver.common.keys import Keys
from bs4 import BeautifulSoup
import time
driver = webdriver.Chrome()
driver.get("https://wx.ikitesurf.com/spot/93670")
time.sleep(5)
windspeed = driver.find_elements_by_class_name("jw-spot-list-marker")
print (windspeed)
driver.close()
html
<span class="jw-list-view-wind-desc">9 (g15) mph W</span>
Related
The webpage I am trying to scrape can only be seen after login so using a direct url won't work. I need to scrape data while I am logged in using my chrome browser.
Then I need to get the value of the the element from
I have tried using the following code.
import requests
from selenium import webdriver
from bs4 import BeautifulSoup as bs
import pandas as pd
from webdriver_manager.chrome import ChromeDriverManager
driver = webdriver.Chrome(ChromeDriverManager().install())
lastdatadate=[]
lastprocesseddate=[]
source = requests.get('webpage.com').text
content = driver.page_source
soup = bs(content, 'lxml')
#print(soup.prettify())
price = soup.find('span', attrs={'id':'calculatedMinRate'})
print(price.text)
You could still perform a login on the opened webdriver and fill in the input fields, as explained here: How to locate and insert a value in a text box (input) using Python Selenium?
Steps:
Fill in the input fields
Find the submit button and trigger a click event
Afterwards add a sleep command, few seconds should be enough
Afterwards you should be able to get the data.
Link that I am scraping : https://www.indusind.com/in/en/personal/cards/credit-card.html
from urllib.request import urlopen
from bs4 import BeautifulSoup
import json, requests, re, sys
from selenium import webdriver
import re
IndusInd_url = "https://www.indusind.com/in/en/personal/cards/credit-card.html"
html = requests.get(IndusInd_url)
soup = BeautifulSoup(html.content, 'lxml')
print(soup)
for x in soup.select("#display-product-cards .text-primary"):
print(x.get_text())
Using the above code I am trying to scrape the titles of the card, but unfortuantely I am getting this output
<html><body><p>This website is secured against online attacks. Your request was blocked due to suspicious behavior<br/>
<br/>
Client IP : 124.123.170.109<br/>
<br/>
Incident Time : 2021-02-24 06:28:10 UTC <br/>
<br/>
Incident ID : YDXx#m6g3nSFLvi5lGg4wgAAAf8<br/>
<br/>
If you feel it was a legitimate request, please contact the website owner for further investigation and remediation with a screenshot of this page.</p></body></html>
Is there any other alternative to follow to scrape the details.
Any help is highly appreciated ! ! !
Please check this.
FYI: Make sure you have the right driver (firefoxe or chrome or whatever with right version)
from selenium import webdriver
import requests
from bs4 import BeautifulSoup
import time
url = 'https://www.indusind.com/in/en/personal/cards/credit-card.html'
# open the chrome driver
driver = webdriver.Chrome(executable_path='webdrivers/chromedriver.exe')
# pings the specified url
driver.get(url)
# sleep time to wait for t seconds to wait for page load
# replace 3 with any int value (int value in seconds)
time.sleep(3)
# gets the page source
pg = driver.page_source
# beautify with beautifulsoup
soup = BeautifulSoup(pg)
# get the titles of the card
for x in soup.select("#display-product-cards .text-primary"):
print(x.get_text())
Below is output image
Can be achieved without BeautifulSoup.
I define the locator with xpath with the value:
//div[#id='display-product-cards']//a[#class='card-title text-primary' and text()!='']
And utilize method .presence_of_all_elements_located.
from selenium import webdriver
from selenium.webdriver.common.by import By
from selenium.webdriver.support.ui import WebDriverWait
from selenium.webdriver.support import expected_conditions as EC
driver = webdriver.Chrome(executable_path='webdrivers/chromedriver.exe')
driver.get('https://www.indusind.com/in/en/personal/cards/credit-card.html')
wait = WebDriverWait(driver, 20)
elements = wait.until(EC.presence_of_all_elements_located((By.XPATH, "//div[#id='display-product-cards']//a[#class='card-title text-primary' and text()!='']")))
for element in elements:
print(element.get_attribute('innerHTML'))
driver.quit()
I am very new to web scraping. I have the following url:
https://www.bloomberg.com/markets/symbolsearch
So, I use Selenium to enter the Symbol Textbox and press Find Symbols to get the details. This is the code:
from selenium import webdriver
from selenium.webdriver.common.keys import Keys
driver = webdriver.Firefox()
driver.get("https://www.bloomberg.com/markets/symbolsearch/")
element = driver.find_element_by_id("query")
element.send_keys("WMT:US")
driver.find_element_by_name("commit").click()
It returns the table. How can I retrieve that? I am pretty clueless.
Second question,
Can I do this without Selenium as it is slowing down things? Is there a way to find an API which returns a JSON?
from selenium import webdriver
from selenium.webdriver.common.keys import Keys
import time
from bs4 import BeautifulSoup
import requests
driver = webdriver.Firefox()
driver.get("https://www.bloomberg.com/markets/symbolsearch/")
element = driver.find_element_by_id("query")
element.send_keys("WMT:US")
driver.find_element_by_name("commit").click()
time.sleep(5)
url = driver.current_url
time.sleep(5)
parsed = requests.get(url)
soup = BeautifulSoup(parsed.content,'html.parser')
a = soup.findAll("table", { "class" : "dual_border_data_table" })
print(a)
here is the total code by which you can get the table you are looking for. now do what you need to do after getting the table. hope it helps
m using this code for scrapping some data from the link https://website.grader.com/results/www.dubizzle.com
The code is as below
#!/usr/bin/python
import urllib
from bs4 import BeautifulSoup
from dateutil.parser import parse
from datetime import timedelta
import MySQLdb
import re
import pdb
import sys
import string
def getting_urls_of_all_pages():
url_rent_flat='https://website.grader.com/results/dubizzle.com'
every_property_in_a_page_data_extraction(url_rent_flat)
def every_property_in_a_page_data_extraction(url):
htmlfile=urllib.urlopen(url).read()
soup=BeautifulSoup(htmlfile)
print soup
Sizeofweb=""
try:
Sizeofweb= soup.find('span', {'data-reactid': ".0.0.3.0.0.3.$0.1.1.0"}).text
print Sizeofweb.get_text().encode("utf-8")
except StandardError as e:
error="Error was {0}".format(e)
print error
getting_urls_of_all_pages()
The part of the html which i am extracting is as below
Snap: https://www.dropbox.com/s/7dwbaiyizwa36m6/5.PNG?dl=0
Code:
<div class="result-value" data-reactid=".0.0.3.0.0.3.$0.1.1">
<span data-reactid=".0.0.3.0.0.3.$0.1.1.0">1.1</span>
<span class="result-value-unit" data-reactid=".0.0.3.0.0.3.$0.1.1.1">MB</span>
</div>
Problem: Problem is that the website takes around 10-15 seconds to load the html source file which has the tags which i want to extract as mentioned in the code.
When the code uses the line htmlfile=urllib.urlopen(url).read() to load the html of the page, I think it loads html of preload of the link which is there before 10-15 seconds.
How can i make a pause in the code and let it load the data after 15+ seconds so the right html with the tags i want to extract loads in the program?
someone recommended me to use selenium. Here is the code but not sure whether if can be integrated in my code and serves the purposes
from selenium import webdriver
from selenium.webdriver.support.ui import WebDriverWait
driver = webdriver.Firefox()
driver.get(baseurl)
Maybe there is some ajax and that is why you are not getting the expected response withe urllib. Selenium is a good solution to that problem.
For selenium use the following:
import time
from selenium import webdriver
from selenium.webdriver.support.ui import WebDriverWait
from bs4 import BeautifulSoup
driver = webdriver.Firefox()
driver.get(baseurl)
time.sleep(15)
html = driver.page_source
soup = BeautifulSoup(html)
I'm trying to scrape a JavaScript enables page using BS and Selenium.
I have the following code so far. It still doesn't somehow detect the JavaScript (and returns a null value). In this case I'm trying to scrape the Facebook comments in the bottom. (Inspect element shows the class as postText)
Thanks for the help!
from selenium import webdriver
from selenium.common.exceptions import NoSuchElementException
from selenium.webdriver.common.keys import Keys
import BeautifulSoup
browser = webdriver.Firefox()
browser.get('http://techcrunch.com/2012/05/15/facebook-lightbox/')
html_source = browser.page_source
browser.quit()
soup = BeautifulSoup.BeautifulSoup(html_source)
comments = soup("div", {"class":"postText"})
print comments
There are some mistakes in your code that are fixed below. However, the class "postText" must exist elsewhere, since it is not defined in the original source code.
My revised version of your code was tested and is working on multiple websites.
from selenium import webdriver
from selenium.common.exceptions import NoSuchElementException
from selenium.webdriver.common.keys import Keys
from bs4 import BeautifulSoup
browser = webdriver.Firefox()
browser.get('http://techcrunch.com/2012/05/15/facebook-lightbox/')
html_source = browser.page_source
browser.quit()
soup = BeautifulSoup(html_source,'html.parser')
#class "postText" is not defined in the source code
comments = soup.findAll('div',{'class':'postText'})
print comments