everyone.
I am working on a python project with selenium to scrape data.
But there is one problem, I have to scrape the data every 5mins.
So I run chrome driver with selenium, the problem is selenium scrape speed is very slow.
If I run this project, It takes at least 30mins. I can't get data every 5mins.
If you have experience in this field, please help me.
If you can give me other ways(for example beautiful soap), I will be very happy.
Note: This site that I want to get data is rendering using javascript.
This is my source code. I am testing it.
from selenium import webdriver
from selenium.webdriver.common.keys import Keys
import pandas as pd
import time
driver = webdriver.Chrome()
driver.set_window_size(800, 600)
tickerNames = []
finvizUrl = "https://finviz.com/screener.ashx?v=111&f=exch_nasd,geo_usa,sh_float_u10,sh_price_u10,sh_relvol_o2"
nasdaqUrl = "https://www.nasdaq.com/market-activity/stocks/"
tickerPrice = []
def openPage(url):
driver.get(url)
def exitDrive():
driver.quit()
def getTickers():
tickers = driver.find_elements_by_class_name('screener-link-primary')
for i in range(len(tickers)):
tickerNames.append(tickers[i].text)
return tickerNames
def comparePrice(tickers):
for i in range(len(tickers)):
openPage(nasdaqUrl+tickers[i])
tickerPrice[i] = driver.find_element_by_class_name('symbol-page-header__pricing-price').text
return tickerPrice
openPage(finvizUrl)
comparePrice(getTickers())
# getTickers()
print(comparePrice())
There seems to be an API on the nasdaq site that you can query (found using network tools), so there isn't really any need to use selenium for this. Here is an example that gets the data using requests
import requests
import lxml.html
import time
FINVIZ_URL = "https://finviz.com/screener.ashx?v=111&f=exch_nasd,geo_usa,sh_float_u10,sh_price_u10,sh_relvol_o2"
NASDAQ_URL = "https://api.nasdaq.com/api/quote/{}/summary?assetclass=stocks"
headers = {
"User-Agent": "Mozilla/5.0 (Macintosh; Intel Mac OS X 10_15_6) AppleWebKit/605.1.15 (KHTML, like Gecko) Version/14.0.3 Safari/605.1.15"
}
session = requests.Session()
session.headers.update(headers)
r = session.get(FINVIZ_URL)
# Get data using lxml xpath but use whatever you want
x = lxml.html.fromstring(r.text)
stocks = x.xpath("//*[#class='screener-link-primary']/text()")
for stock in stocks:
data = session.get(NASDAQ_URL.format(stock))
print(f"INFO for {stock}")
print(data.json()) # This might have the data you want
# Sleep in case there is a rate limit (may not be needed)
time.sleep(5)
Related
Here's the website in question:
https://www.gurufocus.com/stock/AAPL
And the part that interests me is this one (it's the GF Score in the upper part of the website):
I need to extract the strings 'GF Score' and '98/100'.
Firefox Inspector gives me span.t-h6 > span:nth-child(1) as a CSS Selector but I just can't seem to fetch neither the numbers nor the descriptor.
Here's the code that I've used so far to extract the "GF Score" part:
import requests
import bs4 as BeautifulSoup
from lxml import html
req = requests.get('https://www.gurufocus.com/stock/AAPL')
soup = BeautifulSoup(req.content, 'html.parser')
score_soup = soup.select('#gf-score-section-003550 > span > span:nth-child(1)')
score_soup_2 = soup.select('span.t-h6 > span')
print(score_soup)
print(score_soup_2)
tree = html.fromstring(req.content)
score_lxml = tree.xpath ('//*[#id="gf-score-section-003550"]/span/span[1]')
print(score_lxml)
As a result, I'm getting three empty brackets.
The xpath was taken directly out of chrome via the copy function and the nth-child expression in the BS4 part also.
Any suggestions as to what might be at fault here?
Unfortunately get the page using Requests lib impossible, as well as access to the api to which the signature is needed.
There is 2 option:
Use API. It's not free, but much more convenient and faster.
And second one - Selenium. It's free, but the speed is without fine-tuning the wait element. The second problem is protection - cloudflare. Soon without changing the headers and\or IP you probably ll get a ban. So there is example:
from selenium import webdriver
from selenium.common.exceptions import TimeoutException
from selenium.webdriver.support.ui import WebDriverWait
from selenium.webdriver.support import expected_conditions as EC
from selenium.webdriver.common.by import By
from bs4 import BeautifulSoup
def get_gf_score(ticker_symbol: str, timeout=10):
driver.get(f'https://www.gurufocus.com/stock/{ticker_symbol}/summary')
try:
element_present = EC.presence_of_element_located((By.ID, f'register-dialog-email-input'))
WebDriverWait(driver, timeout).until(element_present)
return BeautifulSoup(driver.page_source, 'lxml').find('span', text='GF Score:').findNext('span').get_text(strip=True)
except TimeoutException:
print("Timed out waiting for page to load")
tickers = ['AAPL', 'MSFT', 'AMZN']
driver = webdriver.Chrome()
for ticker in tickers:
print(ticker, get_gf_score(ticker), sep=': ')
OUTPUT:
AAPL: 98/100
MSFT: 97/100
AMZN: 88/100
One way you could get the desired value is:
Make the request to the page - by pretending the request comes from a browser and then, extract the info you need from the JSON object inside the script HTML tag1.
NOTE:
1 Please be warned there that I couldn't get the JSON
object -This is the JSON result, btw- and extract the value by following the path:
js_data['fetch']['data-v-4446338b:0']['stock']['gf_score']
So, as alternative (not a very good one, IMHO, but, works for your purpose, though), I decided to find certain elements on the JSON/string result and then extract the desired value (by dividing the string - i.e. substring).
Full code:
import requests
from bs4 import BeautifulSoup
import json
geturl = r'https://www.gurufocus.com/stock/AAPL'
getheaders = {
'Accept': 'text/html; charset=utf-8',
'User-Agent': 'Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/72.0.3626.119 Safari/537.36',
'Referer': 'https://www.gurufocus.com'
}
s = requests.Session()
r = requests.get(geturl, headers=getheaders)
soup = BeautifulSoup(r.text, "html.parser")
# This is the "<script>" element tag that contains the full JSON object
# with all the data.
scripts = soup.findAll("script")[1]
# Get only the JSON data:
js_data = scripts.get_text("", strip=True)
# -- Get the value from the "gf_score" string - by getting its position:
# Part I: is where the "gf_score" begins.
part_I = js_data.find("gf_score")
# Part II: is where the final position is declared - in this case AFTER the "gf_score" value.
part_II = js_data.find(",gf_score_med")
# Build the desired result and print it:
gf_score = js_data[part_I:part_II].replace("gf_score:", "GF Score: ") + "/100"
print(gf_score)
Result:
GF Score: 98/100
the data is dynamic. I think rank is what you are looking for but the api required authentication. Maybe you can use selenium or playwright to render the page?
I'm trying my hand at some python code, and am having a hell of a time with Selenium. Any help you could offer would be super appreciated. Long story short, I'm trying to pull the average rating of a given movie from Letterboxd.com. For example:
https://letterboxd.com/film/the-dark-knight/
The value I'm looking for is the average rating to 2 decimal places, which you can see if you mouseover the rating that's displayed on the page:
Average Rating 4.43 displayed on mousover
In this case, the average rating is 4.43, and that's the number I'm trying to retrieve.
So far, I've managed to successfully grab the 1 decimal place version using driver.find_elements_by_class_name('average-rating')
In this case, that returns "4.4". But I need "4.43."
I can see the correct value in the developer tools. It appears twice. Once here:
<span class="average-rating" itemprop="aggregateRating" itemscope itemtype="http://schema.org/AggregateRating">
4.4
And again in what appears to be metadata:
<meta name="twitter:data2" content="4.43 out of 5">
Any suggestions on how I can grab that value correctly? Thanks so much!
Cheers,
Ari
There is another way you might wanna think of using to get the rating along with the counting of users voted for that rating. Given that they all are available in the page source within some script tag.
import re
import json
import requests
URL = 'https://letterboxd.com/film/the-dark-knight/'
with requests.Session() as s:
s.headers['User-Agent'] = 'Mozilla/5.0 (Windows NT 6.1) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/84.0.4147.89 Safari/537.36'
r = s.get(URL)
data = json.loads(re.findall(r"CDATA[^{]+(.*)",r.text)[0])
rating = data['aggregateRating']['ratingValue']
user_voted = data['aggregateRating']['ratingCount']
print(rating,user_voted)
Please find the code and let me know if you don't understand anything. To Hover over Main Rating you should use actionchains.
from selenium import webdriver
from selenium.webdriver import ActionChains
from selenium.webdriver.support.ui import WebDriverWait
import time
driver = webdriver.Chrome()
driver.get("https://letterboxd.com/film/the-dark-knight/")
wait = WebDriverWait(driver, 20)
time.sleep(10)
Main_Rating = driver.find_element_by_class_name('average-rating')
print("Main Rating is :- " + Main_Rating.text)
time.sleep(5)
ActionChains(driver).move_to_element(Main_Rating).perform()
More_Rating_info = driver.find_element_by_xpath('//div[#class="twipsy-inner"]').text
More_Message = More_Rating_info.split()
print("More Rating :- " + More_Message[3])
Note - If this resolves your problem then please mark it as answer.
Try below code using beautiful soup and requests:
Benefit of using Beautiful soup and requests:
Fast in terms of getting result.
Less error.
More accessibility to html tags.
import requests
from urllib3.exceptions import InsecureRequestWarning
requests.packages.urllib3.disable_warnings(InsecureRequestWarning)
from bs4 import BeautifulSoup as bs
def extract_avg_rating():
movie_name = 'the-dark-knight'
url = 'https://letterboxd.com/film/' + movie_name
session = requests.Session()
url_response = session.get(url,verify=False)
soup = bs(url_response.text, 'html.parser')
extracted_meta = soup.find_all('meta')[19]
extracted_rating = extracted_meta.attrs['content'].split(' ')[0]
print('Movie ' + movie_name + ' rating ' + extracted_rating)
extract_avg_rating()
In the above code movie_name parameter you can put any film name for ex: lucky-grandma and it will give you the accurate rating. Code is dynamic and help you in extracting other movies ratings and other information despite of only one thing as per your requirement.
im new to webscraping and python. I have done a script before that worked just fine. Im doing basically the same thing in this one but it runs way slower.
This is my code:
import requests
from bs4 import BeautifulSoup
from selenium import webdriver
import selenium
from selenium.webdriver import Firefox
from selenium.webdriver.firefox.options import Options
import time
start = time.time()
opp = Options()
opp.add_argument('-headless')
browser = webdriver.Firefox(executable_path = "/Users/0581279/Desktop/L&S/Watchlist/geckodriver", options=opp)
browser.delete_all_cookies()
browser.get("https://www.bloomberg.com/quote/MSGFINA:LX")
c = browser.page_source
soup = BeautifulSoup(c, "html.parser")
all = soup.find_all("span", {"class": "fieldValue__2d582aa7"})
price = all[6].text
browser.quit()
print(price)
end = time.time()
print(end-start)
Sometimes a single page can take up to 2 minutes to load. Also im just webscraping Bloomberg.
Any help would be appreciated :)
Using requests and BeautifulSoup you can scrape information easy and fast. Here code to get Key Statistics for bloomberg's MSGFINA:LX:
import requests
from bs4 import BeautifulSoup
headers = {
'Upgrade-Insecure-Requests': '1',
'User-Agent': 'Mozilla/5.0 (Macintosh; Intel Mac OS X 10_14_2) AppleWebKit/537.36 (KHTML, like Gecko) '
'Chrome/72.0.3626.119 Safari/537.36',
'DNT': '1'
}
response = requests.get('https://www.bloomberg.com/quote/MSGFINA:LX', headers=headers)
page = BeautifulSoup(response.text, "html.parser")
key_statistics = page.select("div[class^='module keyStatistics'] div[class^='rowListItemWrap']")
for key_statistic in key_statistics:
fieldLabel = key_statistic.select_one("span[class^='fieldLabel']")
fieldValue = key_statistic.select_one("span[class^='fieldValue']")
print("%s: %s" % (fieldLabel.text, fieldValue.text))
Selenium effect some parameters like :
If the site is slow, the Selenium script is slow.
If the performance of the internet connection is not good, the Selenium script is slow.
If the computer running the script is not performing well, the Selenium script is slow.
These situations are not usually in our hands. But programming are.
One of the ways to increase speed is blocking the images load (if we don't use it.)
Blocking the load images will effect the runtime.This is the way to block it :
opp.add_argument('--blink-settings=imagesEnabled=false')
And when you open Driver you dont need to again use BeautifulSoap function to get datas. Selenium functions provide it.Try to below code , Selenium will faster
from selenium import webdriver
from selenium.webdriver.firefox.options import Options
import time
start = time.time()
opp = Options()
opp.add_argument('--blink-settings=imagesEnabled=false')
driver_path = r'Your driver path'
browser = webdriver.Chrome(executable_path=driver_path , options=opp)
browser.delete_all_cookies()
browser.get("https://www.bloomberg.com/quote/MSGFINA:LX")
get_element = browser.find_elements_by_css_selector("span[class='fieldValue__2d582aa7']")
print(get_element[6].text)
browser.quit()
end = time.time()
print(end-start)
So I made some alterations to your code and could load it almost instantly, I used chrome driver which I had installed and then ran the following code.
import requests
from bs4 import BeautifulSoup
from selenium import webdriver
import selenium
import time
start = time.time()
browser = webdriver.Chrome("/Users/XXXXXXXX/Desktop/Programming/FacebookControl/package/chromedriver")
browser.get("https://www.bloomberg.com/quote/MSGFINA:LX")
c = browser.page_source
soup = BeautifulSoup(c, "html.parser")
all = soup.find_all("span", {"class": "fieldValue__2d582aa7"})
price = all[6].text
browser.quit()
print(price)
end = time.time()
print(end-start)
while testing they did block me lol, might want to change headers every once and a while. it printed the price as well.
chromedriver link http://chromedriver.chromium.org/
hope this helps.
output was this:
34.54
7.527994871139526
I am trying to scrape a site, https://www.searchiqs.com/nybro/ (you have to click the "Log In as Guest" to get to the search form. If I search for a Party 1 term like say "Andrew" the results have pagination and also, the request type is POST so the URL does not change and also the sessions time out very quickly. So quickly that if i wait ten minutes and refresh the search url page it gives me a timeout error.
I got started with scraping recently, so I have mostly been doing GET posts where I can decipher the URL. So so far I have realized that I will have to look at the DOM. Using Chrome Tools, I have found the headers. From the Network Tabs, I have also found out the following as the form data that is passed on from the search page to the results page
__EVENTTARGET:
__EVENTARGUMENT:
__LASTFOCUS:
__VIEWSTATE:/wEPaA8FDzhkM2IyZjUwNzg...(i have truncated this for length)
__VIEWSTATEGENERATOR:F92D01D0
__EVENTVALIDATION:/wEdAJ8BsTLFDUkTVU3pxZz92BxwMddqUSAXqb... (i have truncated this for length)
BrowserWidth:1243
BrowserHeight:705
ctl00$ContentPlaceHolder1$scrollPos:0
ctl00$ContentPlaceHolder1$txtName:david
ctl00$ContentPlaceHolder1$chkIgnorePartyType:on
ctl00$ContentPlaceHolder1$txtFromDate:
ctl00$ContentPlaceHolder1$txtThruDate:
ctl00$ContentPlaceHolder1$cboDocGroup:(ALL)
ctl00$ContentPlaceHolder1$cboDocType:(ALL)
ctl00$ContentPlaceHolder1$cboTown:(ALL)
ctl00$ContentPlaceHolder1$txtPinNum:
ctl00$ContentPlaceHolder1$txtBook:
ctl00$ContentPlaceHolder1$txtPage:
ctl00$ContentPlaceHolder1$txtUDFNum:
ctl00$ContentPlaceHolder1$txtCaseNum:
ctl00$ContentPlaceHolder1$cmdSearch:Search
All the ones in caps are hidden. I have also managed to figure out the results structure.
My script thus far is really pathetic as I am completely blank on what to do next. I am still to do the form submission, analyze the pagination and scrape the result but i have absolutely no idea how to proceed.
import re
import urlparse
import mechanize
from bs4 import BeautifulSoup
class DocumentFinderScraper(object):
def __init__(self):
self.url = "https://www.searchiqs.com/nybro/SearchResultsMP.aspx"
self.br = mechanize.Browser()
self.br.addheaders = [('User-agent',
'Mozilla/5.0 (Macintosh; Intel Mac OS X 10_6_8) AppleWebKit/535.7 (KHTML, like Gecko) Chrome/16.0.912.63 Safari/535.7')]
##TO DO
##submit form
#get return URL
#scrape results
#analyze pagination
if __name__ == '__main__':
scraper = DocumentFinderScraper()
scraper.scrape()
Any help would be dearly appreciated
I disabled Javascript and visited https://www.searchiqs.com/nybro/ and the form looks like this:
As you can see the Log In and Log In as Guest buttons are disabled. This will make it impossible for Mechanize to work because it can not process Javascript and you won't be able to submit the form.
For this kind of problems you can use Selenium, that will simulate a full Browser with the disadvantage of being slower than Mechanize.
This code should log you in using Selenium:
from selenium import webdriver
from selenium.webdriver.common.keys import Keys
usr = ""
pwd = ""
driver = webdriver.Firefox()
driver.get("https://www.searchiqs.com/nybro/")
assert "IQS" in driver.title
elem = driver.find_element_by_id("txtUserID")
elem.send_keys(usr)
elem = driver.find_element_by_id("txtPassword")
elem.send_keys(pwd)
elem.send_keys(Keys.RETURN)
Can you extract the VIN number from this webpage?
I tried urllib2.build_opener, requests, and mechanize. I provided user-agent as well, but none of them could see the VIN.
opener = urllib2.build_opener()
opener.addheaders = [('User-agent',('Mozilla/5.0 (Macintosh; Intel Mac OS X 10_6_7) ' 'AppleWebKit/535.1 (KHTML, like Gecko) ' 'Chrome/13.0.782.13 Safari/535.1'))]
page = opener.open(link)
soup = BeautifulSoup(page)
table = soup.find('dd', attrs = {'class': 'tip_vehicleStats'})
vin = table.contents[0]
print vin
That page has much of the information loaded and displayed with Javascript (probably through Ajax calls), most likely as a direct protection against scraping. To scrape this you therefore either need to use a browser that runs Javascript, and control it remotely, or write the scraper itself in javascript, or you need to deconstruct the site and figure out exactly what it loads with Javascript and how, and see if you can duplicate these calls.
You can use browser automation tools for the purpose.
For example this simple selenium script can do your work.
from selenium import webdriver
from bs4 import BeautifulSoup
link = "https://www.iaai.com/Vehicles/VehicleDetails.aspx?auctionID=14712591&itemID=15775059&RowNumber=0"
browser = webdriver.Firefox()
browser.get(link)
page = browser.page_source
soup = BeautifulSoup(page)
table = soup.find('dd', attrs = {'class': 'tip_vehicleStats'})
vin = table.contents.span.contents[0]
print vin
BTW, table.contents[0] prints the entire span, including the span tags.
table.contents.span.contents[0] prints only the VIN no.
You could use selenium, which calls a browser. This works for me :
from selenium import webdriver
from selenium.common.exceptions import NoSuchElementException
from selenium.webdriver.common.keys import Keys
import time
# See: http://stackoverflow.com/questions/20242794/open-a-page-programatically-in-python
browser = webdriver.Firefox() # Get local session of firefox
browser.get("https://www.iaai.com/Vehicles/VehicleDetails.aspx?auctionID=14712591&itemID=15775059&RowNumber=0") # Load page
time.sleep(0.5) # Let the page load
# Search for a tag "span" with an attribute "id" which contains "ctl00_ContentPlaceHolder1_VINc_VINLabel"
e=browser.find_element_by_xpath("//span[contains(#id,'ctl00_ContentPlaceHolder1_VINc_VINLabel')]")
e.text
# Works for me : u'4JGBF7BE9BA648275'
browser.close()
You do not have to use Selenium.
Just make an additional get request:
import requests
stock_number = '123456789' # located at VEHICLE INFORMATION
url = 'https://www.clearvin.com/ads/iaai/check?stockNumber={}&vin='.format(stock_number)
vin = requests.get(url).json()['car']['vin']