I am attempting to retrieve data from pro-football-reference.com, but due to their restrictions on accessing the site, if you access it too quickly, they ban you. I downloaded all of the HTML pages to a data folder on my local machine.
The issue is that now when I try to open the file using a Selenium webdriver it takes about 95 seconds to open and retrieve the html. Below is the code that I am running. The input 'box_score_file' is the local files that I have downloaded to my machine. On average files are around 500 KB.
Is this issue just due to the size of my files or is there a more efficient way to retrieve this?
def Parse_File(box_score_file):
options = Options()
options.headless = True
driver = webdriver.Chrome(options=options)
driver.get(f'file://{box_score_file}')
html = driver.page_source
soup = BeautifulSoup(html, 'html.parser')
[s.decompose() for s in soup.select('tr.over_header')]
[s.decompose() for s in soup.select('tr.thead')]
return soup
I've tried using the requests library, but since some of the tables for the html file are hidden by javascript if I use requests library I can't retrieve all of the tables that I want to create a DataFrame with.
I've also tried using playwright, but whenever I attempt to use playwright I receive the error ModuleNotFoundError: No module named 'pyee.asyncio'. I've tried 'pip install pyee', but anaconda then reports back that the requirement is already satisfied. I've also tried async playwright, but that just gives me the same issue as before saying there is no module named 'pyee.asyncio'
from playwright.sync_api import sync_playwright, TimeoutError as PlaywrightTimeout
def Parse_File(box_score_file):
try:
with sync_playwright() as p:
browser = p.chromium.launch()
page = browser.new_page()
page.goto(box_score_file)
print(page.title())
html = page.inner_html()
except PlaywrightTimeout:
print(f"Timeout error on {box_score_file}")
soup = BeautifulSoup(html, 'html.parser')
[s.decompose() for s in soup.select('tr.over_header')]
[s.decompose() for s in soup.select('tr.thead')]
return soup
This is my first time posting, so please let me know if I can provide any additional information or details and thank you for your time.
Related
I am currently trying to scrape live stock market data from the yahoo finance page.
I am using bs4. My current issue is that whenever I run my script, it does not update properly to reflect the current price of the stock.
If anybody has any advice on how to change that it would be appreciated.
import requests
from bs4 import BeautifulSoup
while True:
page = requests.get("https://nz.finance.yahoo.com/quote/NZDUSD=X?p=NZDUSD=X")
soup = BeautifulSoup(page.text, "html.parser")
price = soup.find("div", {"class": "My(6px) Pos(r) smartphone_Mt(6px)"}).find("span").text
print(price)
NOT POSSIBLE WITH BS4 ALONE
This website particularly uses JavaScript to update the page and urlib etc. just parses the html content of the page not Java Script or AJAX content.
PhantomJs or Selenium Web Browser provide a more mechanized browser that often can run the JavaScript codes enabling dynamic websites. Try Using this :)
Using Selenium It can be done as:
from selenium import webdriver #its the library
import time
from selenium.webdriver.common.keys import Keys
from bs4 import BeautifulSoup as soup
#it Says that we are going to Use chrome browser
chrome_options = webdriver.ChromeOptions()
#hiding the Chrome Browser
chrome_options.add_argument("--headless")
#Initiating Chrome with all properties we need (in this case we use no specific properties
driver = webdriver.Chrome(chrome_options=chrome_options,executable_path='C:/Users/shary/Downloads/chromedriver.exe')
#URL We need to open
url = 'https://nz.finance.yahoo.com/quote/NZDUSD=X?p=NZDUSD=X'
#Starting Our Browser
driver = webdriver.Chrome()
#Accessing the url .. this will open the page just as you open in Chrome etc.
driver.get(url)
while 1:
#it will get you the html content repeatedly .. So you can get the changing price
html = driver.page_source
page_soup = soup(html,features="lxml")
price = page_soup.find("div", {"class": "D(ib) Mend(20px)"}).text
print(price)
time.sleep(5)
Note the Best Comments But Hope this you will understand it :) Else Watch a youtube tutorial to get proper idea what a Selenium Bot does
Hope This will Help. Its working perfect for me :) Accept This Answer if it helps you
I am creating an automated account creator using Pycharm. I am facing a problem to which I have yet to find a good solution. I want to get the site-key in order to pass the captcha to a service I have bought. I have used the requests.get method but it gives back "None" as a result. I am using selenium in my program. After some thought I realized that using requests.get method, if it worked, would bring me a different key than the one that my selenium driver is currently displayed. I googled a lot and found only that there is a module named Selenium-Requests which doesn't have Edge imported. I am using Edge as it is the only browser everyone has and doesn't require the developer version of it like Chrome and Firefox.
Generally I haven't found a fix that can help me retrieve the key within my driver.
This is the retrieve code:
registerurl = requests.get(url)
registerurlstring = ''.join(str(e) for e in registerurl)
soup = BeautifulSoup(registerurlstring, features="html5lib")
hidden_tags = soup.find({"id":"recaptcha-token"})
sitekey = hidden_tags
try:
print('Sitekey = ', sitekey)
except:
print('Sitekey = Not Found')
I am not sure this is what you are after or not.To get the recaptcha value which is inside iframe so you have to target that src value of that iframe and using python request module you can get value of that input.
import requests
from bs4 import BeautifulSoup
url='https://www.google.com/recaptcha/api2/anchor?ar=1&k=6Lc3HAsUAAAAACsN7CgY9MMVxo2M09n_e4heJEiZ&co=aHR0cHM6Ly9zaWdudXAuZXVuZS5sZWFndWVvZmxlZ2VuZHMuY29tOjQ0Mw..&hl=en&v=A1Aard-wURuGsXRGA7JMOqVO&theme=dark&size=invisible&badge=bottomright&cb=ezyy1frci5ms'
registerurl = requests.get(url)
soup = BeautifulSoup(registerurl.text, features="html5lib")
hidden_tags = soup.find('input' ,attrs={"id":"recaptcha-token"})
print(hidden_tags['value'])
Output:
03AOLTBLQFd9hdHGmOesrT0xDcA8MkI6FGIiM3892Uws3aEWzPxUT8-U8IBEZHYzUEba2Jp9m3s9z_sz_fuij9OXZHABulFrI8YCD95kXV_H6xTO9vOubuZfzscleb6fdkkAE3IwUUSdTzPbXILy6SGLPI3LpPUptC1enZLIkQxQq9T8AEPPvCIsVgGe4jSE_l1jCWIRmBeBXsLgPLABZSq6ah6QWFfAngdC1rQaLMKWzLBmzh6ytEEGNYHmEG7P6UVtYcTI1IRIvq-ba-oGIUS1ELUb-1d3upQ29JWBtQ2t7_VNn237fguztf_FUDEHnAfHppUsrz-ZlkE00sMXFCuQ1XF6Qz7lH2j5g2z5KZQiODhRUBRRyd-ydjetz053bKRcgWpnNoZGNf1GBlW5inL9AtyYTkpruttw5sruAPuVgs5mrniQ5hrHNvfDIZKX905T2E21W2DsW1_07rItFYa-zkylMU83YXRQ
Hope this helps.
Updated code to get the iframe src value using webdriver.
import requests
from bs4 import BeautifulSoup
from selenium import webdriver
driver=webdriver.Chrome()
driver.get("https://signup.eune.leagueoflegends.com/en/signup/index")
url=driver.find_element_by_css_selector("iframe[role='presentation']").get_attribute('src')
registerurl = requests.get(url)
soup = BeautifulSoup(registerurl.text, features="html5lib")
hidden_tags = soup.find('input' ,attrs={"id":"recaptcha-token"})
print(hidden_tags['value'])
i'm trying to scrape the website that contains loading screens. when i browse the website it shows loading.. for a sec and then it loads up. But the problem is when i try to scrape it using scrapy it gives me nothing (probably because of that loading). can i solve the problem using scrapy or should i use some other tools?
here's the link to the website if you wanna see https://www.graana.com/project/601/lotus-lake-towers
As it is sending a GET request to get information about the property , you should mimic the same in your code. (You can observe the GET call under console -> Network -> XHR )
# -*- coding: utf-8 -*-
import scrapy
class GranaSpider(scrapy.Spider):
name = 'grana'
allowed_domains = 'www.graana.com'
start_urls = ['https://www.graana.com/api/area/slug/601']
def parse(self, response):
# for url in allurlList:
scrapy.http.Request(response.url, method='GET' , dont_filter=False)
print(response.body)
#convert json response to array and save to your storage system
Output is in json format, convert it to your convenience.
I know this question is old and already answered but I wanted to share my solution after encountering a similar problem. The accepted answer was not helpful to me because I was not using scrapy.
I wanted to scrape a website that first displays a loading page and then displays the actual page content.
Here's an example of such a website :
The requests library will not work for such websites. In my experience, request.get(URL, headers=HEADERS) simply times out .
Solution
Use Selenium.
First you need to know approximately how long the loading page animation lasts. In the above website, it takes around 3 seconds.
The trick is to simply sleep your program for the duration of the animation after navigating to the website with driver.get(URL).
By the time the program finishes sleeping, the loading animation will be over so we can safely extract the HTML of the actual page content using driver.page_source.
from selenium import webdriver
from selenium.webdriver.chrome.options import Options
import time
# the following options are only for setup purposes
chrome_options = Options()
chrome_options.add_argument('--no-sandbox')
chrome_options.add_argument('--disable-dev-shm-usage')
driver = webdriver.Chrome(options=chrome_options)
URL = "https://www.myjob.mu/ShowResults.aspx?Keywords=&Location=&Category=39&Recruiter=Company&SortBy=MostRecent"
driver.get(URL)
time.sleep(5) # any number > 3 should work fine
html = driver.page_source
print(html)
Beautifulsoup library can then be used for parsing the html.
I'm practicing web scraping with Python atm and I found a problem, I wanted to scrape one website that has a list of anime that I watched before but when I try to scrape it (via requests or selenium) it only gets around 30 of 110 anime names from the page.
Here is my code with selenium:
from selenium import webdriver
from bs4 import BeautifulSoup
browser = webdriver.Firefox()
browser.get("https://anilist.co/user/Agusmaris/animelist/Completed")
data = BeautifulSoup(browser.page_source, 'lxml')
for title in data.find_all(class_="title"):
print(title.getText())
And when I run it, the page source only shows up until an anime called 'Golden time' when there are like 70 or more left that are in the page.
Thanks
Edit: Code that works now thanks to 'supputuri':
from selenium import webdriver
from bs4 import BeautifulSoup
import time
driver = webdriver.Firefox()
driver.get("https://anilist.co/user/Agusmaris/animelist/Completed")
time.sleep(3)
footer = driver.find_element_by_css_selector("div.footer")
preY = 0
print(str(footer))
while footer.rect['y'] != preY:
preY = footer.rect['y']
footer.location_once_scrolled_into_view
print('loading')
html = driver.page_source
soup = BeautifulSoup(html, 'lxml')
for title in soup.find_all(class_="title"):
print(title.getText())
driver.close()
driver.quit()
ret = input()
Here is the solution.
Make sure to add import time
driver.get("https://anilist.co/user/Agusmaris/animelist/Completed")
time.sleep(3)
footer =driver.find_element_by_css_selector("div.footer")
preY =0
while footer.rect['y']!=preY:
preY = footer.rect['y']
footer.location_once_scrolled_into_view
time.sleep(1)
print(str(driver.page_source))
This will iterate until all the anime is loaded and then gets the page source.
Let us know if this was helpful.
So, this is the jist of what I get when I load the page source:
AniListwindow.al_token = 'E1lPa1kzYco5hbdwT3GAMg3OG0rj47Gy5kF0PUmH';Sorry, AniList requires Javascript.Please enable Javascript or http://outdatedbrowser.com>upgrade to a modern web browser.Sorry, AniList requires a modern browser.Please http://outdatedbrowser.com>upgrade to a newer web browser.
Since I know damn well that Javascript is enabled and my Chrome version is fully up to date, and the URL listed takes one to a nonsecure website to "download" a new version of your browser, I think this is a spam site. Not sure if you were aware of that when posting so I won't flag as such, but I wanted you and others who come across this to be aware.
I am trying to get video url from links on this page. Video link could be seen on https://in.news.yahoo.com/video/jaguar-fighter-aircraft-crashes-near-084300217.html . (Open in Chrome)
For that I wrote chrome web driver related code as below :
from bs4 import BeautifulSoup
from selenium import webdriver
from pyvirtualdisplay import Display
chromedriver = '/usr/local/bin/chromedriver'
os.environ['webdriver.chrome.driver'] = chromedriver
display = Display(visible=0, size=(800,600))
display.start()
driver = webdriver.Chrome(chromedriver)
driver.get('https://in.news.yahoo.com/video/jaguar-fighter-aircraft-crashes-near-084300217.html')
try:
element = WebDriverWait(driver, 20).until(lambda driver: driver.find_elements_by_class_name('yvp-main'))
self.yahoo_video_trend = []
for s in driver.find_elements_by_class_name('yvp-main'):
print "Processing link - ", item['link']
trend = item
print item['description']
trend['video_link'] = s.find_element_by_tag_name('video').get_attribute('src')
print
print s.find_element_by_tag_name('video').get_attribute('src')
self.yahoo_video_trend.append(trend)
except:
return
This works fine on my local system but when I run on my azure server it does not give any result at s.find_element_by_tag_name('video').get_attribute('src')
I have installed chrome on my azureserver.
Update :
Please see, requests and Beautifulsoup I already tried, but as yahoo loads html content dynamically from json, I could not get it using them.
And yeah azure server is simple linux system with command line access. Not any application.
I tried to reproduce your issue using you code. However, I found there was no tag named video in that page('https://in.news.yahoo.com/video/jaguar-fighter-aircraft-crashes-near-084300217.html')(using IE and Chrome to test).
I used the developer Tool to check the HTML code, like this picture:
It seems that this page used the flash player to play video,not HTML5 video control.
For this reason, I suggest that you can check your code whether used the rightly tag name.
Any concerns, please feel free to let me know.
We tried to reproduce the error on our side. I was not able to get chrome driver to work, but I did try the firefox driver and it worked fine. It was able to load the page and get the link via the URL.
Can you change your code to print the exception and send it to us, to see where the script is failing?
Change your code:
except:
return
try
do
except Exception,e: print str(e)
Send us the exception, so we can take a look.