Webscraping links not the same as manual browsing - python

I have scraped a site for 840 urls...
When I rebuld the urls for more insformation, my python scraper does not porvide the same data as if I manually click on the links.
For example, when I visit this website, https://salesweb.civilview.com/Sales/SalesSearch
If I click on the first 'Details' in the list, it take to a page with more information.
The information that is given is a relative link showing '/Sales/SaleDetails?PropertyId=254119896'
I've scraped the 'details' relative link and then rebuilt the link to match the absolute address.
this address becomes
https://salesweb.civilview.com/Sales/SaleDetails?PropertyId=254119896
However when I do this and try to scrape, I get a total different set of data and it takes me to a general landing page.
https://salesweb.civilview.com/
I thought at first, I needed to use a headless browser to fix the problem, but now I am not sure.
Here is my code:
import time
from selenium import webdriver
baseurl='https://salesweb.civilview.com'
link='/Sales/SaleDetails?PropertyId=254119946'
url1=baseurl+link
driver = webdriver.PhantomJS()
driver.get(url1)
html = driver.page_source
time.sleep(10)
driver.quit()

I found a workaround, if you first interact with the website, you can access the others urls. Unfortunately I have no idea why it works:
driver = webdriver.PhantomJS()
driver.get("https://salesweb.civilview.com/")
driver.find_element_by_link_text('Atlantic County, NJ').click()
driver.get("https://salesweb.civilview.com/Sales/SaleDetails?PropertyId=254119946")
html = driver.page_source
print(html)

Related

Creating url for Bet365 In-Play Live Match Data Scrape with Python and Selenium

I am trying to get urls for each live match from https://www.348365365.com/#/IP/B1.
Here is a python script in which I am using Selenium to parse the main page which contains all live matches.
options = Options()
options.add_argument('--headless')
options.add_argument('--disable-gpu')
driver = webdriver.Chrome(options=options)
driver.get('https://www.348365365.com/#/IP/B1')
time.sleep(10)
page = driver.page_source
driver.quit()
soup = BeautifulSoup(page, 'html.parser')
The problem is that I cannot find the event id. As an example, a url should be like this: https://www.348365365.com/#/IP/EV15569134772C1. I need EV15569134772C1 ids like this to create the urls I need for each match, but it's not present on the page source.
It seems inaccessible with selenium. (page loads indefinitely)
-> https://www.tutorialfor.com/questions-316541.htm
If you manage to connect with selenium, simulate clicks on the divs, retrieve the current url, get back and do it again ...
Moreover, bet365 has had to arm itself for a long time against web-scraping.
From what I've seen, once the page is loaded, nothing more goes through the network. So the solution must be in the files js + html + xhr. Good luck for reverse engineering :)

how to find URL of some elements of a webpage?

the webpage is : https://www.vpgame.com/market/gold?order_type=pro_price&order=desc&offset=0
As you can see there are 25 items in the selling part of this page that when you click them it opens a new tab and show you that specific item details.
Now I want to make a program to get those 25 item URLs and save them in a list, and my problem is as you can see in page inspect, their tags are which should be and also I can't find any 'href' attributes that related to them.
# using selenium and driver = webdriver.Chrome()
link = driver.find_elements_by_tag_name('a')
link2 = [l.get_attribute('href') for l in link]
I thought I can do it with above code but the problem is what I said. any suggestion?
Looks like you are trying to scrape a page that is powered by react. There are no href tags because javascript is powering all the linking. Your best bet is to use selenium to execute a click on each of the div objects, switch to the newly tabe, and use something like this code to get the URL of the page it's taken you to:
import time
links = driver.find_elements_by_class_name('card-header')
urls = []
for link in links:
new_page = link.click()
driver.switch_to.window(driver.window_handles[1])
url = driver.current_url
urls.append(url)
driver.close()
driver.switch_to.window(driver.window_handles[0])
time.sleep(1)
Note that the code closes the new tab each time and goes back to the main tab. I added time.sleep() so it doesn't go too fast.

Webscraping Live data

I am currently trying to scrape live stock market data from the yahoo finance page.
I am using bs4. My current issue is that whenever I run my script, it does not update properly to reflect the current price of the stock.
If anybody has any advice on how to change that it would be appreciated.
import requests
from bs4 import BeautifulSoup
while True:
page = requests.get("https://nz.finance.yahoo.com/quote/NZDUSD=X?p=NZDUSD=X")
soup = BeautifulSoup(page.text, "html.parser")
price = soup.find("div", {"class": "My(6px) Pos(r) smartphone_Mt(6px)"}).find("span").text
print(price)
NOT POSSIBLE WITH BS4 ALONE
This website particularly uses JavaScript to update the page and urlib etc. just parses the html content of the page not Java Script or AJAX content.
PhantomJs or Selenium Web Browser provide a more mechanized browser that often can run the JavaScript codes enabling dynamic websites. Try Using this :)
Using Selenium It can be done as:
from selenium import webdriver #its the library
import time
from selenium.webdriver.common.keys import Keys
from bs4 import BeautifulSoup as soup
#it Says that we are going to Use chrome browser
chrome_options = webdriver.ChromeOptions()
#hiding the Chrome Browser
chrome_options.add_argument("--headless")
#Initiating Chrome with all properties we need (in this case we use no specific properties
driver = webdriver.Chrome(chrome_options=chrome_options,executable_path='C:/Users/shary/Downloads/chromedriver.exe')
#URL We need to open
url = 'https://nz.finance.yahoo.com/quote/NZDUSD=X?p=NZDUSD=X'
#Starting Our Browser
driver = webdriver.Chrome()
#Accessing the url .. this will open the page just as you open in Chrome etc.
driver.get(url)
while 1:
#it will get you the html content repeatedly .. So you can get the changing price
html = driver.page_source
page_soup = soup(html,features="lxml")
price = page_soup.find("div", {"class": "D(ib) Mend(20px)"}).text
print(price)
time.sleep(5)
Note the Best Comments But Hope this you will understand it :) Else Watch a youtube tutorial to get proper idea what a Selenium Bot does
Hope This will Help. Its working perfect for me :) Accept This Answer if it helps you

Can't get all titles from a list with Python WebScraping

I'm practicing web scraping with Python atm and I found a problem, I wanted to scrape one website that has a list of anime that I watched before but when I try to scrape it (via requests or selenium) it only gets around 30 of 110 anime names from the page.
Here is my code with selenium:
from selenium import webdriver
from bs4 import BeautifulSoup
browser = webdriver.Firefox()
browser.get("https://anilist.co/user/Agusmaris/animelist/Completed")
data = BeautifulSoup(browser.page_source, 'lxml')
for title in data.find_all(class_="title"):
print(title.getText())
And when I run it, the page source only shows up until an anime called 'Golden time' when there are like 70 or more left that are in the page.
Thanks
Edit: Code that works now thanks to 'supputuri':
from selenium import webdriver
from bs4 import BeautifulSoup
import time
driver = webdriver.Firefox()
driver.get("https://anilist.co/user/Agusmaris/animelist/Completed")
time.sleep(3)
footer = driver.find_element_by_css_selector("div.footer")
preY = 0
print(str(footer))
while footer.rect['y'] != preY:
preY = footer.rect['y']
footer.location_once_scrolled_into_view
print('loading')
html = driver.page_source
soup = BeautifulSoup(html, 'lxml')
for title in soup.find_all(class_="title"):
print(title.getText())
driver.close()
driver.quit()
ret = input()
Here is the solution.
Make sure to add import time
driver.get("https://anilist.co/user/Agusmaris/animelist/Completed")
time.sleep(3)
footer =driver.find_element_by_css_selector("div.footer")
preY =0
while footer.rect['y']!=preY:
preY = footer.rect['y']
footer.location_once_scrolled_into_view
time.sleep(1)
print(str(driver.page_source))
This will iterate until all the anime is loaded and then gets the page source.
Let us know if this was helpful.
So, this is the jist of what I get when I load the page source:
AniListwindow.al_token = 'E1lPa1kzYco5hbdwT3GAMg3OG0rj47Gy5kF0PUmH';Sorry, AniList requires Javascript.Please enable Javascript or http://outdatedbrowser.com>upgrade to a modern web browser.Sorry, AniList requires a modern browser.Please http://outdatedbrowser.com>upgrade to a newer web browser.
Since I know damn well that Javascript is enabled and my Chrome version is fully up to date, and the URL listed takes one to a nonsecure website to "download" a new version of your browser, I think this is a spam site. Not sure if you were aware of that when posting so I won't flag as such, but I wanted you and others who come across this to be aware.

Python: finding content in dynamically generated HTML

I am trying to get stock options prices from this website based on the series code (for example FMM1), but the content is dynamically generated after the page loads and my python selenium script is not able to extract the correct source code, and therefore does not find it. When I inspect element, I can find it but not when I click on "view source code".
This is my code:
# Here, we open the website for options prices in Chrome
driver = webdriver.Chrome()
driver.get("http://www.bmfbovespa.com.br/pt_br/servicos/market-data/consultas/mercado-de-derivativos/precos-referenciais/precos-referenciais-bm-f-premios-de-opcoes/")
# Since the page is populated by JavaScript code *after* loading the page, we
# tell the browser to wait 10 seconds before getting the source html code
time.sleep(10)
html_file = driver.page_source # gets the html source of the page
print(html_file)
I have also tried the following, but it did not work:
WebDriverWait(driver, 60).until(EC.visibility_of_element_located((By.ID,
"divContainerIframeBmf")))
Use this after the page loads
driver.switch_to.frame(driver.find_element_by_xpath("//iframe"))
and continue performing your operations on the page.

Categories