Looping through web pages to webscrape data

Looping through web pages to webscrape data - python

I'm trying to loop through Zillow pages and extract data. I know that the URL is being updated with a new page number after each iteration but the data extracted is as if the URL is still on page 1.
import selenium
from selenium import webdriver
import requests
from bs4 import BeautifulSoup
import pandas as pd
next_page='https://www.zillow.com/romeo-mi-48065/real-estate-agent-reviews/'
num_data1=pd.DataFrame(columns=['name','number'])
browser=webdriver.Chrome()
browser.get('https://www.zillow.com/romeo-mi-48065/real-estate-agent-reviews/')
while True:
page=requests.get(next_page)
contents=page.content
soup = BeautifulSoup(contents, 'html.parser')
number_p=soup.find_all('p', attrs={'class':'ldb-phone-number'},text=True)
name_p=soup.find_all('p', attrs={'class':'ldb-contact-name'},text=True)
number_p=pd.DataFrame(number_p,columns=['number'])
name_p=pd.DataFrame(name_p,columns=['name'])
num_data=number_p['number'].apply(lambda x: x.text.strip())
nam_data=name_p['name'].apply(lambda x: x.text.strip())
number_df=pd.DataFrame(num_data,columns=['number'])
name_df=pd.DataFrame(nam_data,columns=['name'])
num_data0=pd.concat([number_df,name_df],axis=1)
num_data1=num_data1.append(num_data0)
try:
button=browser.find_element_by_css_selector('.zsg-pagination>li.zsg-pagination-next>a').click()
next_page=str(browser.current_url)
except IndexError:
break

Replace page=requests.get(next_page) with page = browser.page_source
Basically what's happening is that you're going to the next page in Chrome, but then trying to load that page's url with requests which is getting redirected back to page one by Zillow (probably because it doesn't have the cookies or appropriate request headers).

why not make your life easier and use the Zillow API instead of scraping? (do you even have permission to scrape their site?)

Related

Create get request in a webdriver with selenium

Is it possible to send a get request to a webdriver using selenium?
I want to scrape a website with an infinite page and want to scrape a substantial amount of the objects on the website. For this I use Selenium to open the website in a webdriver and scroll down the page until enough objects on the page are visible.
However, I'd like to scrape the information on the page with BeautifulSoup since this is the most effective way in this case. If the get request is send in the normal way (see the code) the response only holds the first objects and not the objects from the scrolled-down page (which makes sence).
But is there any way to send a get request to an open webdriver?
from selenium import webdriver
from webdriver_manager.chrome import ChromeDriverManager
import requests
from bs4 import BeautifulSoup
# Opening the website in the webdriver
driver = webdriver.Chrome(ChromeDriverManager().install())
driver.get(url)
# Loop for scrolling
scroll_start = 0
for i in range(100):
scroll_end = scroll_start + 1080
driver.execute_script(f'window.scrollTo({scroll_start}, {scroll_end})')
time.sleep(2)
scroll_start = scroll_end
# The get request
response = requests.get(url)
soup = BeautifulSoup(response.content, 'lxml')

You should probably find out what is the endpoint that the website is using to get the data for the infinite scrolling.
Go to the website, open the Dev Tools, open the Network tab and find the HTTP request that is asking for the content you're seeking, then MAYBE you can use it too. Just know that there are a lot of variables like, are they using some sort of authorization for their APIs? Are the APIs returning JSON, XML, HTML, ...? Also, I am not sure if this is fair-use.

Not able to get updated data from web page using BeautifulSoup

import requests
URL = 'https://www.moneycontrol.com/india/stockpricequote/cigarettes/itc/ITC'
response = requests.get(URL)
soup = BeautifulSoup(response.text,'html.parser')
# time.sleep(5)
var1 = float(soup.find('td', attrs={'class': 'espopn'}).get_text().replace(",",""))
With this code, I am able to the value of var1, but the web page which I am accessing not showing real-time data once we land on the web page, it took 1 sec to update the real-time value once we land on the web page.
Due to which the value that I am getting in var1 is not a real-time value.
Wanted to know how I can wait once I land on the web page before doing web scraping.
Thanks in Advance.

1.As Data is updating dynamic so hard to get from bs4 so you can try from api itself so how to find it
2.Go to chrome developer mode and then Network tab find xhr and now reload your website under Name tab you will find links but there are lot of
3.But on left side there is search so you can search price and from it gives url and you click on that go to headers copy that url and make call using requests module
import requests
from bs4 import BeautifulSoup
res=requests.get("https://api.moneycontrol.com/mcapi/v1/stock/get-stock-price?scIdList=ITC%2CVST%2CGPI%2CIWP540954%2CGTC&scId=ITC")
main_data=res.json()
main_data['data'][0]
Output:
{'companyName': 'ITC',
'lastPrice': '215.25',
'perChange': '-0.62',
'marketCap': '264947.87',
'scTtm': '19.99',
'perform1yr': '7.33',
'priceBook': '4.16'}
Image:

How do I get a list of redirect urls from Dell.com

I am working on a web scraping project and want to get a list of products from Dell's website. I found this link (https://www.dell.com/support/home/us/en/04/products/) which pulls up a box with a list of product categories (really just redirect urls. If it doesn't come up for you click the button which says "Browse all products"). I tried using Python Requests to GET the page and save the text to a file to parse through, but the response doesn't contain any of the categories/redirect urls. My code is as basic as it gets:
import requests
url = "https://www.dell.com/support/home/us/en/04/products/"
page = requests.get(url)
with open("laptops.txt", "w", encoding="utf-8") as outf:
outf.write(page.text)
outf.close()
Is there a way to get these redirect urls? I am essentially trying to make my own site map of their products so that I can scrape the details of each one. Thanks

This page uses JavaScript to get and display these links - but requests/urllib and BeautifulSoup/lxml can't run JavaScript.
Using DevTools in Firefox/Chrome (tab: Network) I found it reads it from url
https://www.dell.com/support/components/productselector/allproducts?category=all-products/esuprt_&country=pl&language=pl&region=emea&segment=bsd&customerset=plbsd1&openmodal=true&_=1589265310743
so I use it to get links.
You may have to to change country=pl&language=pl in url to get it in different language.
import requests
from bs4 import BeautifulSoup as BS
url = "https://www.dell.com/support/components/productselector/allproducts?category=all-products/esuprt_&country=pl&language=pl&region=emea&segment=bsd&customerset=plbsd1&openmodal=true&_=1589265310743"
response = requests.get(url)
soup = BS(response.text, 'html.parser')
all_items = soup.find_all('a')
for item in all_items:
print(item.text, item['href'])
BTW: Other method is it use Selenium to control real web browser which can run JavaScript.

try using selenium chrome driver it helps for handling dynamic data on website and also features like clicking buttons, handling page refresh etc.
Beginner guide to web scraping

Trying to scrape from mutiple pages with same link

from bs4 import BeautifulSoup
import requests
import time
from selenium import webdriver
driver = webdriver.Chrome(r'C:\chromedriver.exe')
url ='https://www.sambav.com/hyderabad/doctors'
driver.get(url)
soup = BeautifulSoup(driver.page_source,'html.parser')
for links in soup.find_all('div',class_='sambavdoctorname'):
link = links.find('a')
print(link['href'])
driver.close()
I am trying to scrape this page, the link is same in all pages. I am trying to extract the links from all mutiple pages but it's not giving any output nor showing any error just the program gets end.

If you check that website by developer tools in browser ( chrome or mozilla or whatever), before loading website, the website fetch data from few sources. One of this sources is "https://www.sambav.com/api/search/DoctorSearch?searchText=&city=Hyderabad&location=" . Your code could be simplified (and there is no need to use selenium):
import requests
r = requests.get('https://www.sambav.com/api/search/DoctorSearch?searchText=&city=Hyderabad&location=')
BASE_URL_DOCTOR = 'https://www.sambav.com/hyderabad/doctor/'
for item in r.json():
print(BASE_URL_DOCTOR + item['uniqueName'])

Trying to scrape data, but can't scrape it all (Python)

I'm trying to scrape the number of followers from this page http://freelegalconsultancy.blogspot.co.uk/ but can't seem to pull it. I've tried using urllib, urllib2, urllib3, selenium and beautiful soup, but have had no luck pulling the followers. Here's what my code looks like currently:
import urllib2
url = "http://freelegalconsultancy.blogspot.co.uk/"
opener = urllib2.urlopen(url)
for item in opener:
print item
How would I go about pulling the number of followers?

Try to use selenium code as below:
from selenium import webdriver
driver = webdriver.Firefox()
driver.get('http://freelegalconsultancy.blogspot.co.uk/')
driver.switch_to_frame(driver.find_element_by_xpath('//div[#id="followers-iframe-container"]/iframe'))
followers_text = driver.find_element_by_xpath('//div[#class="member-title"]').text
followers = int(followers_text.split('(')[1].split(')')[0])
Last line is kinda rude, so you can change it if you like

We Keep Coding

Python is a programming language that lets you work quickly and integrate systems more effectively.

Looping through web pages to webscrape data - python

why not make your life easier and use the Zillow API instead of scraping? (do you even have permission to scrape their site?)

Related

Create get request in a webdriver with selenium

Not able to get updated data from web page using BeautifulSoup

How do I get a list of redirect urls from Dell.com

Trying to scrape from mutiple pages with same link

Trying to scrape data, but can't scrape it all (Python)

Categories

Resources