CODE IS HERE
Hi guys
I have some problem with scraping this dynamic site (https://kvartiry-bolgarii.ru/)
I need to get all the links to the home sale ads
I used selenium to load the page and get links to ads after that I move the page down to load new ads. After the new ads are loaded, I start to parse all the links on the page and write them to the list again.
But the data in the list is not updated and the script continues to work with the links that were on the page before scrolling down.
By the way, I set a check so that the script is executed until the last announcement on the site appears in the list, the link to which I found out in advance
How can this problem be corrected?
def get_link_info():
try:
url = "https://kvartiry-bolgarii.ru/"
driver = webdriver.Chrome(
executable_path=r'C:\Users\kk\Desktop\scrape_house\drivers\chromedriver.exe',
options=options
)
driver.get(url)
req = requests.get(url)
req.encoding = 'utf8'
soup = BeautifulSoup(req.text, "lxml")
articles = soup.find_all("div", class_="content")
links_urls = []
for article in articles:
house_url = article.find("a").get("href")
links_urls.append(house_url)
#print(links_urls)
first_link_number = links_urls[-2].split("-")[-1]
first_link_number = first_link_number[1:]
#print(first_link_number)
last_link_number = links_urls[-1].split("-")[-1]
last_link_number = last_link_number[1:]
#print(last_link_number)
html = driver.find_element_by_tag_name('html')
html.send_keys(Keys.END)
check = "https://kvartiry-bolgarii.ru/kvartira-v-elitnom-komplekse-s-unikalynym-sadom-o21751"
for a in links_urls:
if a != check:
for article in articles:
house_url = article.find("a").get("href")
links_urls.append(house_url)
html = driver.find_element_by_tag_name('html')
html.send_keys(Keys.END)
print(links_urls[-1])
else:
print(links_urls[0], links_urls[-1])
print("all links are ready")
Some pointers. You don't need to mix selenium,requests and BeautifulSoup. Just selenium is enough. When you are scrolling infinitely, you need to remove duplicate elements before adding them to your list.
You can try this. This should work.
from selenium import webdriver
import time
def get_link_info():
all_links = []
try:
driver = webdriver.Chrome(executable_path='C:/chromedriver.exe')
driver.get('https://kvartiry-bolgarii.ru/')
time.sleep(3)
old_links = set() # Empty Set
while True:
# Scroll to get more ads
driver.execute_script("window.scrollBy(0,3825)", "")
# Wait for new ads to load
time.sleep(8)
links_divs = driver.find_elements_by_xpath('//div[#class="content"]//a') # Find Elements
ans = set(links_divs) - set(old_links) # Remove old elements
for link in ans:
# Scroll to the link.
driver.execute_script("arguments[0].scrollIntoView();", link)
fir = link.get_attribute('href')
all_links.append(fir)
# Remove Duplicates
old_links = links_divs
except Exception as e:
raise e
get_link_info()
Related
I'm working on a project which I'm getting posts from couple of websites and show it in the main page of my website but with filter and letting users to search for keywords and see the posts with that keywords.
this is how the code works :
In here with get the customized url of the site with our keyword and city filter
def link_gen(city='', Kword='' ):
# for example.com
urls =[]
if Kword != '':
if city =='':
url = f'https://www.example.com/search/with{Kword}'
url = url.strip()
url = url.replace(" ", "-")
urls.append(url)
else:
url = f'https://www.example.com/search/with{Kword}in{city}'
url = url.strip()
url = url.replace(" ", "-")
urls.append(url)
else:
if city != '':
url = f'https://www.example.com/search/in{city}'
url = url.strip()
url = url.replace(" ", "-")
urls.append(url)
else: urls.append('none')
return urls
this part is where we crawl for the posts of the target website
# function for getting the title, link, icon, desc of all posts
def get_cards(urls):
data = []
# for example.com
if urls[0] != 'none':
# we use webdriver to get site with dynamic component and design
url = urls[0]
options = Options()
options.headless = True
browser = webdriver.Firefox(options=options)
browser.get(url)
print ("Headless Firefox Initialized")
soup = BeautifulSoup(browser.page_source, 'html.parser')
jobs = soup.find_all( 'div', class_="job-list-item", limit=3)
# looping through all the cards
for job in jobs :
# get the title, link, icon, desc
title = job.find('a', class_= "title vertical-top display-inline" ).text
icon = job.find(tage_name_img)['src']
link = job.find('a', class_= "title vertical-top display-inline" )['href']
date = job.find('div', class_= "date" ).text
data.append(dict(
title = title,
icon = f'https://www.example.com/{icon}',
link = f'https://www.example.com/{link}',
date = date,
site = 'example'
))
browser.close()
return data
but the problem is for getting the post and dynamic tags on the websites I needed to use selenium I can't use session.get(url) because it won't return all the tags
and with selenium it takes for ever to return the posts even though I only crawl 3 posts
but I think webdriver uses a lot of resources.
I ran out of ram when I when I tried to run it locally
any suggestion would be so much appreciated
I am trying to scrape this site https://franchisedisclosure.gov.au/Register with playwright and the url doesn't change after you click on the next button. How do I solve this pagination problem?
Here's my code
`
from bs4 import BeautifulSoup as bs
from playwright.sync_api import sync_playwright
url = 'https://franchisedisclosure.gov.au/Register'
with sync_playwright() as p:
browser = p.chromium.launch(headless=False, slow_mo=50)
page = browser.new_page()
page.goto(url)
page.locator("text=I agree to the terms of use").click()
page.locator("text=Continue").click()
page.wait_for_load_state('domcontentloaded')
page.is_visible('tbody')
html = page.inner_html('table.table.table-hover')
soup = bs(html, 'html.parser')
table = soup.find('tbody')
rows = table.findAll('tr')
names = []
industry = []
Locations = []
for row in rows:
info = row.findAll('td')
name = info[0].text.strip()
industry = info[1].text.strip()
Locations = info[2].text.strip()
`
I've checked online and every solution I see involves the url changing. And for some reason, you can make requests to the api of the site. Postman said something about the parameters not being sent.
With some small adjustments you can get it, lets try this:
from bs4 import BeautifulSoup as bs
from playwright.sync_api import sync_playwright
import time
url = 'https://franchisedisclosure.gov.au/Register'
with sync_playwright() as p:
browser = p.chromium.launch(headless=False, slow_mo=100)
page = browser.new_page()
page.goto(url)
page.locator("text=I agree to the terms of use").click()
page.locator("text=Continue").click()
page.wait_for_load_state('domcontentloaded')
names = []
industry = []
Locations = []
# When you click to next page, an element with text "Loading" appears in the screen, so we save that element
loading_icon = "//strong[text()='Loading...']"
# This is the "next page" button
next_page_locator = "//ul[#class='pagination']/li[3]"
# We select the option of 50 elements per page
page.select_option("#perPageCount", value="50")
# We wait for the selector of loading icon to be visible and then to be hidden, which means the new list is fully loaded
page.wait_for_selector(loading_icon, state="visible")
page.wait_for_selector(loading_icon, state="hidden")
time.sleep(1)
# We make a loop until the button "Next page" is disabled, which means there are no more pages to paginate
while "disabled" not in page.get_attribute(selector=next_page_locator, name="class"):
# We get the info you wanted
page.is_visible('tbody')
html = page.inner_html('table.table.table-hover')
soup = bs(html, 'html.parser')
table = soup.find('tbody')
rows = table.findAll('tr')
for row in rows:
info = row.findAll('td')
name = info[0].text.strip()
industry = info[1].text.strip()
Locations = info[2].text.strip()
# Once we get the info we click in next page and we wait for the loading element to be visible and then to be hidden.
page.click(next_page_locator)
page.wait_for_selector(loading_icon, state="visible")
page.wait_for_selector(loading_icon, state="hidden")
time.sleep(1)
Thanks for the great question... and answer. In addition / as opposed to using the loading_icon, you could also use a "networkidle", so expanding on #Jaky Ruby's answer adding page.wait_for_load_state(state="networkidle"). I often use the networkidle option to check for the completed loading of the next page, however I've read somewhere it's not necessarily best practice... but it works quite often.
from bs4 import BeautifulSoup as bs
from playwright.sync_api import sync_playwright
import time
url = 'https://franchisedisclosure.gov.au/Register'
with sync_playwright() as p:
browser = p.chromium.launch(headless=False, slow_mo=100)
page = browser.new_page()
page.goto(url)
page.locator("text=I agree to the terms of use").click()
page.locator("text=Continue").click()
page.wait_for_load_state('domcontentloaded')
names = []
industry = []
Locations = []
# When you click to next page, an element with text "Loading" appears in the screen, so we save that element
loading_icon = "//strong[text()='Loading...']"
# This is the "next page" button
next_page_locator = "//ul[#class='pagination']/li[3]"
# We select the option of 50 elements per page
page.select_option("#perPageCount", value="50")
# We wait for the selector of loading icon to be visible and then to be hidden, which means the new list is fully loaded
page.wait_for_selector(loading_icon, state="visible")
page.wait_for_selector(loading_icon, state="hidden")
page.wait_for_load_state(state="networkidle")
time.sleep(1)
# We make a loop until the button "Next page" is disabled, which means there are no more pages to paginate
while "disabled" not in page.get_attribute(selector=next_page_locator, name="class"):
# We get the info you wanted
page.is_visible('tbody')
html = page.inner_html('table.table.table-hover')
soup = bs(html, 'html.parser')
table = soup.find('tbody')
rows = table.findAll('tr')
for row in rows:
info = row.findAll('td')
name = info[0].text.strip()
industry = info[1].text.strip()
Locations = info[2].text.strip()
# Once we get the info we click in next page and we wait for the loading element to be visible and then to be hidden.
page.click(next_page_locator)
page.wait_for_selector(loading_icon, state="visible")
page.wait_for_selector(loading_icon, state="hidden")
time.sleep(1)
When I was trying to scrape the data from Sephora and Ulta using beautifulsoup, I could get the html content of the page. Then when I tried to use lxml to parse it using xpath, i didn't get any output. But working with this same xpath in selenium, i could get the output.
Using Beautifulsoup
for i in range(len(df)):
response = requests.get(df['product_url'].iloc[i])
my_url=df['product_url'].iloc[i]
My_url= ureq(my_url)
my_html=My_url.read()
My_url.close()
soup = BeautifulSoup(my_html, 'html.parser')
dom = et.HTML(str(soup))
#price
try:
price=(dom.xpath('//*[#id="1b7a3ab3-2765-4ee2-8367-c8a0e7230fa4"]/span/text()'))
df['price'].iloc[i]=price
except:
pass
Using Selenium
lst=[]
urls=df['product_url']
for url in urls[:599]:
time.sleep(1)
driver.get(url)
time.sleep(2)
try:
prize=driver.find_element('xpath','//*[#id="1b7a3ab3-2765-4ee2-8367-c8a0e7230fa4"]/span').text
except:
pass
lst.append([prize])
pz=None
dt=None
Does anyone know why i cant get the content using lxml to parse it using same xpath in beautifulsoup? Thanks so much in advance.
Sample Link of Ulta:
[1]: https://www.ulta.com/p/coco-mademoiselle-eau-de-parfum-spray-pimprod2015831
Sample Link of Sephora:
[2]: https://www.sephora.com/product/coco-mademoiselle-P12495?skuId=513168&icid2=products
1. About the XPath
driver.find_element('xpath','//*[#id="1b7a3ab3-2765-4ee2-8367-c8a0e7230fa4"]/span').text
I'm a bit surprised that the selenium code works for your Sephora links - the link you provided redirects to a productnotcarried page, but at this link (for example), that XPath has no matches. You can use //p[#data-comp="Price "]//span/b instead.
Actually, even for Ulta, I prefer //*[#class="ProductHero__content"]//*[#class="ProductPricing"]/span just for human-readability although it looks better if you use this path with css selectors
prize=driver.find_element("css selector", '*.ProductHero__content *.ProductPricing>span').text
[Coding for both sites - Selenium]
To account for both sites, you could set up something like this reference dictionary:
xRef = {
'www.ulta.com': '//*[#id="1b7a3ab3-2765-4ee2-8367-c8a0e7230fa4"]/span',
'www.sephora.com': '//p[#data-comp="Price "]//span/b'
}
# for url in urls[:599]:... ################ REST OF CODE #############
and then use it accordingly
# from urllib.parse import urlsplit
# lst, urls, xRef = ....
# for url in urls[:599]:
# sleep...driver.get...sleep...
try:
uxrKey = urlsplit(url).netloc
prize = driver.find_element('xpath', xRef[uxrKey]).text
except:
# pass # you'll just be repeating whatever you got in the previous loop for prize
# [also, if this happens in the first loop, an error will be raised at lst.append([prize])]
prize = None # 'MISSING' # '' #
################ REST OF CODE #############
2. Limitations of Scraping with bs4+requests
I don't know what et and ureq are, but response from requests.get can be parsed without them; although [afaik] bs4 doesn't have any XPath support, css selectors can be used with .select .
price = soup.select('.ProductHero__content .ProductPricing>span') # for Ulta
price = soup.select('p[data-comp~="Price"] span>b') # for Sephora
Although that's enough for Sephora, there's another issue - the price in Ulta pages are loaded with js so the parent of the price span is empty.
3. [Suggested Solution] Extracting from JSON inside script Tags
For both sites, product data can be found inside script tags, so this function can be used to extract price from either site:
# import json
############ LONGER VERSION ##########
def getPrice_fromScript(scriptTag):
try:
s, sj = scriptTag.get_text(), json.loads(scriptTag.get_text())
while s:
sPair = s.split('"#type"', 1)[1].split(':', 1)[1].split(',', 1)
t, s = sPair[0].strip(), sPair[1]
try:
if t == '"Product"': return sj['offers']['price'] # Ulta
elif t == '"Organization"': return sj['offers'][0]['price'] # Sephora
# elif.... # can add more options
# else.... # can add a default
except: continue
except: return None
#######################################
############ SHORTER VERSION ##########
def getPrice_fromScript(scriptTag):
try:
sj = json.loads(scriptTag.get_text())
try: return sj['offers']['price'] # Ulta
except: pass
try: return sj['offers'][0]['price'] # Sephora
except: pass
# try...except: pass # can try more options
except: return None
#######################################
and you can use it with your BeautifulSoup code:
# from requests_html import HTMLSession # IF you use instead of requests
# def getPrice_fromScript....
for i in range(len(df)):
response = requests.get(df['product_url'].iloc[i]) # takes too long [for me]
# response = HTMLSession().get(df['product_url'].iloc[i]) # is faster [for me]
## error handing, just in case ##
if response.status_code != 200:
errorMsg = f'Failed to scrape [{response.status_code} {response.reason}] - '
print(errorMsg, df['product_url'].iloc[i])
continue # skip to next loop/url
soup = BeautifulSoup(response.content, 'html.parser')
pList = [p.strip() for p in [
getPrice_fromScript(s) for s in soup.select('script[type="application/ld+json"]')[:5] # [1:2]
] if p and p.strip()]
if pList: df['price'].iloc[i] = pList[0]
(The price should be in the second script tag with type="application/ld+json", but this is searching the first 5 just in case....)
Note: requests.get was being very slow when I was testing these codes, especially for Sephora, so I ended up using HTMLSession().get instead.
I am trying to scrape Backcountry.com review section. The site uses a dynamic load more section, ie the url doesn't change when you want to load more reviews. I am using Selenium webdriver to interact with the button that loads more review and BeautifulSoup to scrape the reviews.
I was able to successfully interact with the load more button and load all the reviews available. I was also able to scrape the initial reviews that appear before you try the load more button.
IN SUMMARY: I can interact with the load more button, I can scrape the initial reviews available but I cannot scrape all the reviews that are available after I load all.
I have tried to change the html tags to see if that makes a difference. I have tried to increase the sleep time in case the scraper didn't have enough time to complete its job.
# URL and Request code for BeautifulSoup
url_filter_bc = 'https://www.backcountry.com/msr-miniworks-ex-ceramic-water-filter?skid=CAS0479-CE-ONSI&ti=U2VhcmNoIFJlc3VsdHM6bXNyOjE6MTE6bXNy'
res_filter_bc = requests.get(url_filter_bc, headers = {'User-agent' : 'notbot'})
# Function that scrapes the reivews
def scrape_bc(request, website):
newlist = []
soup = BeautifulSoup(request.content, 'lxml')
newsoup = soup.find('div', {'id': 'the-wall'})
reviews = newsoup.find('section', {'id': 'wall-content'})
for row in reviews.find_all('section', {'class': 'upc-single user-content-review review'}):
newdict = {}
newdict['review'] = row.find('p', {'class': 'user-content__body description'}).text
newdict['title'] = row.find('h3', {'class': 'user-content__title upc-title'}).text
newdict['website'] = website
newlist.append(newdict)
df = pd.DataFrame(newlist)
return df
# function that uses Selenium and combines that with the scraper function to output a pandas Dataframe
def full_bc(url, website):
driver = connect_to_page(url, headless=False)
request = requests.get(url, headers = {'User-agent' : 'notbot'})
time.sleep(5)
full_df = pd.DataFrame()
while True:
try:
loadMoreButton = driver.find_element_by_xpath("//a[#class='btn js-load-more-btn btn-secondary pdp-wall__load-more-btn']")
time.sleep(2)
loadMoreButton.click()
time.sleep(2)
except:
print('Done Loading More')
# full_json = driver.page_source
temp_df = pd.DataFrame()
temp_df = scrape_bc(request, website)
full_df = pd.concat([full_df, temp_df], ignore_index = True)
time.sleep(7)
driver.quit()
break
return full_df
I expect a pandas dataframe with 113 rows and three columns.
I am getting a pandas datafram with 18 rows and three columns.
Ok, you clicked loadMoreButton and loaded more reviews. But you keep feeding to scrape_bc the same request content you downloaded once, totally separately from Selenium.
Replace requests.get(...) with driver.page_source and ensure you have driver.page_source in a loop before scrape_bc(...) call
request = driver.page_source
temp_df = pd.DataFrame()
temp_df = scrape_bc(request, website)
Python Knowledge: beginner
I managed to create a script to scrape contact information. The flow I followed since I am a beginner is to extract all the first links and copied it to text file and this is being used in link = browser.find_element_by_link_text(str(link_text)) Scraping of contact details have been confirmed working (based on my separate run). The problem is that after clicking the first links, it won't go on clicking the links inside it, hence it cannot scrape the contact info.
What is wrong with my script? Please bear in mind I am a beginner so my script is a little bit manual and lengthy.
Thanks very much!!!
from selenium import webdriver
from selenium.webdriver.common.by import By
from selenium.webdriver.support.ui import WebDriverWait
from selenium.webdriver.support import expected_conditions as EC
from selenium.common.exceptions import NoSuchElementException
import requests
from bs4 import BeautifulSoup
import urllib
import re
import sys
reload(sys)
sys.setdefaultencoding('utf-8')
import csv, time, lxml
######################### open file list ####################################
testfile = open("category.txt") # this is where I saved the category
readfile = testfile.read()
readfilesplit = readfile.split("\n")
############################### end ###################################
################### open browser ###############################
browser = webdriver.Firefox()
browser.get('http://aucklandtradesmen.co.nz/')
####################### end ###################################
link_texts = readfilesplit
for link_text in link_texts:
link = browser.find_element_by_link_text(str(link_text))
WebDriverWait(browser, 10).until(EC.presence_of_element_located((By.CSS_SELECTOR, ".add-listing")))
link.click() #click link
time.sleep(5)
print "-------------------------------------------------------------------------------------------------"
print("Getting listings for '%s'" % link_text)
################# get list name #######################
urlNoList = 'http://aucklandtradesmen.co.nz/home-mainmenu-1.html'
r = requests.get(browser.current_url)
if (urlNoList != browser.current_url):
soup = BeautifulSoup(r.content, 'html.parser')
g_data = soup.find_all("div", {"class":"listing-summary"})
pageRange = soup.find_all("span", {"class":"xlistings"})
pageR = [pageRange[0].text]
pageMax = str(pageR)[-4:-2] # get max item for lists
X = str(pageMax).replace('nd', '0')
# print "Number of listings: ", X
Y = int(X) #convert string to int
print "Number of listings: ", Y
for item in g_data:
try:
listingNames = item.contents[1].text
lstList = []
lstList[len(lstList):] = [listingNames]
replStr = re.sub(r"u'", "'",str(lstList)) #strip u' char
replStr1 = re.sub(r"\s+'", "'",str(replStr)) #strip space and '
replStr2 = re.sub(r"\sFeatured", "",str(replStr1)) #strip Featured string
print "Cleaned string: ", replStr2
################ SCRAPE INFO ################
################### This is where the code is not executing #######################
count = 0
while (count < Y):
for info in replStr2:
link2 = browser.find_element_by_link_text(str(info))
time.sleep(10)
link2.click()
WebDriverWait(browser, 10).until(EC.presence_of_element_located((By.CSS_SELECTOR, "#rating-msg")))
print "count", count
count+= 1
print("Contact info for: '%s'" % link_text)
r2 = requests.get(browser.current_url)
soup2 = BeautifulSoup(r2.content, 'html.parser')
g_data2 = soup.find_all("div", {"class":"fields"})
for item2 in g_data2:
# print item.contents[0]
print item2.contents[0].text
print item2.contents[1].text
print item2.contents[2].text
print item2.contents[3].text
print item2.contents[4].text
print item2.contents[5].text
print item2.contents[6].text
print item2.contents[7].text
print item2.contents[8].text
browser.back()
WebDriverWait(browser, 10).until(EC.presence_of_element_located((By.CSS_SELECTOR, ".add-listing")))
################### END ---- This is where the code is not executing END ---#######################
############ END SCRAPE INFO ####################
except NoSuchElementException:
browser.back()
WebDriverWait(browser, 10).until(EC.presence_of_element_located((By.CLASS_NAME, "pagenav")))
else:
browser.back()
WebDriverWait(browser, 10).until(EC.presence_of_element_located((By.CLASS_NAME, "pagenav")))
print "Number of listings: 0"
browser.back()
WebDriverWait(browser, 10).until(EC.presence_of_element_located((By.CLASS_NAME, "pagenav")))
By the way this is some of the result:
-------------------------------------------------------------------------------------------------
Getting listings for 'Plumbers'
Number of listings: 5
Cleaned string: ['Hydroflame Plumbing & Gas Ltd']
Cleaned string: ['Osborne Plumbing Ltd']
Cleaned string: ['Plumbers Auckland Central']
Cleaned string: ['Griffiths Plumbing']
Cleaned string: ['Plumber Auckland']
-------------------------------------------------------------------------------------------------
Getting listings for 'Professional Services'
Number of listings: 2
Cleaned string: ['North Shore Chiropractor']
Cleaned string: ['Psychotherapy Werks - Rob Hunter']
-------------------------------------------------------------------------------------------------
Getting listings for 'Property Maintenance'
Number of listings: 7
Cleaned string: ['Auckland Tree Services']
Cleaned string: ['Bob the Tree Man']
Cleaned string: ['Flawless House Washing & Drain Unblocking']
Cleaned string: ['Yardiez']
Cleaned string: ['Build Corp Apartments Albany']
Cleaned string: ['Auckland Trellis']
Cleaned string: ['Landscape Design']
What I would do is change the logic some. Here's the logic flow I would suggest you use. This will eliminate the writing off of the links and speed up the script.
1. Navigate to http://aucklandtradesmen.co.nz/
2. Grab all elements using CSS selector "#index a" and store the attribute "href" of each
in an array of string (links to each category page)
3. Loop through the href array
3.1. Navigate to href
3.1.1. Grab all elements using CSS selector "div.listing-summary a" and store the
.text of each (company names)
3.1.2. If an element .by_link_text("Next") exists, click it and return to 3.1.1.
If you want business contact info off of the company pages, you would want to store the href in 3.1.1. and then loop through that list and grab what you want off the page.
Sorry about the weirdness of the formatting of the list. It won't let me indent more than one level.
okay I found a solution after thinking #jeffC's suggestion:
extract the href values and append it to the base url which is http://aucklandtradesmen.co.nz, so for example the if the extracted href is /home-mainmenu-1/alarms-a-security/armed-alarms-ltd-.html,and tell browser to navigate to that URL..and then I can do whatever I want in the current page..